Two new studies of OA indexing

“Are Open Access Monographs Discoverable in Library Catalogs?”, Libraries and the Academy, Volume 17, No. 1, January 2017…

The analysis indicates that only a small percentage of college and university library catalogs in the United States and Canada consistently enable discovery and access for the test sample.

“The open access aggregators challenge: how well do they identify free full text?”, Medium article-post, 7th January 2017. Looks at BASE and CORE…

when OAI-PMH (which is the standard way of harvesting open access repositories [was established,] no provision was made to have a standard way or a mandatory field to indicate if the item is free to access.” [But today] “many have in fact more metadata-only records than full-text records.

[BASE] “is only able to see 75 free records in National University of Singapore’s IR, 654 free records in Nanyang Technology University’s IR, 143 free records in Singapore Management University’s IR. I did not do a check to see if there were false positives in BASE’s identification of full text but [assuming] they are 100% correct, we see only a full text identification ratio of 0.6%, 3.8% and 2.7% respectively!” […] “the results for CORE are as dismal as BASE.

See also: “From open access metadata to open access content: two principles for increased visibility of open access content”, conference paper presented at Open Repositories 2013, 8th-12th July 2013, Charlottetown, Canada.

… only 27.6% of research outputs in repositories are linked to content that can be downloaded by automatic means and analysed (e.g. indexed). […] the median repository will only provide machine readable content for 13% of its deposited resources. [but] it is likely that these statistics are in fact rather optimistic …

Visit Britain’s open stash of 15,000 “copyright-free” hi-res pictures

Visit Britain provides nearly 15,000 selected copyright-free images, with a search box. The selection is obviously highly curated and high-quality, and no registration is required to download. If you have a pop-up blocker, you’ll need to whitelist to get at the hi-res magazine-quality image download link. There are a few noticeable gaps in coverage, such as the major ceramics tourism hub which is the city of Stoke-on-Trent (one picture, on a search for “Stoke-on-Trent”).


Repositories and Creative Commons license metadata

“Assigning Creative Commons Licenses to Research Metadata: Issues and Cases”, 19th September 2016…

“From a recent analysis, out of a sample of around 2500 publication repository services in OpenDOAR 2 ([those] supporting the OAI-PMH protocol standard), only 9 expose metadata license information: 3 with CC-0, 2 with CC-BY, and 4 which require a permission for commercial use, 3 with CC-0 and 1 with CC-BY.”

Nine. Not nine percent, just… nine. And one can assume that the other 1,100 repositories in OpenDOAR are even less likely to host CC license information for metadata in some form or other.

GRAFT updates

GRAFT has updated, the first update since last October. GRAFT enables a swift Google search across the world’s repositories, searching records and full-text alike. GRAFT currently searches around 1,600 more repositories than OpenDOAR, and does so via a thoroughly cleaned and up-to-date set of index URLs. Please access GRAFT via the page linked above, rather than any browser bookmark, to enjoy the newly added range.

Added to JURN

New Eastern Europe.

Laboratory Phonology.

Humanum Review (Quarterly Review of the John Paul II Institute).

Jewish Observer (1966—) (Difficult to index, sadly, despite indexing the TOCs at*/ and also the PDFs at*/*/JO*.pdf – where * is a wildcard. Someone might care to make a proper TOCs blog for the title, which would be better indexed by Google?)

Selected out-of-print issues of Ars Orientalis and Ars Islamica as full-text (the same volumes were already partly covered by the Smithsonian, but only via bare record pages interfacing with an book-player).

Now indexing the Burlington magazine (art history) full-text volumes directly on

Better indexing of the publications of The Metropolitan Museum of Art.

PLOS articles are now less verbose in JURN search results. JURN now focusses results on the core article, by actively excluding sub-pages for ‘figures’ / ‘citations’ / ‘supplementary’ / ‘comments’. A similar measure has been taken to make Nature’s open article content less verbose, by excluding the ‘tables’ pages for their articles.

Online publications of the Swiss Federal Institute for Forest, Snow and Landscape Research and their journal Diagonal.

Transvaal Museum Monographs.

2017 Edge Question – a Kindle ebook conversion

The 2017 Edge Question responses have just been released. Over 200 of the world’s finest minds answer “What scientific term or concept ought to be more widely known?”. As usual the combined single mega-page weighs in at around the length of two novels, on which the likes of Instapaper will choke. So Kindle ereader owners may want the unabridged unofficial .mobi ebook conversion for the Kindle.


The tyranny of “relevance” sorting

The tyranny of “relevance” sorting is rather wearing. Why is “relevance” the unchangeable default for various forms of search result? Because they’re so very rarely “relevant” (Google Search aside) and more often than not I’m looking for a “by date” ordering. I’ve been to the site before, and now I just want to see what’s new. If there’s one innovation I’d like to see in 2017 it’s a robust browser add-on, one which can be taught to identify the site’s relevance/date toggle and then auto-switches to “by date”.

Excel example sheet: Sort a list to retain only Names and remove the all-lowercase words

Here’s a working Microsoft Excel 2007 .xlsx file (11kb) that has a simple formula to split a word list according to the case of each word’s starting letter. For instance, you have a list that runs…


You want to remove all the words that do not start with a capital letter, since they are not likely to be personal names or place-names or species etc. Excel can’t do this ‘out of the box’, at least not with the various Sort buttons available in Excel 2007. Nor can plugins like ASAP Utilities. This spreadsheet results in a list with the all-lowercase words pushed down to the bottom of the sorted list, thus…


It won’t work properly if you also have words in your list with a capital letter after the first letter, such as “naZgul”. Those words will be flagged as if they start with a capital letter. Numbers, on the other hand, are fine.


A survey of automated book index making software

Want to home-brew a classic “back of the book” index from a Word file, ideally using freeware? Here are all the current software options I could find:

* TExtract can handle a wide variety of input files and seems to be favoured by pro book indexers. From $79 (use for a single title) to $595 (buy outright). Seems likely to take a while to learn.

* WordEmbed v3.11. £80. A MS Word Macro that helps to automate the process where your pre-made book index gets slotted in as an intrinsic ‘living/linked’ part of the MS Word document. It seems to be well regarded as a helping hand, but is not an automated maker of the index in the first place. Not likely to be used by amateurs but it might be something you could tell your hired low-cost ebook freelancer about — they might be interested in learning how to use it and thus adding to their skills-base.

* PDF Index Generator. $69.95, with a free demo limited to the first ten pages of the book. Create a basic automatic index, and then trim back and supplement it as needed. Note that it requires that you install Java to run it, and having Java installed on your PC these days is a very major security risk.

* Index Generator 5.5 is un-crippled freeware for PDFs. It’s more basic than PDF Index Generator (above) but is quite capable and easy to use. I found that it doesn’t require Java to launch or work. For Windows, Mac and Linux. Could make it a lot easier to hand off the indexing to a low-cost ebook freelancer, and get something worthwhile back. Could be used in conjunction with the free Calibre (see below).

* For a simple table of: word | language | times used the free Calibre ebook management and conversion software can also give you a quick output from an ebook of all words in the book. Calibre’s simple word table can then be exported to .csv and thus sorted in MS Excel. To access it from inside Calibre: load your ebook and convert to ePub (it only works with the ePub format) | click the tiny top-right “more” arrows | drop down the extra hidden toolbar | Edit Book | Tools | Reports | Words | Save…

The Word file’s word capitalisation is retained in the resulting Calibre list. On loading into Excel and sorting for capitalised words, one may thus quickly create a rough checklist of important name items, for reference use when selecting words with the likes of Index Generator (which regrettably has no such ‘show capitalised name words only’ function).

Google goes deeper

It seems that JURN’s search results have become even more precise over the last year, if a new report by Searchmetrics is to be believed…

“the study found the URLs for pages that feature in the top 20 search results are about 15% longer on average than in 2015. Searchmetrics said this is likely because Google is better able to identify and display the precise pages that answer the search intention, and these pages are more likely to have longer URLs because they possibly lie buried deeper within websites.”