The tyranny of “relevance” sorting is rather wearing. Why is “relevance” the unchangeable default for various forms of search result? Because they’re so very rarely “relevant” (Google Search aside) and more often than not I’m looking for a “by date” ordering. I’ve been to the site before, and now I just want to see what’s new. If there’s one innovation I’d like to see in 2017 it’s a robust browser add-on, one which can be taught to identify the site’s relevance/date toggle and then auto-switches to “by date”.
Here’s a working Microsoft Excel 2007 .xlsx file (11kb) that has a simple formula to split a word list according to the case of each word’s starting letter. For instance, you have a list that runs…
You want to remove all the words that do not start with a capital letter, since they are not likely to be personal names or place-names or species etc. Excel can’t do this ‘out of the box’, at least not with the various Sort buttons available in Excel 2007. Nor can plugins like ASAP Utilities. This spreadsheet results in a list with the all-lowercase words pushed down to the bottom of the sorted list, thus…
It won’t work properly if you also have words in your list with a capital letter after the first letter, such as “naZgul”. Those words will be flagged as if they start with a capital letter. Numbers, on the other hand, are fine.
Want to home-brew a classic “back of the book” index from a Word file, ideally using freeware? Here are all the current software options I could find:
* TExtract can handle a wide variety of input files and seems to be favoured by pro book indexers. From $79 (use for a single title) to $595 (buy outright). Seems likely to take a while to learn.
* WordEmbed v3.11. £80. A MS Word Macro that helps to automate the process where your pre-made book index gets slotted in as an intrinsic ‘living/linked’ part of the MS Word document. It seems to be well regarded as a helping hand, but is not an automated maker of the index in the first place. Not likely to be used by amateurs but it might be something you could tell your hired low-cost ebook freelancer about — they might be interested in learning how to use it and thus adding to their skills-base.
* PDF Index Generator. $69.95, with a free demo limited to the first ten pages of the book. Create a basic automatic index, and then trim back and supplement it as needed. Note that it requires that you install Java to run it, and having Java installed on your PC these days is a very major security risk.
* Index Generator 5.5 is un-crippled freeware for PDFs. It’s more basic than PDF Index Generator (above) but is quite capable and easy to use. I found that it doesn’t require Java to launch or work. For Windows, Mac and Linux. Could make it a lot easier to hand off the indexing to a low-cost ebook freelancer, and get something worthwhile back. Could be used in conjunction with the free Calibre (see below).
* For a simple table of: word | language | times used the free Calibre ebook management and conversion software can also give you a quick output from an ebook of all words in the book. Calibre’s simple word table can then be exported to .csv and thus sorted in MS Excel. To access it from inside Calibre: load your ebook and convert to ePub (it only works with the ePub format) | click the tiny top-right “more” arrows | drop down the extra hidden toolbar | Edit Book | Tools | Reports | Words | Save…
The Word file’s word capitalisation is retained in the resulting Calibre list. On loading into Excel and sorting for capitalised words, one may thus quickly create a rough checklist of important name items, for reference use when selecting words with the likes of Index Generator (which regrettably has no such ‘show capitalised name words only’ function).
It seems that JURN’s search results have become even more precise over the last year, if a new report by Searchmetrics is to be believed…
“the study found the URLs for pages that feature in the top 20 search results are about 15% longer on average than in 2015. Searchmetrics said this is likely because Google is better able to identify and display the precise pages that answer the search intention, and these pages are more likely to have longer URLs because they possibly lie buried deeper within websites.”
Exhibition (journal of the U.S. National Association for Museum Exhibition, with a two issue partial paywall)
Conservar Patrimonio (Portuguese art conservation journal, partly in English)
Fixed indexing of the scielo.org aggregation sites, to make them less verbose in search results. Specifically, several of the Scielo sites recently introduced an ‘export’ page for each and every citation. These ‘export’ pages are now blocked from JURN’s results.
The launch of Metadata 2020 is reported to have slipped to early 2017. They’re apparently hoping that the big publishers will release all their metadata for open public use, and will flag their open access articles with uniform publicly-discoverable tags. Good luck with that one.
For those interested in end-of-year OA tallies, I can report that this blog recorded a total of 340 journals added to JURN in 2016. Nearly all those titles publish in English on topics in the humanities or the natural world. If the 340 were combined with the worthy foreign language journals URLs also added in 2016, then the total OA journals added to JURN might be around 500. Which means it’s been a somewhat slower year than 2015, which added 450 new titles published in English.
JURN’s annual full link-check + repair is now complete. The checking of the indexed URLs is normally done August/September, so this year it has been running a few months late. Mostly because it took a few months, on and off. URL presence on Google Search is checked to the indexed path at http://www.site.com/journal/articles/pdfs/.. etc and not to http://www.site.com/ etc.
This checking is in addition to the weekly linkbot-enabled checking of the homepage URLs in the Directory.
I see there’s a new 2016 study of the DOAJ in New Library World (Vol. 117, 11/12, pages 746-755). The researchers found that in the DOAJ…
“roughly 20-25% of the [journal homepage] URLs redirected to another URL” but that “only 2.11% of 9,073 journals [proved] to be inaccessible”
… once the redirects were followed.
Two automated tests were done (using home-brewed Excel wizardry, rather than dedicated linkbot software) of all 9,073 titles, one month apart, pinging each journal’s homepage. They followed this up with a manual check on all the URLs of the still-inaccessible journals.
The research seems to have been quite thorough, although I’d observe that a homepage URL is far less likely to be broken than the deeper direct article URLs on the DOAJ’s table-of-content pages. Article page / PDF URLs can be easily broken, for instance by the journal moving from WordPress to OJS or visa versa. A similar test might usefully be run on a sample of DOAJ article URLs, although I must say that I haven’t noticed any problem on the DOAJ in that respect.
I see that Bentham Open (aka Bentham Science Publishers, not directly indexed in JURN) provided 67 of the inaccessible titles. For some reason they are still in the DOAJ after the recent purge, but my quick tests on the DOAJ’s Bentham URLs found all those tested to be unresponsive. That was last night, and they were tested again today and again found to be unresponsive. So I’m not too worried about their popping up in JURN results (via the DOAJ indexing) and I presume that the DOAJ will have them out fairly soon for 404-ing.