The tyranny of “relevance” sorting is rather wearing. Why is “relevance” the unchangeable default for various forms of search result? Because they’re so very rarely “relevant” (Google Search aside) and more often than not I’m looking for a “by date” ordering. I’ve been to the site before, and now I just want to see what’s new. If there’s one innovation I’d like to see in 2017 it’s a robust browser add-on, one which can be taught to identify the site’s relevance/date toggle and then auto-switches to “by date”.
Here’s a working Microsoft Excel 2007 .xlsx file (11kb) that has a simple formula to split a word list according to the case of each word’s starting letter. For instance, you have a list that runs…
You want to remove all the words that do not start with a capital letter, since they are not likely to be personal names or place-names or species etc. Excel can’t do this ‘out of the box’, at least not with the various Sort buttons available in Excel 2007. Nor can plugins like ASAP Utilities. This spreadsheet results in a list with the all-lowercase words pushed down to the bottom of the sorted list, thus…
It won’t work properly if you also have words in your list with a capital letter after the first letter, such as “naZgul”. Those words will be flagged as if they start with a capital letter. Numbers, on the other hand, are fine.
Updated: 13th July 2020.
Want to home-brew a classic “back of the book” index from a Word file, ideally using freeware? Here are all the current software options I could find:
* TExtract can handle a wide variety of input files and seems to be favoured by pro book indexers. From $79 (use for a single title) to $595 (buy outright). Seems likely to take a while to learn.
* WordEmbed. £80. A MS Word Macro that helps to automate the process where your pre-made book index gets slotted in as an intrinsic ‘living/linked’ part of the MS Word document. It seems to be well regarded as a helping hand, but is not an automated maker of the index in the first place. Not likely to be used by amateurs but it might be something you could tell your hired low-cost ebook freelancer about — they might be interested in learning how to use it and thus adding to their skills-base.
* PDF Index Generator. $69.95, with a free demo limited to the first ten pages of the book. Create a basic automatic index, and then trim back and supplement it as needed.
Version 2.4 added a new feature, a… “new query template has been added to allow indexing capitalized phrases” which works this way: get to “Step 2” in the initial PDF import | “Include words” | Click on pencil icon | “Add Query” | Choose “Capitalised Phrases” from the dropdown | this then forms Query 1 | Make sure Query 1 is ticked, and “Index these words only” | OK.
You now have a vastly more useful starting point for a first-pass at an index than otherwise, with all your place-names and personal names done…
There’s also a filter to get the “surnames, forenames” switched over. You can stack filters and/or run multiple indexes and then merge them (video tutorial link: see video at the 3 minute mark) and thus work in stages.
You’d then un-tick the irrelevancies and cut out the mis-steps, and then go through your book manually and add to the index various concepts and ideas which readers might want to look up. That wouldn’t be the end of making a polished index, but it’d be a big chunk of the grunt-work done.
A note on Java:
However, useful as such automation is, note that PDF Index Generator requires that you install Java to run it, and having Java installed on your PC these days is a very very major and ongoing security risk…
Network World reported that in 2014 U.S. Homeland Security… “recommended users uninstall Java completely” throughout the USA. In 2014 PC Magazine advised “Users should either uninstall Java, disable it entirely in the browser, or take other steps to protect themselves from attacks against Java.” In 2015 InfoWorld magazine wrote… in 2015, it’s really, really tempting [for a network admin] to simply uninstall Java from user machines.” In 2017 even Java World wrote, of yet more new and critical vulnerabilities, that… “Users should uninstall Java from their systems”.
Still… one might safety install Java on an old laptop and run from there, if the laptop has sufficient memory, where it would be quarantined from your main PC. Or, for a one-time use on your main PC, you might: i) download the standalone Java installers, ii) disconnect from the Internet; iii) install Java and then PDF Index Generator; iv) do your indexing output and refining work; v) completely uninstall Java and then re-connect to the Internet. Only with the standalone (full, about 58Mb) Java installer and the Internet disconnected does the installer NOT collect and send your system fingerprint to a remote location at Oracle, makers of Java. After install you should also look down the Java Security settings and disable things like Web browser integration (most Web browser makers block all Java plugins by default, but it’s best to check).
Update, July 2020: As of PDF Index Generator 2.9…
The Windows edition of the program now comes with Java embedded inside it, so you don’t have to worry about installing the right Java edition to run the program.
* Index Generator is un-crippled freeware for PDFs. It’s more basic than PDF Index Generator (above), lacking things like Phrase Query filters, but is quite capable and easy to use. I found that it doesn’t require an install of Java to launch or work. It’s available for Windows, Mac and Linux (the latter two do seem to require Java?). The very major drawback is that it currently appears to lack any Query ability to select only capitalised items such as Names and Place Names, and seems to actually case-shift every word in its pick-list to lower-case! Still, it’s in active development, and we may well see it catching up with PDF Index Generator over time.
* For a simple table of: word | language | times used the free Calibre ebook management and conversion software can also give you a quick output from an ebook of all words in the book. Calibre’s simple word table can then be exported to .csv and thus sorted in MS Excel. To access it from inside Calibre: load your ebook and convert to ePub (it only works with the ePub format) | click the tiny top-right “more” arrows | drop down the extra hidden toolbar | Edit Book | Tools | Reports | Words | Save…
The Word file’s word capitalisation is retained in the resulting Calibre list. On loading into Excel and sorting for capitalised words, one may thus quickly create a rough checklist of important name items, for reference use when selecting words with the likes of Index Generator (which regrettably appears to have no such ‘show capitalised name words only’ function).
* Indiscripts’ IndexMatic 2 plugin for Adobe InDesign (which is Adobe’s flagship DTP software).
Possibly someone will eventually whip up a script to automatically check if a word or phrase in an index has a corresponding Wikipedia or Infogalactic page, thus offering another way to filter a word-list down to the more important items.
It seems that JURN’s search results have become even more precise over the last year, if a new report by Searchmetrics is to be believed…
“the study found the URLs for pages that feature in the top 20 search results are about 15% longer on average than in 2015. Searchmetrics said this is likely because Google is better able to identify and display the precise pages that answer the search intention, and these pages are more likely to have longer URLs because they possibly lie buried deeper within websites.”
Exhibition (journal of the U.S. National Association for Museum Exhibition, with a two issue partial paywall)
Conservar Patrimonio (Portuguese art conservation journal, partly in English)
Fixed indexing of the scielo.org aggregation sites, to make them less verbose in search results. Specifically, several of the Scielo sites recently introduced an ‘export’ page for each and every citation. These ‘export’ pages are now blocked from JURN’s results.
The launch of Metadata 2020 is reported to have slipped to early 2017. They’re apparently hoping that the big publishers will release all their metadata for open public use, and will flag their open access articles with uniform publicly-discoverable tags. Good luck with that one.
For those interested in end-of-year OA tallies, I can report that this blog recorded a total of 340 journals added to JURN in 2016. Nearly all those titles publish in English on topics in the humanities or the natural world. If the 340 were combined with the worthy foreign language journals URLs also added in 2016, then the total OA journals added to JURN might be around 500. Which means it’s been a somewhat slower year than 2015, which added 450 new titles published in English.
JURN’s annual full link-check + repair is now complete. The checking of the indexed URLs is normally done August/September, so this year it has been running a few months late. Mostly because it took a few months, on and off. URL presence on Google Search is checked to the indexed path at http://www.site.com/journal/articles/pdfs/.. etc and not to http://www.site.com/ etc.
This checking is in addition to the weekly linkbot-enabled checking of the homepage URLs in the Directory.
I see there’s a new 2016 study of the DOAJ in New Library World (Vol. 117, 11/12, pages 746-755). The researchers found that in the DOAJ…
“roughly 20-25% of the [journal homepage] URLs redirected to another URL” but that “only 2.11% of 9,073 journals [proved] to be inaccessible”
… once the redirects were followed.
Two automated tests were done (using home-brewed Excel wizardry, rather than dedicated linkbot software) of all 9,073 titles, one month apart, pinging each journal’s homepage. They followed this up with a manual check on all the URLs of the still-inaccessible journals.
The research seems to have been quite thorough, although I’d observe that a homepage URL is far less likely to be broken than the deeper direct article URLs on the DOAJ’s table-of-content pages. Article page / PDF URLs can be easily broken, for instance by the journal moving from WordPress to OJS or visa versa. A similar test might usefully be run on a sample of DOAJ article URLs, although I must say that I haven’t noticed any problem on the DOAJ in that respect.
I see that Bentham Open (aka Bentham Science Publishers, not directly indexed in JURN) provided 67 of the inaccessible titles. For some reason they are still in the DOAJ after the recent purge, but my quick tests on the DOAJ’s Bentham URLs found all those tested to be unresponsive. That was last night, and they were tested again today and again found to be unresponsive. So I’m not too worried about their popping up in JURN results (via the DOAJ indexing) and I presume that the DOAJ will have them out fairly soon for 404-ing.