Users of the Firefox web browser and the universal Greasemonkey scripting addon now have a handy Greasemonkey Google URL harvester script to extract URLs from searches. Install this harvester script, and when you next go to the Google home-page you’ll see a new button — “Harvest Urls”…

Try it with a search such as…

   “jstor transmission” “british history” filetype:pdf

On some searches you have to wait for between ten seconds and a minute. The pulsing grey “working” symbol will show you that the script is still working. The above search took less than a second, and returned around 100 links to PDF files, in this simple “one URL per line” format:

It seems to have no problem with the “URL wrapping” that Google sometimes adds. Google URL Harvester is potentially very useful for those seeking to quickly make a combined and de-duplicated list to initially populate a subject-specific Google Custom Search Engine. It means you’re no longer reliant on finding “bunkers” of PDFs stored at particular websites.

Alternatively, a researcher could feed the list to one of the popular download managers (the commercial Teleport Pro, or the free WinHTTrack), download all the PDFs, then index these fully with software such as the free Google Desktop or Copernic Desktop Search or Zotero with the PDF2txt addon, or something more powerful such as the commercial DTSearch. Then you’re assured that you really are searching deep into the full-text of the files, where the Googlebot may not have been.

There are a few drawbacks. The script doesn’t also harvest the titles (“anchors”) of the links from search results (workaround). It also limits you to 1000 URLs per search, which is Google’s hard limit on any one search. Using this method of gathering Google Custom Search Engine content will potentially fill up your allocation of 5,000 URLs very quickly. And fairly imprecisely, unless you know how to search properly.

And as far as I’m aware there’s no software (inc. Zotero) that can load a simple list of PDF URLs, go and harvest the PDFs from it, then automatically create accurate static records for each file while retaining the original link. Some of the personal web crawlers available will do something similar, but these create dynamic search boxes rather than static listing/directory pages.

Book cover: sequel to The Time Machine
  The new sequel to H.G. Wells’s famous novella! 
Available now in paperback and on the Amazon Kindle!