How to extract Google Search results – with URL, title and snippet – in a CSV file

Hurrah! I found a viable way to automatically, reliably, and fairly simply grab a CSV of Google Search results. With URL, title (anchor) text, and even the sample snippet. This is, of course, only intended for academic use — to speedily build useful lists of subject-specific links.

1. Download the free MozBar addon for Firefox. It’s SEO stuff for spam-munching webmasters, but it’s free and it works. Note that the CSV export feature is only present in the Firefox toolbar. Not the Google Chrome version.

2. Temporarily turn off any Firefox addons you might have for modifying the appearance of Google Search results, such as GoogleMonkeyR.

3. Go to Google Search, go to Search Settings, and turn on Google Instant if you have it disabled. Turn the number of results to 100. Save. Now do a test search.

   No SERP Control Panel showing up? Click on the new SEOMoz toolbar (it’s sitting up near the top of your browser), click on the grey cogs, and select Google…

   

   The SERP Control Panel overlay should now appear over to the right of the search results. Note that you may also need to repeat this step, for each new search or page, in order to get the data cued up correctly for a fresh CSV output, if you have Google Instant turned off.

4. On the SERP control panel, click on “Export to CSV”…

Note than we can also do this with Bing and Yahoo, and perhaps others if you can make profiles for them. Possibly it might work with Google Scholar?

5. Open the resulting CSV file with Excel…


Above: Click on picture to see full-size version.

You even get the description/snippet from the search results, although prefaced with some junk — simply delete everything in front of keyword “Undo” in the relevant column, by using Sobelsoft’s Excel Remove (Delete, Replace) Text, Spaces & Characters From Cells Addin for Excel…

Also delete the columns with the SEO junk in them. You now have three clean columns: URL, title, and snippet. Use a formula to convert these to pretty linked HTML in a fourth column, or paste them into a mega-file of subject-specific results for further weeding and sorting.

None of the above is as robust or simple as the broken Google Extract Data and Text, and it’s to be hoped that Sobolsoft fixes this software soon for Windows 7 + IE9.

TPB Physibles

The Web’s biggest pirate galleon has just announced a new search category: “Physibles”, a fancy name for digital 3D objects…

“Data objects that are able (and feasible) to become physical. We believe that things like three dimensional printers, scanners and such are just the first step. We believe that in the nearby future you will print your spare sparts for your vehicles. You will download your sneakers within 20 years.”

Google 3D Warehouse has of course being quietly doing something very similar for some years now. All their models are free (inc. commercial use) too, but legit. They even give you awesome software, Google SketchUp, for free to manipulate and alter the objects.

Google Search harvesting becomes more difficult

Sad to say, but the three Google Search harvesting utilities from Sobolsoft no longer work on Windows 7 with Internet Explorer 9. The utilities are: Google Save Search Results; Google Extract Data & Text; and Excel Import Multiple Google Search Results.

I’m guessing that to use these today one would need to blow the dust off an old Windows Vista PC with something like IE6 or IE7 installed, although the problem might be due to newer versions of Visual Basic runtimes or similar. The utilities don’t run well on Windows XP (I tried one, on an old laptop) because the GUI layouts are truncated in it, vital ‘save’ buttons are unreachable, and the software can’t be re-sized.

Among possible fallback options, none of the Google URL Harvester scripts for Greasemonkey work now. The clunky Outwit Hub still can’t seem to get past Google Search’s URL obfuscation and other clutter, making it fairly useless for the task. SEO software like URLHarvester and Scrapebox doesn’t seem to care about link titles or extracts, just raw URLs and PageRank.

Still working is the basic per-page method of using Multilinks (Firefox) or Linkclump (Chrome), and then my combinatory Excel spreadsheet.

** Update: found a new, free way to do it, that also harvests snippets.

Dynamic Collections

Oxford’s Dynamic Collections is a forthcoming WordPress plugin that seems to still be in private beta, but which sounds interesting. Basically, it harvests OER [Open Educational Resource] records into WordPress from across a range or sources, but filters them by keyword(s). The results, presumably with a bit of hand tweaking, are quickly-built subject lists of such resources.

My guess would be that one could probably do something similar with repository record feeds: use Excel to sort simple CSV records by the presence of keyword(s), then export only the relevant records as CSV, then load these into Omeka.

Footnotes plugin for WordPress

A nice new footnotes plugin for WordPress. It uses simple square brackets, which must have a number at the start of them. It accepts HTML links inside the brackets. I’d love to see this plugin come as standard with the free WordPress.com -hosted blogs…

To get the smaller font size on the footnotes, paste this CSS into your theme’s styles CSS, probably at the foot of the font section (that worked for me). The plugin doesn’t add this CSS automatically.

New global academic repository search tool

I’ve spent a few hours making a new experimental world academic repository search tool, 2012 version. It searches 2,756 repositories. Enjoy.

How it was made: Lists from ROAR, OpenDOAR, BASE, Open Archives, and repository software providers were all URL-extracted and combined in Excel — then comma delimited on the / and the URL path trimmed back a little (not too far), returned to their normal state, de-duplicated, and the resulting list generally cleaned by hand/eye with the aid of a few very useful Sobolsoft addins for Excel. Any special OAI-PMH query/harvester URLs were excluded.

2,756 clean repository URLs remained. These were topped and tailed to turn them into hyperlinks, and all were put into a rough-and-ready on-the-fly Google Search engine.

Drawbacks: Users need to be aware that…

* the search results may sometimes default to being drawn from the main Google database, usually after the first few pages of results.

* you may see Google’s ads with results, if you don’t run AdBlock in your browser.

* the search tool is limited to what the Googlebot is able to “see”

* it may sometimes throw a tantrum and refuse to work.

* it likes quote marks. Idly typing in film noir will get a pitifully small amount of results, but “film noir” will get a hefty amount.

So it’s not perfect, but… you are able to use useful Google Search modifiers such as filetype:pdf or filetype:doc to discover full-text.

Re-usage: If anyone wants to use these URLs for a proper Google Custom Search Engine, feel free. Anyone can make their own 5,000-URL Google CSE, free.

U.S. boom in nonprofit startups is coming, says survey of older folks

A recent survey by Civic Ventures concluded that six million of the USA’s 1960s baby boomers seriously intend, upon retirement, to use their experience to develop new non-profit organisations. And these may not look like the creaky old non-profits that we’ve known until now. They’re likely to be seriously Internet-enabled, and reasonably well funded from private sources. So, here’s a question. Some of the effort will be local (saving stray kitty cats, developing local theatres, creating new woodlands, etc) but how could some of it be directed toward open-access scholarly content? Could structured national programmes be developed to stimulate and guide useful scholarly initiatives by retirees, perhaps based on alumni associations and running alongside things like tax breaks and the promotion of legacies left to help fund open access journals and archive digitisation? And how about your local university gives free library and journals access to any retiree who starts a suitable non-profit, and then invites them all to a free annual TED-like networking event just for them?

Omeka – like WordPress but for creating online academic collections

Omeka: a complete WordPress-like digital collections management system, for academics. It’s free, from the Center for History and New Media at George Mason University. It’s easy to install and use, and has themes, and plugins, and media support, just like WordPress.

Plugins include…

* OAI-PMH repository metadata harvester and CSV import

* Allow users to add a comment and rating to any record. Also add social media buttons.

* Add Library of Congress Subject Headings to your records

* Have your collection records be readable for Zotero users

The Five Stars of Online Journal Articles

David Shotton proposes The Five Stars of Online Journal Articles

“I propose five factors — peer review, open access, enriched content, available datasets and machine-readable metadata — as the Five Stars of Online Journal Articles.”

From a search perspective, I might suggest we need to add another star for “Googlyness”, when all the following factors are present…

* search-engine friendliness (i.e.: make sure the article title shows up as the clickable link in search results, not something like “43w94.taryyt.indd”)

* RSS feeds for linked tables-of-contents

* embedding of the journal title and home URL in each individual PDF or HTML article page (so they can be easily tracked back, after they get casually downloaded to a student’s hard-drive)

Follow

Get every new post delivered to your Inbox.