JURN returns


Sticky post: 1st May 2018.

Ooops. I left off all JURN activity for a month, to write a book (Tolkien, 180,000 words), and… the jurn.org webspace has vanished. The webspace hosting service got badly hacked, a while back, and the account details became disconnected from the credit-card details. The site’s still all there, just made inaccessible by the provider. I’m now considering my options, re: switching hosting/domain.

Anyway, while I get it sorted out, JURN is still accessible here:

JURN Search

This is a link to the ‘raw’ CSE page which is maintained by Google, and of course it never goes down. I see that it now offers the options for sort-by-date and image-search, which the fancy front-page was able to offer. It’s not so pretty or easy to remember the URL for, but it does the job.

The Directory of 3,000 arts & humanities journals in JURN can be had on this blog as a saved PDF.

And finally, GRAFT, my beta ‘all known repositories’ search-engine is still accessible, again via the Google-hosted version…

GRAFT : repository search, searching across full-text and records alike.

Update: With a UserScript addon you can integrate JURN right into Google Search. For instructions and links, see my blog post: JURN ‘in a UserScript’.

Update: You can also add JURN to your Bookmarks bar as an Itty.bitty link. An Itty.bitty link is the Web page, encapsulated within the bookmark itself.


Error rates for Google Scholar citation parsing

Another new prodding of Google Scholar, this time from the latest First Monday “Testing Google Scholar bibliographic data: Estimating error rates for Google Scholar citation parsing”

While data quality is good for journal articles and conference proceedings, books and edited collections are often wrongly described or have incomplete data. We identify a particular problem with material from online repositories [where there appears to be] considerable inhomogeneity in the implementation of data standards [and] a mismatch between repository software and the harvesting protocols employed by Google Scholar.

One of Scholar’s other problems is that it includes Google Books results. While 30% of the time its Google Books inclusions can useful, there is no way to exclude Books results. One might want to exclude because Scholar still can’t seem to determine a proper book from a robot-produced shovelware ebook that assembles public-domain content. Scholar has no ‘edition authority’ which states that the Joshi-edited and annotated Penguin Classics edition of H.P. Lovecraft’s “Dexter Ward” is the gold-standard and that it has a text that has been fully corrected of the many textual errors, omissions and editing mistakes of previous decades. Unlike the public-domain shovelware ebooks that flood Amazon and (often) Google Books.

A basic undergraduate level search, for instance, for Lovecraft “Dexter Ward”, demonstrates the problem on the first page. Joshi is nowhere to be seen, and the searcher is hammered by links to shovelware ebooks (or worse), often with citation counts that suggest they are legitimate.

Added to JURN

Textile & Leather Review

Journal of Danubian Studies and Research (Regional Studies – The Danube)

Space Settlement Journal (National Space Society)

Digital Narratives of Asia (oral history / interview series)

Folia Orientalia (linguistics)

Linguistica Silesiana (linguistics)

Rocznik Orientalistyczny / Yearbook of Oriental Studies

Biodiversidade Brasileira (Brazilian nature conservation. Partially in English)

Revista CEPSUL – Biodiversidade e Conservacao Marinha (Marine nature conservation, Brazil. Partially in English)

Google Scholar at 389 million

Michael Gusenbauer, “Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases”, Scientometrics, November 2018.

The findings provide first-time size estimates of ProQuest and EBSCOHost and indicate that Google Scholar’s size might have been underestimated so far by more than 50%. By our estimation Google Scholar, with 389 million records, is currently the most comprehensive academic search engine.

With the later proviso that there are likely to be many duplicates and near-duplicates, with such tools reporting…

the number of all indexed records on a database, not the number of unique records indexed. This means duplicates, incorrect links, or incorrectly indexed records are all included in the size metrics provided by ASEBDs.

As you can see, the article coins the ugly and unreadable “ASEBDs” for “academic search engines and bibliographic databases”. MASTs might be more mellifluous — Massive Academic Search Tools.

Block autosuggestions from the Google Scholar search box

For those who know what they’re looking for, and how to type… here’s how to block the dumb auto-suggestions from appearing on the Google Scholar search-box:

1. In the UBlock Origin Web browser addon, open the My Filters list (Go: Icon | Slider Controls Icon | My filters tab).

2. Paste in the line…


3. Save the List and exit it. Reload Google Scholar, and the flickery and distracting (and almost always very wrong) drop-down suggestions are gone.

A variant of the above ‘block line’ with probably also work in similarly advanced ad-blocker addons.

News you can loose…

It looks like stories from U.S. news outlets, those that blank UK and European visitors, are now simply being removed from the Google News results. Spotted today, under the Google News results…

Annoying for those inclined to turn on their VPN and see the news story regardless. But, in practice, the publications still blocking overseas visitors are such low-grade regional newspapers that it’s no loss.

Clips and flicks

All U.S. film-makers can now crack anti-copying technologies on content ($ paywalled at law.com), if they need that content for ‘fair use’ use in a new production…

“Digital Millennium Copyright Act (DMCA) exemptions aren’t just for documentary filmmakers any more. The U.S. Copyright Office and Library of Congress last week broadened a DMCA exception to now allow more filmmakers to circumvent anti-copying technology and rip short video clips for purposes of commentary and criticism.”

However, it isn’t a free-for-all. Note that the PDF for the rules states that this new measure is specifically for…

“where the clip is used for parody or its biographical or historically significant nature”.

In a drama movie, the “commentary and criticism” would thus presumably be seen to be implied by the nature of the scene, rather than done in a directly academic or journalistic manner. For instance, I can imagine a dramatised scene of dancing on the beach as the Apollo 11 rocket lifts off behind the dancers. This scene would be a sort of implied commentary on the optimism engendered in the nation by the historically significant moment of sending men to the Moon. And if the high-res source needed for that was only available from Time-Life rather than NASA, then their Blu-ray disc could be cracked and a clip used as the background in the composite. Actually these days it’s probably easier to do it with 3D models and copy of Vue, but some may want the original footage — and historical personages can’t simply be conjured up in the same way.

Also, as the word “clip” is used and video is assumed in the PDF’s text, that leaves hazy the cracking of content protection to obtain a high-res still picture. A film-maker might need such a still for a Ken Burns “pan and scan” type film, and could perhaps argue that the still was required as a irreplaceable source needed to make the film’s video “clip”. But that’s probably something to be clarified in a future round of rule changes.

Practical blog search at the end of 2018

It’s the back end of 2018 and there’s still no really useful and comprehensive search tool for recent blog posts, other than the main Google Search. And even that is iffy. Given that we’re approaching Halloween, I decided to do a quick group test with the simple keyword Lovecraft. He’s a good choice because so much utter trash floods onto the Web in his name. If a search can deal with Lovecraft, it should be able to handle much else.

* Google News: Can filter by ‘blogs’ and by ‘date’, but the results are laughable — are there really only eight blog posts on Lovecraft in October 2018, from worthy long-form and timely-news bloggers? I think not.

* Google Search: The inblogtitle:keyword modifier is no longer useful in search, as it now returns only 10 irrelevant results when used with Lovecraft. One used to be able to find sites that Google ‘knew’ were blogs, and had a keyword in their main blog title. Google Search has also removed ?tbm=blg from their URL options.

* WordPress.com internal cross-blog search: Simple to use, the results looks pretty, but it obviously has very mediocre coverage of its own blogs. Many expected and well-respected blogs do not appear at all. Users need to be aware that they are not seeing results from the entire range of non-spam WordPress.com hosted blogs.

I would suspect that DuckDuckGo may be using this WordPress.com results set as a de facto anti-spam whitelist, since that would explain its curious big gaps in the coverage for WordPress.com blogs. The same may be true of the dismal Bing — the only saving grace for which is the excellence of the Bing News | Most Recent results, which you can RSS-ise by adding &format=rss to the URL. By comparison, NewsNow is nowhere.

* You Got Blogs, a Google CSE: Fairly good at pulling the top three currently-active blogs to the top of the results, but thereafter turns to mush. If the user then sorts by date on a single keyword, the results are far less useful, mainly because You Got Blogs is indexing all *.wordpress.com/* pages rather than just the blog posts via *.wordpress.com/20*/* You Got Blogs is reliant on Google Search, since it’s a CSE, and thus for many blogs Google will only show the most recently-indexed post or else just the front page (e.g. you make seven posts a week, but Google will only show searchers the post it has most recently indexed, and the others will be un-findable). It’s thus an impossible balancing act for You Got Blogs (or any other blog-focussed CSE): if they don’t do a global index of *.wordpress.com/* then they miss a whole lot of results.

* Social Mention. Search restricted just to ‘Blogs’. Pathetic results from ‘Blogs’. No results at all, for ‘Microblogs’. Top three results were very similar to the WordPress.com internal, then a huge gap in time. My guess is they’re blending together the WordPress.com and Bing APIs, and to no great effect.

* DuckDuckGo: Should, theoretically, be good. But is mediocre. It all-but ignores key Lovecraft blogs, blogs which rank very highly in Google Search. I should note that the Duck is excellent in many other respects, especially the relevance of its Image Search. But is still lacks breadth and depth.

* Instant RSS Search Engine. No longer appears to work, even when tested in multiple browsers.

For niche news gatherers wishing to supplement their RSS feedreader and break out of the tiny-minded Twitterbubble, the best option at the end of 2018 is thus to set up a bookmarks folder in your Web browser with the following:

site:wordpress.com/2018/10/ “Lovecraft” -zombie -game -movie
site:blogspot.com/2018/10/ “Lovecraft” -zombie -game -movie

Vary according to your desired keyword and knockout words, obviously. These URLs will work because all blog posts on Blogger and WordPress have the date embedded in their URL.

These bookmarks should be set to run on Google Search and DuckDuckGo and Yandex (the latter with a &lang=en English only filter in the URL). Right-click on the finished Bookmarks folder, select “Open All” and they all load.

Of course, this doesn’t pick up self-hosted blogs, only the free ones. And, obviously you’ll have to manually go in and incrementally change the date numbering in the target URLs, at the end of each month. Thus it’s not a perfect solution.

Once the searches have loaded, switching through to a “week” or “24 hour” view will require the copious use of Google Hit Hider by Domain, to weed the spam and unwanted results. Google Hit Hider knocks out unwanted domains from search results, and does it very well. (Google Hit Hider can run on Yandex, it just needs the results reloaded, in order for its blocking buttons to appear).

Even having set up such a Bookmarks folder, we also still have the problem of Google Search sometimes only offering the front page of a timely and frequently updated blog, rather than its most recent post URLs. In practice though, for a ‘last 24 hours’ search, you don’t actually need a site: modifier…

site:wordpress.com “Lovecraft” -zombie -game -movie

All you need is ‘last 24 hours’ filter alone, and Google Search will lift the best content into the first two pages of results. Kind of useful, as it can thus catch self-hosted blogs, albeit jumbled among legacy news sources and updating catalog sites etc. Even so, you’ll want Google Hit Hider when working at the 24 hour level.

Also useful, inside your new folder, will be a similarly hard-coded Google Images search URL for the last 24 hours or week…

“keyword” -pinterest -youtube -reddit -twitter -wikipedia -tumblr -instagram

… and so on. It only takes a few seconds to visual check the results, and such timely visual results are often useful re: new books, conference posters etc. Keep eBay listings in the mix as they can suggest interesting blog post topics, about old vintage stuff. Again, we’re not keying the search to blogs only, and thus Google Hit Hider is your friend here (it also works on Google Images results – block on Google Search, and it’s also blocked on Images).

There are of course also a whole bunch of “request a demo” agency services which claim to offer social media sentiment tracking. They seem to be of the ‘if you have to ask the price, you can’t afford it’ sort. There’s one free and public service worth a look, Social Searcher. Very slow to load a search, but it’s pretty and it works. It’s no use for blogs, though, but seems useful if you want to quickly glance across recent Facebook and Twitter posts. It covers some other ephemeral sharing sites, but their signal gets swamped by Facebook and Twitter. Not that that matters much as it’s almost all blather and parroting, of no news value.

Text Cleanup 2.0 – now free

I’m pleased to see that Text Cleanup 2.0 is now freeware. It’s Windows desktop software from 2003 that “fixes” text automatically when you copy-paste it. For instance, by unwrapping a chunk of text that has hard line-breaks. Text Cleanup has a nice balance of power and ease-of-use, can save user presets, and still runs fine on a Windows 8.x desktop.