JURN returns


Sticky post: 1st May 2018.

Ooops. I left off all JURN activity for a month, to write a book (Tolkien, 180,000 words), and… the jurn.org webspace has vanished. The webspace hosting service got badly hacked, a while back, and the account details became disconnected from the credit-card details. The site’s still all there, just made inaccessible by the provider. I’m now considering my options, re: switching hosting/domain.

Anyway, while I get it sorted out, JURN is still accessible here:

JURN Search

This is a link to the ‘raw’ CSE page which is maintained by Google, and of course it never goes down. I see that it now offers the options for sort-by-date and image-search, which the fancy front-page was able to offer. It’s not so pretty or easy to remember the URL for, but it does the job.

The Directory of 3,000 arts & humanities journals in JURN can be had on this blog as a saved PDF.

And finally, GRAFT, my beta ‘all known repositories’ search-engine is still accessible, again via the Google-hosted version…

GRAFT : repository search, searching across full-text and records alike.

Update: With a UserScript addon you can integrate JURN right into Google Search. For instructions and links, see my blog post: JURN ‘in a UserScript’.

Update: You can also add JURN to your Bookmarks bar as an Itty.bitty link. An Itty.bitty link is the Web page, encapsulated within the bookmark itself.


Hathi’s toolset now runs on all its content

Hathi now offers free public tools that provide…

“access to the text of the complete 16.7-million-item HathiTrust corpus for non-consumptive research, such as data mining and computational analysis, including items protected by copyright.”

Previously the tools could only run over Hathi’s public domain content.

JURN’s re-check at 80%

80% of JURN’s entire URL list has now been checked for continuing presence of the URL path on Google Search. I check the specific URL path being indexed, and not just the basic domain (e.g.: for ITJ: The Intel Technology Journal , http://www.intel.com/content/www/us/en/research/ rather than http://www.intel.com). Broken URLs are being found/fixed or deleted as required.

Google’s new Dataset Search tool

Google has a new Dataset Search tool. It looks good.

An initial test search for Krita (the open source paint software) didn’t pick up anything, so it is just limited to datasets and is not also bringing in general file-names from FTP servers.

A wide search for Antarctica Cephalopods then gave a good set of 25 results, all of which were record pages that appeared to place their dataset under CC or to be public domain (NASA etc). There doesn’t appear to be any way to then load a further set of results, or to do a further keyword search within the record-pages of the results.

Tutorial: assemble non-overlapping tiles in Photoshop

How to capture zoomified image tiles and semi-automatically re-assemble them into a single image, with Photoshop. Even when there is no overlap between the tiles (which means you can’t use Photoshop’s Photomerge feature).

First, make sure your target picture is of an age and a state to be in the public domain and can legally be liberated. Also, note that the WikiMedia Commons has a de-zoomify advice page which offers various dezoomifying services and tips. These options may be quicker and more accurate than my method. But if the WikiMedia options don’t work, try this…

1. Install the Save All Images extension for Opera (or an addon with similar fuctionality that works in your Web browser).

2. Visit your target page. Zoomify the image and pan around until all tiles have loaded. Then capture all the loaded images on the page with ‘Save All Images’. As you can see, it’s quite sophisticated in its filters, though unfortunately you can’t save your settings as a repeatable preset for a particular website…

Ok, ‘Save All Images’ will pack all the loaded tiles up in a zip file.

3. Extract your saved .zip of images. View the resulting folder as thumbnail images. Delete all images that are not part of the tile set. Rename .jpeg files to .jpg if needed, with Winsome File Renamer or similar. Also rename to alphanumeric order if needed — tiles are downloaded in their tiling sequence, so a sort-by-date should mean that a 1… 2… 3… re-naming should be possible even if the filenames are obfuscated. You want to end up with a folder of image tiles in .jpg and with a logical alphanumeric loading order. Make a note of how many rows and columns make up the complete image (e.g. three tiles across, and four tiles down).

4. Get Paul Rigott’s Photoshop stitcher script File Stitcher.zip (mirror) and unzip it. This script can handle non-overlapped tiles by using an ‘alphanumeric load-order’ option.

5. Load Photoshop. Do not open a new image. Just go: File | Scripts | Browse and then find and load Paul’s script.

Set your numbers for the tiles across / down, and then point the script at your target folder. The images load and are automatically distributed across a newly opened image, with the script doing canvas expansion as needed. As you can see here, the result is not perfect, but 85% of the work has been done automatically. Most tiles have been accurately snapped together into the main image, but a few tiles have been assembled into strips and these remain as outliers.

Just multi-select a few relevant layers (Shift, select with right mouse-click, repeat to add the next layer to the group). Then snap the image together. More recent editions of Photoshop should help with that, if Snap is turned on.

WorldBrain for Chrome

WorldBrain for Chrome : “Full-text search of your Web browsing history and bookmarks. Find previously visited websites & PDFs in seconds.” Works in Opera too, and presumably any browser which supports Chrome extensions and addons.

On install it offered to import my last 90 days of visited URLs from my History, thought it fatally ‘hung’ at 2% and couldn’t get past that even in a few hours. However, that 2% was all I needed, since it was going through the URLs in reverse date order and thus had grabbed the last few days. I cancelled and was left with what I actually wanted: not 90 days’ worth of browsing, but just the last few days to start me off.

You can also Blacklist sites that don’t need to be cached locally, and Google Maps is blacklisted by default. One very important filter you need to add before you do anything is Google and DuckDuckGo searches, or hitting them all again in an automated fashion may cause you to be blocked by those services. Once the initial import is done, you can then unblock the main search-engines and they will cache naturally as you browse.

You’ll also want to visit the Privacy settings and ensure that some things are off/on.

It’s only getting the text, stripped of HTML. Therefore partial searches for filenames of pictures and .zips presumably won’t work, since they’re in the HTML code. Even so, one potential problem appears to be that there’s no rolling “delete page files after 90-days” setting. Presumably your local cache just goes on growing and growing, which may not be so good for those with over-stuffed hard-drives.

You also get a personal annotation and tagging tool as a discreet sidebar button. This also gives you a way to get to the Search interface, if you don’t want the creepy ‘staring eyes’ WorldBrain icon on your Bookmarks bar.