Venezuelan Scielo – URL change

It appears that the Venezuelan Scielo at www. scielo.org.ve/ is now just ve. scielo.org/ DuckDuckGo still indexes the old URL, which it seems no longer works. I get a pass-through to the Wayback Machine at Archive.org on hitting such dead URLs, though that’s the result of my browser plugin.

Google Search is indexing the new URL only, which works. When Google Search changes, it’s usually a sign that the old URL really is kaput.

It’s probably best to keep an eye on the other Scielo aggregators, to see if they make the same change and thus break older URL paths and Web links. They don’t appear to have made such a change, so far.

My guess would be that the .ve change could be a result of buying one of those ‘vanity’ fixes which remove the www. in an URL. After some hard-sell from a salesman such fixes usually turn out to be very expensive to maintain, and after a time the URL often defaults back to normal.

In the meantime JURN is also directly indexing Venezuela’s journals at produccioncientificaluz.org and saber.ucv.ve/ojs/ and erevistas.saber.ula.ve. The nation is starving, but their journals are still online and reachable for now. The latter two appear to have different sets of journals, despite being from the same University of the Andes.

Yahoo Groups to close

The Yahoo empire continues to crash and burn. The old Yahoo Groups will shut down completely on 15th December 2020. If you had a Group there with content that’s still useful, now’s the time to back it up and upload the .ZIP file to Archive.org in perpetuity. Although I imagine that Archive.org itself may already be ‘on the job’ in that respect.

Working Excel spreadsheet: Take a list of home-page URLs, harvest the HTML, extract a snippet of data from each

I’m pleased to present a free ‘ISSN harvester’ for Excel 2007 or higher.

What you need: You have a long list of home-page URLs, one per line. You want a small snippet of data captured from each HTML page. The target data is not in any kind of repeating HTML table or tag, and could be anywhere on each page.

Usage: A long list of home-page URLs is pasted into the first column. The sheet then checks each URL in turn, and also extracts their HTML source into an adjacent cell. A formula in the end column then looks at the captured HTML and extracts the first instance of “ISSN” and any 70 following characters. Where no result is found, the formula leaves a general label as a placeholder.

Download: ISSN-and-data-checker-working.xlsm

Works in Windows and Excel 2007. May require the user to have Internet Explorer installed. Tested and working fine on an 800+ URL list. Each URL just captures the loaded page, not the entire website.

It should be adaptable to capture any snippet of data, just vary and replace the formula. Theoretically, you could also add extra columns to capture other data from the same HTML, such as “i s s n” or “eISSN”.


Credit: This is derived and expanded from the free “Bulk URL status checker in Excel sheet”, which checked a list of home-page URLs for 404s, and also rather usefully extracted each page of HTML to a cell while it was about it. I would have had no idea how to set up that ‘HTML per cell’ bit, without his working example. That spreadsheet was kindly shared on the TechTweaks blog by ‘Conscience’ in April 2017. Here is has been adapted by myself to also extract data.

Added to JURN

Journal of L.M. Montgomery Studies (the life and work of the famous author of Anne of Green Gables)

Technical Bulletins of the Williamstown + Atlanta Art Conservation Center

Archiv Orientalni, 1929-2010. Later known as ArOr: Quarterly Journal of African and Asian Studies. Partly indexed by JURN via the URL kramerius.lib.cas.cz/periodical/ — which is not ideal, in terms of either target or Google Search’s current indexing, but is the best that can be done at present.

WIPO Magazine (World Intellectual Property Organisation, JURN was already indexing the WIPO Journal).

Quantitative Science Studies (MIT, “theoretical and empirical research on science and the scientific workforce”)

Philosophy of Medicine


Luminaria (interdisciplinary natural biology, Brazil, latest issues are partly in English)

+

Three more UK repositories, and the per-article record pages of IA Scholar at fatcat.wiki/release/ (the latter fairly poorly indexed by Google Search, at present).

Initial harvest of ISSNs is now integrated into the openECO directory. The process of adding ISSNs is not yet complete.

Working Excel spreadsheet: Align two lists without fuzzy lookup

Here’s my possibly-useful working Excel list-sorter, made for Excel 2007 and higher.

Situation: You have a long list of items in column A. You’ve copied out this list to run it through a process elsewhere, perhaps in some arcane Windows freeware that is the only thing that can do a particular job for free. This process has added a snippet of wanted new data at the end of each item. Hurrah!

But… possibly the process also discarded some lines, when no new data was found. Or perhaps a ‘helpful’ intern has later added a few lines here and there to the new list. Your new processed list is thus rather awkwardly jumbled up. You can no longer easily align your valuable new data snippets against the old list.

Use: Paste your jumbled and expanded list in Column E, and Column C will automatically sort and auto-align it alongside the original list. No ‘fuzzy lookup’ engine is required.

Download: match_and_sort_without_fuzzy_lookup.xlsx

Two vital Google Search UserScripts, fixed

Newly fixed vital UserScripts for use with Google Search:

Google Search Sidebar

Google Search restore URLs (undo breadcrumbs). This restores readable URL-paths in search results, a vital aid to avoiding the growing amount of spam in Google Search.

Add the following to the top of the Breadcrumbs script, to stop it working on Google Books.


// @exclude http*://www.google.*tbm=bks*
// @exclude http*://www.google.*.*tbm=bks*

Some free tools to extract data from fetched HTML

Here are some relatively simple free Windows desktop tools to ‘extract an item of data from fetched HTML’. They were found while considering if it might be possible to append ISSNs to the JURN directories in a semi-automatic manner.

My target task was: you have a big list of URLs, and the HTML pages for these are to be automatically fetched. Their text is then regex-ed (or the Excel equivalent) to extract a tiny snippet of text data from each page. In this case, any line following the first instance of the word “ISSN” on each home-page. Ideally, each extracted text snippet is then automatically appended to its source URL.

1. Excel Scrape HTML Add-In, free from Analyst Cave. I can’t do anything with it in Excel 2007, so I assume it needs Excel 2016 or higher (2016 introduced the new features Power Query and Get & Transform).

2. Download WebExtractor360 1.0. Simple Windows abandonware from 2009, and lacking any Help in terms of… how do you format your big list of URLs so they can be automatically processed? It also looks like it cannot be limited to just the first-encountered home-page. Still, someone might figure out that bit of the WebExtractor360 puzzle, or pick up the open source code at SourceForge and develop it for easier batch processing and expanded output options.

3. DEiXTo. Genuine Windows freeware from Greece, for “Web data extraction made easy!” The baffling interface and example-free techie manual strongly suggest otherwise though, and you’ll likely need to read the manual very carefully to get it working. There’s also a 2013 academic paper on DEiXTo from the authors.

4. Update: the open source freeware Web-Harvest 2.x from 2010, Java with a clean Windows GUI and a good manual. Seems like a good alternative to DEiXTo. Still works and has many examples and templates, but no template to run through a list of URLs and grab a fragment of data from each home-page. Despite the name it’s a data extractor, not a site harvester.

5. Update: I made one for Excel 2007, and it’s free. Take a list of home-page URLs, harvest the HTML, extract a snippet of data from each.

For paid Windows desktop software, that doesn’t require a PhD in Spreadsheet Wrangling and which indeed assumes you’re not working in Excel, look at Sobolsoft’s $20 Extract Data & Text From Multiple Web Sites Software and BotSol’s Web Extractor. The first from Sobolsoft requires Internet Explorer and that you delve into two of Explorer’s settings to make it not be verbose, in terms of IE not freaking out with process-stopping alerts every time it meets a Twitter button etc. Search is not ideal, as it cannot be limited to just the first-encountered ‘home’ page. Output it not ideal either, as it cannot offer Source URL = no result as a line in the results. The latter software from BotSol has the great advantage that it can limited itself to the home-page and will also try another 2 nearby pages (“About” etc), if it can’t find the target data on the home-page. It’s designed to extract phone numbers, but can be configured to get anything. It’s free for a version that processes a list of 10 URLs at a time, and is $50 for an ‘unlimited URLs’ version (that is regrettably time-bombed).

There are browser-based tools like the long-standing OutWitHub and new free Cloud services such as Octoparse, but they appear focused on ripping competitor ecommerce listings and plugging them into your boss’s database. Also, apparently Octoparse’s “List of URLs” feature requires all the pages to have exactly the same HTML elements.

Subject to change

“Subject indexing in humanities: a comparison between a local university repository and an international bibliographic service”, Journal of Documentation, May 2020.

… the use of subject index terms in humanities journal articles [is] not supported in either the world’s largest commercial abstract and citation database Scopus or the local repository of a public university in Sweden. The indexing policies in the two services do not seem to address the needs of humanities scholars for highly granular subject index terms with appropriate facets; no controlled vocabularies for any humanities discipline are used whatsoever.

A robust fix for reaching the Classic Editor, for free WordPress.com blogs

I’m pleased to see that the vital WordPress.com edit post redirects UserScript has updated, and it handles the current changed arrangements at the WordPress.com free blogs. It’s working fine for all functions (start new post, edit post from side-link on existing post, edit post from wp-admin list, etc). It briskly takes you and your post to the Classic Editor, rather than to the awful Block editor.

I had coded a Lua script for the StrokesPlus mouse-gestures freeware to provide a workaround for the current problem, which was working. But it’s now no longer needed. Here it is anyway, for what it’s worth…



-- A LUA SCRIPT for a STROKESPLUS mouse-gesture.
-- TITLE: Auto-load the Classic Editor at WordPress.com
-- DATE: October 2020.
--
-- Your Web browser is at ../wp-admin/edit.php and you do the mouse gesture.
-- First the script pauses, to ensure wp-admin has time to fully load itself
acDelay(1500)
-- select and copy the current browser URL
acActivateWindow(nil, gex, gey)
acSendKeys("^l{DELAY 100}^c")
url=acGetClipboardText()
-- process the browser URL, trimming it back
new_url=string.gsub(url,"(.+)/.+/?","%1")
acSetClipboardText(new_url)
-- load the new trimmed URL in the browser
acSendKeys("^v{DELAY 100}{ENTER}")
-- copy the current browser URL again
url2=acGetClipboardText()
-- append the posting URL and thus effectively go to New Post
new_url2=string.gsub(url2,".+/?","%1/post-new.php")
acSetClipboardText(new_url2)
acSendKeys("^v{DELAY 100}{ENTER}")
-- delay 7 seconds to allow the sluggish Block editor to load
acDelay(7500)
-- type the word draft in the post title, and Ctrl + S to save as a Draft post
acSendKeys("draft")
acSendKeys("^s")
-- pause 3 seconds for WordPress to switch to the new numbered URL
acDelay(3000)
acActivateWindow(nil, gex, gey)
-- copy this new URL to the clipboard
acSendKeys("^l{DELAY 100}^c")
url3=acGetClipboardText()
-- append the vital &classic-editor slug to the end of the URL
new_url3=string.gsub(url3,".+/?","%1&classic-editor")
acSetClipboardText(new_url3)
-- take the Draft post into the Classic Editor and finish.
acSendKeys("^v{DELAY 500}{ENTER}")


And to handle the additional “Edit” side-link on posts, you’d use a second Lua script with its core being…

-- look at the current URL, keep only the post number
new_url=string.gsub(url,"[^0-9]","")

… then prepend and append the required URL structure around the post number, to get a working URL back again, then load that URL.


Will either of these solutions last beyond 2021? Perhaps not, as I suspect the Classic Editor will then be killed off totally as previously announced for that date, rather that effectively hidden from the mass of users. As such it’s probably best to just start learning the free Open Live Writer and try to use free blogs in WordPress.com that way. That assumes, however, that in 2021 WordPress.com doesn’t also block offline-editing using such blogging software.