• Directory
  • FAQ: about JURN
  • Group tests
  • Guide to academic search
  • JURN’s donationware
  • openEco: nature titles indexed

News from JURN

~ search tool for open access content

News from JURN

Category Archives: Regex

Use Search Regex to delete part of a URL, in WordPress

13 Monday Feb 2023

Posted by David Haden in JURN tips and tricks, Regex

≈ Leave a comment

Problem: You need a working regex to delete an /unwanted section/ of a URL, on a self-hosted installation of WordPress. The /unwanted section/ is different in each URL, and you have hundreds or even thousands of such URLs to deal with.

Solution: If your /unwanted section/ string has a repeating section of the URL in front of it, then you’re in luck. The popular free WordPress plugin ‘Search Regex’ and this small regex, will do it. The regex is seen here working as expected in the RegularExpressions101 sandbox…

What it’s doing: The desired repeating URLpart2 in the URL path is being found and marked, along with its following / slash. This and whatever follows the found section is also marked for deletion, even if what follows is different in every URL. The ‘marking for deletion’ stops at the next / in the URL.

In WordPress: Here’s how it’s applied in Search Regex in WordPress. Because this example is also deleting the repeating URLpart2/, that gets added back via the “Replace” box.

Warnings: Always preview first, using “Search”. Only if you are absolutely happy with how the URL now looks in the previews, should you then press “Replace All”. Always make a backup of the WordPress install before attempting such changes.

Put a Python in your Web browser

04 Wednesday May 2022

Posted by David Haden in Regex, Spotted in the news

≈ Leave a comment

Embed Python scripts in HTML with PyScript…

embed Python programs directly in HTML pages and execute them within the browser without any server-based requirements

What could possibly go wrong?

New category for posts: ‘Regex’

27 Wednesday Apr 2022

Posted by David Haden in My general observations, Regex

≈ Leave a comment

A new category for posts on this blog, Regex. I’ve gone back and retrospectively tagged old posts with it.

Here’s a regex to handle deleting image thumbnail suffixes in WordPress

12 Tuesday Apr 2022

Posted by David Haden in JURN tips and tricks, Regex

≈ Leave a comment

When placing an image in a blog post, on a server installation of WordPress you often get something like this code…

The main image is only linked to, and an ersatz auto-generated thumbnail is what’s shown on the page. One may wish, for various reasons, to change an .XML archive of a WordPress blog so as to only have the filename. If one could remove the suffix seen here in yellow…

… then the blog post will only need to call the original image. The HTML code will take care of the resizing on the blog post.

Such snipping is useful if you only archived the .XML and the original images. You may no longer have or never harvested the multifarious thumbnails that were auto-generated by your server installation of WordPress.

Such filename snipping can be done, and with a simple regex formula, in Notepad++. But first… check what images you have in your local archive for the blog, since it’s just possible that the harvesting collected the thumbnails and not the originals.

Definitely have the originals, with no suffixes? Ok, then let’s proceed. Thankfully all the added filename suffixes have certain repeating elements, even if they have differing pixel dimensions. Thus they can be handled by a regex.

In Notepad++ the following working and tested regex will search and delete the thumbnail extensions, even if they all have different pixel dimensions:

FIND: \|*-([0-9]+)x([0-9]+)\.jpg\|*

REPLACE: .jpg

In plain English: find everything between any – and a following .jpg and if it has the general form of numberxnumber then delete it along with the – and .jpg. Replace the deleted string with .jpg and then repeat the process down through the whole .XML file.

Possibly regex gurus will shriek and swoon at my formulation seen above, but… it works for me.


If you want this regex to run automatically on any post you make on WordPress, as you blog, then Regex Replace extension for the Chrome browser will do the job. It automatically removes the suffix after you “Update” (i.e. save) the post, not when you first insert the image. Thus you’ll need to press “Update” twice on your blog post.

Tested and working. You probably also want to tell Regex Replace to run only on your target blog or website.

You don’t usually need this on free WordPress.com blogs, because they do things differently when placing an image on a blog post.

If you have a fixed width on your blog posts, you can also prevent WordPress from generating unwanted thumbnails thus…

… here I have gone to Media | Settings in the Dashboard, and set 0 to prevent all except “medium” from generating. This is set to the width of the blog post, so I get the HTML sizing code, but the regex ensures all my posts both display and link to the source image. Obviously in this use-case you’ll try not to inflict file sizes above 500kb on your readers.

How to search-replace all image paths in a .XML WordPress export

11 Monday Apr 2022

Posted by David Haden in JURN tips and tricks, Regex

≈ Leave a comment

Here’s how to search-replace all image paths in a .XML WordPress blog export, using the freeware Notepad++.

The idea here is to point them all to a folder of /oldimages/ with your cache of old blog images it it. Otherwise, if there’s no live blog for the new WordPress install to go and fetch them from, they won’t show on the blog posts. So you ideally want all your images pointing to mysite/blog/oldimages and that’s where you upload all your archived blog images.

The \d+ bit stands for ‘any date number’ e.g. /01/02/ It’s a sort of ‘wildcard for dates’. This does not work inside of WordPress, for instance when you have a .XML already imported and a regex plugin to call on. It appears to be Notepad++ specific.

This method assumes you haven’t been letting WordPress rename your image files by size. e.g. upload fluffybunny.jpg and have WordPress insert it as a shrunken fluffybunny-500×350.jpg along with a link to the main and properly name image file. In which case deeper surgery is required.

With a free WordPress blog you may not have this problem. For instance, the above image is placed in the post in shrunken form. But this is done by a bit of code that gets appended to the file path…

… and the image itself is not duplicated, shrunken and renamed. Even when, as a 585 pixel wide image, it only appears in the blog post at 529 pixels.


Also useful here is the old Windows freeware WXRsplitter which will split a single WordPress .XML archive into smaller but valid chunks. Often, a large single .XML (known to WordPress as a WXR file) will not upload once it gets beyond about 18Mb or so. The freeware still works fine. Once the .XML is chunked, you just upload and import each piece in the numbered sequence.

Freeware to convert a WordPress blog .XML export to Word

27 Sunday Mar 2022

Posted by David Haden in JURN tips and tricks, Regex

≈ 2 Comments

Hurrah. Desktop software has been found that will robustly convert all of a WordPress blog .XML export to Word. Specifically, a free blog hosted at the .com WordPress, which doesn’t permit or offer any fancy ebook conversion plugins. The solution is good old Windows freeware, as usual. Though the various freeware directories know nothing of it, and it was only found after hours of digging.

The XML to Doc software is the snappily titled wpxslgui…

wpxslgui is a Windows application which converts an XML File generated by the WordPress Export function into an HTML or Word HTML document.

It’s Windows freeware in v.1.04 (June 2020) from Devio IT Services, being the worthy and generous Herbert Oppolzer of Austria. Tested and working here. Very simple usage, with a Windows GUI.

Also includes the option to… “Convert WordPress XML to a single HTML file allowing filter by category (JavaScript)” but the “Word HTML” output saves as a .DOC file.

However if you cleverly just re-name the .DOC to .HTML it then works fine as a Web page in a Web browser, and thus calls in the images. I’m assuming here that your browser is allowed to go online, but Word is not.

The overall aim here is to get the blog to a clean ebook format for Kindle, removing as much gunky code and blog-cruft as possible. Archivists may also be interested here. As such there are some initial changes you may want to make…

1). First you may want to delete the superfluous 24-hour timestamp by editing the WordPress .XML output itself. What you’re targeting looks like this…

Since datestamps have a unique pattern, a regex can deal with them. Simply deleting all timestamps in Notepad++ with a regex is…

 \d{2}:\d{2}:\d{2}

…and note the single space at the start of the regex.

Running this removes all timestamps but leaves the date intact. Possibly a regex could also re-work the main date to something nicer (e.g. 12th July 2015), but it would likely be a very complex regex.

2). After conversion wpxslgui adds italics on post titles. The CSS stylesheet is embedded at the top of the output HTML, and so changing italics is just a matter of tweaking font-style:italic on H2. Bold might be better.

3). While you’re in the CSS you may want to have external links be something other than blue and purple. In which case edit the colours of a:link and a:visited in the header CSS.

4). wpxslgui adds numbering of posts, in front of each post title. The code spans multiple lines and looks like this…

Awkward, but not impossible for a regex to fix. This Search-Replace regex for Notepad++ will replace all of these with a tilde or whatever other elegant typographer’s HTML mark you might want.

5). Once you have output in semi-cleaned HTML, Ctrl + A should “Select all…” from the browser and then you paste to whatever you’re using. If you were keen on YouTube embeds you will then need to manually go through and delete the WordPress code for these. The WordPress YouTube ‘slug’ insert is presented raw in the post. Without a WordPress installation the code can’t call the video. The same will likely be true for any other fancy embeds of maps, charts, podcasts etc.

Finally, note also that wpxslgui only deals with posts, not pages.

One-click to remove a verbose site from Google

11 Thursday Mar 2021

Posted by David Haden in How to improve academic search, JURN tips and tricks, Regex

≈ Leave a comment

One-click to remove a verbose site from Google Search results, a new UserScript. Preset for Wikipedia, but the URL can be easily changed to be any verbose website. It should ideally be a website that you usually regularly want to remove from search results, but sometimes want to keep. The script is thus more flexible than a regular list-based site blocker.

It works by re-running the current search, but only an instant after some regex has cunningly inserted the command    into the URL.


Also, yes, I’m aware that my ‘add JURN as a link to Google Search’ UserScript has stopped working. Google has re-labelled the divs on the text links just below the search box. A similar script that allows the current search to be passed to Scholar has also stopped working, as have several similar menu scripts. I’m waiting for one of these scripts to update, and thus to show me how it needs to be fixed.

Some free tools to extract data from fetched HTML

07 Wednesday Oct 2020

Posted by David Haden in JURN tips and tricks, Regex

≈ Leave a comment

Here are some relatively simple free Windows desktop tools to ‘extract an item of data from fetched HTML’. They were found while considering if it might be possible to append ISSNs to the JURN directories in a semi-automatic manner.

My target task was: you have a big list of URLs, and the HTML pages for these are to be automatically fetched. Their text is then regex-ed (or the Excel equivalent) to extract a tiny snippet of text data from each page. In this case, any line following the first instance of the word “ISSN” on each home-page. Ideally, each extracted text snippet is then automatically appended to its source URL.

1. Excel Scrape HTML Add-In, free from Analyst Cave. I can’t do anything with it in Excel 2007, so I assume it needs Excel 2016 or higher (2016 introduced the new features Power Query and Get & Transform).

2. Download WebExtractor360 1.0. Simple Windows abandonware from 2009, and lacking any Help in terms of… how do you format your big list of URLs so they can be automatically processed? It also looks like it cannot be limited to just the first-encountered home-page. Still, someone might figure out that bit of the WebExtractor360 puzzle, or pick up the open source code at SourceForge and develop it for easier batch processing and expanded output options.

3. DEiXTo. Genuine Windows freeware from Greece, for “Web data extraction made easy!” The baffling interface and example-free techie manual strongly suggest otherwise though, and you’ll likely need to read the manual very carefully to get it working. There’s also a 2013 academic paper on DEiXTo from the authors.

4. Update: the open source freeware Web-Harvest 2.x from 2010, Java with a clean Windows GUI and a good manual. Seems like a good alternative to DEiXTo. Still works and has many examples and templates, but no template to run through a list of URLs and grab a fragment of data from each home-page. Despite the name it’s a data extractor, not a site harvester.

5. Update: I made one for Excel 2007, and it’s free. Take a list of home-page URLs, harvest the HTML, extract a snippet of data from each.

For paid Windows desktop software, that doesn’t require a PhD in Spreadsheet Wrangling and which indeed assumes you’re not working in Excel, look at Sobolsoft’s $20 Extract Data & Text From Multiple Web Sites Software and BotSol’s Web Extractor. The first from Sobolsoft requires Internet Explorer and that you delve into two of Explorer’s settings to make it not be verbose, in terms of IE not freaking out with process-stopping alerts every time it meets a Twitter button etc. Search is not ideal, as it cannot be limited to just the first-encountered ‘home’ page. Output it not ideal either, as it cannot offer Source URL = no result as a line in the results. The latter software from BotSol has the great advantage that it can limited itself to the home-page and will also try another 2 nearby pages (“About” etc), if it can’t find the target data on the home-page. It’s designed to extract phone numbers, but can be configured to get anything. It’s free for a version that processes a list of 10 URLs at a time, and is $50 for an ‘unlimited URLs’ version (that is regrettably time-bombed).

There are browser-based tools like the long-standing OutWitHub and new free Cloud services such as Octoparse, but they appear focused on ripping competitor ecommerce listings and plugging them into your boss’s database. Also, apparently Octoparse’s “List of URLs” feature requires all the pages to have exactly the same HTML elements.

Free: My Little Regex Cookbook, for Notepad++

27 Sunday Sep 2020

Posted by David Haden in JURN tips and tricks, Regex

≈ 3 Comments

New, My Little Regex Cookbook as a printable eight-page PDF. It has numerous working examples of useful regex for Notepad++ users working with data extraction and text lists. All tested and working in Notepad++.

This is my expanded and now prettified 1.3 PDF version of what first appeared here as the post “Some useful regex commands for Notepad++” in May 2019.

Download: little_regex_cookbook_2020.pdf

Freeware: TextWorx

25 Friday Sep 2020

Posted by David Haden in JURN tips and tricks, Regex

≈ Leave a comment

There’s a relatively new entry in genuine Windows freeware for complex text-manipulation, and this hadn’t been found when I made my summer 2019 survey of Freeware for cleaning and manipulation of text lists.

It’s TextWorx by bgmCoder, a “Universal Text Manipulator”. It lives up to the name, in terms of being able to use it with any text-editor. Highlight the text block you want to work with. Press a keyboard shortcut. Up pops a well-organised tool offering a huge range of “advanced text-manipulation routines”.

The default keyboard shortcut required to trigger the menu is a bit of a contortionist show-stopper, or else it requires you to remove your hand from the mouse:

Win key + K or Win key + shift + K.

But the shortcut is not hardwired and can be changed in the .INI file. And it’s easy enough to trigger a keyboard shortcut with a mouse-gesture. Choose a gesture that ends up somewhere suitable on the screen, since your mouse-cursor position is where the TextWorx interface will appear.

What it doesn’t seem to have is regex functions. It can’t thus function as a handy regex ‘key-ring’. For instance it can’t do things like “Extract all text found between KEYWORD1 and KEYWORD2 to a new List”. For that you’d want regex or Sobolsoft’s £20 “Extract Data Between Two Strings” utility software, which saves the extracted substrings as a list. Or you could save £20 by doing the same with this tested-and-working regex and a copy of the free Notepad++…

FIND:         .*?KEYWORD1(.*?)KEYWORD2|.+
REPLACE:    \1\r\n

← Older posts
Subscribe: RSS News Feed.
I'm on Patreon!

JURN:

  • JURN : directory of ejournals
  • JURN : main search-engine
  • JURN : openEco directory
  • JURN : repository search

Related sites:

  • 4 Humanities
  • Academic Freedom Alliance
  • Accuracy in Academia
  • Alliance Defending Freedom
  • ALPSP
  • alt.academy
  • AMIR
  • Anterotesis
  • Arcadia project
  • Art Historicum (German)
  • AWOL
  • Beall's List (updated at 2018)
  • Beall’s List (old)
  • Beyond Search
  • Bibliographic wilderness
  • Booktwo
  • Campus Reform
  • Charleston Advisor
  • Coalition for Networked Information
  • Communia (public domain watchdog)
  • Cost of Knowledge
  • Council of Editors of Learned Journals
  • Dan Cohen
  • Digital Koans
  • Digital Shift
  • Dissernet (Russian anti-plagiarism)
  • DOAJ
  • Don't Block TOR
  • eFoundations
  • EIFL
  • Electronic Frontier Foundation
  • ELO
  • Embargo Watch
  • ePublishing Trust for Development
  • Facebook: Arab Open Access
  • Facebook: Italian Open Access
  • Facebook: Open Access India
  • Film Studies for Free
  • FIRE
  • Flaky Academic Conferences
  • Found History
  • Foundation for Individual Rights in Education
  • Free Speech Union (UK)
  • Google Algorithm
  • Heterodox Academy
  • Iconclass
  • IFLA Serials blog
  • ImpactStory
  • infoDocket
  • InTech Blog
  • Jinfo (formerly Free Pint)
  • Kindle blog
  • L'edition Electronique (French)
  • La Criee : periodiques (French)
  • Leader Statement Database on Free Speech
  • National Association of Scholars
  • National Coalition of Independent Scholars
  • Neil Beagrie
  • OA Lookup : Policies
  • OA Working Group
  • OASPA
  • Online Searcher
  • Open Access Bibliography
  • Open Access Week
  • Open and Shut?
  • Open Electronic Publishing
  • Open Folklore
  • Open Knowledge Maps
  • Open Library of Humanities
  • Periodiques en ligne (French)
  • Peter Murray Rust
  • PKP / OJS
  • Project Gutenberg
  • Publishing Archaeology
  • RBA Blog
  • Reclaim the Net
  • Research Information
  • Research Remix
  • Right to Research
  • River Valley TV
  • ROARS (Italian)
  • Scholarly Electronic Publishing
  • Scholarship Matters
  • Searchblox
  • Searcher
  • Serials Cataloger
  • Serials Review
  • Society of Young Publishers
  • Speech First
  • TaxoDiary (taxonomies news)
  • Taxpayer Access
  • Tentaclii
  • The Scholarly Kitchen
  • Thoughts from Carl Grant
  • Web Scale Discovery
  • Zotero blog

Some of the libraries linking to JURN

  • Boston College Libraries
  • Brooklyn Public Library, NY
  • Duke University
  • Kobe University, Japan
  • Rhode Island College
  • San Jose State University
  • UConn Stamford
  • University of California
  • University of Cambridge (Casimir Lewy Library)
  • University of Cambridge (main)
  • University of Canberra
  • University of Toronto
  • Washington University
  • West Virginia University

Spare BitCoins? Please send donations to JURN via: 17e2KGuyzjzEEE7BsoYTwMo3MtUod6DrjP

Archives

  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • June 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009

Create a free website or blog at WordPress.com.

  • Follow Following
    • News from JURN
    • Join 903 other followers
    • Already have a WordPress.com account? Log in now.
    • News from JURN
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...