Here are some relatively simple free Windows desktop tools to ‘extract an item of data from fetched HTML’. They were found while considering if it might be possible to append ISSNs to the JURN directories in a semi-automatic manner.
My target task was: you have a big list of URLs, and the HTML pages for these are to be automatically fetched. Their text is then regex-ed (or the Excel equivalent) to extract a tiny snippet of text data from each page. In this case, any line following the first instance of the word “ISSN” on each home-page. Ideally, each extracted text snippet is then automatically appended to its source URL.
1. Excel Scrape HTML Add-In, free from Analyst Cave. I can’t do anything with it in Excel 2007, so I assume it needs Excel 2016 or higher (2016 introduced the new features Power Query and Get & Transform).
2. Download WebExtractor360 1.0. Simple Windows abandonware from 2009, and lacking any Help in terms of… how do you format your big list of URLs so they can be automatically processed? It also looks like it cannot be limited to just the first-encountered home-page. Still, someone might figure out that bit of the WebExtractor360 puzzle, or pick up the open source code at SourceForge and develop it for easier batch processing and expanded output options.
3. DEiXTo. Genuine Windows freeware from Greece, for “Web data extraction made easy!” The baffling interface and example-free techie manual strongly suggest otherwise though, and you’ll likely need to read the manual very carefully to get it working. There’s also a 2013 academic paper on DEiXTo from the authors.
4. Update: the open source freeware Web-Harvest 2.x from 2010, Java with a clean Windows GUI and a good manual. Seems like a good alternative to DEiXTo. Still works and has many examples and templates, but no template to run through a list of URLs and grab a fragment of data from each home-page. Despite the name it’s a data extractor, not a site harvester.
5. Update: I made one for Excel 2007, and it’s free. Take a list of home-page URLs, harvest the HTML, extract a snippet of data from each.
For paid Windows desktop software, that doesn’t require a PhD in Spreadsheet Wrangling and which indeed assumes you’re not working in Excel, look at Sobolsoft’s $20 Extract Data & Text From Multiple Web Sites Software and BotSol’s Web Extractor. The first from Sobolsoft requires Internet Explorer and that you delve into two of Explorer’s settings to make it not be verbose, in terms of IE not freaking out with process-stopping alerts every time it meets a Twitter button etc. Search is not ideal, as it cannot be limited to just the first-encountered ‘home’ page. Output it not ideal either, as it cannot offer Source URL = no result as a line in the results. The latter software from BotSol has the great advantage that it can limited itself to the home-page and will also try another 2 nearby pages (“About” etc), if it can’t find the target data on the home-page. It’s designed to extract phone numbers, but can be configured to get anything. It’s free for a version that processes a list of 10 URLs at a time, and is $50 for an ‘unlimited URLs’ version (that is regrettably time-bombed).
There are browser-based tools like the long-standing OutWitHub and new free Cloud services such as Octoparse, but they appear focused on ripping competitor ecommerce listings and plugging them into your boss’s database. Also, apparently Octoparse’s “List of URLs” feature requires all the pages to have exactly the same HTML elements.