GRAFT has updated with a new batch of repository URLs added. Search across full-text and records alike, in 4,723 repositories.
JURN’s coverage of Moroccan and Ukrainian open access journals is now much improved.
Here are some introductory notes on a preliminary search, relating to some new software I was testing and installing as a favour.
* IMSLP/Petrucci Music Library. Free Public Domain Sheet Music, with 140,000 works at May 2019. No preview, and a 20 second delay on each PDF download while an appeal for donations is displayed. It seems they’re not closely tracking Archive.org, as I couldn’t find a couple of newer public domain items I know are on Archive.org.
* Musopen is the other large repository of free sheet music. You get a visual preview of the sheets, and PDF download links are open and quick. But the results from its keyword search is very bad. You’ll do better with a site: search in Google or DuckDuckGo…
* Metadata tagging on Archive.org is patchy for sheet music. Thought it will often have ‘sheet music’ or ‘sheetmusic’ or ‘score’ in its title. Theoretically it seems it should have a ‘sheet music’ topic tag, but often the uploaders don’t tag their upload with that. The best initial wide searches are:
sheetmusic AND mediatype:texts keywords
“sheet music” AND mediatype:texts keywords
Which together suggest Archive.org currently has about 10,000 sheet music booklets, nearly all of an age to be in the public domain. Including much 1910s and 20s rag-time, marching music, and popular ditties.
* ChoralWiki has a very large collection of public domain choral sheet music.
* The Mutopia Project has a small collection of classical works.
* Band Music a large collection of American marching-band sheet music.
* American university libraries also have very large online collections, usually general popular and light classical music.
I don’t know of any unified one-box search that will search across all of the above. It might be an interesting project to create one, but I won’t be doing that — so feel free to give it a go.
* For audio preview of the music score, you’ll first need to get away from your default MIDI sounds driver. Doing this used to be fiendishly complex, but is now extremely easy with the fine bit of Windows freeware called VirtualMIDISynth. Install, and it takes over from your Windows MIDI sounds driver, and then you add a free .SF2 soundfont to it — so that it has the instruments it needs to play a score with.
Then you need to OCR your sheet music into software player-readable form. For sheet music OCR is termed “OMR”. If you’re using the leading paid Sibelius sheet-music player and writer (termed ‘scorewriters’ by those in the trade), then you’ll find that works seamlessly with the leading PhotoScore “OMR” software. PhotoScore uses the SharpEye SDK, widely said by eagle-eyed music industry testers to be the most accurate at picking out and interpreting all the complex fine detail of a music score.
Then you save your successful scan to an .OPT file and Sibelius can load and play this as if it were a .MP3 file. Your new VirtualMIDISynth audio driver and soundfont makes it sound lovely.
I’d like to recommend some free open source OMR -> Scorewriter software, but I had no success with it and the OMR end of the process is especially lacking and clunky (Java). As someone who’s been away from the world of music for many years, I had the feeling that the robust commercial impetus in the music professions means that it’s hard for open source software makers to stand the pace.
YouTube are once again restricting “sort by date”. They haven’t actually turned it off totally, like they did last time, but the results show it’s obviously being heavily restricted.
Coastal Review, The (papers of the Southeast Coastal Conference on Languages and Literatures)
Irish Studies South (the Irish in the South of the USA)
Air Force Journal of Indo-Pacific Affairs (U.S. Air Force)
The OpenClipArt site has been down for 21 days now, apparently felled by a heavy denial-of-service (DDOS) attack. It has 150,000 bits of clipart, all under CC Zero. Archive.org has a partial mirror of the site, but it’s no use for keyword searching or (it seems) getting the actual image files.
While Wikimedia Commons holds 2,000 images tagged with ‘OpenClipArt’, they didn’t ingest all 150,000 bits of the OpenClipArt clipart. In fact, no-one seems to have done so, and there are also no recent tar.gz archives containing all 150,000 items. The 0.18 and 0.19 releases were 2010 and 2011, and while a 310Mb author-sorted 2.0 release followed, there doesn’t appear to have been a more recent 3.0 release of the archive.
Thus it seems to me that the site Public Domain Files by Open Clip Art Library is the best fallback until OpenClipArt is back up again. It has a searchable partial mirror of 13,778 OpenClipArt images files, with the latest of these dated to mid summer 2014, and the site has no Shutterstock-ery, pop-ups or ‘mailing-list blocking overlay’ nastiness that I could see (while running an ad-blocker).
Once OpenClipArt is back online it would probably be a good idea to archive and distribute a big compressed mirror of the summer 2019 contents, if only in .SVG format.
Situation: When Windows opens a .txt file it will only launch the 64-bit version of Notepad++. How does the user force Windows to open .txt files with the 32-bit Windows, when both the 32-bit and 64-bit Notepad++ are installed? The usual routes of Windows file association and using the Notepad++ Preferences settings have both failed.
Reason for wanting 32-bit: The user may prefer the older version, for light editing, for the time being. Because it supports useful plugins such as MultiClip Viewer and others.
Solution: Open the Windows Registry editor (Start: ‘Regedit’) and navigate down to…
There replace the 64-bit file path with the 32-bit path. Which means, for me…
C:\Program Files (x86)\Notepad++\notepad++.exe
… and then exit the Registry Editor.
Perhaps it’s just the influence of Inkle’s new Heaven’s Vault game, with the epigraphy of its mysterious alien inscriptions, but I’ve taken a bit of a shine to regex. My first failed tests with the Notepad++ regex were obviously with the ‘wrong type’ of regex, as I now know there are slightly different versions for Windows, Linux etc. But I’ve now found commands that do work for me.
The following were found by scouring forums and were then tested while learning more about Notepad++ and how it works (it’s a lot deeper than it looks). They’re actual working practical examples, tested and working with the latest Notepad++ on Windows 64-bit. My testing suggests that exactly the same macros run differently in the old 32-bit vs. new 64-bit Notepad++. Since (so far as I can tell) plugin activity cannot be recorded in macros, I assume the difference is due to regex support.
Please note that I am clueless about writing these things, only knowing how to search for, find and test them. So don’t ask me to advise you on their devising or tweaking. My many thanks especially to guy038 at the Notepad++ forum, and many and various others, for writing these and helping others find solutions. I found that the search-engine Yippy, based on Bing, is especially good at finding these things, and will almost inevitably lead you to guy038. But, so far as I know, he has not made a regex ‘keyring’ or a ‘cookbook’ or suchlike. Hence my need to collect some working examples here under practical headings.
All but one of these regexes (regexii?) run in the ‘Find’ or ‘Replace’ box in Notepad++. One needs to run in the ‘Mark’ tab in the same box…
On your keyboard, it’s useful to know that Ctrl + Home will take your text cursor (‘caret’) back to the top of the Notepad++ page, which may be useful if you are building these commands into a recorded macro.
Lastly, using regex to fiddle with public HTML seems to be frowned on, so I suggest the following are useful for certain offline text cleaning and data-swivelling operations, not for mission-critical coding or live pages.
The list has to be posted here in plain text as a .PDF. I had a blog post all done and polished, but then found that WordPress.com blogs make an utter mess of posted regex code, even when wrapping it with the code tags which are supposed to protect snippets of code! So here are the working regexes in a handy four-page .PDF file…
The apparently-new front-end for the CC Search – Image Search, for speedily finding re-usable Creative Commons images. There are said to be 289 million pictures here, mostly via Common Crawl apparently, from sources including DeviantArt. But Flickr is apparently not yet completely incorporated. Given the size it’s delightfully quick, and as you can see here it’s possible to ‘stack and chain’ the filters by selecting them repeatedly…
The drawback, compared to Google Images, is that in this incarnation of CC Search (the old one is still available) there’s no size filter and the relevancy ranking of results appears to be ‘easily distracted’. My very first search for Mongolian folk song got me a whole lot of Latvian folk dance, scrolling down into as a vast amount of Indian folk music. Very nice, but it took a lot of scrolling to eventually get down to some Mongolian content…
CC Search is under continual improvement through 2019 and more features are planned. Looking down the list in their forward plan I see that searching for CC texts is said to be coming to the new interface later in the summer (“incorporate open texts from major providers”), along with another design makeover (a new “distinct visual look and feel for landing page”).
There’s also talk of future delivery of a front-end for displaying “3D designs”, which suggests that 2020 or 2021 could see a very useful feature, a unified search for all CC 3D model files. I’d suggest that’s what vital in such a tool is a ‘can be re-textured’ search filter, as 3D models are not much use for quick re-use if (as is often the case with CC freebies) the material zones are either missing entirely or screwed up, which means they can’t be re-textured without specialist software and arcane skills. Perhaps a public user-feedback button could be used to indicate “I had success with this great model” / “Don’t waste your time on this”.
Pandoc is a useful universal document converter utility. Yet using the free Pandoc on Windows is often assumed to be about typing arcane command lines into a command prompt, and hoping for the best.
But now there’s a nice free and open source front-end for Windows, for those who just want the markdown. The 2019 PanWriter is a delightfully elegant and simple markdown writing pad for the rest of us.1 It’s a speedy Windows desktop install, and it didn’t even need to be told where Pandoc was on my PC. It just ‘knew’.
Even if you don’t need another sweet text editor, PanWriter also serves as a document importer and converter utility. Just load up a HTML page you saved from the Web, and it instantly converts all the HTML code to markdown. I loaded a few really complex pages and it didn’t blink, instantly presenting a clean markdown conversion.
Why would one want to convert HTML to markdown, you ask? Because it places the HTML and other elements onto single lines, while retaining Web links in place. Such lines are far easier to extract data blocks from. Compared to fiendishly nested HTML code that sprawls across multiple lines.3
Once the markdown text is in Notepad++, then a macro can have its way with it. This can ‘Find and Mark’ the repeating line-blocks2 containing the data you want to extract and clean (e.g. search-engine results). These can then be copied out to a new tab (Top Menu: Search | Bookmarks | Copy Bookmarked lines). Then the macro can run various operations that fix up the text a bit more,4. before finally saving the cleaned list back to HTML via the user-friendly MarkdownViewerPlusPlus plugin for Notepad++.5
So the basic workflow here is:
1. Get your search results and save your Web page(s) as usual. There’s no need to painstakingly select-copy-paste just part of the page.
2. Use a joiner utility (TXTcollector) to join all the saved pages.6 Open the saved HTML with PanWriter and it instantly auto-converts into markdown.
3. Open the saved markdown with Notepad++ and run your cleaning and text-sorting custom macro on it.
4. Copy-paste the resulting linked data list as HTML to your blog etc. (Or to .CSV for Excel import and sorting).
The above is a more advanced and robust version of my recent home-brew workflow, which suggested a browser addon and manual copy-paste. That was more suitable for occasional use by bloggers and academics who can’t afford sophisticated data scrapers (and the proxies to run them).
This workflow has the advantage that: i) it’s all free software; ii) it doesn’t need you to pay for and burn through proxies in your paid scraping software; iii) as long as you have the HTML in your browser it can be grabbed, and it basically doesn’t matter how complex and nested the page code is, as it’s all going into Markdown; and iv) collection can be automated with Windows automation software (JitBit) etc, and processing can be automated with Notepad++ macros. But it is obviously not suited to automated scraping of millions of records from multiple shopping sites — if you’re into that game then you should have the cash to buy in those datasets.
1. There are also two old GUIs for Pandoc, over on GitHub. One there is nice and simple, and has batch… but it crashes and fails on 64-bit Windows, as the developer admits in his readme. The other GUI at GitHub was tried and runs on 64-bit Windows, but seems far less user-friendly. There are also a half dozen Python scripts projects that do this.
2. Marking and exporting lines in Notepad++ can’t currently be done for multi-line nested HTML code, which is why a HTML-to-Markdown conversion is so useful. While multi-line block marking can be done between two keywords [ Find | Mark | Regex with Newline | then paste in…
…this only places a single mark at the top of each marked and highlighted block. It does not run a line of marks down the entire block.
3. Yes, I know about XPath, but with a complex Web page it's: i) fiendishly tricky to do the initial puzzling out of what needs to be captured; ii) often fails to then grab what’s needed; iii) and has even more difficulty in aligning data fields when used as a browser addon.
4. Note that multiline search-replace needs to be done as \n commands not plugins, in macros. Also that Crtl + Home will get your cursor back to the top of the text.
5. Sadly one can’t yet use Notepad++ as the initial importer/converter, as it has no such plugin at present. I’ve looked. There are a couple of possible Python scripts but support for the Python plugin in the latest Notepad++ is a bit of a mess at present, with plugin structures being swopped around and then reverted. So that’s not really an option, unless you want to fall right back to version 5.9 or thereabouts to use a script.
6. Update: If you have no absolute need to keep the saved HTML pages as backup, then Clipboard Magic is lovely little Windows freeware that keeps copies of each clipboard, then when done you “Copy all clips to clipboard” and paste to Notepad++. Or if you still use an older 32-bit Notepad++ you can use the fine MultiClipboard plugin.