Hurrah. Desktop software has been found that will robustly convert all of a WordPress blog .XML export to Word. Specifically, a free blog hosted at the .com WordPress, which doesn’t permit or offer any fancy ebook conversion plugins. The solution is good old Windows freeware, as usual. Though the various freeware directories know nothing of it, and it was only found after hours of digging.
The XML to Doc software is the snappily titled wpxslgui…
wpxslgui is a Windows application which converts an XML File generated by the WordPress Export function into an HTML or Word HTML document.
It’s Windows freeware in v.1.04 (June 2020) from Devio IT Services, being the worthy and generous Herbert Oppolzer of Austria. Tested and working here. Very simple usage, with a Windows GUI.
Also includes the option to… “Convert WordPress XML to a single HTML file allowing filter by category (JavaScript)” but the “Word HTML” output saves as a .DOC file.
However if you cleverly just re-name the .DOC to .HTML it then works fine as a Web page in a Web browser, and thus calls in the images. I’m assuming here that your browser is allowed to go online, but Word is not.
The overall aim here is to get the blog to a clean ebook format for Kindle, removing as much gunky code and blog-cruft as possible. Archivists may also be interested here. As such there are some initial changes you may want to make…
1). First you may want to delete the superfluous 24-hour timestamp by editing the WordPress .XML output itself. What you’re targeting looks like this…
Since datestamps have a unique pattern, a regex can deal with them. Simply deleting all timestamps in Notepad++ with a regex is…
\d{2}:\d{2}:\d{2}
…and note the single space at the start of the regex.
Running this removes all timestamps but leaves the date intact. Possibly a regex could also re-work the main date to something nicer (e.g. 12th July 2015), but it would likely be a very complex regex.
2). After conversion wpxslgui adds italics on post titles. The CSS stylesheet is embedded at the top of the output HTML, and so changing italics is just a matter of tweaking font-style:italic on H2. Bold might be better.
3). While you’re in the CSS you may want to have external links be something other than blue and purple. In which case edit the colours of a:link and a:visited in the header CSS.
4). wpxslgui adds numbering of posts, in front of each post title. The code spans multiple lines and looks like this…
Awkward, but not impossible for a regex to fix. This Search-Replace regex for Notepad++ will replace all of these with a tilde or whatever other elegant typographer’s HTML mark you might want.
5). Once you have output in semi-cleaned HTML, Ctrl + A should “Select all…” from the browser and then you paste to whatever you’re using. If you were keen on YouTube embeds you will then need to manually go through and delete the WordPress code for these. The WordPress YouTube ‘slug’ insert is presented raw in the post. Without a WordPress installation the code can’t call the video. The same will likely be true for any other fancy embeds of maps, charts, podcasts etc.
Finally, note also that wpxslgui only deals with posts, not pages.