• Directory
  • FAQ: about JURN
  • Group tests
  • Guide to academic search
  • JURN’s donationware
  • openEco: nature titles indexed

News from JURN

~ search tool for open access content

News from JURN

Monthly Archives: June 2014

How to archive a free WordPress.com blog with images

21 Saturday Jun 2014

Posted by David Haden in JURN tips and tricks, Regex

≈ Leave a comment

One of the problems found in making a local archival backup of your free WordPress.com blog is that users are not allowed to bulk export their images and other uploaded files. Just the archive .xml file, which has all the HTML of the blog posts inside it. WordPress.com unhelpfully suggest: “uploads and images may need to be manually transferred to the new blog”. That’s possible for the lazy blogger who only ever squeezed out six posts before exhausting their intellectual energies, but not so useful for uber-bloggers with thousands of posts and images.

For those who are self-hosting a WordPress install, archiving all images is a simple matter. Just copy over the relevant folder by FTP access. But for free WordPress.com users that’s not an option.

Similarly, those moving from a live WordPress.com blog to a new self-hosted WordPress blog are also in luck. Import the .xml backup of your blog and the new self-hosted WordPress install should go fetch the old blog’s live images and import them, even reworking all their links to conform to the new site URL. Once everything has been ported across, the old WordPress.com blog can then be deleted.

However, there may be instances where someone wants to make a more long-term local archive of a free WordPress.com blog, especially one that is set to be deleted. A literary executor, for instance, may want to properly archive then close a writer’s substantial blog. Perhaps there are legal problems with the estate that means the blog needs to come down. Perhaps they intend to publish it in book form or online again at some time in the future, but… they’re not sure yet.

But they do know that they want the archive to remain more-or-less portable and flexible into the future. I’m assuming that that person doesn’t have time or the technical savvy to: buy web space; get to the host to activate the database on their website space; get a hosted WordPress install set up and configured with the database; then save the blog out from that. Or to set up a local MySQL etc install on their desktop, something which is dangerously unstable in terms of later moving it to a new PC or a fresh OS install.

In such a case the easiest option for doing this appears to be…

1. Download and install a website ripper (or in more polite parlance, “mirroring”) software. Such as the excellent free HTTrack Website Copier. Use its simple wizard to make a full local mirror of your blog. You’re only doing this to get at the images, and have them accurately mirrored inside their correctly named sub-folders.

Unfortunately the downloading of your target blog may take quite some time, even for a relatively small blog. A test run with JURN’s substantial blog took a ridiculous 90 minutes to mirror, using HTTrack 64bit Windows and standard broadband, including 18,000 “ooh, ooh, share this post on CrapUpon!” and similar WordPress fluff-files.

2. Then download an export backup .xml of your blog, from your blog’s own Dashboard (Dashboard | Tools | Export | Export | Complete | Download Export File). This export will be a text only .xml file, which won’t include any of the blog’s images. (What to do when the export email never arrives)

3. Copy out the images folder (look for a folder titled yourblogname.files.wordpress.com) from the local ‘mirror’ of your blog that HTTrack made. Place this below the location of your blog’s exported .xml file.

4. You now have a relatively clean and simple backup archival copy of your blog, with the folders of blog images aligned (in terms of everything but the base URL) with the URL references contained in the .xml archive file.

5. Make a copy of the blog’s main index.html page, so as to capture any sidebar blogroll links. Perhaps also take a screenshot, and also download the .zip of the template that was used by the blog. Place these items with the .xml and images folders.

6. Save and zip an archive of the blog .xml and and the blog images, plus the index.html, the template .zip, and the screenshot.

The advantage of doing it this way is that the blog is now much more portable across longer periods of time. If — five or ten years down the line, once the author’s estate has been sorted out — you want to put the blog online again, or port it into a book or timeline or whatever, you still have a single-file local .xml copy with code that’s fully accessible for search/replace with a simple text editor. You’d upload HTTrack’s folder(s) of the archived images somewhere, then tweak the archive’s .xml via search-and-place of the image links (perhaps by using the free Notepad++, which can cleanly handle and save huge .xml files without injecting them full of Microsoft Office bloat on saving), such that the .xml archive image links all point to your new online images folder. A new self-hosted WordPress install should then go fetch those images and import them, reworking all the links to conform to the new site URL.


Update, March 2019.

Via the new dashboard, “Export Media Library” newly added…


Update, September 2021. Easier method and better than HTTrack.

The following assumes i) your images were all give unique names at upload (i.e. not continually 1.jpg, 2.jpg., mycat.jpg, mydog.jpg etc); and ii) the exported .ZIP file containing your images never completes its download, ever. A very common occurrence. You are left with no way to get your images archived.

1. From the newer WordPress Dashboard (not the old one), download your .XML files containing posts and pages.
2. Copy them somewhere, rename as .HTML
3. Run them through Sobolsoft Extract Links from Multiple HTML files, or a similar link extractor.
4. Copy the resulting links list to the clipboard, paste to Excel.
5. Sort A-Z and delete all links that are not https:// yoursite. files.wordpress.com/ Note the “.files.” here. This is where the images live. Some older blogs will have both http and htpps links to images, which when sorted will appear at different sections of the list.
6. Copy out the resulting list to Notepad++ and ‘search-replace delete’ the ending slash. You will have…

https:// yoursite.files.wordpress.com /2021/10/yourimage.jpg/

And you want it without the trailing slash…

https:// yoursite.files.wordpress.com /2021/10/yourimage.jpg

7. Save the final list as list.txt This has all the links to your images.
8. Install the free DownThemAll! browser extension in your Web browser, or open it if you already have it.
9. Right-click on its download panel and choose “Load from List”. “Rename on duplicate”. It’s very robust and can handle 10,000+ links with no problem.
10. Start downloading. Getting all the images from a large ten-year blog may take many hours.

In an emergency you can then reinstall WordPress, install the .XML, the same template as before, and also install the Search Replace Regex plugin. FTP upload a folder to your server called oldimages with the images, and then use the plugin to replace the paths to the images so that they all read https:// my_new_blog /oldimages/yourimage.jpg Provided there are no or very few image name conflicts, the site should look and work as before.


One further problem may occur for some who have been hosting on rented server-space. You may have image links that look like this…

This will be because a WordPress install loves to fill your server space with utterly pointless multiply-resized copies of the images you upload. Renamed resized images. Sigh. So, even though you have successfully got the linked image.jpg back on your blog, you will not have the image-193×300.jpg which makes its initial appearance in your blog post. Thus you need to check if this is the case for you, before you begin. If this is this case then you will also need to ensure that step 3 (see above also captures the img scr links as well as the a href links to images.

Another problem with re-linking images is capitalisation. You blog may have links to…

840-north-michigan-c1929.jpg

… which is not the same as an image you might have archived locally as…

840-North-Michigan-c1929.jpg

To re-name all image paths in the .XML, using Notepad++, to a static directory this is what you do…

Notepad++ sees the d bit as ‘any date number’.

South America “virtually nonexistent” in Google Scholar

19 Thursday Jun 2014

Posted by David Haden in JURN's Google watch, Ooops!

≈ Leave a comment

“The dark side of open access in Google and Google Scholar: the case of Latin-American repositories”…

“the [study of the] presence and visibility of [a total of 137] Latin American repositories in Google and Google Scholar […] indicate[s] that the indexing ratio is low in Google, and virtually nonexistent in Google Scholar [with] a complete lack of correspondence between the repository records and the data produced by these two search tools.”

JURN is doing much better, in that regard, with a little help from Red Federada des Repositorios (which is comprehensively indexed by the main Google) and the general ‘open everything’ attitude to publishing scholarship in South America.

Trooclick

12 Thursday Jun 2014

Posted by David Haden in Spotted in the news

≈ Leave a comment

Trooclick, a new attempt at an auto-fisker for news facts, as a browser plug-in. Silly name, and still in invitation-only alpha. But it’s an interesting indication that it might be possible to make it work, with a little human curation along the way.

Beall wringing practice

12 Thursday Jun 2014

Posted by David Haden in Spotted in the news

≈ Leave a comment

The July 2014 issue of Cites & Insights swings the bell-ropes at the Beall list and the DOAJ, and listens for interesting overlaps and more — with an aim of making…

“the clear case that publishers on Beall’s list are not typical of OA [open access] as a whole or of DOAJ”

Few withdrawals from the World Bank

09 Monday Jun 2014

Posted by David Haden in How to improve academic search, Spotted in the news

≈ Leave a comment

Why we need both discoverability and long Plain English summaries (as well as short abstracts) for open academic work… “The solutions to all our problems may be buried in PDFs that nobody reads”. Admittedly, we are talking about World Bank reports, but in the ‘send a Congressman to sleep’ stakes I guess those can go head-to-head with many other academic papers.

On the NESTA

05 Thursday Jun 2014

Posted by David Haden in Spotted in the news

≈ Leave a comment

Fluff up your resume with an internship at Nesta in London…

nesta

Nice to know

05 Thursday Jun 2014

Posted by David Haden in Economics of Open Access, JURN's Google watch, Spotted in the news

≈ Leave a comment

The Court of Justice of the European Union (CJEU) declared today that UK and European Internet users are not acting illegally when simply browsing copyrighted material online.

The equivalent of the USA’s Supreme Court established that users engaged in “Temporary acts of reproduction … which are transient or incidental” (Article 5.1 of the EU Copyright Directive) — such as files automatically copied to a Web browser’s temporary cache and displayed on screen — must not be considered to be making illegal copies. This ruling now applies throughout the UK and Europe.

Earlier this year the EU ruled that hyperlinking to public content is not illegal, and this new ruling seems like the other side of that coin.

“Publish and be damned…”

05 Thursday Jun 2014

Posted by David Haden in Ooops!

≈ Leave a comment

If your shiny new journal is to be published by a commercial megapublisher, it may not be prudent to lead off the first issue with a paper detailing…

“the large profits made by commercial publishers on the back of academics’ labours”

Eco titles repaired

04 Wednesday Jun 2014

Posted by David Haden in Ecology additions

≈ Leave a comment

I used Linkbot to check all the Web links in JURN’s link list of ecology related titles in English, and made repairs. Please refresh any local copies that you may be keeping.

GeoDeepDive

04 Wednesday Jun 2014

Posted by David Haden in How to improve academic search, Spotted in the news

≈ Leave a comment

GeoDeepDive is software that helps…

geo-scientists extract data that is buried in the text, tables, and figures of journal articles and web sites […] As of today, GeoDeepDive has processed over 36K research papers and 134K web pages

← Older posts
Subscribe: RSS News Feed.
I'm on Patreon!

JURN:

  • JURN : directory of ejournals
  • JURN : main search-engine
  • JURN : openEco directory
  • JURN : repository search

Related sites:

  • 4 Humanities
  • Academic Freedom Alliance
  • Accuracy in Academia
  • Alliance Defending Freedom
  • ALPSP
  • alt.academy
  • AMIR
  • Anterotesis
  • Arcadia project
  • Art Historicum (German)
  • AWOL
  • Beall's List (updated at 2018)
  • Beall’s List (old)
  • Beyond Search
  • Bibliographic wilderness
  • Booktwo
  • Campus Reform
  • Charleston Advisor
  • Coalition for Networked Information
  • Communia (public domain watchdog)
  • Cost of Knowledge
  • Council of Editors of Learned Journals
  • Dan Cohen
  • Digital Koans
  • Digital Shift
  • Dissernet (Russian anti-plagiarism)
  • DOAJ
  • Don't Block TOR
  • eFoundations
  • EIFL
  • Electronic Frontier Foundation
  • ELO
  • Embargo Watch
  • ePublishing Trust for Development
  • Facebook: Arab Open Access
  • Facebook: Italian Open Access
  • Facebook: Open Access India
  • Film Studies for Free
  • FIRE
  • Flaky Academic Conferences
  • Found History
  • Foundation for Individual Rights in Education
  • Free Speech Union (UK)
  • Google Algorithm
  • Heterodox Academy
  • Iconclass
  • IFLA Serials blog
  • ImpactStory
  • infoDocket
  • InTech Blog
  • Jinfo (formerly Free Pint)
  • Kindle blog
  • L'edition Electronique (French)
  • La Criee : periodiques (French)
  • Leader Statement Database on Free Speech
  • National Association of Scholars
  • National Coalition of Independent Scholars
  • Neil Beagrie
  • OA Lookup : Policies
  • OA Working Group
  • OASPA
  • Online Searcher
  • Open Access Bibliography
  • Open Access Week
  • Open and Shut?
  • Open Electronic Publishing
  • Open Folklore
  • Open Knowledge Maps
  • Open Library of Humanities
  • Periodiques en ligne (French)
  • Peter Murray Rust
  • PKP / OJS
  • Project Gutenberg
  • Publishing Archaeology
  • RBA Blog
  • Reclaim the Net
  • Research Information
  • Research Remix
  • Right to Research
  • River Valley TV
  • ROARS (Italian)
  • Scholarly Electronic Publishing
  • Scholarship Matters
  • Searchblox
  • Searcher
  • Serials Cataloger
  • Serials Review
  • Society of Young Publishers
  • Speech First
  • TaxoDiary (taxonomies news)
  • Taxpayer Access
  • Tentaclii
  • The Scholarly Kitchen
  • Thoughts from Carl Grant
  • Web Scale Discovery
  • Zotero blog

Some of the libraries linking to JURN

  • Boston College Libraries
  • Brooklyn Public Library, NY
  • Duke University
  • Kobe University, Japan
  • Rhode Island College
  • San Jose State University
  • UConn Stamford
  • University of California
  • University of Cambridge (Casimir Lewy Library)
  • University of Cambridge (main)
  • University of Canberra
  • University of Toronto
  • Washington University
  • West Virginia University

Spare BitCoins? Please send donations to JURN via: 17e2KGuyzjzEEE7BsoYTwMo3MtUod6DrjP

Archives

  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • February 2021
  • January 2021
  • December 2020
  • November 2020
  • October 2020
  • September 2020
  • August 2020
  • July 2020
  • June 2020
  • May 2020
  • April 2020
  • March 2020
  • February 2020
  • January 2020
  • December 2019
  • November 2019
  • October 2019
  • September 2019
  • August 2019
  • July 2019
  • June 2019
  • May 2019
  • April 2019
  • March 2019
  • February 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • September 2018
  • August 2018
  • July 2018
  • June 2018
  • May 2018
  • April 2018
  • March 2018
  • February 2018
  • January 2018
  • December 2017
  • November 2017
  • October 2017
  • September 2017
  • August 2017
  • July 2017
  • June 2017
  • May 2017
  • April 2017
  • March 2017
  • February 2017
  • January 2017
  • December 2016
  • November 2016
  • October 2016
  • September 2016
  • August 2016
  • July 2016
  • June 2016
  • May 2016
  • April 2016
  • March 2016
  • February 2016
  • January 2016
  • December 2015
  • November 2015
  • October 2015
  • September 2015
  • August 2015
  • July 2015
  • June 2015
  • May 2015
  • April 2015
  • March 2015
  • February 2015
  • January 2015
  • December 2014
  • November 2014
  • October 2014
  • September 2014
  • August 2014
  • July 2014
  • June 2014
  • May 2014
  • April 2014
  • March 2014
  • February 2014
  • January 2014
  • December 2013
  • November 2013
  • October 2013
  • September 2013
  • August 2013
  • July 2013
  • June 2013
  • May 2013
  • April 2013
  • March 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • October 2012
  • September 2012
  • August 2012
  • June 2012
  • May 2012
  • April 2012
  • March 2012
  • February 2012
  • January 2012
  • December 2011
  • November 2011
  • October 2011
  • September 2011
  • August 2011
  • June 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010
  • June 2010
  • May 2010
  • April 2010
  • March 2010
  • February 2010
  • January 2010
  • December 2009
  • November 2009
  • October 2009
  • September 2009
  • August 2009
  • July 2009
  • June 2009
  • May 2009
  • April 2009
  • March 2009
  • February 2009

Create a free website or blog at WordPress.com.

  • Follow Following
    • News from JURN
    • Join 901 other followers
    • Already have a WordPress.com account? Log in now.
    • News from JURN
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...