Google indexes images from PDF files. Fairly limited at present, possibly because the pictures all seem to be drawn from a small set of 500 PDFs stored at
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/. But I’d guess that, as Google’s machine-learning algorithms get tuned up on this, we may start to see the service expanded to extract and serve images from wild PDFs. I wonder if there will be a Creative Commons filter for images from open access research PDFs? I also wonder if this may enhance the size of the image pool accessible via JURN’s new Image Search feature?
[ Hat-tip: ResearchBuzz ]
Repozitar is a unified search tool for Czech open repositories. By default their keyword search only returns records which offer full-text. A nice touch, and it makes one wonder why the English-speaking world’s repository search tools seem to have such trouble offering this simple useful feature.
Repozitar is associated with a searchable nationwide registry of Czech theses, seemingly part of a Masaryk University project to help detect plagiarism in theses and student papers. English abstracts appear to be common in the very detailed record pages.
“[In a sample of] 3.5 million scholarly articles published between 1997 and 2012 [there is an] alarming link rot ratio for all three corpora: 13% of arXiv, 22% of Elsevier, and 14% of PMC articles published in 2012 suffer from link rot. These numbers only increase for older articles, for example, for articles published in 2005 the corresponding numbers are 18%, 41%, and 36%.”
Lots of press chatter in the last few days about a not-yet-public new academic search engine from Helsinki Institute for Information Technology called SciNet. It seems it’s been in development for some years. Here’s a screen capture of the UI sliders seen briefly in the video…
I seem vaguely to remember similar style experimental search interfaces, maybe ten years ago now.
But the sliders made me think I’d like to see Google offer such a set of fine-tuning sliders, to change a variety of their currently fixed or on/off search parameters. Although I guess that might then be gamed by the SEO hucksters to winkle out a few of the secrets of Google’s algorithm weightings.
A new embedded search tool for non-fiction writers, Bing Insights for MS Office. It only seems to work in MS Office Live, rather than as a plugin for older desktop installations of Office. Sadly I just couldn’t find the Insights feature at all in Office Live, when I went to test it. So perhaps it’s not yet been rolled out the UK.
But it seems a neat idea, meaning that checking a basic fact no longer entails bouncing out of Word and into a Web browser. The search process also apparently inherits semantic nudges, drawn from the other words and phrases detected in the document. One wonders if the semantic data that Microsoft gain from this will, in time, improve the Bing Search service itself.
I’d expect the Open Source Office software suites to add this sort of fact-checking feature to their Word Processor soon, if they haven’t already (I couldn’t immediately find something similar for Open Office, LibreOffice, etc). Although their natural choice of partner, Wikipedia, might not be the most trustworthy source of facts.
A key element of online search literacy appears to be going backward, rather than forward. Results from 1,200 U.S. librarians surveyed in May 2014 appear to show a …
… 29.3 percent increase, over the past two years, in the perception that students have a rudimentary understanding of web evaluation. “[…] librarians feel students are now using the open web for research less than they did in 2012,” the report says, “[and] when students are on the open web, their evaluation skills are more lackluster.” […] 36.1 percent of the students surveyed felt that they had an advanced understanding of website evaluation, whereas only two percent of librarians considered their students to have a high degree of skill in the same area.”
The respondents were librarians from across the core educational spectrum, from elementary through to four-year academic institutions. 31 percent were based in high schools.
Scholar Ninja, new from Jure Triglav…
I’ve started building a distributed search engine for scholarly literature. … What makes Scholar Ninja unique is that all of its functions (indexing, searching, and distributed server) are contained within a browser extension. [and thus hardened against censorship] “What?”, I can hear you say, “How can that be? Since when can a browser be a server?” Since 3 years ago, when the almighty WebRTC was born. … [Scholar Ninja] is completely contained within a browser extension: install it from the Chrome Web Store. … beware that this is alpha software and may break completely.
Why we need both discoverability and long Plain English summaries (as well as short abstracts) for open academic work… “The solutions to all our problems may be buried in PDFs that nobody reads”. Admittedly, we are talking about World Bank reports, but in the ‘send a Congressman to sleep’ stakes I guess those can go head-to-head with many other academic papers.
GeoDeepDive is software that helps…
geo-scientists extract data that is buried in the text, tables, and figures of journal articles and web sites […] As of today, GeoDeepDive has processed over 36K research papers and 134K web pages
David Prosser at Jisc blogs on the need for action on discoverability…
… 40% of researchers kicked off their project with a trawl through the Internet for material, while only 2% preferred to make a visit to a physical library space. [yet] nearly half of all items within digitised collections are not discoverable via major search engines by their name or title [and, even worse] digitised collections become harder and harder to find over time, for a variety of complex reasons.
Oaddo is an early alpha of a cool new search tool. Imagine that Wikipedia and Pinterest combined to give autocomplete a usability makeover, with Trello acting as the makeup girl. The aim is to help you do deep ‘research search’ when you don’t really know what you’re searching for.
It has an interesting way of allowing your search terms to interact with clustered semantic tags, for drilling down to the best search result. Sort of like a Google autocomplete / autosuggest that’s slowed way down and is largely under your control, and is curated by humans — and as a consequence is not dumb.
Oaddo has a nice clean interface too, which is neatly poised between power and simplicity. The developer Tim Borny has obviously been looking at Trello and Pinterest for inspiration. Although at the moment the discarding of search modifier tags takes two clicks, instead of a fun one-click “fling it to the discard tray” movement.
The other innovation is that it aims to have a democratic user-driven model. That aspect might take Oaddo a long way, provided there’s a critical mass of people — and provided a mechanism can be found to reign in the inevitable SEO spivs, ideological censors, and WikiPolice types.
* Users will ‘vote’ on content, curate content and the database of related terms.
* The community will drive the addition of new features.
So, very interesting. Amid the sea of recent search launches, this is actually one to watch. Here’s Tim Borny’s full explanation…
New long interview with Kathleen Shearer, Executive Director of COAR, on repositories. With a strong focus on discoverability as seen from a broad strategic perspective. From the intro and questions…
“locating and accessing content in OA repositories remains a hit and miss affair, and while many researchers now turn to Google and Google Scholar when looking for research papers, Google Scholar has not been as receptive to indexing repository collections as OA advocates had hoped. … 15 years after the Santa Fe meeting they [researchers] still find it extremely difficult, if not impossible, to search effectively in and across OA repositories”
From the interview…
… “mega-journals” are essentially repositories with overlay services. We should be participating in projects that demonstrate the added value of repositories and repository networks across the research life cycle.” (Kathleen Shearer)
The Bing search engine is now offering predictions…
“… teams within Bing have been experimenting with useful ways that we can harness the power of Bing to model outcomes of events. … Today we are bringing these insights directly to our search results pages. Based on a variety of different signals including search queries and social input from Facebook and Twitter, we are unveiling an experiment we’ve built to give you our prediction of the outcome of a given event.”
The front cover of the latest Smithsonian magazine also heralds the Future Studies meme…
I had a quick look at the full list of Schema.org tags, which are now available in Google CSEs. They can be used to filter the CSE’s site list, serving to “Restrict pages from the above site list to only those that contain [chosen] Schema.org types”. Handy if you have a huge single site of HTML/CSS/XML that you can grep, and you want to prepare it for selective CSE search without having to juggle directories and file names.
It looks to me like those tagging open access scholarly articles would need to be able to chain Schema.org tags into something like…
CreativeWork: ScholarlyArticle: TransferAction: DownloadAction: GiveAction:
Whereas paywall publishers might need something like:
CreativeWork: ScholarlyArticle: TransferAction: DownloadAction: SellAction:
But at present there seems to be only the basic undifferentiated…
Even if there were workable OA additions to Schema.org, there would still the huge problems of: i) persuading people to add the tags to all their ongoing content at the article level, and to do so correctly and consistently; and ii) to have them go back and accurately tag perhaps two decades or more of existing open access articles.
I found a 2013 article from geoscientists who had tested Google Scholar: “Literature searches with Google Scholar: Knowing what you are and are not getting”. Although the body of the paper states that their test phrase was “wildfire-related debris flows”, the data shows they actually tested Scholar with the keywords wildfire-related debris flows. They reportedly found that…
“free articles were available in PDF format for 88% of citations returned by Google Scholar. They were available from open-access journals or via links to organizational sites where authors had posted their publications.”
However if you actually look at their linked search-results data file, then the above statement needs additional clarification. Since it’s clear that paywall articles from Elsevier, Springer and the like, appearing in their Scholar results, were being counted toward those “free articles”. It turns out that many of these were “free” only via a DigiTop proxy overlay for Scholar that is, in the words of DigiTop, “available to USDA employees only”. Nice if you work under the U.S. Department of Agriculture umbrella, but it seems that those outside have to pay.
Does Google Scholar perhaps need to add some kind of “paywall box detector” to its scraper bots? Then perhaps something like [PDF] [-||-] could be added on the right-hand column of the Scholar results, to indicate a PDF that’s “available maybe” — but which will prove to have a paywall that needs to be either backed out from or negotiated? And perhaps [PDF] [-~-] could indicate a genuine direct link to a bona fide PDF file?
Anyway… this is what geoscientists are talking about when they refer to wildfire-related debris flows. Seems like it might be a geological process that intelligent farmers, hiker-campers, and treeline homesteaders around the world would like to learn some precise details about…
Giant mudslides, basically.
Incidentally, the same wildfire-related debris flows search in JURN needs to be tightened up just a little for strong results. Using wildfire-related “debris flows” works better, though the first six pages of good results do stray just a little (to pick up what seem to be three articles about prehistoric ‘dinosaur-era’ debris flow events). Yet even on this test JURN appears to be doing about twice as well as Google Scholar in terms of getting open articles, once Scholar’s ‘false-positive’ paywall PDFs from Elsevier & co. are subtracted from Scholar’s results.
Ten years ago, today…
JISC ITT commission: A study to forecast a delivery, management & access model for eprints & open access journals within Further and Higher Education. … Access should be streamlined and free at the point of use, irrespective of the source of content.
10th March 2004 12:00
Joseph Esposito has usefully had a peek inside a very expensive commercial market report titled Global Social Science & Humanities Publishing 2013-2014.
Social/Humanities publishing is found to be perhaps 25% of the size of Science/Technology/Medicine, at around $5bn. That actually strikes me as something of an achievement, when you consider that we have far smaller research funding inputs and a smaller technical/training infrastructure to call on. But perhaps the $5bn figure is given a strong boost by teacher training textbooks, social work manuals and the like?
Joseph highlights the report’s finding of a highly fragmented market. This market fragmentation is one of the reasons I’m sceptical about the success of a ‘one metadata to rule them all’ solution to OA indexing and discovery. It seems that DOAJ-listed OA journal titles can’t even find their way in full-text into the largest of commercial databases (such as EBSCO Complete) at higher levels than just over 20%. When last heard of the Web of Science / Scopus seemed to be barely scraping 1,000 OA arts and humanities titles indexed. One art history study found that Google Scholar could index only half the DOAJ’s OA art history titles. A dastardly conspiracy to keep OA titles out of these big indexes seems unlikely. So I suspect it’s largely due to many OA editors in the arts and humanities not giving a fig about providing the means to automatically index their content. Their widespread lack of something as basic as RSS feeds seems to confirm that. Add to that the fact that only 56% of DOAJ journals can supply the DOAJ with article metadata. Persuading non-librarian types to do something as simple tag all their back-issue content with some simple new machine-readable OA tag, thus seems rather a long shot. Persuading mainstream publishers to do the same? Well… I suspect one would wait forever for that. Nor are librarians likely to be of much use, after the fact of publication — since they seem to have mostly failed to apply even their own metadata standards to open content, and open repository metadata quality is reported to be dire.
Wouter has hacked out a Google Scholar API workflow today, sort of. I suspect the reason Scholar has never offered an API is the agreements Google has with the large commercial journal publishers and citation database providers.
A new study, “Google Scholar and DSpace”…
“The average indexing ratio [in Google Scholar] for our sample of 10 recent DSpace repositories is 64.8%”
I wonder if the interface presentation has an influence? http://circle.ubc.ca/ is totally hardcore in presentation and keywording, and is indexed at 99%. Whereas http://dash.harvard.edu/ has a more student-friendly blog-like look and feel to it, and is indexed at just 26% despite the harvard.edu domain. But perhaps not, as I guess its more likely due to the presence or otherwise of good machine-readable metadata.