1) Google often doesn’t seem to index quite everything at a site. Nor does it always index everything on a page or in a PDF file. Or perhaps it does index everything, but the algorithm that shapes each set of search results jettisons a few results for various reasons? The other possibility is that Google’s results are drawn from a pool of ‘shards’ of previous results, rather than direct from the core crawl data.
Solution: Google “Caffeine” and subsequent revamps?
2) Results from the main Google search can sometimes differ from those in your CSE. Your CSE will occasionally give radically less results from a site than the main Google does. Google doesn’t explain why this is, or the mechanism behind it. Perhaps there are several different versions of the Google index. Results are often much better when using a more sophisticated search method than simple keywords, searching “for phases” for instance. Sometimes you have to give up on trying to get your CSE to “see” the PDFs you want (although these are visible to the main Google) — and instead find a way to index just the linked table-of-contents pages (which will usually show up in your CSE).
Solution: A lot of extra work. Google could offer a “full Google” CSE to worthy non-profits.
3) Academics love to store the real content at some location that has a different URL than their home-page does. An unoptimised CSE may thus index a website containing ten pages, but not the 10,000 articles that they point to.
Solution: A lot of extra work, of the sort that JURN has undertaken, to find and then optimise the real “content location URL”.
4) Initial URL gathering can be arduous. Techies and web editorial staff at universities love to juggle directory structures, often for no discernible reason, and thus break links. Link-rot is severe in ejournal lists from more than two years ago, and lists over four years old often have around 80% dead links.
Solution: Techies need to set up robust redirects if they really have to break URLs. “Self-destruct tags” that delete a links-list page after a certain date, if it hasn’t been updated for more than two years.
5) Google CSEs cannot pick specific content (e.g.: a run of journal issues) from the meaningless database-driven URLs commonly found in academic repositories, since there is no repeating URL structure to grab onto. It’s a question of indexing “all or nothing”.
Solution: URL re-mapping services that are recognised and can be “unwrapped” by Google? Plain HTML “overlay” TOCs.
6) Editors don’t enforce proper file-names on published documents, which means many CSE search results are titled in the Google results as something like “&63! print_only sh4d7gh.indd” rather than “My Useful Title”. Nor do people add the home location URL and website title to the body of their document — which means that scholars can waste several minutes per article trying to find out where it came from. Some students may never manage to find the journal title for the article they downloaded.
Solution: Better publication standards at open access and independent ejournals.
7) Large Google CSE are easy to make, but take a lot of hand-crafting to properly optimise and maintain. “Dead” CSEs from late 2006, when the CSE service first appeared, litter the web. Most of these were also un-optimised. Despite the potential of CSEs, it’s really hard to find large subject-specific CSE that are both optimised and maintained. Most people now seem to use CSEs for indexing a single site or a small cluster of sites that they own.
Solution: Users should remove old circa-2006 CSEs from the web. Subject-specific academic and business groups should consider building a collaborative CSE rather than a wiki.
8) Google’s search result ranking doesn’t work as well as it might in tightly defined academic searches. The PageRank wants to evenly “spread the results” across a variety of sites, and thus you’ll rarely see results from just one site dominating the first ten hits – although that may be exactly what a tight academic search requires.
Solution: For some types of CSE, this could probably be solved by delving into the optimisation features that Google offers for linked CSEs. Update: Google appears to have tweaked the algorithms to fix this problem.
9) Google searches have a problem with finding text at the end of long article titles, of the kind which are common in academia.
Solution: Authors and publishers should work to keep article and page titles under 50 characters.
10) You can’t have your CSE do a “search within search results”.
Solution: Manually build a set of pages containing the result URLs you want indexed, then get Google to see these as static pages which can then be added to your CSE.