A quick check of the front-page statement “Our corpus currently includes only computer science papers” on Paul Allen’s Semantic Scholar shows that it’s no longer quite true. “Our corpus is mostly computer science papers and… a whole lot of other stuff that the A.I. dragged in” might be a more apt statement.

Semantic Scholar is definitely now ranging more widely in science, looking for fulltext PDFs. I’d guess that its A.I. is working outward from highly-cited papers and ferreting among their citations to try to dig up the fulltext for each. That would explain what appears to be the eclectic nature of Semantic Scholar’s spread away from computer science. On searches for ecology and other not-computer-sci stuff I very easily found a Powerpoint in PDF, a workshop presentation, even a saved print-to-PDF of a book reviews page in Science


… as well as PDF papers from MDPI and ResearchGate, plus really obscurely self-archived and departmental archived PDFs. That kind of scattergun approach and lack of judicious curation seems to me to be the sign of a self-learning A.I. in action.

Interestingly Semantic Scholar also seems to be taking a very bold ‘sod the source’ approach. Each record page has no indication at all of where the fulltext was found, and they’re hosting the harvested PDFs themselves (circa 250,000 at present, as an approx. number from Google). It may be just as well they’re not flagging up the sources, since they can include… oh dear… $35.95 from Elsevier’s ScienceDirect, but free from Semantic Search.

Yes, Google finds Semantic Scholar hosting and giving away some 28,000 articles from Elsevier journals as PDFs, and none of the few PDFs I checked at random were flagged at Elsevier as being OA CC-BY articles. Which doesn’t mean they’re not OA. Elsevier has been known to be happy to take money for articles that should be OA. Anyway, Google this and see what you think…

site:https://pdfs.semanticscholar.org/ sciencedirect