A quick check of the front-page statement “Our corpus currently includes only computer science papers” on Paul Allen’s Semantic Scholar shows that it’s no longer quite true. “Our corpus is mostly computer science papers and… a whole lot of other stuff that the A.I. dragged in” might be a more apt statement.

Semantic Scholar is definitely now ranging more widely in science, looking for fulltext PDFs. I’d guess that its A.I. is working outward from highly-cited papers and ferreting among their citations to try to dig up the fulltext for each. That would explain what appears to be the eclectic nature of Semantic Scholar’s spread away from computer science. On searches for ecology and other not-computer-sci stuff I very easily found a Powerpoint in PDF, a workshop presentation, even a saved print-to-PDF of a book reviews page in Science


… as well as PDF papers from MDPI and ResearchGate, plus really obscurely self-archived and departmental archived PDFs. That kind of scattergun approach and lack of judicious curation seems to me to be the sign of a self-learning baby A.I. in action.