Discovering research articles containing evolutionary timetrees by machine learning

Stanojevic, M.; Andjelkovic, J.; Kasprowicz, A.; Huuki, L.A.; Chao, J.; Hedges, S. B.; Kumar, S.; Obradovic, Z.

Motivation: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree project has been manually locating, curating, and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, userfriendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of textmining approaches and developed optimizations to find research articles containing timetrees automatically. Results: We have developed an optimized machine learning (ML) system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TimeTree resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder tool, TTF, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TimeTree database would potentially double the knowledge accessible to a wider community. Manual inspection showed that the precision on out-of-distribution recently-published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. Availability: https://github.com/marija-stanojevic/time-tree-classification Contact: {marija.stanojevic, s.kumar, zoran.obradovic}@temple.edu Supplementary information: Supplementary data are available at Bioinformatics online

More Like this