skip to main content

Title: Discovering research articles containing evolutionary timetrees by machine learning
Abstract Motivation

Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree (TT) project has been manually locating, curating and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, user-friendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of text-mining approaches and developed optimizations to find research articles containing timetrees automatically.

Results

We have developed an optimized machine learning system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TT resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder (TTF) tool, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TT database would potentially double the knowledge accessible to a wider community. Manual more » inspection showed that the precision on out-of-distribution recently published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs.

Availability and implementation

https://github.com/marija-stanojevic/time-tree-classification.

Supplementary information

Supplementary data are available at Bioinformatics online.

« less
Authors:
; ; ; ; ; ; ; ;
Publication Date:
NSF-PAR ID:
10394306
Journal Name:
Bioinformatics
Volume:
39
Issue:
1
ISSN:
1367-4803
Publisher:
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Motivation: Timetrees depict evolutionary relationships between species and the geological times of their divergence. Hundreds of research articles containing timetrees are published in scientific journals every year. The TimeTree project has been manually locating, curating, and synthesizing timetrees from these articles for almost two decades into a TimeTree of Life, delivered through a unique, userfriendly web interface (timetree.org). The manual process of finding articles containing timetrees is becoming increasingly expensive and time-consuming. So, we have explored the effectiveness of textmining approaches and developed optimizations to find research articles containing timetrees automatically. Results: We have developed an optimized machine learning (ML) system to determine if a research article contains an evolutionary timetree appropriate for inclusion in the TimeTree resource. We found that BERT classification fine-tuned on whole-text articles achieved an F1 score of 0.67, which we increased to 0.88 by text-mining article excerpts surrounding the mentioning of figures. The new method is implemented in the TimeTreeFinder tool, TTF, which automatically processes millions of articles to discover timetree-containing articles. We estimate that the TTF tool would produce twice as many timetree-containing articles as those discovered manually, whose inclusion in the TimeTree database would potentially double the knowledge accessible to a wider community.more »Manual inspection showed that the precision on out-of-distribution recently-published articles is 87%. This automation will speed up the collection and curation of timetrees with much lower human and time costs. Availability: https://github.com/marija-stanojevic/time-tree-classification Contact: {marija.stanojevic, s.kumar, zoran.obradovic}@temple.edu Supplementary information: Supplementary data are available at Bioinformatics online« less
  2. Abstract Motivation

    Cellular Electron CryoTomography (CECT) enables 3D visualization of cellular organization at near-native state and in sub-molecular resolution, making it a powerful tool for analyzing structures of macromolecular complexes and their spatial organizations inside single cells. However, high degree of structural complexity together with practical imaging limitations makes the systematic de novo discovery of structures within cells challenging. It would likely require averaging and classifying millions of subtomograms potentially containing hundreds of highly heterogeneous structural classes. Although it is no longer difficult to acquire CECT data containing such amount of subtomograms due to advances in data acquisition automation, existing computational approaches have very limited scalability or discrimination ability, making them incapable of processing such amount of data.

    Results

    To complement existing approaches, in this article we propose a new approach for subdividing subtomograms into smaller but relatively homogeneous subsets. The structures in these subsets can then be separately recovered using existing computation intensive methods. Our approach is based on supervised structural feature extraction using deep learning, in combination with unsupervised clustering and reference-free classification. Our experiments show that, compared with existing unsupervised rotation invariant feature and pose-normalization based approaches, our new approach achieves significant improvements in both discrimination ability and scalability.more »More importantly, our new approach is able to discover new structural classes and recover structures that do not exist in training data.

    Availability and Implementation

    Source code freely available at http://www.cs.cmu.edu/∼mxu1/software.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  3. Abstract Summary

    Multiple sequence alignment is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP one of the first methods to achieve good accuracy, and WITCH a recent improvement on UPP for accuracy. In this article, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith–Waterman. Our new method, WITCH-NG (i.e. ‘next generation WITCH’) achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG.

    Availability and implementation

    The datasets used in this study are from prior publications and are freely available in public repositories, as indicated in the Supplementary Materials.

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

  4. Abstract

    This article examines the automatic identification of Web registers, that is, text varieties such as news articles and reviews. Most studies have focused on corpora restricted to include only preselected classes with well-defined characteristics. These corpora feature only a subset of documents found on the unrestricted open Web, for which register identification has been particularly difficult because the range of linguistic variation on the Web is known to be substantial. As part of this study, we present the first open release of the Corpus of Online Registers of English (CORE), which is drawn from the unrestricted open Web and, currently, is the largest collection of manually annotated Web registers. Furthermore, we demonstrate that the CORE registers can be automatically identified with competitive results, with the best performance being an F1-score of 68% with the deep learning model BERT. The best performance was achieved using two modeling strategies. The first one involved modeling the registers using propagated register labels, that is, repeating the main register label along with its corresponding subregister label in a multilabel model. In the second one, we explored how the length of the document affects model performance, discovering that the beginning provided superior classification accuracy. Overall,more »the current study presents a systematic approach for the automatic identification of a large number of Web registers from the unrestricted Web, hence providing new pathways for future studies.

    « less
  5. Abstract We present the fifth edition of the TimeTree of Life resource (TToL5), a product of the timetree of life project that aims to synthesize published molecular timetrees and make evolutionary knowledge easily accessible to all. Using the TToL5 web portal, users can retrieve published studies and divergence times between species, the timeline of a species’ evolution beginning with the origin of life, and the timetree for a given evolutionary group at the desired taxonomic rank. TToL5 contains divergence time information on 137,306 species, 41% more than the previous edition. The TToL5 web interface is now Americans with Disabilities Act-compliant and mobile-friendly, a result of comprehensive source code refactoring. TToL5 also offers programmatic access to species divergence times and timelines through an application programming interface, which is accessible at timetree.temple.edu/api. TToL5 is publicly available at timetree.org.