skip to main content


This content will become publicly available on July 1, 2025

Title: Applications of machine learning in phylogenetics
Machine learning has increasingly been applied to a wide range of questions in phylogenetic inference. Supervised machine learning approaches that rely on simulated training data have been used to infer tree topologies and branch lengths, to select substitution models, and to perform downstream inferences of introgression and diversification. Here, we review how researchers have used several promising machine learning approaches to make phylogenetic inferences. Despite the promise of these methods, several barriers prevent supervised machine learning from reaching its full potential in phylogenetics. We discuss these barriers and potential paths forward. In the future, we expect that the application of careful network designs and data encodings will allow supervised machine learning to accommodate the complex processes that continue to confound traditional phylogenetic methods.  more » « less
Award ID(s):
1936187
NSF-PAR ID:
10510513
Author(s) / Creator(s):
; ;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Molecular Phylogenetics and Evolution
Volume:
196
Issue:
C
ISSN:
1055-7903
Page Range / eLocation ID:
108066
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In tribology, a considerable number of computational and experimental approaches to understand the interfacial characteristics of material surfaces in motion and tribological behaviors of materials have been considered to date. Despite being useful in providing important insights on the tribological properties of a system, at different length scales, a vast amount of data generated from these state-of-the-art techniques remains underutilized due to lack of analysis methods or limitations of existing analysis techniques. In principle, this data can be used to address intractable tribological problems including structure–property relationships in tribological systems and efficient lubricant design in a cost and time effective manner with the aid of machine learning. Specifically, data-driven machine learning methods have shown potential in unraveling complicated processes through the development of structure–property/functionality relationships based on the collected data. For example, neural networks are incredibly effective in modeling non-linear correlations and identifying primary hidden patterns associated with these phenomena. Here we present several exemplary studies that have demonstrated the proficiency of machine learning in understanding these critical factors. A successful implementation of neural networks, supervised, and stochastic learning approaches in identifying structure–property relationships have shed light on how machine learning may be used in certain tribological applications. Moreover, ranging from the design of lubricants, composites, and experimental processes to studying fretting wear and frictional mechanism, machine learning has been embraced either independently or integrated with optimization algorithms by scientists to study tribology. Accordingly, this review aims at providing a perspective on the recent advances in the applications of machine learning in tribology. The review on referenced simulation approaches and subsequent applications of machine learning in experimental and computational tribology shall motivate researchers to introduce the revolutionary approach of machine learning in efficiently studying tribology. 
    more » « less
  2. Abstract Motivation

    The application of machine learning approaches in phylogenetics has been impeded by the vast model space associated with inference. Supervised machine learning approaches require data from across this space to train models. Because of this, previous approaches have typically been limited to inferring relationships among unrooted quartets of taxa, where there are only three possible topologies. Here, we explore the potential of generative adversarial networks (GANs) to address this limitation. GANs consist of a generator and a discriminator: at each step, the generator aims to create data that is similar to real data, while the discriminator attempts to distinguish generated and real data. By using an evolutionary model as the generator, we use GANs to make evolutionary inferences. Since a new model can be considered at each iteration, heuristic searches of complex model spaces are possible. Thus, GANs offer a potential solution to the challenges of applying machine learning in phylogenetics.

    Results

    We developed phyloGAN, a GAN that infers phylogenetic relationships among species. phyloGAN takes as input a concatenated alignment, or a set of gene alignments, and infers a phylogenetic tree either considering or ignoring gene tree heterogeneity. We explored the performance of phyloGAN for up to 15 taxa in the concatenation case and 6 taxa when considering gene tree heterogeneity. Error rates are relatively low in these simple cases. However, run times are slow and performance metrics suggest issues during training. Future work should explore novel architectures that may result in more stable and efficient GANs for phylogenetics.

    Availability and implementation

    phyloGAN is available on github: https://github.com/meganlsmith/phyloGAN/.

     
    more » « less
  3. null (Ed.)
    Clustering is a machine learning paradigm of dividing sample subjects into a number of groups such that subjects in the same groups are more similar to those in other groups. With advances in information acquisition technologies, samples can frequently be viewed from different angles or in different modalities, generating multi-view data. Multi-view clustering, that clusters subjects into subgroups using multi-view data, has attracted more and more attentions. Although MVC methods have been developed rapidly, there has not been enough survey to summarize and analyze the current progress. Therefore, we propose a novel taxonomy of the MVC approaches. Similar with machine learning methods, we categorize them into generative and discriminative classes. In discriminative class, based on the way to integrate multiple views, we split it further into five groups: Common Eigenvector Matrix, Common Coefficient Matrix, Common Indicator Matrix, Direct Combination and Combination After Projection. Furthermore, we discuss the relationships between MVC and some related topics: multi-view representation, ensemble clustering, multi-task clustering, multi-view supervised and semi-supervised learning. Several representative real-world applications are elaborated for practitioners. Some commonly used multi-view datasets are introduced and several representative MVC algorithms from each group are run to conduct the comparison to analyze how and why they perform on those datasets. To promote future development of MVC approaches, we point out several open problems that may require further investigation and thorough examination. 
    more » « less
  4. The past decade witnessed rapid development in the measurement and monitoring technologies for food science. Among these technologies, spectroscopy has been widely used for the analysis of food quality, safety, and nutritional properties. Due to the complexity of food systems and the lack of comprehensive predictive models, rapid and simple measurements to predict complex properties in food systems are largely missing. Machine Learning (ML) has shown great potential to improve the classification and prediction of these properties. However, the barriers to collecting large datasets for ML applications still persists. In this paper, we explore different approaches of data annotation and model training to improve data efficiency for ML applications. Specifically, we leverage Active Learning (AL) and Semi-Supervised Learning (SSL) and investigate four approaches: baseline passive learning, AL, SSL, and a hybrid of AL and SSL. To evaluate these approaches, we collect two spectroscopy datasets: predicting plasma dosage and detecting foodborne pathogen. Our experimental results show that, compared to the de facto passive learning approach, advanced approaches (AL, SSL, and the hybrid) can greatly reduce the number of labeled samples, with some cases decreasing the number of labeled samples by more than half. 
    more » « less
  5. Townsend, Jeffrey (Ed.)
    Abstract Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs using clustering approaches and retain families with a single sequence per species. This limits the amount of data available by excluding larger families. Recent advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several methods for species tree inference are robust to the inclusion of paralogs and could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference by examining relationships among 26 primate species in detail and by analyzing five additional data sets. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data. We explore several species tree inference methods, finding that identical trees are returned across nearly all subsets of the data and methods for primates. The relationships among Platyrrhini remain contentious; however, the species tree inference method matters more than the subset of data used. Using data from larger gene families drastically increases the number of genes available and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression in primates. For the other data sets, topological inferences are consistent whether single-copy families or orthologs extracted using decomposition approaches are analyzed. Using larger gene families is a promising approach to include more data in phylogenomics without sacrificing accuracy, at least when high-quality genomes are available. 
    more » « less