NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fast, parallel, and cache-friendly suffix array construction

https://doi.org/10.1186/s13015-024-00263-5

Khan, Jamshed; Rubel, Tobias; Molloy, Erin; Dhulipala, Laxman; Patro, Rob (December 2024, Algorithms for Molecular Biology)

Full Text Available
Detectability of Varied Hybridization Scenarios Using Genome-Scale Hybrid Detection Methods

https://doi.org/10.18061/bssb.v3i1.9284

Bjorner, Marianne B; Molloy, Erin K; Dewey, Colin N; Solis-Lemus, Claudia (October 2024, Bulletin of the Society of Systematic Biologists)

Hybridization events complicate the accurate reconstruction of phylogenies, as they lead to patterns of genetic heritability that are unexpected under traditional, bifurcating models of species trees. This phenomenon has led to the development of methods to infer these varied hybridization events, both methods that reconstruct networks directly, as well as summary methods that predict individual hybridization events from a subset of taxa. However, a lack of empirical comparisons between methods – especially those pertaining to large networks with varied hybridization scenarios – hinders their practical use. Here, we provide a comprehensive review of popular summary methods: TICR, MSCquartets, HyDe, Patterson’s D-Statistic (ABBA-BABA), D3, and Dp. TICR and MSCquartets are based on quartet concordance factors gathered from gene tree topologies and HyDe, Patterson’s D-Statistic, D3, and Dp use site pattern frequencies to identify hybridization events between sets of three taxa. We then use simulated data to address questions of method accuracy and ideal use scenarios by testing methods against complex networks which depict gene flow events that differ in depth (timing), quantity (single vs. multiple, overlapping hybridizations), and rate of gene flow (γ). We find that deeper or multiple hybridization events may introduce noise and weaken the signal of hybridization, leading to higher relative false negative rates across all methods. Despite some forms of hybridization eluding quartet-based detection methods, MSCquartets displays high precision in most scenarios. While HyDe results in high false negative rates when tested on hybridizations involving extinct or unsampled ghost lineages, HyDe is the only method able to identify the direction of hybridization, distinguishing the source parental lineages from recipient hybrid lineages. Lastly, we test the methods on a dataset of ultraconserved elements from the bee subfamily Nomiinae, finding possible hybridization events between clades which correspond to regions of poor support in the species tree estimated in a previous study.
more » « less
Full Text Available
Fast, Parallel, and Cache-Friendly Suffix Array Construction

https://doi.org/10.4230/LIPIcs.WABI.2023.16

Khan, Jamshed; Rubel, Tobias; Dhulipala, Laxman; Molloy, Erin; Patro, Rob (August 2023, 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023))
Belazzougui, Djamal; Ouangraoua, Aïda (Ed.)
String indexes such as the suffix array (SA) and the closely related longest common prefix (LCP) array are fundamental objects in bioinformatics and have a wide variety of applications. Despite their importance in practice, few scalable parallel algorithms for constructing these are known, and the existing algorithms can be highly non-trivial to implement and parallelize. In this paper we present CaPS-SA, a simple and scalable parallel algorithm for constructing these string indexes inspired by samplesort. Due to its design, CaPS-SA has excellent memory-locality and thus incurs fewer cache misses and achieves strong performance on modern multicore systems with deep cache hierarchies. We show that despite its simple design, CaPS-SA outperforms existing state-of-the-art parallel SA and LCP-array construction algorithms on modern hardware. Finally, motivated by applications in modern aligners where the query strings have bounded lengths, we introduce the notion of a bounded-context SA and show that CaPS-SA can easily be extended to exploit this structure to obtain further speedups.
more » « less
Full Text Available
Inferring population structure in biobank-scale genomic data

https://doi.org/10.1016/j.ajhg.2022.02.015

Chiu, Alec M.; Molloy, Erin K.; Tan, Zilong; Talwalkar, Ameet; Sankararaman, Sriram (April 2022, The American Journal of Human Genetics)

Full Text Available
Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

https://doi.org/10.1089/cmb.2020.0424

Legried, Brandon; Molloy, Erin K.; Warnow, Tandy; Roch, Sébastien (May 2021, Journal of Computational Biology)

Full Text Available
ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy

https://doi.org/10.1093/molbev/msaa139

Zhang, Chao; Scornavacca, Celine; Molloy, Erin K; Mirarab, Siavash (September 2020, Molecular Biology and Evolution)
Thorne, Jeffrey (Ed.)
Abstract Phylogenetic inference from genome-wide data (phylogenomics) has revolutionized the study of evolution because it enables accounting for discordance among evolutionary histories across the genome. To this end, summary methods have been developed to allow accurate and scalable inference of species trees from gene trees. However, most of these methods, including the widely used ASTRAL, can only handle single-copy gene trees and do not attempt to model gene duplication and gene loss. As a result, most phylogenomic studies have focused on single-copy genes and have discarded large parts of the data. Here, we first propose a measure of quartet similarity between single-copy and multicopy trees that accounts for orthology and paralogy. We then introduce a method called ASTRAL-Pro (ASTRAL for PaRalogs and Orthologs) to find the species tree that optimizes our quartet similarity measure using dynamic programing. By studying its performance on an extensive collection of simulated data sets and on real data sets, we show that ASTRAL-Pro is more accurate than alternative methods.
more » « less
Full Text Available
Polynomial-Time Statistical Estimation of Species Trees Under Gene Duplication and Loss

Legried, Brandon; Molloy, Erin K.; Warnow, Tandy; Roch, Sebastien (January 2020, International Conference on Research in Computational Molecular Biology (RECOMB 2020))

Phylogenomics---the estimation of species trees from multi-locus datasets---is a common step in many biological studies. However, this estimation is challenged by the fact that genes can evolve under processes, including incomplete lineage sorting (ILS) and gene duplication and loss (GDL), that make their trees different from the species tree. In this paper, we address the challenge of estimating the species tree under GDL. We show that species trees are identifiable under a standard stochastic model for GDL, and that the polynomial-time algorithm ASTRAL-multi, a recent development in the ASTRAL suite of methods, is statistically consistent under this GDL model. We also provide a simulation study evaluating ASTRAL-multi for species tree estimation under GDL. All scripts and datasets used in this study are available on the Illinois Data Bank: https://doi.org/10.13012/B2IDB-2626814_V1.
more » « less
Full Text Available
ILS-Aware Analysis of Low-Homoplasy Retroelement Insertions: Inference of Species Trees and Introgression Using Quartets

https://doi.org/10.1093/jhered/esz076

Springer, Mark S; Molloy, Erin K; Sloan, Daniel B; Simmons, Mark P; Gatesy, John (December 2019, Journal of Heredity)
Murphy, William (Ed.)
Abstract DNA sequence alignments have provided the majority of data for inferring phylogenetic relationships with both concatenation and coalescent methods. However, DNA sequences are susceptible to extensive homoplasy, especially for deep divergences in the Tree of Life. Retroelement insertions have emerged as a powerful alternative to sequences for deciphering evolutionary relationships because these data are nearly homoplasy-free. In addition, retroelement insertions satisfy the “no intralocus-recombination” assumption of summary coalescent methods because they are singular events and better approximate neutrality relative to DNA loci commonly sampled in phylogenomic studies. Retroelements have traditionally been analyzed with parsimony, distance, and network methods. Here, we analyze retroelement data sets for vertebrate clades (Placentalia, Laurasiatheria, Balaenopteroidea, Palaeognathae) with 2 ILS-aware methods that operate by extracting, weighting, and then assembling unrooted quartets into a species tree. The first approach constructs a species tree from retroelement bipartitions with ASTRAL, and the second method is based on split-decomposition with parsimony. We also develop a Quartet-Asymmetry test to detect hybridization using retroelements. Both ILS-aware methods recovered the same species-tree topology for each data set. The ASTRAL species trees for Laurasiatheria have consecutive short branch lengths in the anomaly zone whereas Palaeognathae is outside of this zone. For the Balaenopteroidea data set, which includes rorquals (Balaenopteridae) and gray whale (Eschrichtiidae), both ILS-aware methods resolved balaeonopterids as paraphyletic. Application of the Quartet-Asymmetry test to this data set detected 19 different quartets of species for which historical introgression may be inferred. Evidence for introgression was not detected in the other data sets.
more » « less
Full Text Available
Complete sequencing of ape genomes

https://doi.org/10.1101/2024.07.31.605654

Yoo, DongAhn; Rhie, Arang; Hebbar, Prajna; Antonacci, Francesca; Logsdon, Glennis A; Solar, Steven J; Antipov, Dmitry; Pickett, Brandon D; Safonova, Yana; Montinaro, Francesco; et al (July 2024, bioRxiv)

ABSTRACT We present haplotype-resolved reference genomes and comparative analyses of six ape species, namely: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan, and siamang. We achieve chromosome-level contiguity with unparalleled sequence accuracy (<1 error in 500,000 base pairs), completely sequencing 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, providing more in-depth evolutionary insights. Comparative analyses, including human, allow us to investigate the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference. This includes newly minted gene families within lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes, and subterminal heterochromatin. This resource should serve as a definitive baseline for all future evolutionary studies of humans and our closest living ape relatives.
more » « less
Full Text Available

Search for: All records