skip to main content


Title: MIDAS2: Metagenomic Intra-species Diversity Analysis System
Abstract Summary

The Metagenomic Intra-Species Diversity Analysis System (MIDAS) is a scalable metagenomic pipeline that identifies single nucleotide variants (SNVs) and gene copy number variants in microbial populations. Here, we present MIDAS2, which addresses the computational challenges presented by increasingly large reference genome databases, while adding functionality for building custom databases and leveraging paired-end reads to improve SNV accuracy. This fast and scalable reengineering of the MIDAS pipeline enables thousands of metagenomic samples to be efficiently genotyped.

Availability and implementation

The source code is available at https://github.com/czbiohub/MIDAS2.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
NSF-PAR ID:
10388946
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
1
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Fraser, Claire M. (Ed.)
    ABSTRACT

    Metagenomics is a powerful method for interpreting the ecological roles and physiological capabilities of mixed microbial communities. Yet, many tools for processing metagenomic data are neither designed to consider eukaryotes nor are they built for an increasing amount of sequence data. EukHeist is an automated pipeline to retrieve eukaryotic and prokaryotic metagenome-assembled genomes (MAGs) from large-scale metagenomic sequence data sets. We developed the EukHeist workflow to specifically process large amounts of both metagenomic and/or metatranscriptomic sequence data in an automated and reproducible fashion. Here, we applied EukHeist to the large-size fraction data (0.8–2,000 µm) from Tara Oceans to recover both eukaryotic and prokaryotic MAGs, which we refer to as TOPAZ (Tara Oceans Particle-Associated MAGs). The TOPAZ MAGs consisted of >900 environmentally relevant eukaryotic MAGs and >4,000 bacterial and archaeal MAGs. The bacterial and archaeal TOPAZ MAGs expand upon the phylogenetic diversity of likely particle- and host-associated taxa. We use these MAGs to demonstrate an approach to infer the putative trophic mode of the recovered eukaryotic MAGs. We also identify ecological cohorts of co-occurring MAGs, which are driven by specific environmental factors and putative host-microbe associations. These data together add to a number of growing resources of environmentally relevant eukaryotic genomic information. Complementary and expanded databases of MAGs, such as those provided through scalable pipelines like EukHeist, stand to advance our understanding of eukaryotic diversity through increased coverage of genomic representatives across the tree of life.

    IMPORTANCE

    Single-celled eukaryotes play ecologically significant roles in the marine environment, yet fundamental questions about their biodiversity, ecological function, and interactions remain. Environmental sequencing enables researchers to document naturally occurring protistan communities, without culturing bias, yet metagenomic and metatranscriptomic sequencing approaches cannot separate individual species from communities. To more completely capture the genomic content of mixed protistan populations, we can create bins of sequences that represent the same organism (metagenome-assembled genomes [MAGs]). We developed the EukHeist pipeline, which automates the binning of population-level eukaryotic and prokaryotic genomes from metagenomic reads. We show exciting insight into what protistan communities are present and their trophic roles in the ocean. Scalable computational tools, like EukHeist, may accelerate the identification of meaningful genetic signatures from large data sets and complement researchers’ efforts to leverage MAG databases for addressing ecological questions, resolving evolutionary relationships, and discovering potentially novel biodiversity.

     
    more » « less
  2. Abstract Motivation

    Metagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample.

    Results

    We develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs.

    Availability and implementation

    HiFine is available at https://github.com/dyxstat/HiFine.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Motivation

    Metagenomic and metatranscriptomic analyses can provide an abundance of information related to microbial communities. However, straightforward analysis of this data does not provide optimal results, with a required integration of data types being needed to thoroughly investigate these microbiomes and their environmental interactions.

    Results

    Here, we present MetaQUBIC, an integrated biclustering-based computational pipeline for gene module detection that integrates both metagenomic and metatranscriptomic data. Additionally, we used this pipeline to investigate 735 paired DNA and RNA human gut microbiome samples, resulting in a comprehensive hybrid gene expression matrix of 2.3 million cross-species genes in the 735 human fecal samples and 155 functional enriched gene modules. We believe both the MetaQUBIC pipeline and the generated comprehensive human gut hybrid expression matrix will facilitate further investigations into multiple levels of microbiome studies.

    Availability and implementation

    The package is freely available at https://github.com/OSU-BMBL/metaqubic.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    Disruption of protein–protein interactions can mitigate antibody recognition of therapeutic proteins, yield monomeric forms of oligomeric proteins, and elucidate signaling mechanisms, among other applications. While designing affinity-enhancing mutations remains generally quite challenging, both statistically and physically based computational methods can precisely identify affinity-reducing mutations. In order to leverage this ability to design variants of a target protein with disrupted interactions, we developed the DisruPPI protein design method (DISRUpting Protein–Protein Interactions) to optimize combinations of mutations simultaneously for both disruption and stability, so that incorporated disruptive mutations do not inadvertently affect the target protein adversely.

    Results

    Two existing methods for predicting mutational effects on binding, FoldX and INT5, were demonstrated to be quite precise in selecting disruptive mutations from the SKEMPI and AB-Bind databases of experimentally determined changes in binding free energy. DisruPPI was implemented to use an INT5-based disruption score integrated with an AMBER-based stability assessment and was applied to disrupt protein interactions in a set of different targets representing diverse applications. In retrospective evaluation with three different case studies, comparison of DisruPPI-designed variants to published experimental data showed that DisruPPI was able to identify more diverse interaction-disrupting and stability-preserving variants more efficiently and effectively than previous approaches. In prospective application to an interaction between enhanced green fluorescent protein (EGFP) and a nanobody, DisruPPI was used to design five EGFP variants, all of which were shown to have significantly reduced nanobody binding while maintaining function and thermostability. This demonstrates that DisruPPI may be readily utilized for effective removal of known epitopes of therapeutically relevant proteins.

    Availability and implementation

    DisruPPI is implemented in the EpiSweep package, freely available under an academic use license.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    Genotyping a set of variants from a database is an important step for identifying known genetic traits and disease-related variants within an individual. The growing size of variant databases as well as the high depth of sequencing data poses an efficiency challenge. In clinical applications, where time is crucial, alignment-based methods are often not fast enough. To fill the gap, Shajii et al. propose LAVA, an alignment-free genotyping method which is able to more quickly genotype single nucleotide polymorphisms (SNPs); however, there remains large room for improvements in running time and accuracy.

    Results

    We present the VarGeno method for SNP genotyping from Illumina whole genome sequencing data. VarGeno builds upon LAVA by improving the speed of k-mer querying as well as the accuracy of the genotyping strategy. We evaluate VarGeno on several read datasets using different genotyping SNP lists. VarGeno performs 7–13 times faster than LAVA with similar memory usage, while improving accuracy.

    Availability and implementation

    VarGeno is freely available at: https://github.com/medvedevgroup/vargeno.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less