skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, May 16 until 2:00 AM ET on Saturday, May 17 due to maintenance. We apologize for the inconvenience.


Title: Critical Assessment of Metagenome Interpretation: the second round of challenges
Abstract Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses.  more » « less
Award ID(s):
1845890 1919691 1936791 2107108
PAR ID:
10334129
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Date Published:
Journal Name:
Nature Methods
Volume:
19
Issue:
4
ISSN:
1548-7091
Page Range / eLocation ID:
429 to 440
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Background Viruses influence global patterns of microbial diversity and nutrient cycles. Though viral metagenomics (viromics), specifically targeting dsDNA viruses, has been critical for revealing viral roles across diverse ecosystems, its analyses differ in many ways from those used for microbes. To date, viromics benchmarking has covered read pre-processing, assembly, relative abundance, read mapping thresholds and diversity estimation, but other steps would benefit from benchmarking and standardization. Here we use in silico-generated datasets and an extensive literature survey to evaluate and highlight how dataset composition (i.e., viromes vs bulk metagenomes) and assembly fragmentation impact (i) viral contig identification tool, (ii) virus taxonomic classification, and (iii) identification and curation of auxiliary metabolic genes (AMGs). Results The in silico benchmarking of five commonly used virus identification tools show that gene-content-based tools consistently performed well for long (≥3 kbp) contigs, while k -mer- and blast-based tools were uniquely able to detect viruses from short (≤3 kbp) contigs. Notably, however, the performance increase of k -mer- and blast-based tools for short contigs was obtained at the cost of increased false positives (sometimes up to ∼5% for virome and ∼75% bulk samples), particularly when eukaryotic or mobile genetic element sequences were included in the test datasets. For viral classification, variously sized genome fragments were assessed using gene-sharing network analytics to quantify drop-offs in taxonomic assignments, which revealed correct assignations ranging from ∼95% (whole genomes) down to ∼80% (3 kbp sized genome fragments). A similar trend was also observed for other viral classification tools such as VPF-class, ViPTree and VIRIDIC, suggesting that caution is warranted when classifying short genome fragments and not full genomes. Finally, we highlight how fragmented assemblies can lead to erroneous identification of AMGs and outline a best-practices workflow to curate candidate AMGs in viral genomes assembled from metagenomes. Conclusion Together, these benchmarking experiments and annotation guidelines should aid researchers seeking to best detect, classify, and characterize the myriad viruses ‘hidden’ in diverse sequence datasets. 
    more » « less
  2. Viruses play crucial roles in the ecology of microbial communities, yet they remain relatively understudied in their native environments. Despite many advancements in high-throughput whole-genome sequencing (WGS), sequence assembly, and annotation of viruses, the reconstruction of full-length viral genomes directly from metagenomic sequencing is possible only for the most abundant phages and requires long-read sequencing technologies. Additionally, the prediction of their cellular hosts remains difficult from conventional metagenomic sequencing alone. To address these gaps in the field and to accelerate the study of viruses directly in their native microbiomes, we developed an end-to-end bioinformatics platform for viral genome reconstruction and host attribution from metagenomic data using proximity-ligation sequencing (i.e., Hi-C). We demonstrate the capabilities of the platform by recovering and characterizing the metavirome of a variety of metagenomes, including a fecal microbiome that has also been sequenced with accurate long reads, allowing for the assessment and benchmarking of the new methods. The platform can accurately extract numerous near-complete viral genomes even from highly fragmented short-read assemblies and can reliably predict their cellular hosts with minimal false positives. To our knowledge, this is the first software for performing these tasks. Being significantly cheaper than long-read sequencing of comparable depth, the incorporation of proximity-ligation sequencing in microbiome research shows promise to greatly accelerate future advancements in the field. 
    more » « less
  3. BackgroundReliable and specific biomarkers that can distinguish autism spectrum disorders (ASDs) from commonly co-occurring attention-deficit/hyperactivity disorder (ADHD) are lacking, causing misses and delays in diagnosis, and reducing access to interventions and quality of life. AimsTo examine whether an innovative, brief (1-min), videogame method called Computerised Assessment of Motor Imitation (CAMI), can identify ASD-specific imitation differences compared with neurotypical children and children with ADHD. MethodThis cross-sectional study used CAMI alongside standardised parent-report (Social Responsiveness Scale, Second Edition) and observational measures of autism (Autism Diagnostic Observation Schedule-Second Edition; ADOS-2), ADHD (Conners) and motor ability (Physical and Neurological Examination for Soft Signs). The sample comprised 183 children aged 7–13 years, with ADHD (without ASD), with ASD (with and without ADHD) and who were neurotypical. ResultsRegardless of co-occurring ADHD, children with ASD showed poorer CAMI performance than neurotypical children (P< 0.0001; adjustedR2= 0.28), whereas children with ADHD and neurotypical children showed similar CAMI performance. Receiver operating curve and support vector machine analyses showed that CAMI distinguishes ASD from both neurotypical children (80% true positive rate) and children with ADHD (70% true positive rate), with a high success rate significantly above chance. Among children with ASD, poor CAMI performance was associated with increased autism traits, particularly ADOS-2 measures of social affect and restricted and repetitive behaviours (adjustedR2= 0.23), but not with ADHD traits or motor ability. ConclusionsFour levels of analyses confirm that poor imitation measured by the low-cost and scalable CAMI method specifically distinguishes ASD not only from neurotypical development, but also from commonly co-occurring ADHD. 
    more » « less
  4. Background: Motor imitation difficulties are pervasive in children with Autism Spectrum Disorder (ASD). Previous research demonstrated the validity and reliability of an algorithm called Computerized Assessment of Motor Imitation (CAMI) using 3D depth cameras. However, incorporating CAMI into serious games and making it accessible in clinic and home settings requires a more scalable approach that uses “off-the-shelf” 2D cameras. Method: In a brief (one-minute) task, children (23 ASD, 17 typically developing [TD]) imitated a model’s dance movements while simultaneously being recorded using Kinect Xbox motion tracking technology (Kinect 3D) and a single 2D camera. Pose-estimation software (OpenPose 2D) was used on the 2D camera video to fit a skeleton to the imitating child. Motor imitation scores computed from the fully automated OpenPose 2D CAMI method were compared to scores computed from the Kinect 3D CAMI and Human Observation Coding (HOC) methods. Results: Motor imitation scores obtained from the OpenPose 2D CAMI method were significantly correlated with scores obtained from the Kinect 3D CAMI method (r40 = 0.82, p < 0.001) and the HOC method (r40 = 0.80, p < 0.001). Both 2D and 3D CAMI methods showed better discriminative ability than the HOC, with the Kinect 3D CAMI method outperforming the OpenPose 2D CAMI method (area under ROC curve (AUC): AUCHOC = 0.799, AUC2D-CAMI = 0.876, AUC3D-CAMI= 0.94). Finally, all motor imitation scores were significantly associated with the social communication impairment (all p ≤ 0.003). Conclusions: This pilot-study demonstrated that motor imitation can be automatically quantified using a single 2D camera. 
    more » « less
  5. Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie. 
    more » « less