skip to main content

Search for: All records

Creators/Authors contains: "Prosperi, Mattia"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Mobile sequencing technologies, including Oxford Nanopore’s MinION, MklC, and SmidgION, are bringing genomics in the palm of a hand, opening unprecedented new opportunities in clinical and ecological research and translational applications. While sequencers now need only a USB outlet and provide on-board preprocessing (e.g., base calling), the main data analysis phases are tied to an available broadband Internet connection and cloud computing. Yet the ubiquity of tablets and smartphones, along with their increase in computational power, makes them a perfect candidate for enabling mobile/edge mobile bioinformatics analytics. Also, in on site experimental settings tablets and smartphones are preferable to standard computers due to resilience to humidity or spills, and ease of sterilization. We here present an experimental study on power dissipation, aiming at reducing the battery consumption that currently impedes the execution of intensive bioinformatics analytics pipelines. In particular, we investigated the effects of assorted data structures (including hash tables, vectors, balanced trees, tries) employed in some of the most common tasks of a bioinformatics pipeline, the k- mer representation and counting. By employing a thermal camera, we show how different k-mer-handling data structures impact the power dissipation on a smartphone, finding that a cache-oblivious data structure reduces power dissipationmore »(up to 26% better than others). In conclusion, the choice of data structures in mobile bioinformatics must consider not only computing efficiency (e.g., succinct data structures to reduce RAM usage), but also power consumption of mobile devices that heavily rely on batteries in order to function.« less
    Free, publicly-accessible full text available December 14, 2023
  2. The SARS-CoV-2 pandemic has been presenting in periodic waves and multiple variants, of which some dominated over time with increased transmissibility. SARS-CoV-2 is still adapting in the human population, thus it is crucial to understand its evolutionary patterns and dynamics ahead of time. In this work, we analyzed transmission clusters and topology of SARSCoV-2 phylogenies at the global, regional (North America) and clade-specific (Delta and Omicron) epidemic scales. We used the Nextstrain’s nCov open global all-time phylogeny (September 2022, 2,698 strains, 2,243 for North America, 499 for Delta21A, and 543 for Omicron20M), with Nextstrain’s clade annotation and Pango lineages. Transmission clusters were identified using Phylopart, DYNAMITE, and several tree imbalance measures were calculated, including staircase-ness, Sackin and Colless index. We found that the phylogenetic clustering profiles of the global epidemic have highest diversification at a distance threshold of 3% (divergence of 10, where the tree sampled median is 49). Phylopart and DYNAMITE clusters moderately-to-highly agree with the Pango nomenclature and the Nextstrain’s clade. At the regional and clade-specific scale, transmission clustering profiles tend to flatten and similar clusters are found at distance thresholds between 0.05% and 25%. All the considered phylogenies exhibit high tree imbalance with respect to what expectedmore »in random phylogenies, suggesting short infection times and antigenic drift, perhaps due to progressive transition from innate to adaptive immunity in the population.« less
    Free, publicly-accessible full text available December 6, 2023
  3. Nanopore technology enables portable, real-time sequencing of microbial populations from clinical and ecological samples. An emerging healthcare application for Nanopore includes point-of-care, timely identification of antibiotic resistance genes (ARGs) to help developing targeted treatments of bacterial infections, and monitoring resistant outbreaks in the environment. While several computational tools exist for classifying ARGs from sequencing data, to date (2022) none have been developed for mobile devices. We present here KARGAMobile, a mobile app for portable, real-time, easily interpretable analysis of ARGs from Nanopore sequencing. KARGAMobile is the porting of an existing ARG identification tool named KARGA; it retains the same algorithmic structure, but it is optimized for mobile devices. Specifically, KARGAMobile employs a compressed ARG reference database and different internal data structures to save RAM usage. The KARGAMobile app features a friendly graphical user interface that guides through file browsing, loading, parameter setup, and process execution. More importantly, the output files are post-processed to create visual, printable and shareable reports, aiding users to interpret the ARG findings. The difference in classification performance between KARGAMobile and KARGA is minimal (96.2% vs . 96.9% f-measure on semi-synthetic datasets of 1 million reads with known resistance ground truth). Using real Nanopore experiments, KARGAMobile processesmore »on average 1 GB data every 23–48 min (targeted sequencing - metagenomics), with peak RAM usage below 500MB, independently from input file sizes, and an average temperature of 49°C after 1 h of continuous data processing. KARGAMobile is written in Java and is available at under the MIT license.« less
    Free, publicly-accessible full text available October 17, 2023
  4. Abstract Background Metagenomic data can be used to profile high-importance genes within microbiomes. However, current metagenomic workflows produce data that suffer from low sensitivity and an inability to accurately reconstruct partial or full genomes, particularly those in low abundance. These limitations preclude colocalization analysis, i.e., characterizing the genomic context of genes and functions within a metagenomic sample. Genomic context is especially crucial for functions associated with horizontal gene transfer (HGT) via mobile genetic elements (MGEs), for example antimicrobial resistance (AMR). To overcome this current limitation of metagenomics, we present a method for comprehensive and accurate reconstruction of antimicrobial resistance genes (ARGs) and MGEs from metagenomic DNA, termed t arget- e nriched l ong-read seq uencing (TELSeq). Results Using technical replicates of diverse sample types, we compared TELSeq performance to that of non-enriched PacBio and short-read Illumina sequencing. TELSeq achieved much higher ARG recovery (>1,000-fold) and sensitivity than the other methods across diverse metagenomes, revealing an extensive resistome profile comprising many low-abundance ARGs, including some with public health importance. Using the long reads generated by TELSeq, we identified numerous MGEs and cargo genes flanking the low-abundance ARGs, indicating that these ARGs could be transferred across bacterial taxa via HGT. Conclusions TELSeq can providemore »a nuanced view of the genomic context of microbial resistomes and thus has wide-ranging applications in public, animal, and human health, as well as environmental surveillance and monitoring of AMR. Thus, this technique represents a fundamental advancement for microbiome research and application.« less
    Free, publicly-accessible full text available December 1, 2023
  5. Free, publicly-accessible full text available August 1, 2023
  6. Background: In the wake of the COVID-19 pandemic, scientists have scrambled to collect and analyze SARS-CoV-2 genomic data to inform public health responses to COVID-19 in real-time. Open-source phylogenetic and data visualization platforms for monitoring SARS-CoV-2 genomic epidemiology have rapidly gained popularity for their ability to illuminate spatial-temporal transmission patterns worldwide. However, the utility of such tools to inform public health decision-making for COVID-19 in real-time remains to be explored. Objective: The objective of this study was to convene experts in public health, infectious diseases, virology, and bioinformatics – many of whom were actively engaged in the COVID-19 response at the time of their participation – to discuss the application of phylodynamic tools to inform pandemic responses. Methods: A series of four virtual focus group discussions were hosted between June 2020 and June 2021, covering the pre- and post-variant and vaccination eras of the COVID-19 crisis. Audio recordings were transcribed verbatim, and an iterative, thematic qualitative framework was used for analysis. Results: Of the 41 individuals invited, 23 total participants (56.1%) agreed to participate. Across the four focus group sessions, 15 (65%) of the participants were female, 17 (74%) were white, and 5 (22%) were black. Participants were described asmore »molecular epidemiologists (ME, n=9), clinician-researchers (n=3), infectious disease experts (ID, n=4), and public health professionals (PH) at the local (n=4), state (n=2), and federal (n=1) levels. Collectively, participants felt that successful uptake of phylodynamic tools relies on the strength of academic-public health partnerships. They called for interoperability standards in sequence data sharing and cited many resource issues that must be addressed, including timeliness and cost, in addition to improving issues related to sampling bias and the translation of phylodynamic findings into public health action. Conclusions: This was the first qualitative study to characterize the perspectives of key experts regarding the utility of phylodynamic tools for the public health response to COVID-19. The focus group participants identified key areas for improvement of existing and future phylogenetic and data visualization platforms for monitoring SARS-CoV-2 genomic epidemiology. This information is critical to both policymakers and developers as they consider how to handle existing and emerging SARS-CoV-2 variants during the ongoing crisis.« less
    Free, publicly-accessible full text available February 27, 2024
  7. Free, publicly-accessible full text available August 1, 2023
  8. Abstract

    Antimicrobial resistance (AMR) is a growing threat to public health and farming at large. In clinical and veterinary practice, timely characterization of the antibiotic susceptibility profile of bacterial infections is a crucial step in optimizing treatment. High-throughput sequencing is a promising option for clinical point-of-care and ecological surveillance, opening the opportunity to develop genotyping-based AMR determination as a possibly faster alternative to phenotypic testing. In the present work, we compare the performance of state-of-the-art methods for detection of AMR using high-throughput sequencing data from clinical settings. We consider five computational approaches based on alignment (AMRPlusPlus), deep learning (DeepARG), k-mer genomic signatures (KARGA, ResFinder) or hidden Markov models (Meta-MARC). We use an extensive collection of 585 isolates with available AMR resistance profiles determined by phenotypic tests across nine antibiotic classes. We show how the prediction landscape of AMR classifiers is highly heterogeneous, with balanced accuracy varying from 0.40 to 0.92. Although some algorithms—ResFinder, KARGA and AMRPlusPlus—exhibit overall better balanced accuracy than others, the high per-AMR-class variance and related findings suggest that: (1) all algorithms might be subject to sampling bias both in data repositories used for training and experimental/clinical settings; and (2) a portion of clinical samples might contain uncharacterizedmore »AMR genes that the algorithms—mostly trained on known AMR genes—fail to generalize upon. These results lead us to formulate practical advice for software configuration and application, and give suggestions for future study designs to further develop AMR prediction tools from proof-of-concept to bedside.

    « less
  9. Abstract Background Identification of motifs and quantification of their occurrences are important for the study of genetic diseases, gene evolution, transcription sites, and other biological mechanisms. Exact formulae for estimating count distributions of motifs under Markovian assumptions have high computational complexity and are impractical to be used on large motif sets. Approximated formulae, e.g. based on compound Poisson, are faster, but reliable p value calculation remains challenging. Here, we introduce ‘motif_prob’, a fast implementation of an exact formula for motif count distribution through progressive approximation with arbitrary precision. Our implementation speeds up the exact calculation, usually impractical, making it feasible and posit to substitute currently employed heuristics. Results We implement motif_prob in both Perl and C+ + languages, using an efficient error-bound iterative process for the exact formula, providing comparison with state-of-the-art tools (e.g. MoSDi) in terms of precision, run time benchmarks, along with a real-world use case on bacterial motif characterization. Our software is able to process a million of motifs (13–31 bases) over genome lengths of 5 million bases within the minute on a regular laptop, and the run times for both the Perl and C+ + code are several orders of magnitude smaller (50–1000× faster) than MoSDi, even when usingmore »their fast compound Poisson approximation (60–120× faster). In the real-world use cases, we first show the consistency of motif_prob with MoSDi, and then how the p-value quantification is crucial for enrichment quantification when bacteria have different GC content, using motifs found in antimicrobial resistance genes. The software and the code sources are available under the MIT license at . Conclusions The motif_prob software is a multi-platform and efficient open source solution for calculating exact frequency distributions of motifs. It can be integrated with motif discovery/characterization tools for quantifying enrichment and deviation from expected frequency ranges with exact p values, without loss in data processing efficiency.« less