skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Search for: All records

Creators/Authors contains: "Schatz, Michael C."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Background In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. Results Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read’s similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. Conclusions The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at . 
    more » « less
  2. Free, publicly-accessible full text available January 1, 2024
  3. Free, publicly-accessible full text available January 19, 2024
  4. Abstract

    Advancing crop genomics requires efficient genetic systems enabled by high-quality personalized genome assemblies. Here, we introduce RagTag, a toolset for automating assembly scaffolding and patching, and we establish chromosome-scale reference genomes for the widely used tomato genotype M82 along with Sweet-100, a new rapid-cycling genotype that we developed to accelerate functional genomics and genome editing in tomato. This work outlines strategies to rapidly expand genetic systems and genomic resources in other plant species.

    more » « less
  5. Abstract The highly diverse Solanaceae family contains several widely studied models and crop species. Fully exploring, appreciating, and exploiting this diversity requires additional model systems. Particularly promising are orphan fruit crops in the genus Physalis, which occupy a key evolutionary position in the Solanaceae and capture understudied variation in traits such as inflorescence complexity, fruit ripening and metabolites, disease and insect resistance, self-compatibility, and most notable, the striking inflated calyx syndrome (ICS), an evolutionary novelty found across angiosperms where sepals grow exceptionally large to encapsulate fruits in a protective husk. We recently developed transformation and genome editing in Physalis grisea (groundcherry). However, to systematically explore and unlock the potential of this and related Physalis as genetic systems, high-quality genome assemblies are needed. Here, we present chromosome-scale references for P. grisea and its close relative Physalis pruinosa and use these resources to study natural and engineered variations in floral traits. We first rapidly identified a natural structural variant in a bHLH gene that causes petal color variation. Further, and against expectations, we found that CRISPR–Cas9-targeted mutagenesis of 11 MADS-box genes, including purported essential regulators of ICS, had no effect on inflation. In a forward genetics screen, we identified huskless, which lacks ICS due to mutation of an AP2-like gene that causes sepals and petals to merge into a single whorl of mixed identity. These resources and findings elevate Physalis to a new Solanaceae model system and establish a paradigm in the search for factors driving ICS. 
    more » « less
  6. Slotte, Tanja (Ed.)
    Abstract Intracellular transfers of mitochondrial DNA continue to shape nuclear genomes. Chromosome 2 of the model plant Arabidopsis thaliana contains one of the largest known nuclear insertions of mitochondrial DNA (numts). Estimated at over 600 kb in size, this numt is larger than the entire Arabidopsis mitochondrial genome. The primary Arabidopsis nuclear reference genome contains less than half of the numt because of its structural complexity and repetitiveness. Recent data sets generated with improved long-read sequencing technologies (PacBio HiFi) provide an opportunity to finally determine the accurate sequence and structure of this numt. We performed a de novo assembly using sequencing data from recent initiatives to span the Arabidopsis centromeres, producing a gap-free sequence of the Chromosome 2 numt, which is 641 kb in length and has 99.933% nucleotide sequence identity with the actual mitochondrial genome. The numt assembly is consistent with the repetitive structure previously predicted from fiber-based fluorescent in situ hybridization. Nanopore sequencing data indicate that the numt has high levels of cytosine methylation, helping to explain its biased spectrum of nucleotide sequence divergence and supporting previous inferences that it is transcriptionally inactive. The original numt insertion appears to have involved multiple mitochondrial DNA copies with alternative structures that subsequently underwent an additional duplication event within the nuclear genome. This work provides insights into numt evolution, addresses one of the last unresolved regions of the Arabidopsis reference genome, and represents a resource for distinguishing between highly similar numt and mitochondrial sequences in studies of transcription, epigenetic modifications, and de novo mutations. 
    more » « less
  7. Abstract

    There are many short-read variant-calling tools, with different strengths and weaknesses. We present a tool, Minos, which combines outputs from arbitrary variant callers, increasing recall without loss of precision. We benchmark on 62 samples from three bacterial species and an outbreak of 385Mycobacterium tuberculosissamples. Minos also enables joint genotyping; we demonstrate on a large (N=13k)M. tuberculosiscohort, building a map of non-synonymous SNPs and indels in a region where all such variants are assumed to cause rifampicin resistance. We quantify the correlation with phenotypic resistance and then replicate in a second cohort (N=10k).

    more » « less
  8. Abstract Genome copy number is an important source of genetic variation in health and disease. In cancer, Copy Number Alterations (CNAs) can be inferred from short-read sequencing data, enabling genomics-based precision oncology. Emerging Nanopore sequencing technologies offer the potential for broader clinical utility, for example in smaller hospitals, due to lower instrument cost, higher portability, and ease of use. Nonetheless, Nanopore sequencing devices are limited in the number of retrievable sequencing reads/molecules compared to short-read sequencing platforms, limiting CNA inference accuracy. To address this limitation, we targeted the sequencing of short-length DNA molecules loaded at optimized concentration in an effort to increase sequence read/molecule yield from a single nanopore run. We show that short-molecule nanopore sequencing reproducibly returns high read counts and allows high quality CNA inference. We demonstrate the clinical relevance of this approach by accurately inferring CNAs in acute myeloid leukemia samples. The data shows that, compared to traditional approaches such as chromosome analysis/cytogenetics, short molecule nanopore sequencing returns more sensitive, accurate copy number information in a cost effective and expeditious manner, including for multiplex samples. Our results provide a framework for short-molecule nanopore sequencing with applications in research and medicine, which includes but is not limited to, CNAs. 
    more » « less
  9. Large genomic insertions and deletions are a potent source of functional variation, but are challenging to resolve with short-read sequencing, limiting knowledge of the role of such structural variants (SVs) in human evolution. Here, we used a graph-based method to genotype long-read-discovered SVs in short-read data from diverse human genomes. We then applied an admixture-aware method to identify 220 SVs exhibiting extreme patterns of frequency differentiation – a signature of local adaptation. The top two variants traced to the immunoglobulin heavy chain locus, tagging a haplotype that swept to near fixation in certain southeast Asian populations, but is rare in other global populations. Further investigation revealed evidence that the haplotype traces to gene flow from Neanderthals, corroborating the role of immune-related genes as prominent targets of adaptive introgression. Our study demonstrates how recent technical advances can help resolve signatures of key evolutionary events that remained obscured within technically challenging regions of the genome. 
    more » « less