skip to main content


Search for: All records

Award ID contains: 1920103

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Advancing crop genomics requires efficient genetic systems enabled by high-quality personalized genome assemblies. Here, we introduce RagTag, a toolset for automating assembly scaffolding and patching, and we establish chromosome-scale reference genomes for the widely used tomato genotype M82 along with Sweet-100, a new rapid-cycling genotype that we developed to accelerate functional genomics and genome editing in tomato. This work outlines strategies to rapidly expand genetic systems and genomic resources in other plant species.

     
    more » « less
  2. Abstract Proteins that drive processes like clathrin-mediated endocytosis (CME) are expressed at copy numbers within a cell and across cell types varying from hundreds (e.g. auxilin) to millions (e.g. clathrin). These variations contain important information about function, but without integration with the interaction network, they cannot capture how supply and demand for each protein depends on binding to shared and distinct partners. Here we construct the interface-resolved network of 82 proteins involved in CME and establish a metric, a stoichiometric balance ratio (SBR), that quantifies whether each protein in the network has an abundance that is sub- or super-stoichiometric dependent on the global competition for binding. We find that highly abundant proteins (like clathrin) are super-stoichiometric, but that not all super-stoichiometric proteins are highly abundant, across three cell populations (HeLa, fibroblast, and neuronal synaptosomes). Most strikingly, within all cells there is significant competition to bind shared sites on clathrin and the central AP-2 adaptor by other adaptor proteins, resulting in most being in excess supply. Our network and systematic analysis, including response to perturbations of network components, show how competition for shared binding sites results in functionally similar proteins having widely varying stoichiometries, due to variations in both abundance and their unique network of binding partners. 
    more » « less
  3. Abstract Background In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. Results Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a “screen”) of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read’s similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. Conclusions The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at https://github.com/arun96/sketching . 
    more » « less
  4. Abstract The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society 1,2 . However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals 3,4 . Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome 5 . To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity 6 . Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent–child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements. 
    more » « less
  5. The natural scale separation in the restricted nonlinear (RNL) modelling approach is exploited to build upon recent studies, e.g., Wangsawijaya (2020), that have used scale separation to provide insight into mechanisms underlying secondary motions in turbulent flow over spanwise heterogeneous roughness. In the RNL decomposition the large-scale comprises the streamwise averaged mean and the small-scales are defined through a dynamical restriction that leads to computational tractability, while providing good agreement with salient flow features. In agreement with the experimental work, our results indicate that energy of the large-scales is amplified over the low roughness region due to the secondary flow. The small-scales are shown to play a dominant role in the Reynolds stresses responsible for generation of the secondary flow. Conditional averaging of the RNL mean field reveals stronger momentum pathways over low roughness regions experiencing downwash in instances that differ from the time-averaged trends. Further analysis of the large scale indicates that meandering of low speed streaks in the RNL flow is in response to secondary flow momentum mixing. 
    more » « less
  6. INTRODUCTION One of the central applications of the human reference genome has been to serve as a baseline for comparison in nearly all human genomic studies. Unfortunately, many difficult regions of the reference genome have remained unresolved for decades and are affected by collapsed duplications, missing sequences, and other issues. Relative to the current human reference genome, GRCh38, the Telomere-to-Telomere CHM13 (T2T-CHM13) genome closes all remaining gaps, adds nearly 200 million base pairs (Mbp) of sequence, corrects thousands of structural errors, and unlocks the most complex regions of the human genome for scientific inquiry. RATIONALE We demonstrate how the T2T-CHM13 reference genome universally improves read mapping and variant identification in a globally diverse cohort. This cohort includes all 3202 samples from the expanded 1000 Genomes Project (1KGP), sequenced with short reads, as well as 17 globally diverse samples sequenced with long reads. By applying state-of-the-art methods for calling single-nucleotide variants (SNVs) and structural variants (SVs), we document the strengths and limitations of T2T-CHM13 relative to its predecessors and highlight its promise for revealing new biological insights within technically challenging regions of the genome. RESULTS Across the 1KGP samples, we found more than 1 million additional high-quality variants genome-wide using T2T-CHM13 than with GRCh38. Within previously unresolved regions of the genome, we identified hundreds of thousands of variants per sample—a promising opportunity for evolutionary and biomedical discovery. T2T-CHM13 improves the Mendelian concordance rate among trios and eliminates tens of thousands of spurious SNVs per sample, including a reduction of false positives in 269 challenging, medically relevant genes by up to a factor of 12. These corrections are in large part due to improvements to 70 protein-coding genes in >9 Mbp of inaccurate sequence caused by falsely collapsed or duplicated regions in GRCh38. Using the T2T-CHM13 genome also yields a more comprehensive view of SVs genome-wide, with a greatly improved balance of insertions and deletions. Finally, by providing numerous resources for T2T-CHM13 (including 1KGP genotypes, accessibility masks, and prominent annotation databases), our work will facilitate the transition to T2T-CHM13 from the current reference genome. CONCLUSION The vast improvements in variant discovery across samples of diverse ancestries position T2T-CHM13 to succeed as the next prevailing reference for human genetics. T2T-CHM13 thus offers a model for the construction and study of high-quality reference genomes from globally diverse individuals, such as is now being pursued through collaboration with the Human Pangenome Reference Consortium. As a foundation, our work underscores the benefits of an accurate and complete reference genome for revealing diversity across human populations. Genomic features and resources available for T2T-CHM13. Comparisons to GRCh38 reveal broad improvements in SNVs, indels, and SVs discovered across diverse human populations by means of short-read (1KGP) and long-read sequencing (LRS). These improvements are due to resolution of complex genomic loci (nonsyntenic and previously unresolved), duplication errors, and discordant haplotypes, including those in medically relevant genes. 
    more » « less
  7. Since its initial release in 2000, the human reference genome has covered only the euchromatic fraction of the genome, leaving important heterochromatic regions unfinished. Addressing the remaining 8% of the genome, the Telomere-to-Telomere (T2T) Consortium presents a complete 3.055 billion–base pair sequence of a human genome, T2T-CHM13, that includes gapless assemblies for all chromosomes except Y, corrects errors in the prior references, and introduces nearly 200 million base pairs of sequence containing 1956 gene predictions, 99 of which are predicted to be protein coding. The completed regions include all centromeric satellite arrays, recent segmental duplications, and the short arms of all five acrocentric chromosomes, unlocking these complex regions of the genome to variational and functional studies. 
    more » « less
  8. Kasson, Peter M. (Ed.)
    Clathrin-coated structures must assemble on cell membranes to internalize receptors, with the clathrin protein only linked to the membrane via adaptor proteins. These structures can grow surprisingly large, containing over 20 clathrin, yet they often fail to form productive vesicles, instead aborting and disassembling. We show that clathrin structures of this size can both form and disassemble spontaneously when adaptor protein availability is low, despite high abundance of clathrin. Here, we combine recent in vitro kinetic measurements with microscopic reaction-diffusion simulations and theory to differentiate mechanisms of stable vs unstable clathrin assembly on membranes. While in vitro conditions drive assembly of robust, stable lattices, we show that concentrations, geometry, and dimensional reduction in physiologic-like conditions do not support nucleation if only the key adaptor AP-2 is included, due to its insufficient abundance. Nucleation requires a stoichiometry of adaptor to clathrin that exceeds 1:1, meaning additional adaptor types are necessary to form lattices successfully and efficiently. We show that the critical nucleus contains ~25 clathrin, remarkably similar to sizes of the transient and abortive structures observed in vivo . Lastly, we quantify the cost of bending the membrane under our curved clathrin lattices using a continuum membrane model. We find that the cost of bending the membrane could be largely offset by the energetic benefit of forming curved rather than flat structures, with numbers comparable to experiments. Our model predicts how adaptor density can tune clathrin-coated structures from the transient to the stable, showing that active energy consumption is therefore not required for lattice disassembly or remodeling during growth, which is a critical advance towards predicting productive vesicle formation. 
    more » « less
  9. Centromeres attach chromosomes to spindle microtubules during cell division and, despite this conserved role, show paradoxically rapid evolution and are typified by complex repeats. We used long-read sequencing to generate the Col-CEN Arabidopsis thaliana genome assembly that resolves all five centromeres. The centromeres consist of megabase-scale tandemly repeated satellite arrays, which support CENTROMERE SPECIFIC HISTONE H3 (CENH3) occupancy and are densely DNA methylated, with satellite variants private to each chromosome. CENH3 preferentially occupies satellites that show the least amount of divergence and occur in higher-order repeats. The centromeres are invaded by ATHILA retrotransposons, which disrupt genetic and epigenetic organization. Centromeric crossover recombination is suppressed, yet low levels of meiotic DNA double-strand breaks occur that are regulated by DNA methylation. We propose that Arabidopsis centromeres are evolving through cycles of satellite homogenization and retrotransposon-driven diversification. 
    more » « less