skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Large-scale sequence comparisons with sourmash
The sourmash software package uses MinHash-based sketching to create “signatures”, compressed representations of DNA, RNA, and protein sequences, that can be stored, searched, explored, and taxonomically annotated. sourmash signatures can be used to estimate sequence similarity between very large data sets quickly and in low memory, and can be used to search large databases of genomes for matches to query genomes and metagenomes. sourmash is implemented in C++, Rust, and Python, and is freely available under the BSD license at http://github.com/dib-lab/sourmash.  more » « less
Award ID(s):
1711984
PAR ID:
10192442
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
F1000Research
Volume:
8
ISSN:
2046-1402
Page Range / eLocation ID:
1006
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Giant viruses are widespread in the biosphere and play important roles in biogeochemical cycling and host genome evolution. Also known as nucleo-cytoplasmic large DNA viruses (NCLDVs), these eukaryotic viruses harbor the largest and most complex viral genomes known. Studies have shown that NCLDVs are frequently abundant in metagenomic datasets, and that sequences derived from these viruses can also be found endogenized in diverse eukaryotic genomes. The accurate detection of sequences derived from NCLDVs is therefore of great importance, but this task is challenging owing to both the high level of sequence divergence between NCLDV families and the extraordinarily high diversity of genes encoded in their genomes, including some encoding for metabolic or translation-related functions that are typically found only in cellular lineages. Here, we present ViralRecall, a bioinformatic tool for the identification of NCLDV signatures in ‘omic data. This tool leverages a library of giant virus orthologous groups (GVOGs) to identify sequences that bear signatures of NCLDVs. We demonstrate that this tool can effectively identify NCLDV sequences with high sensitivity and specificity. Moreover, we show that it can be useful both for removing contaminating sequences in metagenome-assembled viral genomes as well as the identification of eukaryotic genomic loci that derived from NCLDV. ViralRecall is written in Python 3.5 and is freely available on GitHub: https://github.com/faylward/viralrecall. 
    more » « less
  2. Coevolution is common and frequently governs host–pathogen interaction outcomes. Phenotypes underlying these interactions often manifest as the combined products of the genomes of interacting species, yet traditional quantitative trait mapping approaches ignore these intergenomic interactions. Devil facial tumor disease (DFTD), an infectious cancer afflicting Tasmanian devils (Sarcophilus harrisii), has decimated devil populations due to universal host susceptibility and a fatality rate approaching 100%. Here, we used a recently developed joint genome-wide association study (i.e., co-GWAS) approach, 15 y of mark-recapture data, and 960 genomes to identify intergenomic signatures of coevolution between devils and DFTD. Using a traditional GWA approach, we found that both devil and DFTD genomes explained a substantial proportion of variance in how quickly susceptible devils became infected, although genomic architectures differed across devils and DFTD; the devil genome had fewer loci of large effect whereas the DFTD genome had a more polygenic architecture. Using a co-GWA approach, devil–DFTD intergenomic interactions explained ~3× more variation in how quickly susceptible devils became infected than either genome alone, and the top genotype-by-genotype interactions were significantly enriched for cancer genes and signatures of selection. A devil regulatory mutation was associated with differential expression of a candidate cancer gene and showed putative allele matching effects with two DFTD coding sequence variants. Our results highlight the need to account for intergenomic interactions when investigating host–pathogen (co)evolution and emphasize the importance of such interactions when considering devil management strategies. 
    more » « less
  3. In dry summer months, stream baseflow sourced from groundwater is essential to support aquatic ecosystems and anthropogenic water use. Hydrologic signatures, or metrics describing unique features of streamflow timeseries, are useful for quantifying and predicting these valuable baseflow and groundwater storage resources across continental scales. Hydrologic signatures can be predicted based on catchment attributes summarising climate and landscape and can be used to characterise baseflow and groundwater processes that cannot be directly measured. While past watershed‐scale studies suggest that landscape attributes are important controls on baseflow and storage processes, recent regional‐to‐global scale modelling studies have instead found that landscape attributes have weaker relationships with hydrologic signatures of these processes than expected compared to climate attributes. In this study, we quantify two landscape attributes, average geologic age and the proportion of catchment area covered by wetlands. We investigate if incorporating these additional predictors into existing large‐sample attribute datasets strengthens continental‐scale, empirical relationships between landscape attributes and hydrologic signatures. We quantify 14 hydrologic signatures related to baseflow and groundwater processes in catchments across the contiguous United States, evaluate the relationships between the new catchment attributes and hydrologic signatures with correlation analysis and use the new attributes to predict hydrologic signatures with random forest models. We found that the average geologic age of catchments was a highly influential predictor of hydrologic signatures, especially for signatures describing baseflow magnitude in catchments, and had greater importance than existing attributes of the subsurface. In contrast, we found that the proportion of wetlands in catchments had limited influence on our hydrologic signature predictions. We recommend incorporating catchment geologic age into large‐sample catchment datasets to improve predictions of baseflow and storage hydrologic signatures and processes across continental scales. 
    more » « less
  4. Large genomic insertions and deletions are a potent source of functional variation, but are challenging to resolve with short-read sequencing, limiting knowledge of the role of such structural variants (SVs) in human evolution. Here, we used a graph-based method to genotype long-read-discovered SVs in short-read data from diverse human genomes. We then applied an admixture-aware method to identify 220 SVs exhibiting extreme patterns of frequency differentiation – a signature of local adaptation. The top two variants traced to the immunoglobulin heavy chain locus, tagging a haplotype that swept to near fixation in certain southeast Asian populations, but is rare in other global populations. Further investigation revealed evidence that the haplotype traces to gene flow from Neanderthals, corroborating the role of immune-related genes as prominent targets of adaptive introgression. Our study demonstrates how recent technical advances can help resolve signatures of key evolutionary events that remained obscured within technically challenging regions of the genome. 
    more » « less
  5. Abstract Borgs are huge extrachromosomal elements (ECE) of anaerobic methane-consuming “CandidatusMethanoperedens” archaea. Here, we used nanopore sequencing to validate published complete genomes curated from short reads and to reconstruct new genomes. 13 complete and four near-complete linear genomes share 40 genes that define a largely syntenous genome backbone. We use these conserved genes to identify new Borgs from peatland soil and to delineate Borg phylogeny, revealing two major clades. Remarkably, Borg genes encoding nanowire-like electron-transferring cytochromes and cell surface proteins are more highly expressed than those of hostMethanoperedens, indicating that Borgs augment theMethanoperedensactivity in situ. We reconstructed the first complete 4.00 Mbp genome for aMethanoperedensthat is inferred to be a Borg host and predicted its methylation motifs, which differ from pervasive TC and CC methylation motifs of the Borgs. Thus, methylation may enableMethanoperedensto distinguish their genomes from those of Borgs. Very high Borg toMethanoperedensratios and structural predictions suggest that Borgs may be capable of encapsulation. The findings clearly define Borgs as a distinct class of ECE with shared genomic signatures, establish their diversification from a common ancestor with genetic inheritance, and raise the possibility of periodic existence outside of host cells. 
    more » « less