skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Title: SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning
Abstract The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen .  more » « less
Award ID(s):
2126387
PAR ID:
10378506
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Genome Biology
Volume:
23
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The ability of unevolved amino acid sequences to become biological catalysts was key to the emergence of life on Earth. However, billions of years of evolution separate complex modern enzymes from their simpler early ancestors. To probe how unevolved sequences can develop new functions, we use ultrahigh-throughput droplet microfluidics to screen for phosphoesterase activity amidst a library of more than one million sequences based on a de novo designed 4-helix bundle. Characterization of hits revealed that acquisition of function involved a large jump in sequence space enriching for truncations that removed >40% of the protein chain. Biophysical characterization of a catalytically active truncated protein revealed that it dimerizes into an α-helical structure, with the gain of function accompanied by increased structural dynamics. The identified phosphodiesterase is a manganese-dependent metalloenzyme that hydrolyses a range of phosphodiesters. It is most active towards cyclic AMP, with a rate acceleration of ~109and a catalytic proficiency of >1014 M−1, comparable to larger enzymes shaped by billions of years of evolution. 
    more » « less
  2. null (Ed.)
    Abstract PULs (polysaccharide utilization loci) are discrete gene clusters of CAZymes (Carbohydrate Active EnZymes) and other genes that work together to digest and utilize carbohydrate substrates. While PULs have been extensively characterized in Bacteroidetes, there exist PULs from other bacterial phyla, as well as archaea and metagenomes, that remain to be catalogued in a database for efficient retrieval. We have developed an online database dbCAN-PUL (http://bcb.unl.edu/dbCAN_PUL/) to display experimentally verified CAZyme-containing PULs from literature with pertinent metadata, sequences, and annotation. Compared to other online CAZyme and PUL resources, dbCAN-PUL has the following new features: (i) Batch download of PUL data by target substrate, species/genome, genus, or experimental characterization method; (ii) Annotation for each PUL that displays associated metadata such as substrate(s), experimental characterization method(s) and protein sequence information, (iii) Links to external annotation pages for CAZymes (CAZy), transporters (UniProt) and other genes, (iv) Display of homologous gene clusters in GenBank sequences via integrated MultiGeneBlast tool and (v) An integrated BLASTX service available for users to query their sequences against PUL proteins in dbCAN-PUL. With these features, dbCAN-PUL will be an important repository for CAZyme and PUL research, complementing our other web servers and databases (dbCAN2, dbCAN-seq). 
    more » « less
  3. BACKGROUND: Genomic surveillance allows identification of circulating SARS-CoV-2 variants. We provide an update on the evolution of SARS-CoV-2 in Rhode Island (RI). METHODS: All publicly available SARS-CoV-2 RI sequences were retrieved from https://www.gisaid.org. Genomic analyses were conducted to identify variants of concern (VOC), variants being monitored (VBM), or non-VOC/non-VBM, and investigate their evolution. RESULTS: Overall, 17,340 SARS-CoV-2 RI sequences were available between 2/2020–5/2022 across five (globally recognized) major waves, including 1,462 (8%) sequences from 36 non VOC/non-VBM until 5/2021; 10,565 (61%) sequences from 8 VBM between 5/2021–12/2021, most commonly Delta; and 5,313 (31%) sequences from the VOC Omicron from 12/2021 onwards. Genomic analyses demonstrated 71 Delta and 44 Omicron sub-lineages, with occurrence of variant-defining mutations in other variants. CONCLUSION: Statewide SARS-CoV-2 genomic surveillance allows for continued characterization of circulating variants and monitoring of viral evolution, which inform the local health force and guide public health on mitigation efforts against COVID-19. KEYWORDS: COVID-19, SARS-CoV-2, variants, genomic sequencing, Rhode Island 
    more » « less
  4. Characterizing computational demand of Cyber-Physical Systems (CPS) is critical for guaranteeing that multiple hard real-time tasks may be scheduled on shared resources without missing deadlines. In a CPS involving repetition such as industrial automation systems found in chemical process control or robotic manufacturing, sensors and actuators used as part of the industrial process may be conditionally enabled (and disabled) as a sequence of repeated steps is executed. In robotic manufacturing, for example, these steps may be the movement of a robotic arm through some trajectories followed by activation of end-effector sensors and actuators at the end of each completed motion. The conditional enabling of sensors and actuators produces a sequence of Monotonically Ascending Execution times (MAE) with lower WCET when the sensors are disabled and higher WCET when enabled. Since these systems may have several predefined steps to follow before repeating the entire sequence each unique step may result in several consecutive sequences of MAE. The repetition of these unique sequences of MAE result in a repeating WCET sequence. In the absence of an efficient demand characterization technique for repeating WCET sequences composed of subsequences with monotonically increasing execution time, this work proposes a new task model to describe the behavior of real-world systems which generate large repeating WCET sequences with subsequences of monotonically increasing execution times. In comparison to the most applicable current model, the Generalized Multiframe model (GMF), an empirically and theoretically faster method for characterizing the demand is provided. The demand characterization algorithm is evaluated through a case study of a robotic arm and simulation of 10,000 randomly generated tasks where, on average, the proposed approach is 231 and 179 times faster than the state-of-the-art in the case study and simulation respectively. 
    more » « less
  5. Abstract Synthetic DNA motifs form the basis of nucleic acid nanotechnology. The biochemical and biophysical properties of these motifs determine their applications. Here, we present a detailed characterization of switchback DNA, a globally left-handed structure composed of two parallel DNA strands. Compared to a conventional duplex, switchback DNA shows lower thermodynamic stability and requires higher magnesium concentration for assembly but exhibits enhanced biostability against some nucleases. Strand competition and strand displacement experiments show that component sequences have an absolute preference for duplex complements instead of their switchback partners. Further, we hypothesize a potential role for switchback DNA as an alternate structure in sequences containing short tandem repeats. Together with small molecule binding experiments and cell studies, our results open new avenues for switchback DNA in biology and nanotechnology. 
    more » « less