Search for: All records

Creators/Authors contains: "Salzberg, Steven L"

« Prev Next »

Total Resources

27

Resource Type
Conference Paper

0

Conference Proceeding

0

Dataset

0

Journal Article

27

Workshop Report

0

Availability
Full Text / Resource Available

26

Citation Only

1

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A genome sequence for the threatened whitebark pine

https://doi.org/10.1093/g3journal/jkae061

Neale, David B. ; Zimin, Aleksey V. ; Meltzer, Amy ; Bhattarai, Akriti ; Amee, Maurice ; Figueroa Corona, Laura ; Allen, Brian J. ; Puiu, Daniela ; Wright, Jessica ; De La Torre, Amanda R. ; et al ( March 2024 , G3: Genes, Genomes, Genetics)

Abstract
Whitebark pine (WBP, Pinus albicaulis) is a white pine of subalpine regions in the Western contiguous United States and Canada. WBP has become critically threatened throughout a significant part of its natural range due to mortality from the introduced fungal pathogen white pine blister rust (WPBR, Cronartium ribicola) and additional threats from mountain pine beetle (Dendroctonus ponderosae), wildfire, and maladaptation due to changing climate. Vast acreages of WBP have suffered nearly complete mortality. Genomic technologies can contribute to a faster, more cost-effective approach to the traditional practices of identifying disease-resistant, climate-adapted seed sources for restoration. With deep-coverage Illumina short reads of haploid megagametophyte tissue and Oxford Nanopore long reads of diploid needle tissue, followed by a hybrid, multistep assembly approach, we produced a final assembly containing 27.6 Gb of sequence in 92,740 contigs (N50 537,007 bp) and 34,716 scaffolds (N50 2.0 Gb). Approximately 87.2% (24.0 Gb) of total sequence was placed on the 12 WBP chromosomes. Annotation yielded 25,362 protein-coding genes, and over 77% of the genome was characterized as repeats. WBP has demonstrated the greatest variation in resistance to WPBR among the North American white pines. Candidate genes for quantitative resistance include disease resistance genes known as nucleotide-binding leucine-rich repeat receptors (NLRs). A combination of protein domain alignments and direct genome scanning was employed to fully describe the 3 subclasses of NLRs. Our high-quality reference sequence and annotation provide a marked improvement in NLR identification compared to previous assessments that leveraged de novo-assembled transcriptomes.

more » « less
Investigating open reading frames in known and novel transcripts using ORFanage

https://doi.org/10.1038/s43588-023-00496-1

Varabyou, Ales ; Erdogdu, Beril ; Salzberg, Steven L. ; Pertea, Mihaela ( July 2023 , Nature Computational Science)

Free, publicly-accessible full text available July 31, 2024
CHESS 3: an improved, comprehensive catalog of human genes and transcripts based on large-scale expression data, phylogenetic analysis, and protein structure

https://doi.org/10.1186/s13059-023-03088-4

Varabyou, Ales ; Sommer, Markus J. ; Erdogdu, Beril ; Shinder, Ida ; Minkin, Ilia ; Chao, Kuan-Hao ; Park, Sukhwan ; Heinz, Jakob ; Pockrandt, Christopher ; Shumate, Alaina ; et al ( October 2023 , Genome Biology)

Abstract
CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available athttp://ccb.jhu.edu/chess.

more » « less
The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual

https://doi.org/10.1093/g3journal/jkac321

Chao, Kuan-Hao ; Zimin, Aleksey V. ; Pertea, Mihaela ; Salzberg, Steven L. ; Emerson, ed., J. J. ( January 2023 , G3: Genes, Genomes, Genetics)

Abstract
We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein-coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.

more » « less
The SAMBA tool uses long reads to improve the contiguity of genome assemblies

https://doi.org/10.1371/journal.pcbi.1009860

Zimin, Aleksey V. ; Salzberg, Steven L. ( February 2022 , PLOS Computational Biology)
Shao, Mingfu (Ed.)
Third-generation sequencing technologies can generate very long reads with relatively high error rates. The lengths of the reads, which sometimes exceed one million bases, make them invaluable for resolving complex repeats that cannot be assembled using shorter reads. Many high-quality genome assemblies have already been produced, curated, and annotated using the previous generation of sequencing data, and full re-assembly of these genomes with long reads is not always practical or cost-effective. One strategy to upgrade existing assemblies is to generate additional coverage using long-read data, and add that to the previously assembled contigs. SAMBA is a tool that is designed to scaffold and gap-fill existing genome assemblies with additional long-read data, resulting in substantially greater contiguity. SAMBA is the only tool of its kind that also computes and fills in the sequence for all spanned gaps in the scaffolds, yielding much longer contigs. Here we compare SAMBA to several similar tools capable of re-scaffolding assemblies using long-read data, and we show that SAMBA yields better contiguity and introduces fewer errors than competing methods. SAMBA is open-source software that is distributed at https://github.com/alekseyzimin/masurca .
more » « less
Full Text Available
PhyloCSF++: a fast and user-friendly implementation of PhyloCSF with annotation tools

https://doi.org/10.1093/bioinformatics/btab756

Pockrandt, Christopher ; Steinegger, Martin ; Salzberg, Steven L. ; Martelli, ed., Pier Luigi ( November 2021 , Bioinformatics)

Abstract Summary
PhyloCSF++ is an efficient and parallelized C++ implementation of the popular PhyloCSF method to distinguish protein-coding and non-coding regions in a genome based on multiple sequence alignments (MSAs). It can score alignments or produce browser tracks for entire genomes in the wig file format. Additionally, PhyloCSF++ annotates coding sequences in GFF/GTF files using precomputed tracks or computes and scores MSAs on the fly with MMseqs2.
Availability and implementation
PhyloCSF++ is released under the AGPLv3 license. Binaries and source code are available at https://github.com/cpockrandt/PhyloCSFpp. The software can be installed through bioconda. A variety of tracks can be accessed through ftp://ftp.ccb.jhu.edu/pub/software/phylocsfpp/.

more » « less
Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments

https://doi.org/10.1101/gr.266213.120

Varabyou, Ales ; Salzberg, Steven L. ; Pertea, Mihaela ( February 2021 , Genome Research)
null (Ed.)
Full Text Available
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank

https://doi.org/10.1186/s13059-020-02023-1

Steinegger, Martin ; Salzberg, Steven L. ( December 2020 , Genome Biology)

Full Text Available
Ultrafast and accurate 16S rRNA microbial community analysis using Kraken 2

https://doi.org/10.1186/s40168-020-00900-2

Lu, Jennifer ; Salzberg, Steven L. ( December 2020 , Microbiome)

Full Text Available
SkewIT: The Skew Index Test for large-scale GC Skew analysis of bacterial genomes

https://doi.org/10.1371/journal.pcbi.1008439

Lu, Jennifer ; Salzberg, Steven L. ( December 2020 , PLOS Computational Biology)
Rzhetsky, Andrey (Ed.)
GC skew is a phenomenon observed in many bacterial genomes, wherein the two replication strands of the same chromosome contain different proportions of guanine and cytosine nucleotides. Here we demonstrate that this phenomenon, which was first discovered in the mid-1990s, can be used today as an analysis tool for the 15,000+ complete bacterial genomes in NCBI’s Refseq library. In order to analyze all 15,000+ genomes, we introduce a new method, SkewIT (Skew Index Test), that calculates a single metric representing the degree of GC skew for a genome. Using this metric, we demonstrate how GC skew patterns are conserved within certain bacterial phyla, e.g. Firmicutes, but show different patterns in other phylogenetic groups such as Actinobacteria. We also discovered that outlier values of SkewIT highlight potential bacterial mis-assemblies. Using our newly defined metric, we identify multiple mis-assembled chromosomal sequences in previously published complete bacterial genomes. We provide a SkewIT web app https://jenniferlu717.shinyapps.io/SkewIT/ that calculates SkewI for any user-provided bacterial sequence. The web app also provides an interactive interface for the data generated in this paper, allowing users to further investigate the SkewI values and thresholds of the Refseq-97 complete bacterial genomes. Individual scripts for analysis of bacterial genomes are provided in the following repository: https://github.com/jenniferlu717/SkewIT .
more » « less
Full Text Available

« Prev Next »