NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The first gapless, reference-quality, fully annotated genome from a Southern Han Chinese individual

https://doi.org/10.1093/g3journal/jkac321

Chao, Kuan-Hao; Zimin, Aleksey V.; Pertea, Mihaela; Salzberg, Steven L.; Emerson, ed., J. J. (January 2023, G3: Genes, Genomes, Genetics)

Abstract We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein-coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.
more » « less
Investigating open reading frames in known and novel transcripts using ORFanage

https://doi.org/10.1038/s43588-023-00496-1

Varabyou, Ales; Erdogdu, Beril; Salzberg, Steven L.; Pertea, Mihaela (July 2023, Nature Computational Science)

Full Text Available
Improved transcriptome assembly using a hybrid of long and short reads with StringTie

https://doi.org/10.1371/journal.pcbi.1009730

Shumate A, Wong B (June 2022, PLoS computational biology)

Short-read RNA sequencing and long-read RNA sequencing each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. Here we present a new release of StringTie that performs hybrid-read assembly. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Here we demonstrate the improved accuracy on simulated data and real data from Arabidopsis thaliana, Mus musculus, and human. We also show that hybrid-read assembly is more accurate than correcting long reads prior to assembly while also being substantially faster. StringTie is freely available as open source software at https://github.com/gpertea/stringtie.
more » « less
Full Text Available
TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets

https://doi.org/10.1093/bioinformatics/btab342

Varabyou, Ales; Pertea, Geo; Pockrandt, Christopher; Pertea, Mihaela (May 2021, Bioinformatics)
Ponty, Yann (Ed.)
Abstract Summary Although the ability to programmatically summarize and visually inspect sequencing data is an integral part of genome analysis, currently available methods are not capable of handling large numbers of samples. In particular, making a visual comparison of transcriptional landscapes between two sets of thousands of RNA-seq samples is limited by available computational resources, which can be overwhelmed due to the sheer size of the data. In this work, we present TieBrush, a software package designed to process very large sequencing datasets (RNA, whole-genome, exome, etc.) into a form that enables quick visual and computational inspection. TieBrush can also be used as a method for aggregating data for downstream computational analysis, and is compatible with most software tools that take aligned reads as input. Availability and implementation TieBrush is provided as a C++ package under the MIT License. Precompiled binaries, source code and example data are available on GitHub (https://github.com/alevar/tiebrush). Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Effects of transcriptional noise on estimates of gene and transcript expression in RNA sequencing experiments

https://doi.org/10.1101/gr.266213.120

Varabyou, Ales; Salzberg, Steven L.; Pertea, Mihaela (February 2021, Genome Research)
null (Ed.)
Full Text Available
GFF Utilities: GffRead and GffCompare

https://doi.org/10.12688/f1000research.23297.1

Pertea, Geo; Pertea, Mihaela (January 2020, F1000Research)

Summary: GTF (Gene Transfer Format) and GFF (General Feature Format) are popular file formats used by bioinformatics programs to represent and exchange information about various genomic features, such as gene and transcript locations and structure. GffRead and GffCompare are open source programs that provide extensive and efficient solutions to manipulate files in a GTF or GFF format. While GffRead can convert, sort, filter, transform, or cluster genomic features, GffCompare can be used to compare and merge different gene annotations. Availability and implementation: GFF utilities are implemented in C++ for Linux and OS X and released as open source under an MIT license ( https://github.com/gpertea/gffread , https://github.com/gpertea/gffcompare ).
more » « less
Full Text Available
Transcriptome assembly from long-read RNA-seq alignments with StringTie2

https://doi.org/10.1186/s13059-019-1910-1

Kovaka, Sam; Zimin, Aleksey V.; Pertea, Geo M.; Razaghi, Roham; Salzberg, Steven L.; Pertea, Mihaela (December 2019, Genome Biology)

Full Text Available
Human contamination in bacterial genomes has created thousands of spurious proteins

https://doi.org/10.1101/gr.245373.118

Breitwieser, Florian P; Pertea, Mihaela; Zimin, Aleksey; Salzberg, Steven L (January 2019, Genome Research)

Full Text Available

Search for: All records