NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AncestryGrapher toolkit: Python command-line pipelines to visualize global- and local- ancestry inferences from the RFMIX version 2 software

https://doi.org/10.1093/bioinformatics/btae616

Lisi, Alessandro; Campbell, Michael_C; Kendziorski, ed., Christina (October 2024, Bioinformatics)

Abstract SummaryAdmixture is a fundamental process that has shaped levels and patterns of genetic variation in human populations. RFMIX version 2 (RFMIX2) utilizes a robust modeling approach to identify the genetic ancestries in admixed populations. However, this software does not have a built-in method to visually summarize the results of analyses. Here, we introduce the AncestryGrapher toolkit, which converts the numerical output of RFMIX2 into graphical representations of global and local ancestry (i.e. the per-individual ancestry components and the genetic ancestry along chromosomes, respectively). ResultsTo demonstrate the utility of our methods, we applied the AncestryGrapher toolkit to visualize the global and local ancestry of individuals in the North African Mozabite Berber population from the Human Genome Diversity Panel. Our results showed that the Mozabite Berbers derived their ancestry from the Middle East, Europe, and sub-Saharan Africa (global ancestry). We also found that the population origin of ancestry varied considerably along chromosomes (local ancestry). For example, we observed variance in local ancestry in the genomic region on Chromosome 2 containing the regulatory sequence in the MCM6 gene associated with lactase persistence, a human trait tied to the cultural development of adult milk consumption. Overall, the AncestryGrapher toolkit facilitates the exploration, interpretation, and reporting of ancestry patterns in human populations. Availability and implementationThe AncestryGrapher toolkit is free and open source on https://github.com/alisi1989/RFmix2-Pipeline-to-plot.
more » « less
The Naïve Bayes classifier++ for metagenomic taxonomic classification—query evaluation

https://doi.org/10.1093/bioinformatics/btae743

Duan, Haozhe_Neil; Hearne, Gavin; Polikar, Robi; Rosen, Gail_L; Kendziorski, ed., Christina (December 2024, Bioinformatics)

Abstract MotivationThis study examines the query performance of the NBC++ (Incremental Naive Bayes Classifier) program for variations in canonicality, k-mer size, databases, and input sample data size. We demonstrate that both NBC++ and Kraken2 are influenced by database depth, with macro measures improving as depth increases. However, fully capturing the diversity of life, especially viruses, remains a challenge. ResultsNBC++ can competitively profile the superkingdom content of metagenomic samples using a small training database. NBC++ spends less time training and can use a fraction of the memory than Kraken2 but at the cost of long querying time. Major NBC++ enhancements include accommodating canonical k-mer storage (leading to significant storage savings) and adaptable and optimized memory allocation that accelerates query analysis and enables the software to be run on nearly any system. Additionally, the output now includes log-likelihood values for each training genome, providing users with valuable confidence information. Availability and implementationSource code and Dockerfile are available at http://github.com/EESI/Naive_Bayes.
more » « less
A supervised Bayesian factor model for the identification of multi-omics signatures

https://doi.org/10.1093/bioinformatics/btae202

Gygi, Jeremy P.; Konstorum, Anna; Pawar, Shrikant; Aron, Edel; Kleinstein, Steven H.; Guan, Leying; Kendziorski, ed., Christina (April 2024, Bioinformatics)

Abstract MotivationPredictive biological signatures provide utility as biomarkers for disease diagnosis and prognosis, as well as prediction of responses to vaccination or therapy. These signatures are identified from high-throughput profiling assays through a combination of dimensionality reduction and machine learning techniques. The genes, proteins, metabolites, and other biological analytes that compose signatures also generate hypotheses on the underlying mechanisms driving biological responses, thus improving biological understanding. Dimensionality reduction is a critical step in signature discovery to address the large number of analytes in omics datasets, especially for multi-omics profiling studies with tens of thousands of measurements. Latent factor models, which can account for the structural heterogeneity across diverse assays, effectively integrate multi-omics data and reduce dimensionality to a small number of factors that capture correlations and associations among measurements. These factors provide biologically interpretable features for predictive modeling. However, multi-omics integration and predictive modeling are generally performed independently in sequential steps, leading to suboptimal factor construction. Combining these steps can yield better multi-omics signatures that are more predictive while still being biologically meaningful. ResultsWe developed a supervised variational Bayesian factor model that extracts multi-omics signatures from high-throughput profiling datasets that can span multiple data types. Signature-based multiPle-omics intEgration via lAtent factoRs (SPEAR) adaptively determines factor rank, emphasis on factor structure, data relevance and feature sparsity. The method improves the reconstruction of underlying factors in synthetic examples and prediction accuracy of coronavirus disease 2019 severity and breast cancer tumor subtypes. Availability and implementationSPEAR is a publicly available R-package hosted at https://bitbucket.org/kleinstein/SPEAR.
more » « less
CNAsim: improved simulation of single-cell copy number profiles and DNA-seq data from tumors

https://doi.org/10.1093/bioinformatics/btad434

Weiner, Samson; Bansal, Mukul S.; Kendziorski, ed., Christina (July 2023, Bioinformatics)

Abstract SummaryCNAsim is a software package for improved simulation of single-cell copy number alteration (CNA) data from tumors. CNAsim can be used to efficiently generate single-cell copy number profiles for thousands of simulated tumor cells under a more realistic error model and a broader range of possible CNA mechanisms compared with existing simulators. The error model implemented in CNAsim accounts for the specific biases of single-cell sequencing that leads to read count fluctuation and poor resolution of CNA detection. For improved realism over existing simulators, CNAsim can (i) generate WGD, whole-chromosomal CNAs, and chromosome-arm CNAs, (ii) simulate subclonal population structure defined by the accumulation of chromosomal CNAs, and (iii) dilute the sampled cell population with both normal diploid cells and pseudo-diploid cells. The software can also generate DNA-seq data for sampled cells. Availability and implementationCNAsim is written in Python and is freely available open-source from https://github.com/samsonweiner/CNAsim.
more » « less
ExplorATE: a new pipeline to explore active transposable elements from RNA-seq data

https://doi.org/10.1093/bioinformatics/btac354

Femenias, Martin M.; Santos, Juan C.; Sites, Jr, Jack W.; Avila, Luciano J.; Morando, Mariana; Kendziorski, ed., Christina (May 2022, Bioinformatics)

Abstract MotivationTransposable elements (TEs) are ubiquitous in genomes and many remain active. TEs comprise an important fraction of the transcriptomes with potential effects on the host genome, either by generating deleterious mutations or promoting evolutionary novelties. However, their functional study is limited by the difficulty in their identification and quantification, particularly in non-model organisms. ResultsWe developed a new pipeline [explore active transposable elements (ExplorATE)] implemented in R and bash that allows the quantification of active TEs in both model and non-model organisms. ExplorATE creates TE-specific indexes and uses the Selective Alignment (SA) to filter out co-transcribed transposons within genes based on alignment scores. Moreover, our software incorporates a Wicker-like criteria to refine a set of target TEs and avoid spurious mapping. Based on simulated and real data, we show that the SA strategy adopted by ExplorATE achieved better estimates of non-co-transcribed elements than other available alignment-based or mapping-based software. ExplorATE results showed high congruence with alignment-based tools with and without a reference genome, yet ExplorATE required less execution time. Likewise, ExplorATE expands and complements most previous TE analyses by incorporating the co-transcription and multi-mapping effects during quantification, and provides a seamless integration with other downstream tools within the R environment. Availability and implementationSource code is available at https://github.com/FemeniasM/ExplorATEproject and https://github.com/FemeniasM/ExplorATE_shell_script. Data available on request. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less

Search for: All records