Title: PCA outperforms popular hidden variable inference methods for molecular QTL mapping
Abstract BackgroundEstimating and accounting for hidden variables is widely practiced as an important step in molecular quantitative trait locus (molecular QTL, henceforth “QTL”) analysis for improving the power of QTL identification. However, few benchmark studies have been performed to evaluate the efficacy of the various methods developed for this purpose. ResultsHere we benchmark popular hidden variable inference methods including surrogate variable analysis (SVA), probabilistic estimation of expression residuals (PEER), and hidden covariates with prior (HCP) against principal component analysis (PCA)—a well-established dimension reduction and factor discovery method—via 362 synthetic and 110 real data sets. We show that PCA not only underlies the statistical methodology behind the popular methods but is also orders of magnitude faster, better-performing, and much easier to interpret and use. ConclusionsTo help researchers use PCA in their QTL analysis, we provide an R package along with a detailed guide, both of which are freely available athttps://github.com/heatherjzhou/PCAForQTL. We believe that using PCA rather than SVA, PEER, or HCP will substantially improve and simplify hidden variable inference in QTL mapping as well as increase the transparency and reproducibility of QTL research. more »« less
Huang, Xin; Struck, Travis J; Davey, Sean W; Gutenkunst, Ryan N
(, bioRxiv)
Abstract Summarydadi is a popular software package for inferring models of demographic history and natural selection from population genomic data. But using dadi requires Python scripting and manual parallelization of optimization jobs. We developed dadi-cli to simplify dadi usage and also enable straighforward distributed computing. Availability and Implementationdadi-cli is implemented in Python and released under the Apache License 2.0. The source code is available athttps://github.com/xin-huang/dadi-cli. dadi-cli can be installed via PyPI and conda, and is also available through Cacao on Jetstream2https://cacao.jetstream-cloud.org/.
Abstract BackgroundComputational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. ResultsIn our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. ConclusionsOur heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly packagehttps://github.com/humengying0907/deconvBenchmarkingandhttps://doi.org/10.5281/zenodo.8206516, enabling further developments in deconvolution methods.
Abstract BackgroundLow-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce to analyze population structure of low-depth sequencing data. ResultsThe method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common. ConclusionsWe apply to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The package is available onhttps://github.com/yiwenstat/MCPCA_PopGen.
Bentz, Philip C; Leebens‐Mack, Jim
(, Applications in Plant Sciences)
Abstract PremiseTarget sequence capture (Hyb‐Seq) is a cost‐effective sequencing strategy that employs RNA probes to enrich for specific genomic sequences. By targeting conserved low‐copy orthologs, Hyb‐Seq enables efficient phylogenomic investigations. Here, we present Asparagaceae1726—a Hyb‐Seq probe set targeting 1726 low‐copy nuclear genes for phylogenomics in the angiosperm family Asparagaceae—which will aid the often‐challenging delineation and resolution of evolutionary relationships within Asparagaceae. MethodsHere we describe and validate the Asparagaceae1726 probe set (https://github.com/bentzpc/Asparagaceae1726) in six of the seven subfamilies of Asparagaceae. We perform phylogenomic analyses with these 1726 loci and evaluate how inclusion of paralogs and bycatch plastome sequences can enhance phylogenomic inference with target‐enriched data sets. ResultsWe recovered at least 82% of target orthologs from all sampled taxa, and phylogenomic analyses resulted in strong support for all subfamilial relationships. Additionally, topology and branch support were congruent between analyses with and without inclusion of target paralogs, suggesting that paralogs had limited effect on phylogenomic inference. DiscussionAsparagaceae1726 is effective across the family and enables the generation of robust data sets for phylogenomics of any Asparagaceae taxon. Asparagaceae1726 establishes a standardized set of loci for phylogenomic analysis in Asparagaceae, which we hope will be widely used for extensible and reproducible investigations of diversification in the family.
Abstract BackgroundThe pan-genome of a species is the union of the genes and non-coding sequences present in all individuals (cultivar, accessions, or strains) within that species. ResultsHere we introduce PGV, a reference-agnostic representation of the pan-genome of a species based on the notion of consensus ordering. Our experimental results demonstrate that PGV enables an intuitive, effective and interactive visualization of a pan-genome by providing a genome browser that can elucidate complex structural genomic variations. ConclusionsThe PGV software can be installed via conda or downloaded fromhttps://github.com/ucrbioinfo/PGV. The companion PGV browser athttp://pgv.cs.ucr.educan be tested using example bed tracks available from the GitHub page.
@article{osti_10373537,
place = {Country unknown/Code not available},
title = {PCA outperforms popular hidden variable inference methods for molecular QTL mapping},
url = {https://par.nsf.gov/biblio/10373537},
DOI = {10.1186/s13059-022-02761-4},
abstractNote = {Abstract BackgroundEstimating and accounting for hidden variables is widely practiced as an important step in molecular quantitative trait locus (molecular QTL, henceforth “QTL”) analysis for improving the power of QTL identification. However, few benchmark studies have been performed to evaluate the efficacy of the various methods developed for this purpose. ResultsHere we benchmark popular hidden variable inference methods including surrogate variable analysis (SVA), probabilistic estimation of expression residuals (PEER), and hidden covariates with prior (HCP) against principal component analysis (PCA)—a well-established dimension reduction and factor discovery method—via 362 synthetic and 110 real data sets. We show that PCA not only underlies the statistical methodology behind the popular methods but is also orders of magnitude faster, better-performing, and much easier to interpret and use. ConclusionsTo help researchers use PCA in their QTL analysis, we provide an R package along with a detailed guide, both of which are freely available athttps://github.com/heatherjzhou/PCAForQTL. We believe that using PCA rather than SVA, PEER, or HCP will substantially improve and simplify hidden variable inference in QTL mapping as well as increase the transparency and reproducibility of QTL research.},
journal = {Genome Biology},
volume = {23},
number = {1},
publisher = {Springer Science + Business Media},
author = {Zhou, Heather J. and Li, Lei and Li, Yumei and Li, Wei and Li, Jingyi Jessica},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.