Title: A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data
AbstractBackground
Low-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce to analyze population structure of low-depth sequencing data.
Results
The method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common.
Conclusions
We apply to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The package is available onhttps://github.com/yiwenstat/MCPCA_PopGen.
Genetic barcoding provides a high-throughput way to simultaneously track the frequencies of large numbers of competing and evolving microbial lineages. However making inferences about the nature of the evolution that is taking place remains a difficult task.
Results
Here we describe an algorithm for the inference of fitness effects and establishment times of beneficial mutations from barcode sequencing data, which builds upon a Bayesian inference method by enforcing self-consistency between the population mean fitness and the individual effects of mutations within lineages. By testing our inference method on a simulation of 40,000 barcoded lineages evolving in serial batch culture, we find that this new method outperforms its predecessor, identifying more adaptive mutations and more accurately inferring their mutational parameters.
Conclusion
Our new algorithm is particularly suited to inference of mutational parameters when read depth is low. We have made Python code for our serial dilution evolution simulations, as well as both the old and new inference methods, available on GitHub (https://github.com/FangfeiLi05/FitMut2), in the hope that it can find broader use by the microbial evolution community.
Clark, Lindsay V.; Mays, Wittney; Lipka, Alexander E.; Sacks, Erik J.(
, BMC Bioinformatics)
AbstractBackground
Given the economic and environmental importance of allopolyploids and other species with highly duplicated genomes, there is a need for methods to distinguish paralogs, i.e. duplicate sequences within a genome, from Mendelian loci, i.e. single copy sequences that pair at meiosis. The ratio of observed to expected heterozygosity is an effective tool for filtering loci but requires genotyping to be performed first at a high computational cost, whereas counting the number of sequence tags detected per genotype is computationally quick but very ineffective in inbred or polyploid populations. Therefore, new methods are needed for filtering paralogs.
Results
We introduce a novel statistic,Hind/HE, that uses the probability that two reads sampled from a genotype will belong to different alleles, instead of observed heterozygosity. The expected value ofHind/HEis the same across all loci in a dataset, regardless of read depth or allele frequency. In contrast to methods based on observed heterozygosity, it can be estimated and used for filtering loci prior to genotype calling. In addition to filtering paralogs, it can be used to filter loci with null alleles or high overdispersion, and identify individuals with unexpected ploidy and hybrid status. We demonstrate that the statistic is useful at read depths as low as five to 10, well below the depth needed for accurate genotype calling in polyploid and outcrossing species.
Conclusions
Our methodology for estimatingHind/HEacross loci and individuals, as well as determining reasonable thresholds for filtering loci, is implemented in polyRAD v1.6, available athttps://github.com/lvclark/polyRAD. In large sequencing datasets, we anticipate that the ability to filter markers and identify problematic individuals prior to genotype calling will save researchers considerable computational time.
Watowich, Marina M.; Chiou, Kenneth L.; Graves, Brian; Montague, Michael J.; Brent, Lauren J. N.; Higham, James P.; Horvath, Julie E.; Lu, Amy; Martinez, Melween I.; Platt, Michael L.; et al(
, Molecular Ecology Resources)
Abstract
Monitoring genetic diversity in wild populations is a central goal of ecological and evolutionary genetics and is critical for conservation biology. However, genetic studies of nonmodel organisms generally lack access to species‐specific genotyping methods (e.g. array‐based genotyping) and must instead use sequencing‐based approaches. Although costs are decreasing, high‐coverage whole‐genome sequencing (WGS), which produces the highest confidence genotypes, remains expensive. More economical reduced representation sequencing approaches fail to capture much of the genome, which can hinder downstream inference. Low‐coverage WGS combined with imputation using a high‐confidence reference panel is a cost‐effective alternative, but the accuracy of genotyping using low‐coverage WGS and imputation in nonmodel populations is still largely uncharacterized. Here, we empirically tested the accuracy of low‐coverage sequencing (0.1–10×) and imputation in two natural populations, one with a large (n = 741) reference panel, rhesus macaques (Macaca mulatta), and one with a smaller (n = 68) reference panel, gelada monkeys (Theropithecus gelada). Using samples sequenced to coverage as low as 0.5×, we could impute genotypes at >95% of the sites in the reference panel with high accuracy (medianr2 ≥ 0.92). We show that low‐coverage imputed genotypes can reliably calculate genetic relatedness and population structure. Based on these data, we also provide best practices and recommendations for researchers who wish to deploy this approach in other populations, with all code available on GitHub (https://github.com/mwatowich/LoCSI‐for‐non‐model‐species). Our results endorse accurate and effective genotype imputation from low‐coverage sequencing, enabling the cost‐effective generation of population‐scale genetic datasets necessary for tackling many pressing challenges of wildlife conservation.
van Boheemen, Lotte A.; Bou‐Assi, Sarah; Uesugi, Akane; Hodgins, Kathryn A.(
, Ecology and Evolution)
Abstract
Rapid adaptation can aid invasive populations in their competitive success. Resource allocation trade‐off hypotheses predict higher resource availability or the lack of natural enemies in introduced ranges allow for increased growth and reproduction, thus contributing to invasive success. Evidence for such hypotheses is however equivocal and tests among multiple ranges over productivity gradients are required to provide a better understanding of the general applicability of these theories.
Using common gardens, we investigated the adaptive divergence of various constitutive and inducible defence‐related traits between the native North American and introduced European and Australian ranges, while controlling for divergence due to latitudinal trait clines, individual resource budgets, and population differentiation, using >11,000 SNPs.
Rapid, repeated clinal adaptation in defence‐related traits was apparent despite distinct demographic histories. We also identified divergence among ranges in some defence‐related traits, although differences in energy budgets among ranges may explain some, but not all, defence‐related trait divergence. We do not identify a general reduction in defence in concert with an increase in growth among the multiple introduced ranges as predicted trade‐off hypotheses.
Synthesis: The rapid spread of invasive species is affected by a multitude of factors, likely including adaptation to climate and escape from natural enemies. Unravelling the mechanisms underlying invasives' success enhances understanding of eco‐evolutionary theory and is essential to inform management strategies in the face of ongoing climate change.
OPEN RESEARCH BADGES
This article has been awarded Open Materials, Open Data, Preregistered Research Designs Badges. All materials and data are publicly accessible via the Open Science Framework athttps://doi.org/10.6084/m9.figshare.8028875.v1,https://github.com/lotteanna/defence_adaptation,https://doi.org/10.1101/435271.
Structural variation (SV), which ranges from 50 bp to$$\sim$$ 3 Mb in size, is an important type of genetic variations. Deletion is a type of SV in which a part of a chromosome or a sequence of DNA is lost during DNA replication. Three types of signals, including discordant read-pairs, reads depth and split reads, are commonly used for SV detection from high-throughput sequence data. Many tools have been developed for detecting SVs by using one or multiple of these signals.
Results
In this paper, we develop a new method called EigenDel for detecting the germline submicroscopic genomic deletions. EigenDel first takes advantage of discordant read-pairs and clipped reads to get initial deletion candidates, and then it clusters similar candidates by using unsupervised learning methods. After that, EigenDel uses a carefully designed approach for calling true deletions from each cluster. We conduct various experiments to evaluate the performance of EigenDel on low coverage sequence data.
Conclusions
Our results show that EigenDel outperforms other major methods in terms of improving capability of balancing accuracy and sensitivity as well as reducing bias. EigenDel can be downloaded fromhttps://github.com/lxwgcool/EigenDel.
Zhang, Miao, Liu, Yiwen, Zhou, Hua, Watkins, Joseph, and Zhou, Jin.
"A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data". BMC Bioinformatics 22 (1). Country unknown/Code not available: Springer Science + Business Media. https://doi.org/10.1186/s12859-021-04265-7.https://par.nsf.gov/biblio/10252957.
@article{osti_10252957,
place = {Country unknown/Code not available},
title = {A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data},
url = {https://par.nsf.gov/biblio/10252957},
DOI = {10.1186/s12859-021-04265-7},
abstractNote = {Abstract BackgroundLow-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce to analyze population structure of low-depth sequencing data. ResultsThe method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common. ConclusionsWe apply to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The package is available onhttps://github.com/yiwenstat/MCPCA_PopGen.},
journal = {BMC Bioinformatics},
volume = {22},
number = {1},
publisher = {Springer Science + Business Media},
author = {Zhang, Miao and Liu, Yiwen and Zhou, Hua and Watkins, Joseph and Zhou, Jin},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.