skip to main content

Title: Post‐selection inference for changepoint detection algorithms with application to copy number variation data

Changepoint detection methods are used in many areas of science and engineering, for example, in the analysis of copy number variation data to detect abnormalities in copy numbers along the genome. Despite the broad array of available tools, methodology for quantifying our uncertainty in the strength (or the presence) of given changepointspost‐selectionare lacking. Post‐selection inference offers a framework to fill this gap, but the most straightforward application of these methods results in low‐powered hypothesis tests and leaves open several important questions about practical usability. In this work, we carefully tailor post‐selection inference methods toward changepoint detection, focusing on copy number variation data. To accomplish this, we study commonly used changepoint algorithms: binary segmentation, as well as two of its most popular variants, wild and circular, and the fused lasso. We implement some of the latest developments in post‐selection inference theory, mainly auxiliary randomization. This improves the power, which requires implementations of Markov chain Monte Carlo algorithms (importance sampling and hit‐and‐run sampling) to carry out our tests. We also provide recommendations for improving practical useability, detailed simulations, and example analyses on array comparative genomic hybridization as well as sequencing data.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
Page Range / eLocation ID:
p. 1037-1049
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The interface between field biology and technology is energizing the collection of vast quantities of environmental data. Passive acoustic monitoring, the use of unattended recording devices to capture environmental sound, is an example where technological advances have facilitated an influx of data that routinely exceeds the capacity for analysis. Computational advances, particularly the integration of machine learning approaches, will support data extraction efforts. However, the analysis and interpretation of these data will require parallel growth in conceptual and technical approaches for data analysis. Here, we use a large hand‐annotated dataset to showcase analysis approaches that will become increasingly useful as datasets grow and data extraction can be partially automated.

    We propose and demonstrate seven technical approaches for analyzing bioacoustic data. These include the following: (1) generating species lists and descriptions of vocal variation, (2) assessing how abiotic factors (e.g., rain and wind) impact vocalization rates, (3) testing for differences in community vocalization activity across sites and habitat types, (4) quantifying the phenology of vocal activity, (5) testing for spatiotemporal correlations in vocalizations within species, (6) among species, and (7) using rarefaction analysis to quantify diversity and optimize bioacoustic sampling.

    To demonstrate these approaches, we sampled in 2016 and 2018 and used hand annotations of 129,866 bird vocalizations from two forests in New Hampshire, USA, including sites in the Hubbard Brook Experiment Forest where bioacoustic data could be integrated with more than 50 years of observer‐based avian studies. Acoustic monitoring revealed differences in community patterns in vocalization activity between forests of different ages, as well as between nearby similar watersheds. Of numerous environmental variables that were evaluated, background noise was most clearly related to vocalization rates. The songbird community included one cluster of species where vocalization rates declined as ambient noise increased and another cluster where vocalization rates declined over the nesting season. In some common species, the number of vocalizations produced per day was correlated at scales of up to 15 km. Rarefaction analyses showed that adding sampling sites increased species detections more than adding sampling days.

    Although our analyses used hand‐annotated data, the methods will extend readily to large‐scale automated detection of vocalization events. Such data are likely to become increasingly available as autonomous recording units become more advanced, affordable, and power efficient. Passive acoustic monitoring with human or automated identification at the species level offers growing potential to complement observer‐based studies of avian ecology.

    more » « less
  2. Abstract Aim

    Lineage fusion (merging of two or more populations of a species resulting in a single panmictic group) is a special case of secondary contact. It has the potential to counteract diversification and speciation, or to facilitate it through creation of novel genotypes. Understanding the prevalence of lineage fusion in nature requires reliable detection of it, such that efficient summary statistics are needed. Here, we report on simulations that characterized the initial intensity and subsequent decay of signatures of past fusion for 17 summary statistics applicable to DNA sequence haplotype data.




    Diploid out‐crossing species.


    We considered a range of scenarios that could reveal the impacts of different combinations of read length versus number of loci (arrangement of DNA sequence data), and whether or not pre‐fusion populations experienced bottlenecks coinciding with their divergence (historical context of fusion). Post‐fusion gene pools were sampled along 10 successive time points representing increasing lag times following merging of sister populations, and summary statistic values were recalculated at each.


    Many summary statistics were able to detect signatures of complete merging of populations after a sampling lag time of 1.5Negenerations, but the most informative ones included two neutrality tests and four diversity metrics, withZnS(a linkage disequilibrium‐based neutrality test) being particularly powerful. Correlation was relatively low among the two neutrality tests and two of the diversity metrics. There were clear benefits of many short (200‐bp × 200) loci over a handful of long (4‐kb × 10) loci. Also, only the latter genetic dataset type showed impacts of bottlenecks during divergence upon the number of informative summary statistics.

    Main conclusions

    This work contributes to identifying cases of lineage fusion, and advances phylogeography by enabling more nuanced reconstructions of how individual species, or multiple members of an ecological community, responded to past environmental change.

    more » « less
  3. de Visser, J. Arjan (Ed.)
    The rate of adaptive evolution depends on the rate at which beneficial mutations are introduced into a population and the fitness effects of those mutations. The rate of beneficial mutations and their expected fitness effects is often difficult to empirically quantify. As these 2 parameters determine the pace of evolutionary change in a population, the dynamics of adaptive evolution may enable inference of their values. Copy number variants (CNVs) are a pervasive source of heritable variation that can facilitate rapid adaptive evolution. Previously, we developed a locus-specific fluorescent CNV reporter to quantify CNV dynamics in evolving populations maintained in nutrient-limiting conditions using chemostats. Here, we use CNV adaptation dynamics to estimate the rate at which beneficial CNVs are introduced through de novo mutation and their fitness effects using simulation-based likelihood–free inference approaches. We tested the suitability of 2 evolutionary models: a standard Wright–Fisher model and a chemostat model. We evaluated 2 likelihood-free inference algorithms: the well-established Approximate Bayesian Computation with Sequential Monte Carlo (ABC-SMC) algorithm, and the recently developed Neural Posterior Estimation (NPE) algorithm, which applies an artificial neural network to directly estimate the posterior distribution. By systematically evaluating the suitability of different inference methods and models, we show that NPE has several advantages over ABC-SMC and that a Wright–Fisher evolutionary model suffices in most cases. Using our validated inference framework, we estimate the CNV formation rate at the GAP1 locus in the yeast Saccharomyces cerevisiae to be 10 −4.7 to 10 −4 CNVs per cell division and a fitness coefficient of 0.04 to 0.1 per generation for GAP1 CNVs in glutamine-limited chemostats. We experimentally validated our inference-based estimates using 2 distinct experimental methods—barcode lineage tracking and pairwise fitness assays—which provide independent confirmation of the accuracy of our approach. Our results are consistent with a beneficial CNV supply rate that is 10-fold greater than the estimated rates of beneficial single-nucleotide mutations, explaining the outsized importance of CNVs in rapid adaptive evolution. More generally, our study demonstrates the utility of novel neural network–based likelihood–free inference methods for inferring the rates and effects of evolutionary processes from empirical data with possible applications ranging from tumor to viral evolution. 
    more » « less
  4. Abstract

    Integrative analyses based on statistically relevant associations between genomics and a wealth of intermediary phenotypes (such as imaging) provide vital insights into their clinical relevance in terms of the disease mechanisms. Estimates for uncertainty in the resulting integrative models are however unreliable unless inference accounts for the selection of these associations with accuracy. In this paper, we develop selection‐aware Bayesian methods, which (1) counteract the impact of model selection bias through a “selection‐aware posterior” in a flexible class of integrative Bayesian models post a selection of promising variables via ℓ1‐regularized algorithms; (2) strike an inevitable trade‐off between the quality of model selection and inferential power when the same data set is used for both selection and uncertainty estimation. Central to our methodological development, a carefully constructed conditional likelihood function deployed with a reparameterization mapping provides tractable updates when gradient‐based Markov chain Monte Carlo (MCMC) sampling is used for estimating uncertainties from the selection‐aware posterior. Applying our methods to a radiogenomic analysis, we successfully recover several important gene pathways and estimate uncertainties for their associations with patient survival times.

    more » « less
  5. Abstract Background

    The topology of metabolic networks is both well-studied and remarkably well-conserved across many species. The regulation of these networks, however, is much more poorly characterized, though it is known to be divergent across organisms—two characteristics that make it difficult to model metabolic networks accurately. While many computational methods have been built to unravel transcriptional regulation, there have been few approaches developed for systems-scale analysis and study of metabolic regulation. Here, we present a stepwise machine learning framework that applies established algorithms to identify regulatory interactions in metabolic systems based on metabolic data: stepwise classification of unknown regulation, or SCOUR.


    We evaluated our framework on both noiseless and noisy data, using several models of varying sizes and topologies to show that our approach is generalizable. We found that, when testing on data under the most realistic conditions (low sampling frequency and high noise), SCOUR could identify reaction fluxes controlled only by the concentration of a single metabolite (its primary substrate) with high accuracy. The positive predictive value (PPV) for identifying reactions controlled by the concentration of two metabolites ranged from 32 to 88% for noiseless data, 9.2 to 49% for either low sampling frequency/low noise or high sampling frequency/high noise data, and 6.6–27% for low sampling frequency/high noise data, with results typically sufficiently high for lab validation to be a practical endeavor. While the PPVs for reactions controlled by three metabolites were lower, they were still in most cases significantly better than random classification.


    SCOUR uses a novel approach to synthetically generate the training data needed to identify regulators of reaction fluxes in a given metabolic system, enabling metabolomics and fluxomics data to be leveraged for regulatory structure inference. By identifying and triaging the most likely candidate regulatory interactions, SCOUR can drastically reduce the amount of time needed to identify and experimentally validate metabolic regulatory interactions. As high-throughput experimental methods for testing these interactions are further developed, SCOUR will provide critical impact in the development of predictive metabolic models in new organisms and pathways.

    more » « less