skip to main content


This content will become publicly available on May 18, 2024

Title: Segmenting and Genotyping Large, Polymorphic Inversions
Large, polymorphic inversions can contribute to population structure and enable mutually-exclusive adaptations to survive in the same population. Current methods for detecting inversions from single-nucleotide polymorphisms (SNPs) called from population genomics data require an experienced, human user to prepare the data and interpret the results. Ideally, these methods would be completely automated yet robust to allow usage by inexperienced users. Towards this goal, automated approaches for segmentation of inversions and inference of sample genotypes are introduced and evaluated on chromosomes from flies, mosquitoes, and prairie sunflowers.  more » « less
Award ID(s):
1947257
NSF-PAR ID:
10463483
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
2023 IEEE International Conference on Electro Information Technology (eIT)
Page Range / eLocation ID:
153 to 162
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Background Large (>1 Mb), polymorphic inversions have substantial impacts on population structure and maintenance of genotypes. These large inversions can be detected from single nucleotide polymorphism (SNP) data using unsupervised learning techniques like PCA. Construction and analysis of a feature matrix from millions of SNPs requires large amount of memory and limits the sizes of data sets that can be analyzed. Methods We propose using feature hashing construct a feature matrix from a VCF file of SNPs for reducing memory usage. The matrix is constructed in a streaming fashion such that the entire VCF file is never loaded into memory at one time. Results When evaluated on Anopheles mosquito and Drosophila fly data sets, our approach reduced memory usage by 97% with minimal reductions in accuracy for inversion detection and localization tasks. Conclusion With these changes, inversions in larger data sets can be analyzed easily and efficiently on common laptop and desktop computers. Our method is publicly available through our open-source inversion analysis software, Asaph. 
    more » « less
  2. Abstract

    Continent‐scale observations of seismic phenomena have provided multi‐scale constraints of the Earth's interior. Of those analyzed, array‐based observations of slowness vector properties (backazimuth and horizontal slowness) and multipathing have yet to be made on a continental scale. Slowness vector measurements give inferences on mantle heterogeneity properties such as velocity perturbation and velocity gradient strength and quantify their effect on the wavefield. Multipathing is a consequence of waves interacting with strong velocity gradients resulting in two arrivals with different slowness vector properties and times. The mantle structure beneath the contiguous Unites States has been thoroughly analyzed by previous seismic studies and is data‐rich, making it an excellent testing ground to both analyze mantle structure with our approach and compare with other imaging techniques. We apply an automated array‐analysis technique to an SKS data set to create the first continent‐scale data set of multipathing and slowness vector measurements. We analyze the divergence of the slowness vector deviation field to highlight seismically slow and fast regions. Our results resolve several slow mantle anomalies beneath Yellowstone, the Appalachian mountains and fast anomalies throughout the mantle. Many of the anomalies cause multipathing in frequency bands 0.15–0.30 and 0.20–0.40 Hz which suggests velocity transitions over at most 500 km exist. Comparing our observations to synthetics created from tomography models, we find model NA13 (Bedle et al., 2021,https://doi.org/10.1029/2021GC009674) fits our data best but differences still remain. We therefore suggest slowness vector measurements should be used as an additional constraint in tomographic inversions and will lead to better resolved models of the mantle.

     
    more » « less
  3. Abstract

    Patterns of genetic diversity within species contain information the history of that species, including how they have responded to historical climate change and how easily the organism is able to disperse across its habitat. More than 40,000 phylogeographic and population genetic investigations have been published to date, each collecting genetic data from hundreds of samples. Despite these millions of data points, meta‐analyses are challenging because the synthesis of results across hundreds of studies, each using different methods and forms of analysis, is a daunting and time‐consuming task. It is more efficient to proceed by repurposing existing data and using automated data analysis. To facilitate data repurposing, we created a database (phylogatR) that aggregates data from different sources and conducts automated multiple sequence alignments and data curation to provide users with nearly ready‐to‐analyse sets of data for thousands of species. Two types of scientific research will be made easier by phylogatR: large meta‐analyses of thousands of species that can address classic questions in evolutionary biology and ecology, and student‐ or citizen‐ science based investigations that will introduce a broad range of people to the analysis of genetic data. phylogatR enhances the value of existing data via the creation of software and web‐based tools that enable these data to be recycled and reanalysed and increase accessibility to big data for research laboratories and classroom instructors with limited computational expertise and resources.

     
    more » « less
  4. Abstract

    Modern in situ digital imaging systems collect vast numbers of images of marine organisms and suspended particles. Automated methods to classify objects in these images – largely supervised machine learning techniques – are now used to deal with this onslaught of biological data. Though such techniques can minimize the human cost of analyzing the data, they also have important limitations. In training automated classifiers, we implicitly program them with an inflexible understanding of the environment they are observing. When the relationship between the classifier and the population changes, the computer's performance degrades, potentially decreasing the accuracy of the estimate of community composition. This limitation of automated classifiers is known as “dataset shift.” Here, we describe techniques for addressing dataset shift. We then apply them to the output of a binary deep neural network searching for diatom chains in data generated by the Scripps Plankton Camera System (SPCS) on the Scripps Pier. In particular, we describe a supervised quantification approach to adjust a classifier's output using a small number of human corrected images to estimate the system error in a time frame of interest. This method yielded an 80% improvement in mean absolute error over the raw classifier output on a set of 41 independent samples from the SPCS. The technique can be extended to adjust the output of multi‐category classifiers and other in situ observing systems.

     
    more » « less
  5. Purugganan, Michael (Ed.)
    Abstract Structural variants (SVs) are a largely unstudied feature of plant genome evolution, despite the fact that SVs contribute substantially to phenotypes. In this study, we discovered SVs across a population sample of 347 high-coverage, resequenced genomes of Asian rice (Oryza sativa) and its wild ancestor (O. rufipogon). In addition to this short-read data set, we also inferred SVs from whole-genome assemblies and long-read data. Comparisons among data sets revealed different features of genome variability. For example, genome alignment identified a large (∼4.3 Mb) inversion in indica rice varieties relative to japonica varieties, and long-read analyses suggest that ∼9% of genes from the outgroup (O. longistaminata) are hemizygous. We focused, however, on the resequencing sample to investigate the population genomics of SVs. Clustering analyses with SVs recapitulated the rice cultivar groups that were also inferred from SNPs. However, the site-frequency spectrum of each SV type—which included inversions, duplications, deletions, translocations, and mobile element insertions—was skewed toward lower frequency variants than synonymous SNPs, suggesting that SVs may be predominantly deleterious. Among transposable elements, SINE and mariner insertions were found at especially low frequency. We also used SVs to study domestication by contrasting between rice and O. rufipogon. Cultivated genomes contained ∼25% more derived SVs and mobile element insertions than O. rufipogon, indicating that SVs contribute to the cost of domestication in rice. Peaks of SV divergence were enriched for known domestication genes, but we also detected hundreds of genes gained and lost during domestication, some of which were enriched for traits of agronomic interest. 
    more » « less