skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, February 13 until 2:00 AM ET on Friday, February 14 due to maintenance. We apologize for the inconvenience.


Title: Tensor Decomposition-based Feature Extraction and Classification to Detect Natural Selection from Genomic Data
Abstract

Inferences of adaptive events are important for learning about traits, such as human digestion of lactose after infancy and the rapid spread of viral variants. Early efforts toward identifying footprints of natural selection from genomic data involved development of summary statistic and likelihood methods. However, such techniques are grounded in simple patterns or theoretical models that limit the complexity of settings they can explore. Due to the renaissance in artificial intelligence, machine learning methods have taken center stage in recent efforts to detect natural selection, with strategies such as convolutional neural networks applied to images of haplotypes. Yet, limitations of such techniques include estimation of large numbers of model parameters under nonconvex settings and feature identification without regard to location within an image. An alternative approach is to use tensor decomposition to extract features from multidimensional data although preserving the latent structure of the data, and to feed these features to machine learning models. Here, we adopt this framework and present a novel approach termed T-REx, which extracts features from images of haplotypes across sampled individuals using tensor decomposition, and then makes predictions from these features using classical machine learning methods. As a proof of concept, we explore the performance of T-REx on simulated neutral and selective sweep scenarios and find that it has high power and accuracy to discriminate sweeps from neutrality, robustness to common technical hurdles, and easy visualization of feature importance. Therefore, T-REx is a powerful addition to the toolkit for detecting adaptive processes from genomic data.

 
more » « less
Award ID(s):
2001063
PAR ID:
10564619
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Molecular Biology and Evolution
Date Published:
Journal Name:
Molecular Biology and Evolution
Volume:
40
Issue:
10
ISSN:
0737-4038
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps.

     
    more » « less
  2. Kim, Yuseob (Ed.)
    Abstract Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data. 
    more » « less
  3. Higher-order tensors have received increased attention across science and engineering. While most tensor decomposition methods are developed for a single tensor observation, scientific studies often collect side information, in the form of node features and interactions thereof, together with the tensor data. Such data problems are common in neuroimaging, network analysis, and spatial-temporal modeling. Identifying the relationship between a high-dimensional tensor and side information is important yet challenging. Here, we develop a tensor decomposition method that incorporates multiple feature matrices as side information. Unlike unsupervised tensor decomposition, our supervised decomposition captures the effective dimension reduction of the data tensor confined to feature space of interest. An efficient alternating optimization algorithm with provable spectral initialization is further developed. Our proposal handles a broad range of data types, including continuous, count, and binary observations. We apply the method to diffusion tensor imaging data from human connectome project and multi-relational political network data. We identify the key global connectivity pattern and pinpoint the local regions that are associated with available features. The package and data used are available at https://CRAN.R-project.org/package=tensorregress. Supplementary materials for this article are available online. 
    more » « less
  4. Scanned images of patent or historical documents often contain localized zigzag noise introduced by the digitizing process; yet when viewed as a whole image, global structures are apparent to humans, but not to machines. Existing denoising methods work well for natural images, but not for binary diagram images, which makes feature extraction difficult for computer vision and machine learning methods and algorithms. We propose a topological graph-based representation to tackle this denoising problem. The graph representation emphasizes the shapes and topology of diagram images, making it ideal for use in machine learning applications such as classification and matching of scientific diagram images. Our approach and algorithms provide essential structure and lay important foundation for computer vision such as scene graph-based applications, because topological relations and spatial arrangement among objects in images are captured and stored in our skeleton graph. In addition, while the parameters for almost all pixel-based methods are not adaptive, our method is robust in that it only requires one parameter and it is adaptive. Experimental comparisons with existing methods show the effectiveness of our approach. 
    more » « less
  5. The increasing uncertainty of distributed energy resources promotes the risks of transient events for power systems. To capture event dynamics, Phasor Measurement Unit (PMU) data is widely utilized due to its high resolutions. Notably, Machine Learning (ML) methods can process PMU data with feature learning techniques to identify events. However, existing ML-based methods face the following challenges due to salient characteristics from both the measurement and the label sides: (1) PMU streams have a large size with redundancy and correlations across temporal, spatial, and measurement type dimensions. Nevertheless, existing work cannot effectively uncover the structural correlations to remove redundancy and learn useful features. (2) The number of event labels is limited, but most models focus on learning with labeled data, suffering risks of non-robustness to different system conditions. To overcome the above issues, we propose an approach called Kernelized Tensor Decomposition and Classification with Semi-supervision (KTDC-Se). Firstly, we show that the key is to tensorize data storage, information filtering via decomposition, and discriminative feature learning via classification. This leads to an efficient exploration of structural correlations via high-dimensional tensors. Secondly, the proposed KTDC-Se can incorporate rich unlabeled data to seek decomposed tensors invariant to varying operational conditions. Thirdly, we make KTDC-Se a joint model of decomposition and classification so that there are no biased selections of the two steps. Finally, to boost the model accuracy, we add kernels for non-linear feature learning. We demonstrate the KTDC-Se superiority over the state-of-the-art methods for event identification using PMU data. 
    more » « less