Fast, low-memory detection and localization of large, polymorphic inversions from SNPs

Nowling, Ronald J.; Fallas-Moya, Fabian; Sadovnik, Amir; Emrich, Scott; Aleck, Matthew; Leskiewicz, Daniel; Peters, John G.

doi:10.7717/peerj.12831

Citation Details

Fast, low-memory detection and localization of large, polymorphic inversions from SNPs

Background Large (>1 Mb), polymorphic inversions have substantial impacts on population structure and maintenance of genotypes. These large inversions can be detected from single nucleotide polymorphism (SNP) data using unsupervised learning techniques like PCA. Construction and analysis of a feature matrix from millions of SNPs requires large amount of memory and limits the sizes of data sets that can be analyzed. Methods We propose using feature hashing construct a feature matrix from a VCF file of SNPs for reducing memory usage. The matrix is constructed in a streaming fashion such that the entire VCF file is never loaded into memory at one time. Results When evaluated on Anopheles mosquito and Drosophila fly data sets, our approach reduced memory usage by 97% with minimal reductions in accuracy for inversion detection and localization tasks. Conclusion With these changes, inversions in larger data sets can be analyzed easily and efficiently on common laptop and desktop computers. Our method is publicly available through our open-source inversion analysis software, Asaph. more »

Award ID(s):: 1947257

PAR ID:: 10329957

Author(s) / Creator(s):: Nowling, Ronald J.; Fallas-Moya, Fabian; Sadovnik, Amir; Emrich, Scott; Aleck, Matthew; Leskiewicz, Daniel; Peters, John G.

Date Published:: 2022-01-01

Journal Name:: PeerJ

Volume:: 10

ISSN:: 2167-8359

Page Range / eLocation ID:: e12831

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Journal Article:
https://doi.org/10.7717/peerj.12831

More Like this