skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The parallelism motifs of genomic data analysis
Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.  more » « less
Award ID(s):
1823034
PAR ID:
10192460
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Volume:
378
Issue:
2166
ISSN:
1364-503X
Page Range / eLocation ID:
20190394
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sharma, Divakar (Ed.)
    Nontuberculous mycobacteria (NTM) are ubiquitous environmental opportunistic pathogens that can cause chronic lung disease. Within the United States, Hawai’i has the highest incidence of NTM lung disease, though the precise reasons are yet to be fully elucidated. One possibility is the high prevalence of NTM in the Hawai’i environment acting as a potential reservoir for opportunistic NTM infections. Through our previous initiatives to collect and characterize NTM in Hawai’i, community scientists of Hawai’i have collected thousands of environmental samples for sequencing. Here, these community scientists were invited for the first time into a high school lab in O’ahu for a genomic sequencing workshop, where participants sequenced four of the collected isolate genomic samples using the Oxford Nanopore Technologies MinION sequencer. Participants generated high quality long read data that when combined with short read Illumina data yielded complete bacterial genomic assemblies suitable for in-depth analysis. The gene annotation analysis identified a suite of genes that might help NTM thrive in the Hawai’i environment. Further, we found evidence of co-occurring methylobacteria, revealed from the sequencing data, suggesting that in some cases methylobacteria and NTM may coexist in the same niche, challenging previously accepted paradigms. The sequencing efforts presented here generated novel insights regarding the potential survival strategies and microbial interactions of NTM in the geographic hot spot of Hawai’i. We highlight the contributions of community scientists and present an activity that can be reimplemented as a workshop or classroom activity by other research groups to engage their local communities. 
    more » « less
  2. DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets. 
    more » « less
  3. null (Ed.)
    QNSTOP consists of serial and parallel (OpenMP) Fortran 2003 codes for the quasi-Newton stochastic optimization method of Castle and Trosset for stochastic search problems. A complete description of QNSTOP for both local search with stochastic objective and global search with “noisy” deterministic objective is given here, to the best of our knowledge, for the first time. For stochastic search problems, some convergence theory exists for particular algorithmic choices and parameter values. Both the parallel driver subroutine, which offers several parallel decomposition strategies, and the serial driver subroutine can be used for local stochastic search or global deterministic search, based on an input switch. Some performance data for computational systems biology problems is given. 
    more » « less
  4. Abstract Natural selection leaves detectable patterns of altered spatial diversity within genomes, and identifying affected regions is crucial for understanding species evolution. Recently, machine learning approaches applied to raw population genomic data have been developed to uncover these adaptive signatures. Convolutional neural networks (CNNs) are particularly effective for this task, as they handle large data arrays while maintaining element correlations. However, shallow CNNs may miss complex patterns due to their limited capacity, while deep CNNs can capture these patterns but require extensive data and computational power. Transfer learning addresses these challenges by utilizing a deep CNN pretrained on a large dataset as a feature extraction tool for downstream classification and evolutionary parameter prediction. This approach reduces extensive training data generation requirements and computational needs while maintaining high performance. In this study, we developed TrIdent, a tool that uses transfer learning to enhance detection of adaptive genomic regions from image representations of multilocus variation. We evaluated TrIdent across various genetic, demographic, and adaptive settings, in addition to unphased data and other confounding factors. TrIdent demonstrated improved detection of adaptive regions compared to recent methods using similar data representations. We further explored model interpretability through class activation maps and adapted TrIdent to infer selection parameters for identified adaptive candidates. Using whole-genome haplotype data from European and African populations, TrIdent effectively recapitulated known sweep candidates and identified novel cancer, and other disease-associated genes as potential sweeps. 
    more » « less
  5. Abstract High-throughput sequencing data enables the comprehensive study of genomes and the variation therein. Essential for the interpretation of this genomic data is a thorough understanding of the computational methods used for processing and analysis. Whereas “gold-standard” empirical datasets exist for this purpose in humans, synthetic (i.e., simulated) sequencing data can offer important insights into the capabilities and limitations of computational pipelines for any arbitrary species and/or study design—yet, the ability of read simulator software to emulate genomic characteristics of empirical datasets remains poorly understood. We here compare the performance of six popular short-read simulators—ART, DWGSIM, InSilicoSeq, Mason, NEAT, and wgsim—and discuss important considerations for selecting suitable models for benchmarking. 
    more » « less