DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the randommore »
The parallelism motifs of genomic data analysis
Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
- Page Range or eLocation-ID:
- Sponsoring Org:
- National Science Foundation
More Like this
Phylogenetic Structure of Plant Communities: Are Polyploids Distantly Related to Co-occurring Diploids?Polyploidy is widely acknowledged to have played an important role in the evolution and diversification of vascular plants. However, the influence of genome duplication on population-level dynamics and its cascading effects at the community level remain unclear. In part, this is due to persistent uncertainties over the extent of polyploid phenotypic variation, and the interactions between polyploids and co-occurring species, and highlights the need to integrate polyploid research at the population and community level. Here, we investigate how community-level patterns of phylogenetic relatedness might influence escape from minority cytotype exclusion, a classic population genetics hypothesis about polyploid establishment, and population-level species interactions. Focusing on two plant families in which polyploidy has evolved multiple times, Brassicaceae and Rosaceae, we build upon the hypothesis that the greater allelic and phenotypic diversity of polyploids allow them to successfully inhabit a different geographic range compared to their diploid progenitor and close relatives. Using a phylogenetic framework, we specifically test (1) whether polyploid species are more distantly related to diploids within the same community than co-occurring diploids are to one another, and (2) if polyploid species tend to exhibit greater ecological success than diploids, using species abundance in communities as an indicator of successful establishment.more »
We introduce Operational Genomic Unit (OGU), a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent from taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldomly applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in one synthetic and two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome datasets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project dataset, and more accurate prediction of human age by the gut microbiomes in the Finnish population. We provide Woltka, amore »
Abstract Motivation Clinical sequencing aims to identify somatic mutations in cancer cells for accurate diagnosis and treatment. However, most widely used clinical assays lack patient-matched control DNA and additional analysis is needed to distinguish somatic and unfiltered germline variants. Such computational analyses require accurate assessment of tumor cell content in individual specimens. Histological estimates often do not corroborate with results from computational methods that are primarily designed for normal-tumor matched data and can be confounded by genomic heterogeneity and presence of sub-clonal mutations. Methods All-FIT is an iterative weighted least square method to estimate specimen tumor purity based on the allele frequencies of variants detected in high-depth, targeted, clinical sequencing data. Results Using simulated and clinical data, we demonstrate All-FIT’s accuracy and improved performance against leading computational approaches, highlighting the importance of interpreting purity estimates based on expected biology of tumors. Availability and Implementation Freely available at http://software.khiabanian-lab.org. Supplementary information Supplementary data are available at Bioinformatics online.
The massive surge in the amount of observational field data demands richer and more meaningful collab-oration between data scientists and geoscientists. This document was written by members of the Working Group on Case Studies of the NSF-funded RCN on Intelli-gent Systems Research To Support Geosciences (IS-GEO, https:// is-geo.org/ ) to describe our vision to build and enhance such collaboration through the use of specially-designed benchmark datasets. Benchmark datasets serve as summary descriptions of problem areas, providing a simple interface between disciplines without requiring extensive background knowledge. Benchmark data intend to address a number of overarching goals. First, they are concrete, identifiable, and public, which results in a natural coordination of research efforts across multiple disciplines and institutions. Second, they provide multi-fold opportunities for objective comparison of various algorithms in terms of computational costs, accuracy, utility and other measurable standards, to address a particular question in geoscience. Third, as materials for education, the benchmark data cultivate future human capital and interest in geoscience problems and data science methods. Finally, a concerted effort to produce and publish benchmarks has the potential to spur the development of new data science methods, while provid-ing deeper insights into many fundamental problems in modern geosciences. Thatmore »