skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics
Abstract As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.  more » « less
Award ID(s):
1911094 1838177 1730574
PAR ID:
10205630
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Nucleic Acids Research
Volume:
48
Issue:
10
ISSN:
0305-1048
Page Range / eLocation ID:
5217 to 5234
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As a key component of inherent optical properties (IOPs) in ocean color remote sensing, phytoplankton absorption coefficient (aphy), especially in hyperspectral, greatly enhances our understanding of phytoplankton community composition (PCC). The recent launches of NASA’s hyperspectral missions, such as EMIT and PACE, have generated an urgent need for hyperspectral algorithms for studying phytoplankton. Retrieving aphy from ocean color remote sensing in coastal waters has been extremely challenging due to complex optical properties. Traditional methods often fail under these circumstances, while improved machine-learning approaches are hindered by data scarcity, heterogeneity, and noise from data collection. In response, this study introduces a novel machine learning framework for hyperspectral retrievals of aphy based on the mixture-of-experts (MOEs), named PhA-MOE. Various preprocessing methods for hyperspectral training data are explored, with the combination of robust and logarithmic scalers identified as optimal. The proposed PhA-MOE for aphy prediction is tailored to both past and current hyperspectral missions, including EMIT and PACE. Extensive experiments reveal the importance of data preprocessing and improved performance of PhA-MOE in estimating aphy as well as in handling data heterogeneity. Notably, this study marks the first application of a machine learning–based MOE model to real PACE-OCI hyperspectral imagery, validated using match-up field data. This application enables the exploration of spatiotemporal variations in aphy within an optically complex estuarine environment. 
    more » « less
  2. Defects and dopants play critical roles in defining the properties of a material. Achieving a mechanistic understanding of how such properties arise is challenging with current experimental methods, and computational approaches suffer from significant modeling limitations that frequently require a posteriori fitting. Consequently, the pace of dopant discovery as a means of tuning material properties for a particular application has been slow. However, recent advances in computation have enabled researchers to move away from semiempirical schemes to reposition density functional theory as a predictive tool and improve the accessibility of highly accurate first-principles methods to all researchers. This Perspective discusses some of these recent achievements that provide more accurate first-principles geometric, thermodynamic, optical, and electronic properties simultaneously. Advancements related to supercells, basis sets, functionals, and optimization protocols, as well as suggestions for evaluating the quality of a computational model through comparison to experimental data, are discussed. Moreover, recent computational results in the fields of energy materials, heterogeneous catalysis, and quantum informatics are reviewed along with an evaluation of current frontiers and opportunities in the field of computational materials chemistry. 
    more » « less
  3. Abstract In recent years, advances in image processing and machine learning have fueled a paradigm shift in detecting genomic regions under natural selection. Early machine learning techniques employed population-genetic summary statistics as features, which focus on specific genomic patterns expected by adaptive and neutral processes. Though such engineered features are important when training data are limited, the ease at which simulated data can now be generated has led to the recent development of approaches that take in image representations of haplotype alignments and automatically extract important features using convolutional neural networks. Digital image processing methods termed α-molecules are a class of techniques for multiscale representation of objects that can extract a diverse set of features from images. One such α-molecule method, termed wavelet decomposition, lends greater control over high-frequency components of images. Another α-molecule method, termed curvelet decomposition, is an extension of the wavelet concept that considers events occurring along curves within images. We show that application of these α-molecule techniques to extract features from image representations of haplotype alignments yield high true positive rate and accuracy to detect hard and soft selective sweep signatures from genomic data with both linear and nonlinear machine learning classifiers. Moreover, we find that such models are easy to visualize and interpret, with performance rivaling those of contemporary deep learning approaches for detecting sweeps. 
    more » « less
  4. Fungal taxonomy and ecology have been revolutionized by the application of molecular methods and both have increasing connections to genomics and functional biology. However, data streams from traditional specimen- and culture-based systematics are not yet fully integrated with those from metagenomic and metatranscriptomic studies, which limits understanding of the taxonomic diversity and metabolic properties of fungal communities. This article reviews current resources, needs, and opportunities for sequence-based classification and identification (SBCI) in fungi as well as related efforts in prokaryotes. To realize the full potential of fungal SBCI it will be necessary to make advances in multiple areas. Improvements in sequencing methods, including long-read and single-cell technologies, will empower fungal molecular ecologists to look beyond ITS and current shotgun metagenomics approaches. Data quality and accessibility will be enhanced by attention to data and metadata standards and rigorous enforcement of policies for deposition of data and workflows. Taxonomic communities will need to develop best practices for molecular characterization in their focal clades, while also contributing to globally useful datasets including ITS. Changes to nomenclatural rules are needed to enable validPUBLICation of sequence-based taxon descriptions. Finally, cultural shifts are necessary to promote adoption of SBCI and to accord professional credit to individuals who contribute to community resources. 
    more » « less
  5. Living in a data-driven world with rapidly growing machine learning techniques, it is apparent that utilizing these methods is necessary to achieve state-of-the-art performance in object detection. Recent novel approaches in the deep-learning field have boasted real-time object segmentation methods given the algorithm is connected to a large validation dataset. Knowing that these algorithms are restricted to a given dataset, it is apparent that the need for data generating algorithms is on a rise. As some object detection problems may suffice with a statically trained deep-learning model, it is true that others will not. Given the no free lunch theorem, we know that no machine learning algorithm can truly generalize to data it has not been trained on; therefore, deep learning models trained on images of cats will not necessarily classify dogs correctly. With modern deep learning libraries being ported for mobile devices, a wide range of utilityhas been made apparent for plant researchers around the world. One such usage of these real-time approaches is to count and classify seed kernels, replacing monotonous-human-error-ridden tasks. Plant scientists around the world have daily jobs of counting seeds by hand or using multi-thousand dollar devices to automate the task. It is apparent that many third world countries, where such consumer devices do not exist or require too many resources, could benefit from such an automated task. PhenoApps, an organization started within Kansas State University, has been supplying a subset of these countries with modern phones for such uses. With the following seed segmentation algorithm and the usage of modern mobile devices, scientists can count seeds with the click of a button and produce results in split-seconds. The algorithms proposed in this paper achieve multiple novel implementations. Mainly, Rice’s Theorem was used to show that object detection in clusters is an undecidable task for Turing Machines. Along with this, the novel implementations include an Android application which can segment seed kernels and a machine learning algorithm which can accurately generate contour data sets. The data generator provided in this paper is an effective start for the later usage of deep learning models and is the first step for a real-time dynamic and static seed counter. 
    more » « less