skip to main content


Title: Sketching and Sublinear Data Structures in Genomics
Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full-text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.  more » « less
Award ID(s):
1763680 1750472
PAR ID:
10132819
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Annual Review of Biomedical Data Science
Volume:
2
Issue:
1
ISSN:
2574-3414
Page Range / eLocation ID:
93 to 118
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Recent advances in high‐throughput methods of molecular analyses have led to an explosion of studies generating large‐scale ecological data sets. In particular, noticeable effect has been attained in the field of microbial ecology, where new experimental approaches provided in‐depth assessments of the composition, functions and dynamic changes of complex microbial communities. Because even a single high‐throughput experiment produces large amount of data, powerful statistical techniques of multivariate analysis are well suited to analyse and interpret these data sets. Many different multivariate techniques are available, and often it is not clear which method should be applied to a particular data set. In this review, we describe and compare the most widely used multivariate statistical techniques including exploratory, interpretive and discriminatory procedures. We consider several important limitations and assumptions of these methods, and we present examples of how these approaches have been utilized in recent studies to provide insight into the ecology of the microbial world. Finally, we offer suggestions for the selection of appropriate methods based on the research question and data set structure.

     
    more » « less
  2. Transmission electron microscopy (TEM), and its counterpart, scanning TEM (STEM), are powerful materials characterization tools capable of probing crystal structure, composition, charge distribution, electronic structure, and bonding down to the atomic scale. Recent (S)TEM instrumentation developments such as electron beam aberration-correction as well as faster and more efficient signal detection systems have given rise to new and more powerful experimental methods, some of which (e.g., 4D-STEM, spectrum-imaging, in situ/operando (S)TEM)) facilitate the capture of high-dimensional datasets that contain spatially-resolved structural, spectroscopic, time- and/or stimulus-dependent information across the sub-angstrom to several micrometer length scale. Thus, through the variety of analysis methods available in the modern (S)TEM and its continual development towards high-dimensional data capture, it is well-suited to the challenge of characterizing isometric mixed-metal oxides such as pyrochlores, fluorites, and other complex oxides that reside on a continuum of chemical and spatial ordering. In this review, we present a suite of imaging and diffraction (S)TEM techniques that are uniquely suited to probe the many types, length-scales, and degrees of disorder in complex oxides, with a focus on disorder common to pyrochlores, fluorites and the expansive library of intermediate structures they may adopt. The application of these techniques to various complex oxides will be reviewed to demonstrate their capabilities and limitations in resolving the continuum of structural and chemical ordering in these systems. 
    more » « less
  3. Larochelle, H. ; Ranzato, M. ; Hadsell, R. ; Balcan, M.F. ; Lin, H. (Ed.)
    In the last decade, there has been increasing interest in topological data analysis, a new methodology for using geometric structures in data for inference and learning. A central theme in the area is the idea of persistence, which in its most basic form studies how measures of shape change as a scale parameter varies. There are now a number of frameworks that support statistics and machine learning in this context. However, in many applications there are several different parameters one might wish to vary: for example, scale and density. In contrast to the one-parameter setting, techniques for applying statistics and machine learning in the setting of multiparameter persistence are not well understood due to the lack of a concise representation of the results. We introduce a new descriptor for multiparameter persistence, which we call the Multiparameter Persistence Image, that is suitable for machine learning and statistical frameworks, is robust to perturbations in the data, has finer resolution than existing descriptors based on slicing, and can be efficiently computed on data sets of realistic size. Moreover, we demonstrate its efficacy by comparing its performance to other multiparameter descriptors on several classification tasks. 
    more » « less
  4. In graph machine learning, data collection, sharing, and analysis often involve multiple parties, each of which may require varying levels of data security and privacy. To this end, preserving privacy is of great importance in protecting sensitive information. In the era of big data, the relationships among data entities have become unprecedentedly complex, and more applications utilize advanced data structures (i.e., graphs) that can support network structures and relevant attribute information. To date, many graph-based AI models have been proposed (e.g., graph neural networks) for various domain tasks, like computer vision and natural language processing. In this paper, we focus on reviewing privacypreserving techniques of graph machine learning. We systematically review related works from the data to the computational aspects. We rst review methods for generating privacy-preserving graph data. Then we describe methods for transmitting privacy-preserved information (e.g., graph model parameters) to realize the optimization-based computation when data sharing among multiple parties is risky or impossible. In addition to discussing relevant theoretical methodology and software tools, we also discuss current challenges and highlight several possible future research opportunities for privacy-preserving graph machine learning. Finally, we envision a uni ed and comprehensive secure graph machine learning system. 
    more » « less
  5. Abstract

    Transcription initiation is regulated in a highly organized fashion to ensure proper cellular functions. Accurate identification of transcription start sites (TSSs) and quantitative characterization of transcription initiation activities are fundamental steps for studies of regulated transcriptions and core promoter structures. Several high-throughput techniques have been developed to sequence the very 5′end of RNA transcripts (TSS sequencing) on the genome scale. Bioinformatics tools are essential for processing, analysis, and visualization of TSS sequencing data. Here, we present TSSr, an R package that provides rich functions for mapping TSS and characterizations of structures and activities of core promoters based on all types of TSS sequencing data. Specifically, TSSr implements several newly developed algorithms for accurately identifying TSSs from mapped sequencing reads and inference of core promoters, which are a prerequisite for subsequent functional analyses of TSS data. Furthermore, TSSr also enables users to export various types of TSS data that can be visualized by genome browser for inspection of promoter activities in association with other genomic features, and to generate publication-ready TSS graphs. These user-friendly features could greatly facilitate studies of transcription initiation based on TSS sequencing data. The source code and detailed documentations of TSSr can be freely accessed at https://github.com/Linlab-slu/TSSr.

     
    more » « less