skip to main content


Title: Locality-Sensitive Bucketing Functions for the Edit Distance
Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rate, but encounter much reduced sensitivity on data with high error rate. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. Here we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be (d₁, d₂)-sensitive if any two sequences within an edit distance of d₁ are mapped into at least one shared bucket, and any two sequences with distance at least d₂ are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of (d₁,d₂) and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. These results provide theoretical foundations for their practical use in analyzing sequences with high error rate while also providing insights for the hardness of designing ungapped LSH functions.  more » « less
Award ID(s):
2019797
NSF-PAR ID:
10417130
Author(s) / Creator(s):
;
Editor(s):
Boucher, Christina; Rahmann, Sven
Date Published:
Journal Name:
22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Volume:
242
ISSN:
1868-8969
Page Range / eLocation ID:
22:1--22:14
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existingk-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps.

    Results

    In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be$$(d_1, d_2)$$(d1,d2)-sensitive if any two sequences within an edit distance of$$d_1$$d1are mapped into at least one shared bucket, and any two sequences with distance at least$$d_2$$d2are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of$$(d_1,d_2)$$(d1,d2)and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal.

    Conclusion

    These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.

     
    more » « less
  2. INTRODUCTION Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons. RATIONALE Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types. RESULTS To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated. CONCLUSION TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution. Tissue-Aware Conservation Inference Toolkit (TACIT) associates genetic differences between species with phenotypes. TACIT works by generating open chromatin data from a few species in a tissue related to a phenotype, using the sequences underlying open and closed chromatin regions to train a machine learning model for predicting tissue-specific open chromatin and associating open chromatin predictions across dozens of mammals with the phenotype. [Species silhouettes are from PhyloPic] 
    more » « less
  3. Approximation is a technique that optimizes the balance between application outcome quality and its resource usage. Trading quality for performance has been investigated for single application scenarios, but not for environments where multiple approximate applications may run concurrently on the same machine, interfering with each other by sharing machine resources. Applying existing, single application techniques to this multi-programming environment may lead to configuration space size explosion, or result in poor overall application quality outcomes. Our new RAPID-M system is the first cross-application con-figuration management framework. It reduces the problem size by clustering configurations of individual applications into local"similarity buckets". The global cross-applications configuration selection is based on these local bucket spaces. RAPID-M dynamically assigns buckets to applications such that overall quality is maximized while respecting individual application cost budgets. Once assigned a bucket, reconfigurations within buckets may be performed locally with minimal impact on global selections. Experimental results using six configurable applications show that even large configuration spaces of complex applications can be clustered into a small number of buckets, resulting in search space size reductions of up to 9 orders of magnitude for our six applications. RAPID-M constructs performance cost models with an average prediction error of ≤3%. For our application execution traces, RAPID-M dynamically selects configurations that lower the budget violation rate by 33.9% with an average budget exceeding rate of 6.6% as compared to other possible approaches. RAPID-M successfully finishes 22.75% more executions which translates to a 1.52X global output quality increase under high system loads. The overhead of RAPID-M is within ≤1% of application execution times. 
    more » « less
  4. Localization of wireless transmitters based on channel state information (CSI) fingerprinting finds widespread use in indoor as well as outdoor scenarios. Fingerprinting localization first builds a database containing CSI with measured location information. One then searches for the most similar CSI in this database to approximate the position of wireless transmitters. In this paper, we investigate the efficacy of locality-sensitive hashing (LSH) to reduce the complexity of the nearest neighbor- search (NNS) required by conventional fingerprinting localization systems. More specifically, we propose a low-complexity and memory efficient LSH function based on the sum-to-one (STOne) transform and use approximate hash matches. We evaluate the accuracy and complexity (in terms of the number of searches and storage requirements) of our approach for line-of-sight (LoS) and non-LoS channels, and we show that LSH enables low-complexity fingerprinting localization with comparable accuracy to methods relying on exact NNS or deep neural networks. 
    more » « less
  5. We study the problem of classifier derandomization in machine learning: given a stochastic binary classifier f:X→[0,1], sample a deterministic classifier f̂ :X→{0,1} that approximates the output of f in aggregate over any data distribution. Recent work revealed how to efficiently derandomize a stochastic classifier with strong output approximation guarantees, but at the cost of individual fairness -- that is, if f treated similar inputs similarly, f̂ did not. In this paper, we initiate a systematic study of classifier derandomization with metric fairness guarantees. We show that the prior derandomization approach is almost maximally metric-unfair, and that a simple ``random threshold'' derandomization achieves optimal fairness preservation but with weaker output approximation. We then devise a derandomization procedure that provides an appealing tradeoff between these two: if f is α-metric fair according to a metric d with a locality-sensitive hash (LSH) family, then our derandomized f̂ is, with high probability, O(α)-metric fair and a close approximation of f. We also prove generic results applicable to all (fair and unfair) classifier derandomization procedures, including a bias-variance decomposition and reductions between various notions of metric fairness. 
    more » « less