skip to main content

This content will become publicly available on December 6, 2022

Title: Neural Distance Embeddings for Biological Sequences
The development of data-dependent heuristics and representations for biological sequences that reflect their evolutionary distance is critical for large-scale biological research. However, popular machine learning approaches, based on continuous Euclidean spaces, have struggled with the discrete combinatorial formulation of the edit distance that models evolution and the hierarchical relationship that characterises real-world datasets. We present Neural Distance Embeddings (NeuroSEED), a general framework to embed sequences in geometric vector spaces, and illustrate the effectiveness of the hyperbolic space that captures the hierarchical structure and provides an average 38% reduction in embedding RMSE against the best competing geometry. The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised NeuroSEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets. As an example for hierarchical clustering, the proposed pretrained and from-scratch methods match the quality of competing baselines with 30x and 15x runtime reduction, respectively.
; ; ; ; ;
Award ID(s):
1835598 1934578 1918940 2030477
Publication Date:
Journal Name:
Advances in neural information processing systems
Sponsoring Org:
National Science Foundation
More Like this
  1. Event perception tasks such as recognizing and localizing actions in streaming videos are essential for scaling to real-world application contexts. We tackle the problem of learning actor-centered representations through the notion of continual hierarchical predictive learning to localize actions in streaming videos without the need for training labels and outlines for the objects in the video. We propose a framework driven by the notion of hierarchical predictive learning to construct actor-centered features by attention-based contextualization. The key idea is that predictable features or objects do not attract attention and hence do not contribute to the action of interest. Experiments on three benchmark datasets show that the approach can learn robust representations for localizing actions using only one epoch of training, i.e., a single pass through the streaming video. We show that the proposed approach outperforms unsupervised and weakly supervised baselines while offering competitive performance to fully supervised approaches. Additionally, we extend the model to multi-actor settings to recognize group activities while localizing the multiple, plausible actors. We also show that it generalizes to out-of-domain data with limited performance degradation.
  2. Item-based models are among the most popular collaborative filtering approaches for building recommender systems. Random walks can provide a powerful tool for harvesting the rich network of interactions captured within these models. They can exploit indirect relations between the items, mitigate the effects of sparsity, ensure wider itemspace coverage, as well as increase the diversity of recommendation lists. Their potential however, can be hindered by the tendency of the walks to rapidly concentrate towards the central nodes of the graph, thereby significantly restricting the range of K -step distributions that can be exploited for personalized recommendations. In this work, we introduce RecWalk ; a novel random walk-based method that leverages the spectral properties of nearly uncoupled Markov chains to provably lift this limitation and prolong the influence of users’ past preferences on the successive steps of the walk—thereby allowing the walker to explore the underlying network more fruitfully. A comprehensive set of experiments on real-world datasets verify the theoretically predicted properties of the proposed approach and indicate that they are directly linked to significant improvements in top- n recommendation accuracy. They also highlight RecWalk’s potential in providing a framework for boosting the performance of item-based models. RecWalk achieves state-of-the-art top-more »n recommendation quality outperforming several competing approaches, including recently proposed methods that rely on deep neural networks.« less
  3. Node classification is of great importance among various graph mining tasks. In practice, real-world graphs generally follow the long-tail distribution, where a large number of classes only consist of limited labeled nodes. Although Graph Neural Networks (GNNs) have achieved significant improvements in node classification, their performance decreases substantially in such a few-shot scenario. The main reason can be attributed to the vast generalization gap between meta-training and meta-test due to the task variance caused by different node/class distributions in meta-tasks (i.e., node-level and class-level variance). Therefore, to effectively alleviate the impact of task variance, we propose a task-adaptive node classification framework under the few-shot learning setting. Specifically, we first accumulate meta-knowledge across classes with abundant labeled nodes. Then we transfer such knowledge to the classes with limited labeled nodes via our proposed task-adaptive modules. In particular, to accommodate the different node/class distributions among meta-tasks, we propose three essential modules to perform node-level, class-level, and task-level adaptations in each meta-task, respectively. In this way, our framework can conduct adaptations to different meta-tasks and thus advance the model generalization performance on meta-test tasks. Extensive experiments on four prevalent node classification datasets demonstrate the superiority of our framework over the state-of-the-art baselines. Ourmore »code is provided at« less
  4. Weak supervision (WS) frameworks are a popular way to bypass hand-labeling large datasets for training data-hungry models. These approaches synthesize multiple noisy but cheaply-acquired estimates of labels into a set of high-quality pseudo-labels for downstream training. However, the synthesis technique is specific to a particular kind of label, such as binary labels or sequences, and each new label type requires manually designing a new synthesis algorithm. Instead, we propose a universal technique that enables weak supervision over any label type while still offering desirable properties, including practical flexibility, computational efficiency, and theoretical guarantees. We apply this technique to important problems previously not tackled by WS frameworks including learning to rank, regression, and learning in hyperbolic space. Theoretically, our synthesis approach produces a consistent estimators for learning some challenging but important generalizations of the exponential family model. Experimentally, we validate our framework and show improvement over baselines in diverse settings including real-world learning-to-rank and regression problems along with learning on hyperbolic manifolds.
  5. We propose Deep Estimators of Features (DEFs), a learning-based framework for predicting sharp geometric features in sampled 3D shapes. Differently from existing data-driven methods, which reduce this problem to feature classification, we propose to regress a scalar field representing the distance from point samples to the closest feature line on local patches. Our approach is the first that scales to massive point clouds by fusing distance-to-feature estimates obtained on individual patches. We extensively evaluate our approach against related state-of-the-art methods on newly proposed synthetic and real-world 3D CAD model benchmarks. Our approach not only outperforms these (with improvements in Recall and False Positives Rates), but generalizes to real-world scans after training our model on synthetic data and fine-tuning it on a small dataset of scanned data. We demonstrate a downstream application, where we reconstruct an explicit representation of straight and curved sharp feature lines from range scan data. We make code, pre-trained models, and our training and evaluation datasets available at