skip to main content

Title: Revisiting Local Neighborhood Methods in Machine Learning
Several machine learning methods leverage the idea of locality by using k-nearest neighbor (KNN) techniques to design better pattern recognition models. However, the choice of KNN parameters such as k is often made experimentally, e.g., via cross-validation, leading to local neighborhoods without a clear geometric interpretation. In this paper, we replace KNN with our recently introduced polytope neighborhood scheme - Non Negative Kernel regression (NNK). NNK formulates neighborhood selection as a sparse signal approximation problem and is adaptive to the local distribution of samples in the neighborhood of the data point of interest. We analyze the benefits of local neighborhood construction based on NNK. In particular, we study the generalization properties of local interpolation using NNK and present data dependent bounds in the non asymptotic setting. The applicability of NNK in transductive few shot learning setting and for measuring distance between two datasets is demonstrated. NNK exhibits robust, superior performance in comparison to standard locally weighted neighborhood methods.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
2021 IEEE Data Science and Learning Workshop (DSLW)
Page Range / eLocation ID:
1 to 6
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this work we address the adequacy of two machine learning methods to tackle the problem of wind velocity estimation in the lowermost region of the atmosphere using on-board inertial drone data within an outdoor setting. We fed these data, and accompanying wind tower measurements, into a K-nearest neighbor (KNN) algorithm and a long short-term memory (LSTM) neural network to predict future windspeeds, by exploiting the stabilization response of two hovering drones in a wind field. Of the two approaches, we found that LSTM proved to be the most capable supervised learning model during more capricious wind conditions, and made competent windspeed predictions with an average root mean square error of 0.61 m·s−1 averaged across two drones, when trained on at least 20 min of flight data. During calmer conditions, a linear regression model demonstrated acceptable performance, but under more variable wind regimes the LSTM performed considerably better than the linear model, and generally comparable to more sophisticated methods. Our approach departs from other multi-rotor-based windspeed estimation schemes by circumventing the use of complex and specific dynamic models, to instead directly learn the relationship between drone attitude and fluctuating windspeeds. This exhibits utility in a range of otherwise prohibitive environments, like mountainous terrain or off-shore sites. 
    more » « less
  2. Nie, Qing (Ed.)
    The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g ., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at . 
    more » « less
  3. Data poisoning aims to compromise a machine learning based software component by contaminating its training set to change its prediction results for test inputs. Existing methods for deciding data-poisoning robustness have either poor accuracy or long running time and, more importantly, they can only certify some of the truly-robust cases, but remain inconclusive when certification fails. In other words, they cannot falsify the truly-non-robust cases. To overcome this limitation, we propose a systematic testing based method, which can falsify as well as certify data-poisoning robustness for a widely used supervised-learning technique named k-nearest neighbors (KNN). Our method is faster and more accurate than the baseline enumeration method, due to a novel over-approximate analysis in the abstract domain, to quickly narrow down the search space, and systematic testing in the concrete domain, to find the actual violations. We have evaluated our method on a set of supervised-learning datasets. Our results show that the method significantly outperforms state-of-the-art techniques, and can decide data-poisoning robustness of KNN prediction results for most of the test inputs. 
    more » « less
  4. Abstract Background

    A cell exhibits a variety of responses to internal and external cues. These responses are possible, in part, due to the presence of an elaborate gene regulatory network (GRN) in every single cell. In the past 20 years, many groups worked on reconstructing the topological structure of GRNs from large-scale gene expression data using a variety of inference algorithms. Insights gained about participating players in GRNs may ultimately lead to therapeutic benefits. Mutual information (MI) is a widely used metric within this inference/reconstruction pipeline as it can detect any correlation (linear and non-linear) between any number of variables (n-dimensions). However, the use of MI with continuous data (for example, normalized fluorescence intensity measurement of gene expression levels) is sensitive to data size, correlation strength and underlying distributions, and often requires laborious and, at times, ad hoc optimization.


    In this work, we first show that estimating MI of a bi- and tri-variate Gaussian distribution usingk-nearest neighbor (kNN) MI estimation results in significant error reduction as compared to commonly used methods based on fixed binning. Second, we demonstrate that implementing the MI-based kNN Kraskov–Stoögbauer–Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR). Finally, through extensive in-silico benchmarking we show that a new inference algorithm CMIA (Conditional Mutual Information Augmentation), inspired by CLR, in combination with the KSG-MI estimator, outperforms commonly used methods.


    Using three canonical datasets containing 15 synthetic networks, the newly developed method for GRN reconstruction—which combines CMIA, and the KSG-MI estimator—achieves an improvement of 20–35% in precision-recall measures over the current gold standard in the field. This new method will enable researchers to discover new gene interactions or better choose gene candidates for experimental validations.

    more » « less
  5. Abstract

    X-ray binaries (XRBs) consist of a compact object that accretes material from an orbiting secondary star. The most secure method we have for determining if the compact object is a black hole is to determine its mass: This is limited to bright objects and requires substantial time-intensive spectroscopic monitoring. With new X-ray sources being discovered with different X-ray observatories, developing efficient, robust means to classify compact objects becomes increasingly important. We compare three machine-learning classification methods (Bayesian Gaussian Processes (BGPs), K-Nearest Neighbors (KNN), Support Vector Machines) for determining whether the compact objects are neutron stars or black holes (BHs) in XRB systems. Each machine-learning method uses spatial patterns that exist between systems of the same type in 3D color–color–intensity diagrams. We used lightcurves extracted using 6 yr of data with MAXI/GSC for 44 representative sources. We find that all three methods are highly accurate in distinguishing pulsing from nonpulsing neutron stars (NPNS) with 95% of NPNS and 100% of pulsars accurately predicted. All three methods have high accuracy in distinguishing BHs from pulsars (92%) but continue to confuse BHs with a subclass of NPNS, called bursters, with KNN doing the best at only 50% accuracy for predicting BHs. The precision of all three methods is high, providing equivalent results over 5–10 independent runs. In future work, we will suggest a fourth dimension be incorporated to mitigate the confusion of BHs with bursters. This work paves the way toward more robust methods to efficiently distinguish BHs, NPNS, and pulsars.

    more » « less