skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Hyperplane-Based Algorithm for Semi-Supervised Dimension Reduction
We consider the semi-supervised dimension reduction problem: given a high dimensional dataset with a small number of labeled data and huge number of unlabeled data, the goal is to find the low-dimensional embedding that yields good classification results. Most of the previous algorithms for this task are linkage-based algorithms. They try to enforce the must-link and cannot-link constraints in dimension reduction, leading to a nearest neighbor classifier in low dimensional space. In this paper, we propose a new hyperplane-based semi-supervised dimension reduction method---the main objective is to learn the low-dimensional features that can both approximate the original data and form a good separating hyperplane. We formulate this as a non-convex optimization problem and propose an efficient algorithm to solve it. The algorithm can scale to problems with millions of features and can easily incorporate non-negative constraints in order to learn interpretable non-negative features. Experiments on real world datasets demonstrate that our hyperplane-based dimension reduction method outperforms state-of-art linkage-based methods when very few labels are available.  more » « less
Award ID(s):
1719097
PAR ID:
10058227
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE International Conference on Data Mining (ICDM)
Page Range / eLocation ID:
101 to 110
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The problem of clustering points on a union of subspaces finds numerous applications in machine learning and computer vision, and it has been extensively studied in the past two decades. When the subspaces are low-dimensional, the problem can be formulated as a convex sparse optimization problem, for which numerous accurate, efficient and robust methods exist. When the subspaces are of high relative dimension (e.g., hyperplanes), the problem is intrinsically non-convex, and existing methods either lack theory, are computationally costly, lack robustness to outliers, or learn hyperplanes one at a time. In this paper, we propose Hyperplane ARangentment Descent (HARD), a method that robustly learns all the hyperplanes simultaneously by solving a novel non-convex non-smooth ℓ1 minimization problem. We provide geometric conditions under which the ground-truth hyperplane arrangement is a coordinate-wise minimizer of our objective. Furthermore, we devise efficient algorithms, and give conditions under which they converge to coordinate-wise minimizes. We provide empirical evidence that HARD surpasses state-of-the-art methods and further show an interesting experiment in clustering deep features on CIFAR-10. 
    more » « less
  2. In this paper, we address the problem of detecting and learning anomalies in high-dimensional data-streams in real-time. Following a data-driven approach, we propose an online and multivariate anomaly detection method that is suitable for the timely and accurate detection of anomalies. We propose our method for both semi-supervised and supervised settings. By combining the semi-supervised and supervised algorithms, we present a self-supervised online learning algorithm in which the semi-supervised algorithm trains the supervised algorithm to improve its detection performance over time. The methods are comprehensively analyzed in terms of computational complexity, asymptotic optimality, and false alarm rate. The performances of the proposed algorithms are also evaluated using real-world cybersecurity datasets, that show a significant improvement over the state-of-the-art results. 
    more » « less
  3. Chambers, Erin W.; Gudmundsson, Joachim (Ed.)
    Datasets with non-trivial large scale topology can be hard to embed in low-dimensional Euclidean space with existing dimensionality reduction algorithms. We propose to model topologically complex datasets using vector bundles, in such a way that the base space accounts for the large scale topology, while the fibers account for the local geometry. This allows one to reduce the dimensionality of the fibers, while preserving the large scale topology. We formalize this point of view and, as an application, we describe a dimensionality reduction algorithm based on topological inference for vector bundles. The algorithm takes as input a dataset together with an initial representation in Euclidean space, assumed to recover part of its large scale topology, and outputs a new representation that integrates local representations obtained through local linear dimensionality reduction. We demonstrate this algorithm on examples coming from dynamical systems and chemistry. In these examples, our algorithm is able to learn topologically faithful embeddings of the data in lower target dimension than various well known metric-based dimensionality reduction algorithms. 
    more » « less
  4. Imaging data-based prognostic models focus on using an asset’s degradation images to predict its time to failure (TTF). Most image-based prognostic models have two common limitations. First, they require degradation images to be complete (i.e., images are observed continuously and regularly over time). Second, they usually employ an unsupervised dimension reduction method to extract low-dimensional features and then use the features for TTF prediction. Because unsupervised dimension reduction is conducted on the degradation images without the involvement of TTFs, there is no guarantee that the extracted features are effective for failure time prediction. To address these challenges, this article develops a supervised tensor dimension reduction-based prognostic model. The model first proposes a supervised dimension reduction method for tensor data. It uses historical TTFs to guide the detection of a tensor subspace to extract low-dimensional features from high-dimensional incomplete degradation imaging data. Next, the extracted features are used to construct a prognostic model based on (log)-location-scale regression. An optimization algorithm for parameter estimation is proposed, and analytical solutions are discussed. Simulated data and a real-world data set are used to validate the performance of the proposed model. History: Bianca Maria Colosimo served as the senior editor for this article Funding: This work was supported by National Science Foundation [2229245]. Data Ethics & Reproducibility Note: The code capsule is available on Code Ocean at https://github.com/czhou9/Code-and-Data-for-IJDS and in the e-Companion to this article (available at https://doi.org/10.1287/ijds.2022.x022 ). 
    more » « less
  5. Manifold embedding algorithms map high-dimensional data down to coordinates in a much lower-dimensional space. One of the aims of dimension reduction is to find intrinsic coordinates that describe the data manifold. The coordinates returned by the embedding algorithm are abstract, and finding their physical or domain-related meaning is not formalized and often left to domain experts. This paper studies the problem of recovering the meaning of the new low-dimensional representation in an automatic, principled fashion. We propose a method to explain embedding coordinates of a manifold as non-linear compositions of functions from a user-defined dictionary. We show that this problem can be set up as a sparse linear Group Lasso recovery problem, find sufficient recovery conditions, and demonstrate its effectiveness on data 
    more » « less