Machine learning (ML)-based data-driven methods have promoted the progress of modeling in many engineering domains. These methods can achieve high prediction and generalization performance for large, high-quality datasets. However, ML methods can yield biased predictions if the observed data (i.e., response variable y) are corrupted by outliers. This paper addresses this problem with a novel, robust ML approach that is formulated as an optimization problem by coupling locally weighted least-squares support vector machines for regression (LWLS-SVMR) with one weight function. The weight is a function of residuals and allows for iteration within the proposed approach, significantly reducing the negative interference of outliers. A new efficient hybrid algorithm is developed to solve the optimization problem. The proposed approach is assessed and validated by comparison with relevant ML approaches on both one-dimensional simulated datasets corrupted by various outliers and multi-dimensional real-world engineering datasets, including datasets used for predicting the lateral strength of reinforced concrete (RC) columns, the fuel consumption of automobiles, the rising time of a servomechanism, and dielectric breakdown strength. Finally, the proposed method is applied to produce a data-driven solver for computational mechanics with a nonlinear material dataset corrupted by outliers. The results all show that the proposed method is robust against non-extreme and extreme outliers and improves the predictive performance necessary to solve various engineering problems.
more »
« less
Robust semi-supervised non-negative matrix factorization for binary subspace learning
Abstract Non-negative matrix factorization and its extensions were applied to various areas (i.e., dimensionality reduction, clustering, etc.). When the original data are corrupted by outliers and noise, most of non-negative matrix factorization methods cannot achieve robust factorization and learn a subspace with binary codes. This paper puts forward a robust semi-supervised non-negative matrix factorization method for binary subspace learning, called RSNMF, for image clustering. For better clustering performance on the dataset contaminated by outliers and noise, we propose a weighted constraint on the noise matrix and impose manifold learning into non-negative matrix factorization. Moreover, we utilize the discrete hashing learning method to constrain the learned subspace, which can achieve a binary subspace from the original data. Experimental results validate the robustness and effectiveness of RSNMF in binary subspace learning and image clustering on the face dataset corrupted by Salt and Pepper noise and Contiguous Occlusion.
more »
« less
- Award ID(s):
- 2011927
- PAR ID:
- 10238282
- Date Published:
- Journal Name:
- Complex & Intelligent Systems
- ISSN:
- 2199-4536
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Traditional neuroimaging analysis, such as clustering the data collected for the Alzheimer's disease (AD), usually relies on the data from one single imaging modality. However, recent technology and equipment advancements provide with us opportunities to better analyze diseases, where we could collect and employ the data from different image and genetic modalities that may potentially enhance the predictive performance. To perform better clustering in AD analysis, in this paper we conduct a new study to make use of the data from different modalities/views. To achieve this goal, we propose a simple yet efficient method based on Non-negative Matrix Factorization (NMF) which can not only achieve better prediction performance but also deal with some data missing in some views. Experimental results on the ADNI dataset demonstrate the effectiveness of our proposed method.more » « less
-
We propose a neural network for unsupervised anomaly detection with a novel robust subspace recovery layer (RSR layer). This layer seeks to extract the underlying subspace from a latent representation of the given data and removes outliers that lie away from this subspace. It is used within an autoencoder. The encoder maps the data into a latent space, from which the RSR layer extracts the subspace. The decoder then smoothly maps back the underlying subspace to a “manifold” close to the original inliers. Inliers and outliers are distinguished according to the distances between the original and mapped positions (small for inliers and large for outliers). Extensive numerical experiments with both image and document datasets demonstrate state-of-the-art precision and recall.more » « less
-
Discovering and clustering subspaces in high-dimensional data is a fundamental problem of machine learning with a wide range of applications in data mining, computer vision, and pattern recognition. Earlier methods divided the problem into two separate stages of finding the similarity matrix and finding clusters. Similar to some recent works, we integrate these two steps using a joint optimization approach. We make the following contributions: (i) we estimate the reliability of the cluster assignment for each point before assigning a point to a subspace. We group the data points into two groups of “certain” and “uncertain”, with the assignment of latter group delayed until their subspace association certainty improves. (ii) We demonstrate that delayed association is better suited for clustering subspaces that have ambiguities, i.e. when subspaces intersect or data are contaminated with outliers/noise. (iii) We demonstrate experimentally that such delayed probabilistic association leads to a more accurate self-representation and final clusters. The proposed method has higher accuracy both for points that exclusively lie in one subspace, and those that are on the intersection of subspaces. (iv) We show that delayed association leads to huge reduction of computational cost, since it allows for incremental spectral clusteringmore » « less
-
What learning algorithms can be run directly on compressively-sensed data? In this work, we consider the question of accurately and efficiently computing low-rank matrix or tensor factorizations given data compressed via random projections. We examine the approach of first performing factorization in the compressed domain, and then reconstructing the original high-dimensional factors from the recovered (compressed) factors. In both the matrix and tensor settings, we establish conditions under which this natural approach will provably recover the original factors. While it is well-known that random projections preserve a number of geometric properties of a dataset, our work can be viewed as showing that they can also preserve certain solutions of non-convex, NP- Hard problems like non-negative matrix factorization. We support these theoretical results with experiments on synthetic data and demonstrate the practical applicability of compressed factorization on real-world gene expression and EEG time series datasets.more » « less