Tensor factorization methods have recently gained increased popularity. A key feature that renders tensors attractive is the ability to directly model multi-relational data. In this work, we propose ParaSketch, a parallel tensor factorization algorithm that enables massive parallelism, to deal with large tensors. The idea is to compress the large tensor into multiple small tensors, decompose each small tensor in parallel, and combine the results to reconstruct the desired latent factors. Prior art in this di- rection entails potentially very high complexity in the (Gaussian) compression and final combining stages. Adopting sketching matrices for compression, the proposed method enjoys a dramatic reduction in compression complexity, and features a much lighter combining step. Moreover, theoretical analysis shows that the compressed tensors inherit latent identifiability under mild conditions, hence establishing correctness of the overall approach. Numerical experiments corroborate the theory and demonstrate the effectiveness of the proposed algorithm.
more »
« less
Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data
What learning algorithms can be run directly on compressively-sensed data? In this work, we consider the question of accurately and efficiently computing low-rank matrix or tensor factorizations given data compressed via random projections. We examine the approach of first performing factorization in the compressed domain, and then reconstructing the original high-dimensional factors from the recovered (compressed) factors. In both the matrix and tensor settings, we establish conditions under which this natural approach will provably recover the original factors. While it is well-known that random projections preserve a number of geometric properties of a dataset, our work can be viewed as showing that they can also preserve certain solutions of non-convex, NP- Hard problems like non-negative matrix factorization. We support these theoretical results with experiments on synthetic data and demonstrate the practical applicability of compressed factorization on real-world gene expression and EEG time series datasets.
more »
« less
- PAR ID:
- 10098846
- Date Published:
- Journal Name:
- International Conference on Machine Learning (ICML)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Abstract Non-negative matrix factorization and its extensions were applied to various areas (i.e., dimensionality reduction, clustering, etc.). When the original data are corrupted by outliers and noise, most of non-negative matrix factorization methods cannot achieve robust factorization and learn a subspace with binary codes. This paper puts forward a robust semi-supervised non-negative matrix factorization method for binary subspace learning, called RSNMF, for image clustering. For better clustering performance on the dataset contaminated by outliers and noise, we propose a weighted constraint on the noise matrix and impose manifold learning into non-negative matrix factorization. Moreover, we utilize the discrete hashing learning method to constrain the learned subspace, which can achieve a binary subspace from the original data. Experimental results validate the robustness and effectiveness of RSNMF in binary subspace learning and image clustering on the face dataset corrupted by Salt and Pepper noise and Contiguous Occlusion.more » « less
-
Sparse tensor factorization is a popular tool in multi-way data analysis and is used in applications such as cybersecurity, recommender systems, and social network analysis. In many of these applications, the tensor is not known a priori and instead arrives in a streaming fashion for a potentially unbounded amount of time. Existing approaches for streaming sparse tensors are not practical for unbounded streaming because they rely on maintaining the full factorization of the data, which grows linearly with time. In this work, we present CP-stream, an algorithm for streaming factorization in the model of the canonical polyadic decomposition which does not grow linearly in time or space, and is thus practical for long-term streaming. Additionally, CP-stream incorporates user-specified constraints such as non-negativity which aid in the stability and interpretability of the factorization. An evaluation of CP-stream demonstrates that it converges faster than state-of-the-art streaming algorithms while achieving lower reconstruction error by an order of magnitude. We also evaluate it on real-world sparse datasets and demonstrate its usability in both network traffic analysis and discussion tracking. Our evaluation uses exclusively public datasets and our source code is released to the public as part of SPLATT, an open source high-performance tensor factorization toolkit.more » « less
-
Social science approaches to missing values predict avoided, unrequested, or lost information from dense data sets, typically surveys. The authors propose a matrix factorization approach to missing data imputation that (1) identifies underlying factors to model similarities across respondents and responses and (2) regularizes across factors to reduce their overinfluence for optimal data reconstruction. This approach may enable social scientists to draw new conclusions from sparse data sets with a large number of features, for example, historical or archival sources, online surveys with high attrition rates, or data sets created from Web scraping, which confound traditional imputation techniques. The authors introduce matrix factorization techniques and detail their probabilistic interpretation, and they demonstrate these techniques’ consistency with Rubin’s multiple imputation framework. The authors show via simulations using artificial data and data from real-world subsets of the General Social Survey and National Longitudinal Study of Youth cases for which matrix factorization techniques may be preferred. These findings recommend the use of matrix factorization for data reconstruction in several settings, particularly when data are Boolean and categorical and when large proportions of the data are missing.more » « less
-
Spatiotemporal systems are ubiquitous in a large number of scientific areas, representing underlying knowledge and patterns in the data. Here, a fundamental question usually arises as how to understand and characterize these spatiotemporal systems with a certain data-driven machine learning framework. In this work, we introduce an unsupervised pattern discovery framework, namely, dynamic autoregressive tensor factorization. Our framework is essentially built on the fact that the spatiotemporal systems can be well described by the time-varying autoregression on multivariate or even multidimensional data. In the modeling process, tensor factorization is seamlessly integrated into the time-varying autoregression for discovering spatial and temporal modes/patterns from the spatiotemporal systems in which the spatial factor matrix is assumed to be orthogonal. To evaluate the framework, we apply it to several real-world spatiotemporal datasets, including fluid flow dynamics, international import/export merchandise trade, and urban human mobility. On the international trade dataset with dimensions {country/region, product type, year}, our framework can produce interpretable import/export patterns of countries/regions, while the low-dimensional product patterns are also important for classifying import/export merchandise and understanding systematical differences between import and export. On the ridesharing mobility dataset with dimensions {origin, destination, time}, our framework is helpful for identifying the shift of spatial patterns of urban human mobility that changed between 2019 and 2022. Empirical experiments demonstrate that our framework can discover interpretable and meaningful patterns from the spatiotemporal systems that are both time-varying and multidimensional.more » « less
An official website of the United States government

