skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Title: Probabilistic methods for approximate archetypal analysis
Abstract Archetypal analysis (AA) is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of AA in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that provided data are approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of AA. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of AA. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.  more » « less
Award ID(s):
1752202 2136198
PAR ID:
10335149
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Information and Inference: A Journal of the IMA
ISSN:
2049-8772
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Archetypal analysis (AA) is a versatile data analysis method to cluster distinct features within a data set. Here, we demonstrate a framework showing the power of AA to spatio-temporally resolve events in calcium imaging, an imaging modality commonly used in neurobiology and neuroscience to capture neuronal communication patterns. After validation of our AA-based approach on synthetic data sets, we were able to characterize neuronal communication patterns in recorded calcium waves. Clinical relevance– Transient calcium events play an essential role in brain cell communication, growth, and network formation, as well as in neurodegeneration. To reliably interpret calcium events from personalized medicine data, where patterns may differ from patient to patient, appropriate image processing and signal analysis methods need to be developed for optimal network characterization. 
    more » « less
  2. We present a novel method for identifying topological features of chromatin domains in live cells using single-particle tracking and topological data analysis (TDA). By applying TDA to particle trajectories, we can effectively detect complex spatial patterns, such as loops, that are often missed by traditional time series analysis. Using simulations of polymer bead–spring chains, we have validated the accuracy of our method and determined its limitations for detecting loops. Our approach offers a promising avenue for exploring the topological complexity of chromatin in living cells using TDA techniques. 
    more » « less
  3. In this paper, we examine the properties of the Jones polynomial using dimensionality reduction learning techniques combined with ideas from topological data analysis. Our data set consists of more than 10 million knots up to 17 crossings and two other special families up to 2001 crossings. We introduce and describe a method for using filtrations to analyze infinite data sets where representative sampling is impossible or impractical, an essential requirement for working with knots and the data from knot invariants. In particular, this method provides a new approach for analyzing knot invariants using Principal Component Analysis. Using this approach on the Jones polynomial data, we find that it can be viewed as an approximately three-dimensional subspace, that this description is surprisingly stable with respect to the filtration by the crossing number, and that the results suggest further structures to be examined and understood. 
    more » « less
  4. An emerging method for data analysis is called Topological Data Analysis (TDA). TDA is based in the mathematical field of topology and examines the properties of spaces under continuous deformation. One of the key tools used for TDA is called persistent homology which considers the connectivity of points in a d-dimensional point cloud at different spatial resolutions to identify topological properties (holes, loops, and voids) in the space. Persistent homology then classifies the topological features by their persistence through the range of spatial connectivity. Unfortunately the memory and run-time complexity of computing persistent homology is exponential and current tools can only process a few thousand points in R3. Fortunately, the use of data reduction techniques enables persistent homology to be applied to much larger point clouds. Techniques to reduce the data range from random sampling of points to clustering the data and using the cluster centroids as the reduced data. While several data reduction approaches appear to preserve the large topological features present in the original point cloud, no systematic study comparing the efficacy of different data clustering techniques in preserving the persistent homology results has been performed. This paper explores the question of topology preserving data reductions and describes formally when and how topological features can be mischaracterized or lost by data reduction techniques. The paper also performs an experimental assessment of data reduction techniques and resilient effects on the persistent homology. In particular, data reduction by random selection is compared to cluster centroids extracted from different data clustering algorithms. 
    more » « less
  5. Abstract The arrival time prediction of coronal mass ejections (CMEs) is an area of active research. Many methods with varying levels of complexity have been developed to predict CME arrival. However, the mean absolute error (MAE) of predictions remains above 12 hr, even with the increasing complexity of methods. In this work we develop a new method for CME arrival time prediction that uses magnetohydrodynamic simulations involving data-constrained flux-rope-based CMEs, which are introduced in a data-driven solar wind background. We found that for six CMEs studied in this work the MAE in arrival time was ∼8 hr. We further improved our arrival time predictions by using ensemble modeling and comparing the ensemble solutions with STEREO-A and STEREO-B heliospheric imager data. This was done by using our simulations to create synthetic J-maps. A machine-learning (ML) method called the lasso regression was used for this comparison. Using this approach, we could reduce the MAE to ∼4 hr. Another ML method based on the neural networks (NNs) made it possible to reduce the MAE to ∼5 hr for the cases when HI data from both STEREO-A and STEREO-B were available. NNs are capable of providing similar MAE when only the STEREO-A data are used. Our methods also resulted in very encouraging values of standard deviation (precision) of arrival time. The methods discussed in this paper demonstrate significant improvements in the CME arrival time predictions. Our work highlights the importance of using ML techniques in combination with data-constrained magnetohydrodynamic modeling to improve space weather predictions. 
    more » « less