skip to main content


Title: Probabilistic methods for approximate archetypal analysis
Abstract

Archetypal analysis (AA) is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of AA in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that provided data are approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of AA. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of AA. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.

 
more » « less
Award ID(s):
1752202 2136198
NSF-PAR ID:
10367051
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Information and Inference: A Journal of the IMA
Volume:
12
Issue:
1
ISSN:
2049-8772
Page Range / eLocation ID:
p. 466-493
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Rationale

    Nitrogen isotopic compositions (δ15N) of source and trophic amino acids (AAs) are crucial tracers of N sources and trophic enrichments in diverse fields, including archeology, astrobiochemistry, ecology, oceanography, and paleo‐sciences. The current analytical technique using gas chromatography‐combustion‐isotope ratio mass spectrometry (GC/C/IRMS) requires derivatization, which is not compatible with some key AAs. Another approach using high‐performance liquid chromatography‐elemental analyzer‐IRMS (HPLC/EA/IRMS) may experience coelution issues with other compounds in certain types of samples, and the highly sensitive nano‐EA/IRMS instrumentations are not widely available.

    Methods

    We present a method for high‐precision δ15N measurements of AAs (δ15N‐AA) optimized for canonical source AA‐phenylalanine (Phe) and trophic AA‐glutamic acid (Glu). This offline approach entails purification and separation via high‐pressure ion‐exchange chromatography (IC) with automated fraction collection, the sequential chemical conversion of AA to nitrite and then to nitrous oxide (N2O), and the final determination of δ15N of the produced N2O via purge‐and‐trap continuous‐flow isotope ratio mass spectrometry (PT/CF/IRMS).

    Results

    The cross‐plots of δ15N of Glu and Phe standards (four different natural‐abundance levels) generated by this method and their accepted values have a linear regression slope of 1 and small intercepts demonstrating high accuracy. The precisions were 0.36‰–0.67‰ for Phe standards and 0.27‰–0.35‰ for Glu standards. Our method and the GC/C/IRMS approach produced equivalent δ15N values for two lab standards (McCarthy Lab AA mixture and cyanobacteria) within error. We further tested our method on a wide range of natural sample matrices and obtained reasonable results.

    Conclusions

    Our method provides a reliable alternative to the current methods for δ15N‐AA measurement as IC or HPLC‐based techniques that can collect underivatized AAs are widely available. Our chemical approach that converts AA to N2O can be easily implemented in laboratories currently analyzing δ15N of N2O using PT/CF/IRMS. This method will help promote the use of δ15N‐AA in important studies of N cycling and trophic ecology in a wide range of research areas.

     
    more » « less
  2. ABSTRACT

    We compare how several forms of multicriteria decision analysis (MCDA) can enhance the practice of alternatives assessment (AA). We report on a workshop in which 12 practitioners from US corporations, government agencies, NGOs, and consulting organizations applied different MCDA techniques to 3 AA case studies to understand how they improved the decision process. Participants were asked to select a preferred alternative in each case using a different decision analysis approach: their unaided decision‐making method, individual or lightly facilitated group multiattribute value theory (MAVT), and more extensively facilitated group structured decision making (SDM). Surveys conducted after each exercise revealed that participants were positive toward the use of formal decision‐making methods for AA, reporting meaningful increases in their understanding of the trade‐offs involved and their own values. Participants also reported challenges with each approach. While the MCDA techniques were reported to enhance transparency and communication, they did not consistently lead to higher satisfaction with a decision and/or outcome, and they were not more likely to be adopted within their organizations than unaided approaches. More formal decision‐making methods have promise in the context of AA, but practitioners will need more guidance to use such tools successfully. Practitioners will also need to define what “success” constitutes; different approaches may be called for depending on whether the objective is increased understanding, satisfaction with the outcome, satisfaction with the process, or something else.Integr Environ Assess Manag2021;17:27–41. © 2020 SETAC

     
    more » « less
  3. Abstract

    The arrival time prediction of coronal mass ejections (CMEs) is an area of active research. Many methods with varying levels of complexity have been developed to predict CME arrival. However, the mean absolute error (MAE) of predictions remains above 12 hr, even with the increasing complexity of methods. In this work we develop a new method for CME arrival time prediction that uses magnetohydrodynamic simulations involving data-constrained flux-rope-based CMEs, which are introduced in a data-driven solar wind background. We found that for six CMEs studied in this work the MAE in arrival time was ∼8 hr. We further improved our arrival time predictions by using ensemble modeling and comparing the ensemble solutions with STEREO-A and STEREO-B heliospheric imager data. This was done by using our simulations to create synthetic J-maps. A machine-learning (ML) method called the lasso regression was used for this comparison. Using this approach, we could reduce the MAE to ∼4 hr. Another ML method based on the neural networks (NNs) made it possible to reduce the MAE to ∼5 hr for the cases when HI data from both STEREO-A and STEREO-B were available. NNs are capable of providing similar MAE when only the STEREO-A data are used. Our methods also resulted in very encouraging values of standard deviation (precision) of arrival time. The methods discussed in this paper demonstrate significant improvements in the CME arrival time predictions. Our work highlights the importance of using ML techniques in combination with data-constrained magnetohydrodynamic modeling to improve space weather predictions.

     
    more » « less
  4. Abstract

    Landmark‐based geometric morphometrics has emerged as an essential discipline for the quantitative analysis of size and shape in ecology and evolution. With the ever‐increasing density of digitized landmarks, the possible development of a fully automated method of landmark placement has attracted considerable attention. Despite the recent progress in image registration techniques, which could provide a pathway to automation, three‐dimensional (3D) morphometric data are still mainly gathered by trained experts. For the most part, the large infrastructure requirements necessary to perform image‐based registration, together with its system specificity and its overall speed, have prevented its wide dissemination.

    Here, we propose and implement a general and lightweight point cloud‐based approach to automatically collect high‐dimensional landmark data in 3D surfaces (Automated Landmarking through Point cloud Alignment and Correspondence Analysis). Our framework possesses several advantages compared with image‐based approaches. First, it presents comparable landmarking accuracy, despite relying on a single, random reference specimen and much sparser sampling of the structure's surface. Second, it can be efficiently run on consumer‐grade personal computers. Finally, it is general and can be applied at the intraspecific level to any biological structure of interest, regardless of whether anatomical atlases are available.

    Our validation procedures indicate that the method can recover intraspecific patterns of morphological variation that are largely comparable to those obtained by manual digitization, indicating that the use of an automated landmarking approach should not result in different conclusions regarding the nature of multivariate patterns of morphological variation.

    The proposed point cloud‐based approach has the potential to increase the scale and reproducibility of morphometrics research. To allow ALPACA to be used out‐of‐the‐box by users with no prior programming experience, we implemented it as a SlicerMorph module. SlicerMorph is an extension that enables geometric morphometrics data collection and 3D specimen analysis within the open‐source 3D Slicer biomedical visualization ecosystem. We expect that convenient access to this platform will make ALPACA broadly applicable within ecology and evolution.

     
    more » « less
  5. An emerging method for data analysis is called Topological Data Analysis (TDA). TDA is based in the mathematical field of topology and examines the properties of spaces under continuous deformation. One of the key tools used for TDA is called persistent homology which considers the connectivity of points in a d-dimensional point cloud at different spatial resolutions to identify topological properties (holes, loops, and voids) in the space. Persistent homology then classifies the topological features by their persistence through the range of spatial connectivity. Unfortunately the memory and run-time complexity of computing persistent homology is exponential and current tools can only process a few thousand points in R3. Fortunately, the use of data reduction techniques enables persistent homology to be applied to much larger point clouds. Techniques to reduce the data range from random sampling of points to clustering the data and using the cluster centroids as the reduced data. While several data reduction approaches appear to preserve the large topological features present in the original point cloud, no systematic study comparing the efficacy of different data clustering techniques in preserving the persistent homology results has been performed. This paper explores the question of topology preserving data reductions and describes formally when and how topological features can be mischaracterized or lost by data reduction techniques. The paper also performs an experimental assessment of data reduction techniques and resilient effects on the persistent homology. In particular, data reduction by random selection is compared to cluster centroids extracted from different data clustering algorithms. 
    more » « less