skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Inferring End‐Members From Geoscience Data Using Simplex Projected Gradient Descent‐Archetypal Analysis
Abstract End‐member mixing analysis (EMMA) is widely used to analyze geoscience data for their end‐members and mixing proportions. Many traditional EMMA methods depend on known end‐members, which are sometimes uncertain or unknown. Unsupervised EMMA methods infer end‐members from data, but many existing ones don't strictly follow necessary constraints and lack full mathematical interpretability. Here, we introduce a novel unsupervised machine learning method, simplex projected gradient descent‐archetypal analysis (SPGD‐AA), which uses the ML model archetypal analysis to infer end‐members intuitively and interpretably without prior knowledge. SPGD‐AA uses extreme corners in data as end‐members or “archetypes,” and represents data as mixtures of end‐members. This method is most suitable for linear (conservative) mixing problems when samples with similar characteristics to end‐members are present in data. Validation on synthetic and real data sets, including river chemistry, deep‐sea sediment elemental composition, and hyperspectral imaging, shows that SPGD‐AA effectively recovers end‐members consistent with domain expertise and outperforms conventional approaches. SPGD‐AA is applicable to a wide range of geoscience data sets and beyond.  more » « less
Award ID(s):
2209866 2209863 2209865 2209864
PAR ID:
10586549
Author(s) / Creator(s):
 ;  
Publisher / Repository:
DOI PREFIX: 10.1029
Date Published:
Journal Name:
Journal of Geophysical Research: Machine Learning and Computation
Volume:
2
Issue:
2
ISSN:
2993-5210
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract. End-member mixing analysis (EMMA) is a method of interpreting stream water chemistry variations and is widely used for chemical hydrograph separation. It is based on the assumption that stream water is a conservative mixture of varying contributions from well-characterized source solutions (end-members). These end-members are typically identified by collecting samples of potential end-member source waters from within the watershed and comparing these to the observations. Here we introduce a complementary data-driven method (convex hull end-member mixing analysis – CHEMMA) to infer the end-member compositions and their associated uncertainties from the stream water observations alone. The method involves two steps. The first uses convex hull nonnegative matrix factorization (CH-NMF) to infer possible end-member compositions by searching for a simplex that optimally encloses the stream water observations. The second step uses constrained K-means clustering (COP-KMEANS) to classify the results from repeated applications of CH-NMF and analyzes the uncertainty associated with the algorithm. In an example application utilizing the 1986 to 1988 Panola Mountain Research Watershed dataset, CHEMMA is able to robustly reproduce the three field-measured end-members found in previous research using only the stream water chemical observations. CHEMMA also suggests that a fourth and a fifth end-member can be (less robustly) identified. We examine uncertainties in end-member identification arising from non-uniqueness, which is related to the data structure, of the CH-NMF solutions, and from the number of samples using both real and synthetic data. The results suggest that the mixing space can be identified robustly when the dataset includes samples that contain extremely small contributions of one end-member, i.e., samples containing extremely large contributions from one end-member are not necessary but do reduce uncertainty about the end-member composition. 
    more » « less
  2. Abstract Archetypal analysis (AA) is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of AA in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that provided data are approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of AA. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of AA. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets. 
    more » « less
  3. Archetypal analysis (AA) is a versatile data analysis method to cluster distinct features within a data set. Here, we demonstrate a framework showing the power of AA to spatio-temporally resolve events in calcium imaging, an imaging modality commonly used in neurobiology and neuroscience to capture neuronal communication patterns. After validation of our AA-based approach on synthetic data sets, we were able to characterize neuronal communication patterns in recorded calcium waves. Clinical relevance– Transient calcium events play an essential role in brain cell communication, growth, and network formation, as well as in neurodegeneration. To reliably interpret calcium events from personalized medicine data, where patterns may differ from patient to patient, appropriate image processing and signal analysis methods need to be developed for optimal network characterization. 
    more » « less
  4. Abstract Zooplankton contribute a major component of the vertical flux of particulate organic matter to the ocean interior by packaging consumed food and waste into large, dense fecal pellets that sink quickly. Existing methods for quantifying the contribution of fecal pellets to particulate organic matter use either visual identification or lipid biomarkers, but these methods may exclude fecal material that is not morphologically distinct, or may include zooplankton carcasses in addition to fecal pellets. Based on results from seven pairs of wild‐caught zooplankton and their fecal pellets, we assess the ability of compound‐specific isotope analysis of amino acids (CSIA‐AA) to chemically distinguish fecal pellets as an end‐member material within particulate organic matter. Nitrogen CSIA‐AA is an improvement on previous uses of bulk stable isotope ratios, which cannot distinguish between differences in baseline isotope ratios and fractionation due to metabolic processing. We suggest that the relative trophic position of zooplankton and their fecal pellets, as calculated using CSIA‐AA, can provide a metric for estimating the dietary absorption efficiency of zooplankton. Using this metric, the zooplankton examined here had widely ranging dietary absorption efficiencies, where lower dietary absorption may equate to higher proportions of fecal packaging of undigested material. The nitrogen isotope ratios of threonine and alanine statistically distinguished the zooplankton fecal pellets from literature‐derived examples of phytoplankton, zooplankton biomass, and microbially degraded organic matter. We suggest that δ15N values of threonine and alanine could be used in mixing models to quantify the contribution of fecal pellets to particulate organic matter. 
    more » « less
  5. Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Canmachines learn the language of life fromthe unannotated protein sequence data well enough to identify significant errors in the protein “sentences”? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted.We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise. 
    more » « less