skip to main content

Title: Probabilistic methods for approximate archetypal analysis

Archetypal analysis (AA) is an unsupervised learning method for exploratory data analysis. One major challenge that limits the applicability of AA in practice is the inherent computational complexity of the existing algorithms. In this paper, we provide a novel approximation approach to partially address this issue. Utilizing probabilistic ideas from high-dimensional geometry, we introduce two preprocessing techniques to reduce the dimension and representation cardinality of the data, respectively. We prove that provided data are approximately embedded in a low-dimensional linear subspace and the convex hull of the corresponding representations is well approximated by a polytope with a few vertices, our method can effectively reduce the scaling of AA. Moreover, the solution of the reduced problem is near-optimal in terms of prediction errors. Our approach can be combined with other acceleration techniques to further mitigate the intrinsic complexity of AA. We demonstrate the usefulness of our results by applying our method to summarize several moderately large-scale datasets.

; ; ;
Award ID(s):
Publication Date:
Journal Name:
Information and Inference: A Journal of the IMA
Page Range or eLocation-ID:
p. 466-493
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The arrival time prediction of coronal mass ejections (CMEs) is an area of active research. Many methods with varying levels of complexity have been developed to predict CME arrival. However, the mean absolute error (MAE) of predictions remains above 12 hr, even with the increasing complexity of methods. In this work we develop a new method for CME arrival time prediction that uses magnetohydrodynamic simulations involving data-constrained flux-rope-based CMEs, which are introduced in a data-driven solar wind background. We found that for six CMEs studied in this work the MAE in arrival time was ∼8 hr. We further improved our arrival time predictions by using ensemble modeling and comparing the ensemble solutions with STEREO-A and STEREO-B heliospheric imager data. This was done by using our simulations to create synthetic J-maps. A machine-learning (ML) method called the lasso regression was used for this comparison. Using this approach, we could reduce the MAE to ∼4 hr. Another ML method based on the neural networks (NNs) made it possible to reduce the MAE to ∼5 hr for the cases when HI data from both STEREO-A and STEREO-B were available. NNs are capable of providing similar MAE when only the STEREO-A data aremore »used. Our methods also resulted in very encouraging values of standard deviation (precision) of arrival time. The methods discussed in this paper demonstrate significant improvements in the CME arrival time predictions. Our work highlights the importance of using ML techniques in combination with data-constrained magnetohydrodynamic modeling to improve space weather predictions.

    « less
  2. Abstract

    Non-invasive and label-free spectral microscopy (spectromicroscopy) techniques can provide quantitative biochemical information complementary to genomic sequencing, transcriptomic profiling, and proteomic analyses. However, spectromicroscopy techniques generate high-dimensional data; acquisition of a single spectral image can range from tens of minutes to hours, depending on the desired spatial resolution and the image size. This substantially limits the timescales of observable transient biological processes. To address this challenge and move spectromicroscopy towards efficient real-time spatiochemical imaging, we developed a grid-less autonomous adaptive sampling method. Our method substantially decreases image acquisition time while increasing sampling density in regions of steeper physico-chemical gradients. When implemented with scanning Fourier Transform infrared spectromicroscopy experiments, this grid-less adaptive sampling approach outperformed standard uniform grid sampling in a two-component chemical model system and in a complex biological sample,Caenorhabditis elegans. We quantitatively and qualitatively assess the efficiency of data acquisition using performance metrics and multivariate infrared spectral analysis, respectively.

  3. Abstract

    Dimensionless numbers and scaling laws provide elegant insights into the characteristic properties of physical systems. Classical dimensional analysis and similitude theory fail to identify a set of unique dimensionless numbers for a highly multi-variable system with incomplete governing equations. This paper introduces a mechanistic data-driven approach that embeds the principle of dimensional invariance into a two-level machine learning scheme to automatically discover dominant dimensionless numbers and governing laws (including scaling laws and differential equations) from scarce measurement data. The proposed methodology, called dimensionless learning, is a physics-based dimension reduction technique. It can reduce high-dimensional parameter spaces to descriptions involving only a few physically interpretable dimensionless parameters, greatly simplifying complex process design and system optimization. We demonstrate the algorithm by solving several challenging engineering problems with noisy experimental measurements (not synthetic data) collected from the literature. Examples include turbulent Rayleigh-Bénard convection, vapor depression dynamics in laser melting of metals, and porosity formation in 3D printing. Lastly, we show that the proposed approach can identify dimensionally homogeneous differential equations with dimensionless number(s) by leveraging sparsity-promoting techniques.

  4. Abstract

    We develop a new heavy quark transport model, QLBT, to simulate the dynamical propagation of heavy quarks inside the quark-gluon plasma (QGP) created in relativistic heavy-ion collisions. Our QLBT model is based on the linear Boltzmann transport (LBT) model with the ideal QGP replaced by a collection of quasi-particles to account for the non-perturbative interactions among quarks and gluons of the hot QGP. The thermal masses of quasi-particles are fitted to the equation of state from lattice QCD simulations using the Bayesian statistical analysis method. Combining QLBT with our advanced hybrid fragmentation-coalescence hadronization approach, we calculate the nuclear modification factor$$R_\mathrm {AA}$$RAAand the elliptic flow$$v_2$$v2ofDmesons at the Relativistic Heavy-Ion Collider and the Large Hadron Collider. By comparing our QLBT calculation to the experimental data on theDmeson$$R_\mathrm {AA}$$RAAand$$v_2$$v2, we extract the heavy quark transport parameter$$\hat{q}$$q^and diffusion coefficient$$D_\mathrm {s}$$Dsin the temperature range of$$1-4~T_\mathrm {c}$$1-4Tc, and compare them with the lattice QCD results and other phenomenological studies.

  5. Abstract Motivation

    Heritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide single nucleotide polymorphism (SNP) variation data has motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets. Linear mixed models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs, i.e. the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs, however, poses serious computational burdens.


    We propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a method-of-moment estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector multiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMBmax( log⁡3N, log⁡3M)).

    We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data.more »On standard hardware, our method computes heritability on a dataset of 500 000 individuals and 100 000 SNPs in 38 min.

    Availability and implementation

    The RHE-reg software is made freely available to the research community at:

    « less