skip to main content

Title: DeepHull: Fast Convex Hull Approximation in High Dimensions
Computing or approximating the convex hull of a dataset plays a role in a wide range of applications, including economics, statistics, and physics, to name just a few. However, convex hull computation and approximation is exponentially complex, in terms of both memory and computation, as the ambient space dimension increases. In this paper, we propose DeepHull, a new convex hull approximation algorithm based on convex deep networks (DNs) with continuous piecewise-affine nonlinearities and nonnegative weights. The idea is that binary classification between true data samples and adversarially generated samples with such a DN naturally induces a polytope decision boundary that approximates the true data convex hull. A range of exploratory experiments demonstrates that DeepHull efficiently produces a meaningful convex hull approximation, even in a high-dimensional ambient space.
; ;
Award ID(s):
1911094 1838177 1730574
Publication Date:
Journal Name:
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract. End-member mixing analysis (EMMA) is a method of interpreting stream water chemistry variations and is widely used for chemical hydrograph separation. It is based on the assumption that stream water is a conservative mixture of varying contributions from well-characterized source solutions (end-members). These end-members are typically identified by collecting samples of potential end-member source waters from within the watershed and comparing these to the observations. Here we introduce a complementary data-driven method (convex hull end-member mixing analysis – CHEMMA) to infer the end-member compositions and their associated uncertainties from the stream water observations alone. The method involves two steps. The first uses convex hull nonnegative matrix factorization (CH-NMF) to infer possible end-member compositions by searching for a simplex that optimally encloses the stream water observations. The second step uses constrained K-means clustering (COP-KMEANS) to classify the results from repeated applications of CH-NMF and analyzes the uncertainty associated with the algorithm. In an example application utilizing the 1986 to 1988 Panola Mountain Research Watershed dataset, CHEMMA is able to robustly reproduce the three field-measured end-members found in previous research using only the stream water chemical observations. CHEMMA also suggests that a fourth and a fifth end-member can be (less robustly) identified. We examine uncertainties inmore »end-member identification arising from non-uniqueness, which is related to the data structure, of the CH-NMF solutions, and from the number of samples using both real and synthetic data. The results suggest that the mixing space can be identified robustly when the dataset includes samples that contain extremely small contributions of one end-member, i.e., samples containing extremely large contributions from one end-member are not necessary but do reduce uncertainty about the end-member composition.« less
  2. Summary This paper is concerned with empirical likelihood inference on the population mean when the dimension $p$ and the sample size $n$ satisfy $p/n\rightarrow c\in [1,\infty)$. As shown in Tsao (2004), the empirical likelihood method fails with high probability when $p/n>1/2$ because the convex hull of the $n$ observations in $\mathbb{R}^p$ becomes too small to cover the true mean value. Moreover, when $p> n$, the sample covariance matrix becomes singular, and this results in the breakdown of the first sandwich approximation for the log empirical likelihood ratio. To deal with these two challenges, we propose a new strategy of adding two artificial data points to the observed data. We establish the asymptotic normality of the proposed empirical likelihood ratio test. The proposed test statistic does not involve the inverse of the sample covariance matrix. Furthermore, its form is explicit, so the test can easily be carried out with low computational cost. Our numerical comparison shows that the proposed test outperforms some existing tests for high-dimensional mean vectors in terms of power. We also illustrate the proposed procedure with an empirical analysis of stock data.
  3. For a set P of n points in the unit ball b ⊆ R d , consider the problem of finding a small subset T ⊆ P such that its convex-hull ε-approximates the convex-hull of the original set. Specifically, the Hausdorff distance between the convex hull of T and the convex hull of P should be at most ε. We present an efficient algorithm to compute such an ε ′ -approximation of size kalg, where ε ′ is a function of ε, and kalg is a function of the minimum size kopt of such an ε-approximation. Surprisingly, there is no dependence on the dimension d in either of the bounds. Furthermore, every point of P can be ε- approximated by a convex-combination of points of T that is O(1/ε2 )-sparse. Our result can be viewed as a method for sparse, convex autoencoding: approximately representing the data in a compact way using sparse combinations of a small subset T of the original data. The new algorithm can be kernelized, and it preserves sparsity in the original input.
  4. Summary

    The upper bounds on the coverage probabilities of the confidence regions based on blockwise empirical likelihood and non-standard expansive empirical likelihood methods for time series data are investigated via studying the probability of violating the convex hull constraint. The large sample bounds are derived on the basis of the pivotal limit of the blockwise empirical log-likelihood ratio obtained under fixed b asymptotics, which has recently been shown to provide a more accurate approximation to the finite sample distribution than the conventional χ2-approximation. Our theoretical and numerical findings suggest that both the finite sample and the large sample upper bounds for coverage probabilities are strictly less than 1 and the blockwise empirical likelihood confidence region can exhibit serious undercoverage when the dimension of moment conditions is moderate or large, the time series dependence is positively strong or the block size is large relative to the sample size. A similar finite sample coverage problem occurs for non-standard expansive empirical likelihood. To alleviate the coverage bound problem, we propose to penalize both empirical likelihood methods by relaxing the convex hull constraint. Numerical simulations and data illustrations demonstrate the effectiveness of our proposed remedies in terms of delivering confidence sets with more accuratemore »coverage. Some technical details and additional simulation results are included in on-line supplemental material.

    « less
  5. We develop a variant of the multinomial logit model with impatient customers and study assortment optimization and pricing problems under this choice model. In our choice model, a customer incrementally views the assortment of available products in multiple stages. The patience level of a customer determines the maximum number of stages in which the customer is willing to view the assortments of products. In each stage, if the product with the largest utility provides larger utility than a minimum acceptable utility, which we refer to as the utility of the outside option, then the customer purchases that product right away. Otherwise, the customer views the assortment of products in the next stage as long as the customer’s patience level allows the customer to do so. Under the assumption that the utilities have the Gumbel distribution and are independent, we give a closed-form expression for the choice probabilities. For the assortment-optimization problem, we develop a polynomial-time algorithm to find the revenue-maximizing sequence of assortments to offer. For the pricing problem, we show that, if the sequence of offered assortments is fixed, then we can solve a convex program to find the revenue-maximizing prices, with which the decision variables are the probabilitiesmore »that a customer reaches different stages. We build on this result to give a 0.878-approximation algorithm when both the sequence of assortments and the prices are decision variables. We consider the assortment-optimization problem when each product occupies some space and there is a constraint on the total space consumption of the offered products. We give a fully polynomial-time approximation scheme for this constrained problem. We use a data set from Expedia to demonstrate that incorporating patience levels, as in our model, can improve purchase predictions. We also check the practical performance of our approximation schemes in terms of both the quality of solutions and the computation times.« less