skip to main content


Title: A Mixed-Integer Fractional Optimization Approach to Best Subset Selection
We consider the best subset selection problem in linear regression—that is, finding a parsimonious subset of the regression variables that provides the best fit to the data according to some predefined criterion. We are primarily concerned with alternatives to cross-validation methods that do not require data partitioning and involve a range of information criteria extensively studied in the statistical literature. We show that the problem of interest can be modeled using fractional mixed-integer optimization, which can be tackled by leveraging recent advances in modern optimization solvers. The proposed algorithms involve solving a sequence of mixed-integer quadratic optimization problems (or their convexifications) and can be implemented with off-the-shelf solvers. We report encouraging results in our computational experiments, with respect to both the optimization and statistical performance. Summary of Contribution: This paper considers feature selection problems with information criteria. We show that by adopting a fractional optimization perspective (a well-known field in nonlinear optimization and operations research), it is possible to leverage recent advances in mixed-integer quadratic optimization technology to tackle traditional statistical problems long considered intractable. We present extensive computational experiments, with both synthetic and real data, illustrating that the new fractional optimization approach is orders of magnitude faster than existing approaches in the literature.  more » « less
Award ID(s):
1818700
NSF-PAR ID:
10289598
Author(s) / Creator(s):
;
Date Published:
Journal Name:
INFORMS Journal on Computing
ISSN:
1091-9856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The L 0 -regularized least squares problem (a.k.a. best subsets) is central to sparse statistical learning and has attracted significant attention across the wider statistics, machine learning, and optimization communities. Recent work has shown that modern mixed integer optimization (MIO) solvers can be used to address small to moderate instances of this problem. In spite of the usefulness of L 0 -based estimators and generic MIO solvers, there is a steep computational price to pay when compared with popular sparse learning algorithms (e.g., based on L 1 regularization). In this paper, we aim to push the frontiers of computation for a family of L 0 -regularized problems with additional convex penalties. We propose a new hierarchy of necessary optimality conditions for these problems. We develop fast algorithms, based on coordinate descent and local combinatorial optimization, that are guaranteed to converge to solutions satisfying these optimality conditions. From a statistical viewpoint, an interesting story emerges. When the signal strength is high, our combinatorial optimization algorithms have an edge in challenging statistical settings. When the signal is lower, pure L 0 benefits from additional convex regularization. We empirically demonstrate that our family of L 0 -based estimators can outperform the state-of-the-art sparse learning algorithms in terms of a combination of prediction, estimation, and variable selection metrics under various regimes (e.g., different signal strengths, feature correlations, number of samples and features). Our new open-source sparse learning toolkit L0Learn (available on CRAN and GitHub) reaches up to a threefold speedup (with p up to 10 6 ) when compared with competing toolkits such as glmnet and ncvreg. 
    more » « less
  2. null (Ed.)
    Under the linear regression framework, we study the variable selection problem when the underlying model is assumed to have a small number of nonzero coefficients. Non-convex penalties in speci c forms are well-studied in the literature for sparse estimation. A recent work, Ahn, Pang, and Xin (2017), has pointed out that nearly all existing non-convex penalties can be represented as difference-of-convex (DC) functions, which are the difference of two convex functions, while itself may not be convex. There is a large existing literature on optimization problems when their objectives and/or constraints involve DC functions. Efficient numerical solutions have been proposed. Under the DC framework, directional-stationary (d-stationary) solutions are considered, and they are usually not unique. In this paper, we show that under some mild conditions, a certain subset of d-stationary solutions in an optimization problem (with a DC objective) has some ideal statistical properties: namely, asymptotic estimation consistency, asymptotic model selection consistency, asymptotic efficiency. Our assumptions are either weaker than or comparable with those conditions that have been adopted in other existing works. This work shows that DC is a nice framework to offer a uni ed approach to these existing works where non-convex penalties are involved. Our work bridges the communities of optimization and statistics. 
    more » « less
  3. Space mission planning and spacecraft design are tightly coupled and need to be considered together for optimal performance; however, this integrated optimization problem results in a large-scale Mixed-Integer Nonlinear Programming (MINLP) problem, which is challenging to solve. In response to this challenge, this paper proposes a new solution approach to this MINLP problem by iterative solving a set of coupled subproblems via the augmented Lagrangian coordination approach following the philosophy of Multi-disciplinary Design Optimization (MDO). The proposed approach leverages the unique structure of the problem that enables its decomposition into a set of coupled subproblems of different types: a Mixed-Integer Quadratic Programming (MIQP) subproblem for mission planning and one or more Nonlinear Programming (NLP) subproblem(s) for spacecraft design. Since specialized MIQP or NLP solvers can be applied to each subproblem, the proposed approach can efficiently solve the otherwise intractable integrated MINLP problem. An automatic and effective method to find an initial solution for this iterative approach is also proposed so that the optimization can be performed without the need for a user-defined initial guess. In the demonstration case study, a human lunar exploration mission sequence is optimized with a subsystem-level parametric spacecraft design model. Compared to the state-of-the-art method, the proposed formulation can obtain a better solution with a shorter computational time even without parallelization. For larger problems, the proposed solution approach can also be easily parallelizable and thus is expected to be further advantageous and scalable. 
    more » « less
  4. We consider the high-dimensional linear regression problem, where the algorithmic goal is to efficiently infer an unknown feature vector $\beta^*\in\mathbb{R}^p$ from its linear measurements, using a small number $n$ of samples. Unlike most of the literature, we make no sparsity assumption on $\beta^*$, but instead adopt a different regularization: In the noiseless setting, we assume $\beta^*$ consists of entries, which are either rational numbers with a common denominator $Q\in\mathbb{Z}^+$ (referred to as $Q-$rationality); or irrational numbers taking values in a rationally independent set of bounded cardinality, known to learner; collectively called as the mixed-range assumption. Using a novel combination of the Partial Sum of Least Squares (PSLQ) integer relation detection, and the Lenstra-Lenstra-Lov\'asz (LLL) lattice basis reduction algorithms, we propose a polynomial-time algorithm which provably recovers a $\beta^*\in\mathbb{R}^p$ enjoying the mixed-range assumption, from its linear measurements $Y=X\beta^*\in\mathbb{R}^n$ for a large class of distributions for the random entries of $X$, even with one measurement ($n=1$). In the noisy setting, we propose a polynomial-time, lattice-based algorithm, which recovers a $\beta^*\in\mathbb{R}^p$ enjoying the $Q-$rationality property, from its noisy measurements $Y=X\beta^*+W\in\mathbb{R}^n$, even from a single sample ($n=1$). We further establish that for large $Q$, and normal noise, this algorithm tolerates information-theoretically optimal level of noise. We then apply these ideas to develop a polynomial-time, single-sample algorithm for the phase retrieval problem. Our methods address the single-sample ($n=1$) regime, where the sparsity-based methods such as the Least Absolute Shrinkage and Selection Operator (LASSO) and the Basis Pursuit are known to fail. Furthermore, our results also reveal algorithmic connections between the high-dimensional linear regression problem, and the integer relation detection, randomized subset-sum, and shortest vector problems. 
    more » « less
  5. Abstract

    This paper studies several solution paths of sparse quadratic minimization problems as a function of the weighing parameter of the bi-objective of estimation loss versus solution sparsity. Three such paths are considered: the “$$\ell _0$$0-path” where the discontinuous$$\ell _0$$0-function provides the exact sparsity count; the “$$\ell _1$$1-path” where the$$\ell _1$$1-function provides a convex surrogate of sparsity count; and the “capped$$\ell _1$$1-path” where the nonconvex nondifferentiable capped$$\ell _1$$1-function aims to enhance the$$\ell _1$$1-approximation. Serving different purposes, each of these three formulations is different from each other, both analytically and computationally. Our results deepen the understanding of (old and new) properties of the associated paths, highlight the pros, cons, and tradeoffs of these sparse optimization models, and provide numerical evidence to support the practical superiority of the capped$$\ell _1$$1-path. Our study of the capped$$\ell _1$$1-path is interesting in its own right as the path pertains to computable directionally stationary (= strongly locally minimizing in this context, as opposed to globally optimal) solutions of a parametric nonconvex nondifferentiable optimization problem. Motivated by classical parametric quadratic programming theory and reinforced by modern statistical learning studies, both casting an exponential perspective in fully describing such solution paths, we also aim to address the question of whether some of them can be fully traced in strongly polynomial time in the problem dimensions. A major conclusion of this paper is that a path of directional stationary solutions of the capped$$\ell _1$$1-regularized problem offers interesting theoretical properties and practical compromise between the$$\ell _0$$0-path and the$$\ell _1$$1-path. Indeed, while the$$\ell _0$$0-path is computationally prohibitive and greatly handicapped by the repeated solution of mixed-integer nonlinear programs, the quality of$$\ell _1$$1-path, in terms of the two criteria—loss and sparsity—in the estimation objective, is inferior to the capped$$\ell _1$$1-path; the latter can be obtained efficiently by a combination of a parametric pivoting-like scheme supplemented by an algorithm that takes advantage of the Z-matrix structure of the loss function.

     
    more » « less