- Award ID(s):
- NSF-PAR ID:
- Meila, Marina and
- Date Published:
- Journal Name:
- Proceedings of the 38th International Conference on Machine Learning
- Page Range / eLocation ID:
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
null (Ed.)The L 0 -regularized least squares problem (a.k.a. best subsets) is central to sparse statistical learning and has attracted significant attention across the wider statistics, machine learning, and optimization communities. Recent work has shown that modern mixed integer optimization (MIO) solvers can be used to address small to moderate instances of this problem. In spite of the usefulness of L 0 -based estimators and generic MIO solvers, there is a steep computational price to pay when compared with popular sparse learning algorithms (e.g., based on L 1 regularization). In this paper, we aim to push the frontiers of computation for a family of L 0 -regularized problems with additional convex penalties. We propose a new hierarchy of necessary optimality conditions for these problems. We develop fast algorithms, based on coordinate descent and local combinatorial optimization, that are guaranteed to converge to solutions satisfying these optimality conditions. From a statistical viewpoint, an interesting story emerges. When the signal strength is high, our combinatorial optimization algorithms have an edge in challenging statistical settings. When the signal is lower, pure L 0 benefits from additional convex regularization. We empirically demonstrate that our family of L 0 -based estimators can outperform the state-of-the-art sparse learning algorithms in terms of a combination of prediction, estimation, and variable selection metrics under various regimes (e.g., different signal strengths, feature correlations, number of samples and features). Our new open-source sparse learning toolkit L0Learn (available on CRAN and GitHub) reaches up to a threefold speedup (with p up to 10 6 ) when compared with competing toolkits such as glmnet and ncvreg.more » « less
null (Ed.)We investigate a data-driven approach to constructing uncertainty sets for robust optimization problems, where the uncertain problem parameters are modeled as random variables whose joint probability distribution is not known. Relying only on independent samples drawn from this distribution, we provide a nonparametric method to estimate uncertainty sets whose probability mass is guaranteed to approximate a given target mass within a given tolerance with high confidence. The nonparametric estimators that we consider are also shown to obey distribution-free finite-sample performance bounds that imply their convergence in probability to the given target mass. In addition to being efficient to compute, the proposed estimators result in uncertainty sets that yield computationally tractable robust optimization problems for a large family of constraint functions.more » « less
Interest in magnetic fields on the ancient Earth and other planetary bodies has motivated the paleomagnetic analysis of complex rocks such as meteorites that carry heterogeneous magnetizations at <<1 mm scales. The net magnetic moment of natural remanent magnetization (NRM) in such small samples is often below the detection threshold of common cryogenic magnetometers. The quantum diamond microscope (QDM) is an emerging magnetic imaging technology with ~1 μm resolution and can, in principle, recover magnetizations as weak as 10−17 Am2. However, the typically 1–100 μm sample‐to‐sensor distance of QDM measurements can result in complex (nondipolar) magnetic field maps, from which the net magnetic moment cannot be determined using a simple algorithm. Here we generate synthetic magnetic field maps to quantify the errors introduced by sample nondipolarity and by map processing procedures such as upward continuation. We find that inversions based on least squares dipole fits of upward continued data can recover the net moment of complex samples with <5% to 10% error for maps with signal‐to‐noise ratio (SNR) in the range typical of current generation QDMs. We validate these error estimates experimentally using comparisons between QDM maps and between QDM and SQUID microscope data, concluding that, within the limitations described here, the QDM is a robust technique for recovering the net magnetic moment of weakly magnetized samples. More sophisticated net moment fitting algorithms in the future can be combined with upward continuation methods described here to improve accuracy.
Few studies have systematically investigated the effects of subsetting strategies on soil modelling or explored the potential of emergent methods from other fields not previously applied to pedometrics. This study considers smallholder agricultural villages in southern India that have been understudied in terms of chemometric modelling intended to support soil health, fertility and management. Therefore, the objective was to investigate the application of visible near‐infrared spectroscopy and chemometrics to predict soil properties in this setting. In addition, this study evaluated the effects of methods of calibration subsetting and new parametric models on the prediction of soil properties. These novel methods were transferred from the genomics field to soil science. Three strategic subsetting methods were used to produce calibration subsets that consider the variation in the soil properties, the spectra and both together; this is in addition to standard random calibration subsetting. Partial least squares regression (PLSR) and two methods from genomics that impose variable reduction were used for modelling; the latter were sparse PLSR (SPLSR) and the heteroscedastic effects model (HEM). Soil samples were collected from two villages and analysed for texture, soil carbon and available macro‐ and micro‐nutrients. The results showed that soil texture and carbon could be predicted moderately to strongly, whereas plant nutrient properties were predicted poorly to moderately. Random subsetting and subsetting by property distribution were more appropriate when spectra varied less overall, whereas subsetting that incorporates variation in spectra and properties improved results when spectral variation increased. The SPLSR and HEM models improved results over PLSR in some cases, or at least maintained prediction strength while using fewer predictors. Subsetting methods improved prediction results in 75% of cases. This study filled an important research gap by systematically studying local subsetting behaviour under different degrees of spectral and attribute variation.
Explored new calibration subsetting methods and chemometric models in soil spectral modelling.
Compared the methods and models for 17 soil properties in an understudied area of India.
Random subsetting was not always optimal; subsetting matters and depends on data characteristics.
Sparse models from genomics performed better in 75% of cases than a standard method.
The assumption that training and testing samples are generated from the same distribution does not always hold for real-world machine-learning applications. The procedure of tackling this discrepancy between the training (source) and testing (target) domains is known as domain adaptation. We propose an unsupervised version of domain adaptation that considers the presence of only unlabelled data in the target domain. Our approach centres on finding correspondences between samples of each domain. The correspondences are obtained by treating the source and target samples as graphs and using a convex criterion to match them. The criteria used are first-order and second-order similarities between the graphs as well as a class-based regularization. We have also developed a computationally efficient routine for the convex optimization, thus allowing the proposed method to be used widely. To verify the effectiveness of the proposed method, computer simulations were conducted on synthetic, image classification and sentiment classification datasets. Results validated that the proposed local sample-to- sample matching method out-performs traditional moment-matching methods and is competitive with respect to current local domain-adaptation methods.more » « less