skip to main content


Title: Homogeneity Structure Learning in Large-scale Panel Data with Heavy-tailed Errors
Large-scale panel data is ubiquitous in many modern data science applications. Conventional panel data analysis methods fail to address the new challenges, like individual impacts of covariates, endogeneity, embedded low-dimensional structure, and heavy-tailed errors, arising from the innovation of data collection platforms on which applications operate. In response to these challenges, this paper studies large-scale panel data with an interactive effects model. This model takes into account the individual impacts of covariates on each spatial node and removes the exogenous condition by allowing latent factors to affect both covariates and errors. Besides, we waive the sub-Gaussian assumption and allow the errors to be heavy-tailed. Further, we propose a data-driven procedure to learn a parsimonious yet flexible homogeneity structure embedded in high-dimensional individual impacts of covariates. The homogeneity structure assumes that there exists a partition of regression coeffcients where the coeffcients are the same within each group but different between the groups. The homogeneity structure is flexible as it contains many widely assumed low dimensional structures (sparsity, global impact, etc.) as its special cases. Non-asymptotic properties are established to justify the proposed learning procedure. Extensive numerical experiments demonstrate the advantage of the proposed learning procedure over conventional methods especially when the data are generated from heavy-tailed distributions.  more » « less
Award ID(s):
2015539 1953196 1820702
NSF-PAR ID:
10217339
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
22
ISSN:
1533-7928
Page Range / eLocation ID:
1-42
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.

     
    more » « less
  2. Abstract Why the new findings matter

    The process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.

    Implications for educational researchers and policy makers

    The progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.

    Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.

     
    more » « less
  3. We study high-dimensional signal recovery from non-linear measurements with design vectors having elliptically symmetric distribution. Special attention is devoted to the situation when the unknown signal belongs to a set of low statistical complexity, while both the measurements and the design vectors are heavy-tailed. We propose and analyze a new estimator that adapts to the structure of the problem, while being robust both to the possible model misspecification characterized by arbitrary non-linearity of the measurements as well as to data corruption modeled by the heavy-tailed distributions. Moreover, this estimator has low computational complexity. Our results are expressed in the form of exponential concentration inequalities for the error of the proposed estimator. On the technical side, our proofs rely on the generic chaining methods, and illustrate the power of this approach for statistical applications. Theory is supported by numerical experiments demonstrating that our estimator outperforms existing alternatives when data is heavy-tailed. 
    more » « less
  4. SUMMARY

    The advent of fast sensing technologies allow for real-time model updates in many applications where the model parameters are uncertain. Once the observations are collected, Bayesian algorithms offer a pathway for real-time inversion (a.k.a. model parameters/inputs update) because of the flexibility of the Bayesian framework against non-uniqueness and uncertainties. However, Bayesian algorithms rely on the repeated evaluation of the computational models and deep learning (DL) based proxies can be useful to address this computational bottleneck. In this paper, we study the effects of the approximate nature of the deep learned models and associated model errors during the inversion of borehole electromagnetic (EM) measurements, which are usually obtained from logging while drilling. We rely on the iterative ensemble smoothers as an effective algorithm for real-time inversion due to its parallel nature and relatively low computational cost. The real-time inversion of EM measurements is used to determine the subsurface geology and properties, which are critical for real-time adjustments of the well trajectory (geosteering). The use of deep neural network (DNN) as a forward model allows us to perform thousands of model evaluations within seconds, which is very useful to quantify uncertainties and non-uniqueness in real-time. While significant efforts are usually made to ensure the accuracy of the DL models, it is widely known that the DNNs can contain some type of model-error in the regions not covered by the training data, which are unknown and training specific. When the DL models are utilized during inversion of EM measurements, the effects of the model-errors could manifest themselves as a bias in the estimated input parameters and as a consequence might result in a low-quality geosteering decision. We present numerical results highlighting the challenges associated with the inversion of EM measurements while neglecting model-error. We further demonstrate the utility of a recently proposed flexible iterative ensemble smoother in reducing the effect of model-bias by capturing the unknown model-errors, thus improving the quality of the estimated subsurface properties for geosteering operation. Moreover, we describe a procedure for identifying inversion multimodality and propose possible solutions to alleviate it in real-time.

     
    more » « less
  5. Abstract

    Predicting the proportion of the water year a given stream will remain at or above various flow thresholds is critically important for making sound water management decisions. Flow duration curves (FDCs) succinctly capture this information using all data available over some historical period, while annual flow duration curves (AFDCs) instead use data from each individual water year. Analyzing the population of AFDCs, and in particular the tails of this distribution, can allow water managers to better prepare for years with extreme streamflow conditions. However, long time series of observations are necessary to capture interannual streamflow variations and are problematic to obtain in rapidly changing and poorly gauged catchments. By incorporating a process‐based model to construct AFDCs based on daily rainfall statistics and flow recession characteristics, the proposed approach is a first step toward addressing this challenge. Results indicate that prediction performance varies substantially across flow quantiles and that the current model fails to properly capture the interannual variability of low flows. Numerical analyses attributed these errors to nonlinearity in storage‐discharge relation, rather than cross‐scale streamflow correlations and non‐Poissonian rainfall, explaining the origin of commonly observed heavy‐tailed behavior in low flow quantiles. We present a case study on hydroelectric power generation, showing that faithfully capturing both interannual streamflow variability and recession nonlinearity has important implications for installation profitability.

     
    more » « less