skip to main content

Title: Unweighted estimation based on optimal sample under measurement constraints

To tackle massive data, subsampling is a practical approach to select the more informative data points. However, when responses are expensive to measure, developing efficient subsampling schemes is challenging, and an optimal sampling approach under measurement constraints was developed to meet this challenge. This method uses the inverses of optimal sampling probabilities to reweight the objective function, which assigns smaller weights to the more important data points. Thus, the estimation efficiency of the resulting estimator can be improved. In this paper, we propose an unweighted estimating procedure based on optimal subsamples to obtain a more efficient estimator. We obtain the unconditional asymptotic distribution of the estimator via martingale techniques without conditioning on the pilot estimate, which has been less investigated in the existing subsampling literature. Both asymptotic results and numerical results show that the unweighted estimator is more efficient in parameter estimation.

more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Canadian Journal of Statistics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose improved estimation method for logistic regression based on subsamples taken according the optimal subsampling probabilities developed in Wang et al. (2018). Both asymptotic results and numerical results show that the new estimator has a higher estimation efficiency. We also develop a new algorithm based on Poisson subsampling, which does not require to approximate the optimal subsampling probabilities all at once. This is computationally advantageous when available random-access memory is not enough to hold the full data. Interestingly, asymptotic distributions also show that Poisson subsampling produces a more efficient estimator if the sampling ratio, the ratio of the subsample size to the full data sample size, does not converge to zero. We also obtain the unconditional asymptotic distribution for the estimator based on Poisson subsampling. Pilot estimators are required to calculate subsampling probabilities and to correct biases in un-weighted estimators; interestingly, even if pilot estimators are inconsistent, the proposed method still produce consistent and asymptotically normal estimators. 
    more » « less
  2. null (Ed.)
    Summary We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling probabilities. One version minimizes the trace of the asymptotic variance-covariance matrix for a linearly transformed parameter estimator and the other minimizes that of the original parameter estimator. The former does not depend on the densities of the responses given covariates and is easy to implement. Algorithms based on optimal subsampling probabilities are proposed and asymptotic distributions, and the asymptotic optimality of the resulting estimators are established. Furthermore, we propose an iterative subsampling procedure based on the optimal subsampling probabilities in the linearly transformed parameter estimation which has great scalability to utilize available computational resources. In addition, this procedure yields standard errors for parameter estimators without estimating the densities of the responses given the covariates. We provide numerical examples based on both simulated and real data to illustrate the proposed method. 
    more » « less
  3. Recent research has developed several Monte Carlo methods for estimating the normalization constant (partition function) based on the idea of annealing. This means sampling successively from a path of distributions that interpolate between a tractable "proposal" distribution and the unnormalized "target" distribution. Prominent estimators in this family include annealed importance sampling and annealed noise-contrastive estimation (NCE). Such methods hinge on a number of design choices: which estimator to use, which path of distributions to use and whether to use a path at all; so far, there is no definitive theory on which choices are efficient. Here, we evaluate each design choice by the asymptotic estimation error it produces. First, we show that using NCE is more efficient than the importance sampling estimator, but in the limit of infinitesimal path steps, the difference vanishes. Second, we find that using the geometric path brings down the estimation error from an exponential to a polynomial function of the parameter distance between the target and proposal distributions. Third, we find that the arithmetic path, while rarely used, can offer optimality properties over the universally-used geometric path. In fact, in a particular limit, the optimal path is arithmetic. Based on this theory, we finally propose a two-step estimator to approximate the optimal path in an efficient way. 
    more » « less
  4. Abstract

    Calibration weighting has been widely used to correct selection biases in nonprobability sampling, missing data and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft calibration scheme, where the outcome and the selection indicator follow mixed-effect models. The scheme imposes an exact calibration on the fixed effects and an approximate calibration on the random effects. On the one hand, our soft calibration has an intrinsic connection with best linear unbiased prediction, which results in a more efficient estimation compared to hard calibration. On the other hand, soft calibration weighting estimation can be envisioned as penalized propensity score weight estimation, with the penalty term motivated by the mixed-effect structure. The asymptotic distribution and a valid variance estimator are derived for soft calibration. We demonstrate the superiority of the proposed estimator over other competitors in simulation studies and using a real-world data application on the effect of BMI screening on childhood obesity.

    more » « less
  5. Abstract

    Camera traps deployed in grids or stratified random designs are a well‐established survey tool for wildlife but there has been little evaluation of study design parameters.

    We used an empirical subsampling approach involving 2,225 camera deployments run at 41 study areas around the world to evaluate three aspects of camera trap study design (number of sites, duration and season of sampling) and their influence on the estimation of three ecological metrics (species richness, occupancy and detection rate) for mammals.

    We found that 25–35 camera sites were needed for precise estimates of species richness, depending on scale of the study. The precision of species‐level estimates of occupancy (ψ) was highly sensitive to occupancy level, with <20 camera sites needed for precise estimates of common (ψ > 0.75) species, but more than 150 camera sites likely needed for rare (ψ < 0.25) species. Species detection rates were more difficult to estimate precisely at the grid level due to spatial heterogeneity, presumably driven by unaccounted habitat variability factors within the study area. Running a camera at a site for 2 weeks was most efficient for detecting new species, but 3–4 weeks were needed for precise estimates of local detection rate, with no gains in precision observed after 1 month. Metrics for all mammal communities were sensitive to seasonality, with 37%–50% of the species at the sites we examined fluctuating significantly in their occupancy or detection rates over the year. This effect was more pronounced in temperate sites, where seasonally sensitive species varied in relative abundance by an average factor of 4–5, and some species were completely absent in one season due to hibernation or migration.

    We recommend the following guidelines to efficiently obtain precise estimates of species richness, occupancy and detection rates with camera trap arrays: run each camera for 3–5 weeks across 40–60 sites per array. We recommend comparisons of detection rates be model based and include local covariates to help account for small‐scale variation. Furthermore, comparisons across study areas or times must account for seasonality, which could have strong impacts on mammal communities in both tropical and temperate sites.

    more » « less