skip to main content

Title: Variance Estimation in the Analysis of Microarray Data

Microarrays are one of the most widely used high throughput technologies. One of the main problems in the area is that conventional estimates of the variances that are required in the t-statistic and other statistics are unreliable owing to the small number of replications. Various methods have been proposed in the literature to overcome this lack of degrees of freedom problem. In this context, it is commonly observed that the variance increases proportionally with the intensity level, which has led many researchers to assume that the variance is a function of the mean. Here we concentrate on estimation of the variance as a function of an unknown mean in two models: the constant coefficient of variation model and the quadratic variance–mean model. Because the means are unknown and estimated with few degrees of freedom, naive methods that use the sample mean in place of the true mean are generally biased because of the errors-in-variables phenomenon. We propose three methods for overcoming this bias. The first two are variations on the theme of the so-called heteroscedastic simulation–extrapolation estimator, modified to estimate the variance function consistently. The third class of estimators is entirely different, being based on semiparametric information calculations. Simulations show the power of our methods and their lack of bias compared with the naive method that ignores the measurement error. The methodology is illustrated by using microarray data from leukaemia patients.

more » « less
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Page Range / eLocation ID:
p. 425-445
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Distributed lag models (DLMs) have been widely used in environmental epidemiology to quantify the lagged effects of air pollution on a health outcome of interest such as mortality and morbidity. Most previous DLM approaches consider only one pollutant at a time. We propose a distributed lag interaction model to characterize the joint lagged effect of two pollutants. One natural way to model the interaction surface is by assuming that the underlying basis functions are tensor products of the basis functions that generate the main effect distributed lag functions. We extend Tukey's 1 degree-of-freedom interaction structure to the two-dimensional DLM context. We also consider shrinkage versions of the two to allow departure from the specified Tukey interaction structure and achieve bias—variance trade-off. We derive the marginal lag effects of one pollutant when the other pollutant is fixed at certain quantiles. In a simulation study, we show that the shrinkage methods have better average performance in terms of mean-squared error across various scenarios. We illustrate the methods proposed by using the ‘National morbidity, mortality, and air pollution study’ data to model the joint effects of particulate matter and ozone on mortality count in Chicago, Illinois, from 1987 to 2000.

    more » « less
  2. Abstract

    Effective theories describing black hole exteriors resemble open quantum systems inasmuch as many unmeasurable degrees of freedom beyond the horizon interact with those we can see. A solvable Caldeira‐Leggett type model of a quantum field that mixes with many unmeasured thermal degrees of freedom on a shared surface was proposed inarXiv:2106.09854to provide a benchmark against which more complete black hole calculations might be compared. We here use this model to test two types of field‐theoretic approximation schemes that also lend themselves to describing black hole behaviour: Open EFT techniques (as applied to the fields themselves, rather than Unruh‐DeWitt detectors) and mean‐field methods. Mean‐field methods are of interest because the effective Hamiltonians to which they lead can be nonlocal; a possible source for the nonlocality that is sometimes entertained as being possible for black holes in the near‐horizon regime. Open EFTs compute the evolution of the field state, allowing discussion of thermalization and decoherence even when these occur at such late times that perturbative methods fail (as they often do). Applying both of these methods to a solvable system identifies their domains of validity and shows how their predictions relate to more garden‐variety perturbative tools.

    more » « less
  3. Abstract

    High-resolution profiles of vertical velocity obtained from two different surface-following autonomous platforms, Surface Wave Instrument Floats with Tracking (SWIFTs) and a Liquid Robotics SV3 Wave Glider, are used to compute dissipation rate profilesϵ(z) between 0.5 and 5 m depth via the structure function method. The main contribution of this work is to update previous SWIFT methods to account for bias due to surface gravity waves, which are ubiquitous in the near-surface region. We present a technique where the data are prefiltered by removing profiles of wave orbital velocities obtained via empirical orthogonal function (EOF) analysis of the data prior to computing the structure function. Our analysis builds on previous work to remove wave bias in which analytic modifications are made to the structure function model. However, we find the analytic approach less able to resolve the strong vertical gradients inϵ(z) near the surface. The strength of the EOF filtering technique is that it does not require any assumptions about the structure of nonturbulent shear, and does not add any additional degrees of freedom in the least squares fit to the model of the structure function. In comparison to the analytic method,ϵ(z) estimates obtained via empirical filtering have substantially reduced noise and a clearer dependence on near-surface wind speed.

    more » « less
  4. The weighted nearest neighbors (WNN) estimator has been popularly used as a flexible and easy-to-implement nonparametric tool for mean regression estimation. The bagging technique is an elegant way to form WNN estimators with weights automatically generated to the nearest neighbors (Steele, 2009; Biau et al., 2010); we name the resulting estimator as the distributional nearest neighbors (DNN) for easy reference. Yet, there is a lack of distributional results for such estimator, limiting its application to statistical inference. Moreover, when the mean regression function has higher-order smoothness, DNN does not achieve the optimal nonparametric convergence rate, mainly because of the bias issue. In this work, we provide an in-depth technical analysis of the DNN, based on which we suggest a bias reduction approach for the DNN estimator by linearly combining two DNN estimators with different subsampling scales, resulting in the novel two-scale DNN (TDNN) estimator. The two-scale DNN estimator has an equivalent representation of WNN with weights admitting explicit forms and some being negative. We prove that, thanks to the use of negative weights, the two-scale DNN estimator enjoys the optimal nonparametric rate of convergence in estimating the regression function under the fourth order smoothness condition. We further go beyond estimation and establish that the DNN and two-scale DNN are both asymptotically normal as the subsampling scales and sample size diverge to infinity. For the practical implementation, we also provide variance estimators and a distribution estimator using the jackknife and bootstrap techniques for the two-scale DNN. These estimators can be exploited for constructing valid confidence intervals for nonparametric inference of the regression function. The theoretical results and appealing nite-sample performance of the suggested two-scale DNN method are illustrated with several simulation examples and a real data application. 
    more » « less
  5. Summary

    Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to a serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the refitted cross-validation method proposed.

    more » « less