skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Optimal Allocation of Replicates for Measurement Evaluation Studies
Abstract

Optimal experimental design is important for the efficient use of modern high-throughput technologies such as microarrays and proteomics. Multiple factors including the reliability of measurement system, which itself must be estimated from prior experimental work, could influence design decisions. In this study, we describe how the optimal number of replicate measures (technical replicates) for each biological sample (biological replicate) can be determined. Different allocations of biological and technical replicates were evaluated by minimizing the variance of the ratio of technical variance (measurement error) to the total variance (sum of sampling error and measurement error). We demonstrate that if the number of biological replicates and the number of technical replicates per biological sample are variable, while the total number of available measures is fixed, then the optimal allocation of replicates for measurement evaluation experiments requires two technical replicates for each biological replicate. Therefore, it is recommended to use two technical replicates for each biological replicate if the goal is to evaluate the reproducibility of measurements.

 
more » « less
PAR ID:
10506164
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Genomics, Proteomics & Bioinformatics
Volume:
4
Issue:
3
ISSN:
1672-0229
Format(s):
Medium: X Size: p. 196-202
Size(s):
p. 196-202
Sponsoring Org:
National Science Foundation
More Like this
  1. Most linear experimental design problems assume homogeneous variance, while the presence of heteroskedastic noise is present in many realistic settings. Let a learner have access to a finite set of measurement vectors that can be probed to receive noisy linear responses. We propose, analyze and empirically evaluate a novel design for uniformly bounding estimation error of the variance parameters. We demonstrate this method on two adaptive experimental design problems under heteroskedastic noise, fixed confidence transductive best-arm identification and level-set identification and prove the first instance-dependent lower bounds in these settings. Lastly, we construct near-optimal algorithms and demonstrate the large improvements in sample complexity gained from accounting for heteroskedastic variance in these designs empirically. 
    more » « less
  2. Optimal designs minimize the number of experimental runs (samples) needed to accurately estimate model parameters, resulting in algorithms that, for instance, efficiently minimize parameter estimate variance. Governed by knowledge of past observations, adaptive approaches adjust sampling constraints online as model parameter estimates are refined, continually maximizing expected information gained or variance reduced. We apply adaptive Bayesian inference to estimate transition rates of Markov chains, a common class of models for stochastic processes in nature. Unlike most previous studies, our sequential Bayesian optimal design is updated with each observation and can be simply extended beyond two-state models to birth–death processes and multistate models. By iteratively finding the best time to obtain each sample, our adaptive algorithm maximally reduces variance, resulting in lower overall error in ground truth parameter estimates across a wide range of Markov chain parameterizations and conformations. 
    more » « less
  3. Experimental design is a classical area in statistics and has also found new applications in machine learning. In the combinatorial experimental design problem, the aim is to estimate an unknown m-dimensional vector x from linear measurements where a Gaussian noise is introduced in each measurement. The goal is to pick k out of the given n experiments so as to make the most accurate estimate of the unknown parameter x. Given a set S of chosen experiments, the most likelihood estimate x0 can be obtained by a least squares computation. One of the robust measures of error estimation is the D-optimality criterion which aims to minimize the generalized variance of the estimator. This corresponds to minimizing the volume of the standard confidence ellipsoid for the estimation error x − x0. The problem gives rise to two natural variants depending on whether repetitions of experiments is allowed or not. The latter variant, while being more general, has also found applications in geographical location of sensors. We show a close connection between approximation algorithms for the D-optimal design problem and constructions of approximately m-wise positively correlated distributions. This connection allows us to obtain first approximation algorithms for the D-optimal design problem with and without repetitions. We then consider the case when the number of experiments chosen is much larger than the dimension m and show one can obtain asymptotically optimal algorithms in this case. 
    more » « less
  4. Abstract Motivation

    High-throughput mRNA sequencing (RNA-Seq) is a powerful tool for quantifying gene expression. Identification of transcript isoforms that are differentially expressed in different conditions, such as in patients and healthy subjects, can provide insights into the molecular basis of diseases. Current transcript quantification approaches, however, do not take advantage of the shared information in the biological replicates, potentially decreasing sensitivity and accuracy.

    Results

    We present a novel hierarchical Bayesian model called Differentially Expressed Isoform detection from Multiple biological replicates (DEIsoM) for identifying differentially expressed (DE) isoforms from multiple biological replicates representing two conditions, e.g. multiple samples from healthy and diseased subjects. DEIsoM first estimates isoform expression within each condition by (1) capturing common patterns from sample replicates while allowing individual differences, and (2) modeling the uncertainty introduced by ambiguous read mapping in each replicate. Specifically, we introduce a Dirichlet prior distribution to capture the common expression pattern of replicates from the same condition, and treat the isoform expression of individual replicates as samples from this distribution. Ambiguous read mapping is modeled as a multinomial distribution, and ambiguous reads are assigned to the most probable isoform in each replicate. Additionally, DEIsoM couples an efficient variational inference and a post-analysis method to improve the accuracy and speed of identification of DE isoforms over alternative methods. Application of DEIsoM to an hepatocellular carcinoma (HCC) dataset identifies biologically relevant DE isoforms. The relevance of these genes/isoforms to HCC are supported by principal component analysis (PCA), read coverage visualization, and the biological literature.

    Availability and implementation

    The software is available at https://github.com/hao-peng/DEIsoM

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Summary

    Due to the sparsity and high dimensionality, microbiome data are routinely summarized into pairwise distances capturing the compositional differences. Many biological insights can be gained by analyzing the distance matrix in relation to some covariates. A microbiome sampling method that characterizes the inter-sample relationship more reproducibly is expected to yield higher statistical power. Traditionally, the intraclass correlation coefficient (ICC) has been used to quantify the degree of reproducibility for a univariate measurement using technical replicates. In this work, we extend the traditional ICC to distance measures and propose a distance-based ICC (dICC). We derive the asymptotic distribution of the sample-based dICC to facilitate statistical inference. We illustrate dICC using a real dataset from a metagenomic reproducibility study.

    Availability and implementation

    dICC is implemented in the R CRAN package GUniFrac.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less