skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Turning the information-sharing dial: Efficient inference from different data sources
A fundamental aspect of statistics is the integration of data from different sources. Classically, Fisher and others were focused on how to integrate homogeneous (or only mildly heterogeneous) sets of data. More recently, as data are becoming more accessible, the question of if data sets from different sources should be integrated is becoming more relevant. The current literature treats this as a question with only two answers: integrate or don’t. Here we take a different approach, motivated by information-sharing principles coming from the shrinkage estimation literature. In particular, we deviate from the do/don’t perspective and propose a dial parameter that controls the extent to which two data sources are integrated. How far this dial parameter should be turned is shown to depend, for example, on the informativeness of the different data sources as measured by Fisher information. In the context of generalized linear models, this more nuanced data integration framework leads to relatively simple parameter estimates and valid tests/confidence intervals. Moreover, we demonstrate both theoretically and empirically that setting the dial parameter according to our recommendation leads to more efficient estimation compared to other binary data integration schemes.  more » « less
Award ID(s):
2337943 2051225
PAR ID:
10576242
Author(s) / Creator(s):
;
Publisher / Repository:
The Institute of Mathematical Statistics and the Bernoulli Society
Date Published:
Journal Name:
Electronic Journal of Statistics
Volume:
18
Issue:
2
ISSN:
1935-7524
Page Range / eLocation ID:
2974-3020
Subject(s) / Keyword(s):
Data enrichment Generalized linear models Kullback–Leibler divergence Ridge regression Transfer learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Multimodal integration combines information from different sources or modalities to gain a more comprehensive understanding of a phenomenon. The challenges in multi-omics data analysis lie in the complexity, high dimensionality, and heterogeneity of the data, which demands sophisticated computational tools and visualization methods for proper interpretation and visualization of multi-omics data. In this paper, we propose a novel method, termed Orthogonal Multimodality Integration and Clustering (OMIC), for analyzing CITE-seq. Our approach enables researchers to integrate multiple sources of information while accounting for the dependence among them. We demonstrate the effectiveness of our approach using CITE-seq data sets for cell clustering. Our results show that our approach outperforms existing methods in terms of accuracy, computational efficiency, and interpretability. We conclude that our proposed OMIC method provides a powerful tool for multimodal data analysis that greatly improves the feasibility and reliability of integrated data. 
    more » « less
  2. null (Ed.)
    Before formal education begins, children typically acquire a vocabulary of thousands of words. This learning process requires the use of many different information sources in their social environment, including their current state of knowledge and the context in which they hear words used. How is this information integrated? We specify a developmental model according to which children consider information sources in an age-specific way and integrate them via Bayesian inference. This model accurately predicted 2–5-year-old children’s word learning across a range of experimental conditions in which they had to integrate three information sources. Model comparison suggests that the central locus of development is an increased sensitivity to individual information sources, rather than changes in integration ability. This work presents a developmental theory of information integration during language learning and illustrates how formal models can be used to make a quantitative test of the predictive and explanatory power of competing theories. 
    more » « less
  3. Emerging lithium-ion battery systems require high-fidelity electrochemical models for advanced control, diagnostics, and design. Accordingly, battery parameter estimation is an active research domain where novel algorithms are being developed to calibrate complex models from input-output data. Amidst these efforts, little focus has been placed on the fundamental mechanisms governing estimation accuracy, spurring the question, why is an estimate accurate or inaccurate? In response, we derive a generalized estimation error equation under the commonly adopted least-squares objective function, which reveals that the error can be represented as a combination of system uncertainties (i.e., in model, measurement, and parameter) and uncertainty-propagating sensitivity structures in the data. We then relate the error equation to conventional error analysis criteria, such as the Fisher information matrix, Cramér-Rao bound, and parameter sensitivity, to assess the benefits and limitations of each. The error equation is validated through several uni- and bivariate estimations of lithium-ion battery electrochemical parameters using experimental data. These results are also analyzed with the error equation to study the error compositions and parameter identifiability under different data. Finally, we show that adding target parameters to the estimation without increasing the amount of data intrinsically reduces the robustness of the results to system uncertainties. 
    more » « less
  4. Ruiz, Francisco; Dy, Jennifer; van de Meent, Jan-Willem (Ed.)
    We study discrete distribution estimation under user-level local differential privacy (LDP). In user-level $$\varepsilon$$-LDP, each user has $$m\ge1$$ samples and the privacy of all $$m$$ samples must be preserved simultaneously. We resolve the following dilemma: While on the one hand having more samples per user should provide more information about the underlying distribution, on the other hand, guaranteeing the privacy of all $$m$$ samples should make the estimation task more difficult. We obtain tight bounds for this problem under almost all parameter regimes. Perhaps surprisingly, we show that in suitable parameter regimes, having $$m$$ samples per user is equivalent to having $$m$$ times more users, each with only one sample. Our results demonstrate interesting phase transitions for $$m$$ and the privacy parameter $$\varepsilon$$ in the estimation risk. Finally, connecting with recent results on shuffled DP, we show that combined with random shuffling, our algorithm leads to optimal error guarantees (up to logarithmic factors) under the central model of user-level DP in certain parameter regimes. We provide several simulations to verify our theoretical findings. 
    more » « less
  5. Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework. 
    more » « less