skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Discussion of ‘A Unified Framework for De-Duplication and Population Size Estimation’ by Tancredi, Steorts, and Liseo.
Population size estimation techniques, such as multiple-systems or capture-recapture estimation, typically require multiple samples from the study population, in addition to the information on which individuals are included in which samples. In many contexts, these samples come from existing data sources that contain certain information on the individuals but no unique identifiers. The goal of record linkage and duplicate detection techniques is to identify unique individuals across and within samples based on the information collected on them, which might correspond to basic partial identifiers, such as given and family name, and other demographic information. Therefore, record linkage and duplicate detection are often needed to generate the input for a given population size estimation technique that a researcher might want to use. Linkage decisions, however, are subject to uncertainty when partial identifiers are limited or contain errors and missingness, and therefore, intuitively, uncertainty in the linkage and deduplication process should somehow be taken into account in the stage of population size estimation.  more » « less
Award ID(s):
1852841
PAR ID:
10389235
Author(s) / Creator(s):
Date Published:
Journal Name:
Bayesian analysis
Volume:
15
Issue:
2
ISSN:
1931-6690
Page Range / eLocation ID:
659 - 663
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Merging datafiles containing information on overlapping sets of entities is a challenging task in the absence of unique identifiers, and is further complicated when some entities are duplicated in the datafiles. Most approaches to this problem have focused on linking two files assumed to be free of duplicates, or on detecting which records in a single file are duplicates. However, it is common in practice to encounter scenarios that fit somewhere in between or beyond these two settings. We propose a Bayesian approach for the general setting of multifile record linkage and duplicate detection. We use a novel partition representation to propose a structured prior for partitions that can incorporate prior information about the data collection processes of the datafiles in a flexible manner, and extend previous models for comparison data to accommodate the multifile setting. We also introduce a family of loss functions to derive Bayes estimates of partitions that allow uncertain portions of the partitions to be left unresolved. The performance of our proposed methodology is explored through extensive simulations. Supplementary materials for this article are available online. 
    more » « less
  2. Summary Computerised Record Linkage methods help us combine multiple data sets from different sources when a single data set with all necessary information is unavailable or when data collection on additional variables is time consuming and extremely costly. Linkage errors are inevitable in the linked data set because of the unavailability of error‐free unique identifiers. A small amount of linkage errors can lead to substantial bias and increased variability in estimating parameters of a statistical model. In this paper, we propose a unified theory for statistical analysis with linked data. Our proposed method, unlike the ones available for secondary data analysis of linked data, exploits record linkage process data as an alternative to taking a costly sample to evaluate error rates from the record linkage procedure. A jackknife method is introduced to estimate bias, covariance matrix and mean squared error of our proposed estimators. Simulation results are presented to evaluate the performance of the proposed estimators that account for linkage errors. 
    more » « less
  3. Abstract The nocturnal aye-aye, Daubentonia madagascariensis, is one of the most elusive lemurs on the island of Madagascar. The timing of its activity and arboreal lifestyle has generally made it difficult to obtain accurate assessments of population size using traditional census methods. Therefore, alternative estimates provided by population genetic inference are essential for yielding much needed information for conservation measures and for enabling ecological and evolutionary studies of this species. Here, we utilize genomic data from 17 individuals—including 5 newly sequenced, high-coverage genomes—to estimate this history. Essential to this estimation are recently published annotations of the aye-aye genome which allow for variation at putatively neutral genomic regions to be included in the estimation procedures, and regions subject to selective constraints, or in linkage to such sites, to be excluded owing to the biasing effects of selection on demographic inference. By comparing a variety of demographic estimation tools to develop a well-supported model of population history, we find strong support for two demes, separating northern Madagascar from the rest of the island. Additionally, we find that the aye-aye has experienced two severe reductions in population size. The first occurred rapidly, ∼3,000 to 5,000 years ago, and likely corresponded with the arrival of humans to Madagascar. The second occurred over the past few decades and is likely related to substantial habitat loss, suggesting that the species is still undergoing population decline and remains at great risk for extinction. 
    more » « less
  4. We consider causal inference for observational studies with data spread over two files. One file includes the treatment, outcome, and some covariates measured on a set of individuals, and the other file includes additional causally-relevant covariates measured on a partially overlapping set of individuals. By linking records in the two databases, the analyst can control for more covariates, thereby reducing the risk of bias compared to using only one file alone. When analysts do not have access to a unique identifier that enables perfect, error-free linkages, they typically rely on probabilistic record linkage to construct a single linked data set, and estimate causal effects using these linked data. This typical practice does not propagate uncertainty from imperfect linkages to the causal inferences. Further, it does not take advantage of relationships among the variables to improve the linkage quality. We address these shortcomings by fusing regression-assisted, Bayesian probabilistic record linkage with causal inference. The Markov chain Monte Carlo sampler generates multiple plausible linked data files as byproducts that analysts can use for multiple imputation inferences. Here, we show results for two causal estimators based on propensity score overlap weights. Using simulations and data from the Italy Survey on Household Income and Wealth, we show that our approach can improve the accuracy of estimated treatment effects. 
    more » « less
  5. Wren, Jonathan (Ed.)
    Abstract MotivationMathematical models in systems biology help generate hypotheses, guide experimental design, and infer the dynamics of gene regulatory networks. These models are characterized by phenomenological or mechanistic parameters, which are typically hard to measure. Therefore, efficient parameter estimation is central to model development. Global optimization techniques, such as evolutionary algorithms (EAs), are applied to estimate model parameters by inverse modeling, i.e. calibrating models by minimizing a function that evaluates a measure of the error between model predictions and experimental data. EAs estimate model parameters “fittest individuals” by generating a large population of individuals using strategies like recombination and mutation over multiple “generations.” Typically, only a few individuals from each generation are used to create new individuals in the next generation. Improved Evolutionary Strategy by Stochastic Ranking (ISRES), proposed by Runnarson and Yao, is one such EA that is widely used in systems biology to estimate parameters. ISRES uses information at most from a pair of individuals in any generation to create a new population to minimize the error. In this article, we propose an efficient evolutionary strategy, ISRES+, which builds on ISRES by combining information from all individuals across the population and across all generations to develop a better understanding of the fitness landscape. ResultsISRES+ uses the additional information generated by the algorithm during evolution to approximate the local neighborhood around the best-fit individual using linear least squares fits in one and two dimensions, enabling efficient parameter estimation. ISRES+ outperforms ISRES and results in fitter individuals with a tighter distribution over multiple runs, such that a typical run of ISRES+ estimates parameters with a higher goodness-of-fit compared with ISRES. Availability and implementationAlgorithm and implementation: Github—https://github.com/gtreeves/isres-plus-bandodkar-2022. 
    more » « less