Abstract This paper demonstrates the advantages of sharing information about unknown features of covariates across multiple model components in various nonparametric regression problems including multivariate, heteroscedastic, and semicontinuous responses. In this paper, we present a methodology which allows for information to be shared nonparametrically across various model components using Bayesian sum‐of‐tree models. Our simulation results demonstrate that sharing of information across related model components is often very beneficial, particularly in sparse high‐dimensional problems in which variable selection must be conducted. We illustrate our methodology by analyzing medical expenditure data from the Medical Expenditure Panel Survey (MEPS). To facilitate the Bayesian nonparametric regression analysis, we develop two novel models for analyzing the MEPS data using Bayesian additive regression trees—a heteroskedastic log‐normal hurdle model with a “shrink‐toward‐homoskedasticity” prior and a gamma hurdle model.
more »
« less
Bayesian synthetic prediction of state level poverty using Indian Household Consumer Expenditure Survey Data1
Goal 1 of the 2030 Agenda for Sustainable Development, adopted by all United Nations member States in 2015, is to end poverty in all forms everywhere. The major indicator to monitor the goal is the so-called headcount ratio or poverty rate, i.e., proportion or percentage of people under poverty. In India, where nearly a quarter of population still live below the poverty line, monitoring of poverty needs greater attention, more frequently at shorter intervals (e.g., every year) to evaluate the effectiveness of planning, programs and actions taken by the governments to eradicate poverty. Poverty rate computation for India depends on two basic ingredients – rural and urban poverty lines for different states and union territories and average Monthly Per-capita Consumer Expenditure (MPCE). While MPCE can be obtained every year, usually from the Consumer Expenditure Survey on shorter schedules with a few exceptions where the information is obtained from another survey, determination of poverty lines is a highly complex, costly and time-consuming process. Poverty lines are essentially determined by a panel of experts who draws their conclusions partly based on their subjective opinions and partly based on data from multiple sources. The main data source the panel uses is the Consumer Expenditure Survey data with a detailed schedule, which are usually available every five years or so. In this paper, we undertake a feasibility study to explore if estimates of headcount ratios or Poverty Ratios in intervening years can be provided in absence of poverty lines by relating poverty ratios with average MPCE through a statistical model. Then we can use the fitted model to predict poverty rates for intervening years based on average MPCE. We explore a few in this work models using Bayesian methodology. The reason behind calling this ‘synthetic prediction’ rests on the synthetic assumption of model invariance over years, often used in the small area literature. While the data-based assessment of our Bayesian synthetic prediction procedure is encouraging, there is a great potential for improvements on the models presented in this paper, e.g., by incorporating more auxiliary data as they become available. In any case, we expect our preliminary work in this important area will encourage researchers to think about statistical modeling as a possible way to at least partially solve a problem for which no objective solution is currently available.
more »
« less
- Award ID(s):
- 1758808
- PAR ID:
- 10396343
- Date Published:
- Journal Name:
- Statistical Journal of the IAOS
- Volume:
- 38
- Issue:
- 4
- ISSN:
- 1874-7655
- Page Range / eLocation ID:
- 1325 to 1335
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Poverty maps derived from satellite imagery are increasingly used to inform high-stakes policy decisions, such as the allocation of humanitarian aid and the distribution of government resources. Such poverty maps are typically constructed by training machine learning algorithms on a relatively modest amount of “ground truth” data from surveys, and then predicting poverty levels in areas where imagery exists but surveys do not. Using survey and satellite data from ten countries, this paper investigates disparities in representation, systematic biases in prediction errors, and fairness concerns in satellite-based poverty mapping across urban and rural lines, and shows how these phenomena affect the validity of policies based on predicted maps. Our findings highlight the importance of careful error and bias analysis before using satellite-based poverty maps in real-world policy decisions.more » « less
-
Poverty maps derived from satellite imagery are increasingly used to inform high-stakes policy decisions, such as the allocation of humanitarian aid and the distribution of government resources. Such poverty maps are typically constructed by training machine learning algorithms on a relatively modest amount of “ground truth” data from surveys, and then predicting poverty levels in areas where imagery exists but surveys do not. Using survey and satellite data from ten countries, this paper investigates disparities in representation, systematic biases in prediction errors, and fairness concerns in satellite-based poverty mapping across urban and rural lines, and shows how these phenomena affect the validity of policies based on predicted maps. Our findings highlight the importance of careful error and bias analysis before using satellite-based poverty maps in real-world policy decisions.more » « less
-
Abstract AimsTo examine changes in drinking behavior among United States (US) adults between March 10 and July 21, 2020, a critical period during the COVID‐19 pandemic. DesignLongitudinal, internet‐based panel survey. SettingThe Understanding America Study (UAS), a nationally representative panel of US adults age 18 or older. ParticipantsA total of 4298 US adults who reported alcohol use. MeasurementsChanges in number of reported drinking days from March 11, 2020 through July 21, 2020 in the overall sample and stratified by sex, age, race/ethnicity, household structure, poverty status, and census region. FindingsCompared with March 11, the number of drinking days per week was significantly higher on April 1 by an average of 0.36 days (95% CI = 0.30, 0.43), on May 1 by an average of 0.55 days (95% CI = 0.47, 0.63), on June 1 by an average of 0.41 days (95% CI = 0.33, 0.49), and on July 1 by an average of 0.39 days (95% CI = 0.31, 0.48). Males, White participants, and older adults reported sustained increases in drinking days, whereas female participants and individuals living under the federal poverty line had attenuated drinking days in the latter part of the study period. ConclusionsBetween March and mid‐July 2020, adults in the United States reported increases in the number of drinking days, with sustained increases observed among males, White participants, and older adults.more » « less
-
Motivated by privacy concerns in long-term longitudinal studies in medical and social science research, we study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a model where, in every time step, each individual reports a new data element, and the goal of the synthesizer is to incrementally update a synthetic dataset in a consistent way to capture a rich class of statistical properties. We give continual synthetic data generation algorithms that preserve two basic types of queries: fixed time window queries and cumulative time queries. We show nearly tight upper bounds on the error rates of these algorithms and demonstrate their empirical performance on realistically sized datasets from the U.S. Census Bureau's Survey of Income and Program Participation.more » « less