skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Hunting for Discriminatory Proxies in Linear Regression Models
A machine learning model may exhibit discrimination when used to make decisions involving people. One potential cause for such outcomes is that the model uses a statistical proxy for a protected demographic attribute. In this paper we formulate a definition of proxy use for the setting of linear regression and present algorithms for detecting proxies. Our definition follows recent work on proxies in classification models, and characterizes a model's constituent behavior that: 1) correlates closely with a protected random variable, and 2) is causally influential in the overall behavior of the model. We show that proxies in linear regression models can be efficiently identified by solving a second-order cone program, and further extend this result to account for situations where the use of a certain input variable is justified as a business necessity''. Finally, we present empirical results on two law enforcement datasets that exhibit varying degrees of racial disparity in prediction outcomes, demonstrating that proxies shed useful light on the causes of discriminatory behavior in models.  more » « less
Award ID(s):
1704845
PAR ID:
10095671
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Page Range / eLocation ID:
4568-4578
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract During the mid‐Holocene (MH: ∼6,000 years Before Present) and Last Interglacial LIG (LIG: ∼129,000–116,000 years Before Present) differences in the seasonal and latitudinal distribution of insolation drove Northern Hemisphere high‐latitude warming comparable to that projected for the end of the 21st century in low emissions scenarios. Paleoclimate proxy records point to distinct but regionally variable hydroclimatic changes during these past warm intervals. However, model simulations have generally disagreed on North American regional moisture patterns during the MH and LIG. To investigate how closely the latest generation of models associated with the Paleoclimate Model Intercomparison Project (PMIP4) reproduces proxy‐inferred moisture patterns during recent warm periods, we compare hydroclimate output from 17 PMIP4 models with newly updated compilations of moisture‐sensitive North American proxy records during the MH and LIG. Agreement is lower for the MH, with models producing wet anomalies across the western United States (US) where most proxies indicate increased aridity relative to the preindustrial period. The models that agree most closely with the LIG proxy compilation display relative wetness in the eastern US and Alaska, and dryness in the northwest and central US. An assessment of atmospheric dynamics using an ensemble of the three LIG simulations that best agree with the proxies suggests that weaker winter North Pacific pressure gradients and steeper summer North Pacific and Atlantic gradients drive LIG precipitation patterns. Our updated compilations and proxy‐model comparisons offer a tool for benchmarking climate models and their performance in simulating climate states that are warmer than present. 
    more » « less
  2. SUMMARY Geothermal heat flow beneath the Greenland and Antarctic ice sheets is an important boundary condition for ice sheet dynamics, but is rarely measured directly and therefore is inferred indirectly from proxies (e.g. seismic structure, magnetic Curie depth, surface topography). We seek to improve the understanding of the relationship between heat flow and one such proxy—seismic structure—and determine how well heat flow data can be predicted from the structure (the characterization problem). We also seek to quantify the extent to which this relationship can be extrapolated from one continent to another (the transportability problem). To address these problems, we use direct heat flow observations and new seismic structural information in the contiguous United States and Europe, and construct three Machine Learning models of the relationship with different levels of complexity (Linear Regression, Decision Tree and Random Forest). We compare these models in terms of their interpretability, the predicted heat flow accuracy within a continent and the accuracy of the extrapolation between Europe and the United States. The Random Forest and Decision Tree models are the most accurate within a continent, while the Linear Regression and Decision Tree models are the most accurate upon extrapolation between continents. The Decision Tree model uniquely illuminates the regional variations of the relationship between heat flow and seismic structure. From the Decision Tree model, uppermost mantle shear wave speed, crustal shear wave speed and Moho depth together explain more than half of the observed heat flow variations in both the United States [$$r^2 \approx 0.6$$ (coefficient of determination), $$\mathrm{RMSE} \approx 8\, {\rm mW}\,{\rm m}^{-2}$$ (Root Mean Squared Error)] and Europe ($$r^2 \approx 0.5, \mathrm{RMSE} \approx 13\, {\rm mW}\,{\rm m}^{-2}$$), such that uppermost mantle shear wave speed is the most important. Extrapolating the U.S.-trained models to Europe reasonably predicts the geographical distribution of heat flow [$$\rho = 0.48$$ (correlation coefficient)], but not the absolute amplitude of the variations ($r^2 = 0.17$), similarly from Europe to the United States ($$\rho = 0.66, r^2 = 0.24$$). The deterioration of accuracy upon extrapolation is caused by differences between the continents in how seismic structure is imaged, the heat flow data and intrinsic crustal radiogenic heat production. Our methods have the potential to improve the reliability and resolution of heat flow inferences across Antarctica and the validation and cross-validation procedures we present can be applied to heat flow proxies other than seismic structure, which may help resolve inconsistencies between existing subglacial heat flow values inferred using different proxies. 
    more » « less
  3. null (Ed.)
    Global climate change is altering patterns of temperature variation, with unpredictable consequences for species and ecosystems. The Metabolic Theory of Ecology (MTE) provides a powerful framework for predicting climate change impacts on ectotherm metabolic performance. MTE postulates that physiological and ecological processes are limited by organism metabolic rates, which scale predictably with body mass and temperature. The purpose of this study was to determine if different metabolic proxies generate different empirical estimates of key MTE model parameters for the aquatic frog Xenopus laevis when allowed to exhibit normal diving behavior. We used a novel methodological approach in combining a flow-through respirometry setup with the open-source Arduino platform to measure mass and temperature effects on 4 different proxies for whole-body metabolism (total O2 consumption, cutaneous O2 consumption, pulmonary O2 consumption, and ventilation frequency), following thermal acclimation to one of 3 temperatures (8°C, 17°C, or 26°C). Different metabolic proxies generated different mass-scaling exponents (b) and activation energy (EA) estimates, highlighting the importance of metabolic proxy selection when parameterizing MTE-derived models. Animals acclimated to 17°C had higher O2 consumption across all temperatures, but thermal acclimation did not influence estimates of key MTE parameters EA and b. Cutaneous respiration generated lower MTE parameters than pulmonary respiration, consistent with temperature and mass constraints on dissolved oxygen availability, SA:V ratios, and diffusion distances across skin. Our results show that the choice of metabolic proxy can have a big impact on empirical estimates for key MTE model parameters. 
    more » « less
  4. null (Ed.)
    Abstract. The Last Millennium Reanalysis (LMR) utilizes an ensemble methodology to assimilate paleoclimate data for the production of annually resolved climate field reconstructions of the Common Era. Two key elements are the focus of this work: the set of assimilated proxy records and the forward models that map climate variables to proxy measurements. Results based on an updated proxy database and seasonal regression-based forward models are compared to the LMR prototype, which was based on a smaller set of proxy records and simpler proxy models formulated as univariate linear regressions against annual temperature. Validation against various instrumental-era gridded analyses shows that the new reconstructions of surface air temperature and 500 hPa geopotential height are significantly improved (from 10 % to more than 100 %), while improvements in reconstruction of the Palmer Drought Severity Index are more modest. Additional experiments designed to isolate the sources of improvement reveal the importance of the updated proxy records, including coral records for improving tropical reconstructions, and tree-ring density records for temperature reconstructions, particularly in high northern latitudes. Proxy forward models that account for seasonal responses, and dependence on both temperature and moisture for tree-ring width, also contribute to improvements in reconstructed thermodynamic and hydroclimate variables in midlatitudes. The variability of temperature at multidecadal to centennial scales is also shown to be sensitive to the set of assimilated proxies, especially to the inclusion of primarily moisture-sensitive tree-ring-width records. 
    more » « less
  5. Abstract Climate field reconstructions (CFRs) combine modern observational data with paleoclimatic proxies to estimate climate variables over spatiotemporal grids during time periods when widespread observations of climatic conditions do not exist. The Common Era (CE) has been a period over which many seasonally‐ and annually‐resolved CFRs have been produced on regional to global scales. CFRs over the CE were first produced in the 1970s using dendroclimatic records and linear regression‐based approaches. Since that time, many new CFRs have been produced using a wide range of proxy data sets and reconstruction techniques. We assess the early history of research on CFRs for the CE, which provides context for our review of advances in CFR research over the last two decades. We review efforts to derive gridded hydroclimatic CFRs over continental regions using networks of tree‐ring proxies. We subsequently explore work to produce hemispheric‐ and global‐scale CFRs of surface temperature using multi‐proxy data sets, before specifically reviewing recently‐developed data assimilation techniques and how they have been used to produce simultaneous reconstructions of multiple climatic fields globally. We then review efforts to develop standardized and digitized databases of proxy networks for use in CFR research, before concluding with some thoughts on important next steps for CFR development. 
    more » « less