NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

https://doi.org/10.1145/3665252.3665267

Rosenblatt, Lucas; Herman, Bernease; Holovenko, Anastasia; Lee, Wonkwon; Loftus, Joshua; McKinnie, Elizabeth; Rumezhak, Taras; Stadnik, Andrii; Howe, Bill; Stoyanovich, Julia (May 2024, ACM SIGMOD Record)

Differential privacy (DP) data synthesizers are increasingly proposed to afford public release of sensitive information, offering theoretical guarantees for privacy (and, in some cases, utility), but limited empirical evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, the accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data and comparing the results.
more » « less
Full Text Available
Epistemic Parity: Reproducibility as an Evaluation Metric for Differential Privacy

https://doi.org/10.14778/3611479.3611517

Rosenblatt, Lucas; Herman, Bernease; Holovenko, Anastasia; Lee, Wonkwon; Loftus, Joshua; McKinnie, Elizabeth; Rumezhak, Taras; Stadnik, Andrii; Howe, Bill; Stoyanovich, Julia (July 2023, Proceedings of the VLDB Endowment)

Differential privacy (DP) data synthesizers are increasingly proposed to afford public release of sensitive information, offering theoretical guarantees for privacy (and, in some cases, utility), but limited empirical evidence of utility in practical settings. Utility is typically measured as the error on representative proxy tasks, such as descriptive statistics, multivariate correlations, the accuracy of trained classifiers, or performance over a query workload. The ability for these results to generalize to practitioners' experience has been questioned in a number of settings, including the U.S. Census. In this paper, we propose an evaluation methodology for synthetic data that avoids assumptions about the representativeness of proxy tasks, instead measuring the likelihood that published conclusions would change had the authors used synthetic data, a condition we call epistemic parity. Our methodology consists of reproducing empirical conclusions of peer-reviewed papers on real, publicly available data, then re-running these experiments a second time on DP synthetic data and comparing the results. We instantiate our methodology over a benchmark of recent peer-reviewed papers that analyze public datasets in the ICPSR social science repository. We model quantitative claims computationally to automate the experimental workflow, and model qualitative claims by reproducing visualizations and comparing the results manually. We then generate DP synthetic datasets using multiple state-of-the-art mechanisms, and estimate the likelihood that these conclusions will hold. We find that, for reasonable privacy regimes, state-of-the-art DP synthesizers are able to achieve high epistemic parity for several papers in our benchmark. However, some papers, and particularly some specific findings, are difficult to reproduce for any of the synthesizers. Given these results, we advocate for a new class of mechanisms that can reorder the priorities for DP data synthesis: favor stronger guarantees for utility (as measured by epistemic parity) and offer privacy protection with a focus on application-specific threat models and risk-assessment.
more » « less
Full Text Available
Surj: Ontological Learning for Fast, Accurate, and Robust Hierarchical Multi-label Classification

https://doi.org/10.1145/3487553.3524723

Yang, Sean T.; Howe, Bill (April 2022, WWW '22: Companion Proceedings of the Web Conference 2022)

Full Text Available
Responsible data management

https://doi.org/10.1145/3488717

Stoyanovich, Julia; Abiteboul, Serge; Howe, Bill; Jagadish, H. V.; Schelter, Sebastian (June 2022, Communications of the ACM)

Perspectives on the role and responsibility of the data-management research community in designing, developing, using, and overseeing automated decision systems.
more » « less
Full Text Available
Integrative urban AI to expand coverage, access, and equity of urban data

https://doi.org/10.1140/epjs/s11734-022-00475-z

Howe, Bill; Brown, Jackson Maxfield; Han, Bin; Herman, Bernease; Weber, Nic; Yan, An; Yang, Sean; Yang, Yiwei (July 2022, The European Physical Journal Special Topics)

Full Text Available
EquiTensors: Learning Fair Integrations of Heterogeneous Urban Data

https://doi.org/10.1145/3448016.3452777

Yan, An; Howe, Bill (June 2021, SIGMOD/PODS '21: Proceedings of the 2021 International Conference on Management of Data)
null (Ed.)
Neural methods are state-of-the-art for urban prediction problems such as transportation resource demand, accident risk, crowd mobility, and public safety. Model performance can be improved by integrating exogenous features from open data repositories (e.g., weather, housing prices, traffic, etc.), but these uncurated sources are often too noisy, incomplete, and biased to use directly. We propose to learn integrated representations, called EquiTensors, from heterogeneous datasets that can be reused across a variety of tasks. We align datasets to a consistent spatio-temporal domain, then describe an unsupervised model based on convolutional denoising autoencoders to learn shared representations. We extend this core integrative model with adaptive weighting to prevent certain datasets from dominating the signal. To combat discriminatory bias, we use adversarial learning to remove correlations with a sensitive attribute (e.g., race or income). Experiments with 23 input datasets and 4 real applications show that EquiTensors could help mitigate the effects of the sensitive information embodied in the biased data. Meanwhile, applications using EquiTensors outperform models that ignore exogenous features and are competitive with "oracle" models that use hand-selected datasets.
more » « less
Full Text Available
COVID-19 Brings Data Equity Challenges to the Fore

https://doi.org/10.1145/3440889

Jagadish, H. V.; Stoyanovich, Julia; Howe, Bill (March 2021, Digital Government: Research and Practice)
null (Ed.)
The COVID-19 pandemic is compelling us to make crucial data-driven decisions quickly, bringing together diverse and unreliable sources of information without the usual quality control mechanisms we may employ. These decisions are consequential at multiple levels: They can inform local, state, and national government policy, be used to schedule access to physical resources such as elevators and workspaces within an organization, and inform contact tracing and quarantine actions for individuals. In all these cases, significant inequities are likely to arise and to be propagated and reinforced by data-driven decision systems. In this article, we propose a framework, called FIDES, for surfacing and reasoning about data equity in these systems.
more » « less
Full Text Available
The Many Facets of Data Equity

Jagadish, H.V.; Stoyanovich, Julia; Howe, Bill (January 2021, theWorkshop Proceedings of the EDBT/ICDT 2021 Joint Conference)
Costa, Constantinos; Pitoura, Evaggelia (Ed.)
Data-driven systems can be unfair, in many different ways. All too often, as data scientists, we focus narrowly on one technical aspect of fairness. In this paper, we attempt to address equity broadly, and identify the many different ways in which it is manifest in data-driven systems.
more » « less
Full Text Available
JECL: Joint Embedding and Cluster Learning for Image-Text Pairs

https://doi.org/10.1109/ICPR48806.2021.9412667

Yang, Sean T.; Huang, Kuan-Hao; Howe, Bill (January 2021, 2020 25th International Conference on Pattern Recognition (ICPR))
null (Ed.)
We propose JECL, a method for clustering image-caption pairs by training parallel encoders with regularized clustering and alignment objectives, simultaneously learning both representations and cluster assignments. These image-caption pairs arise frequently in high-value applications where structured training data is expensive to produce, but free-text descriptions are common. JECL trains by minimizing the Kullback-Leibler divergence between the distribution of the images and text to that of a combined joint target distribution and optimizing the Jensen-Shannon divergence between the soft cluster assignments of the images and text. Regularizers are also applied to JECL to prevent trivial solutions. Experiments show that JECL outperforms both single-view and multi-view methods on large benchmark image-caption datasets, and is remarkably robust to missing captions and varying data sizes.
more » « less
Full Text Available
Fairness-Aware Demand Prediction for New Mobility

https://doi.org/10.1609/aaai.v34i01.5458

Yan, An; Howe, Bill (June 2020, Proceedings of the AAAI Conference on Artificial Intelligence)

Emerging transportation modes, including car-sharing, bike-sharing, and ride-hailing, are transforming urban mobility yet have been shown to reinforce socioeconomic inequity. These services rely on accurate demand prediction, but the demand data on which these models are trained reflect biases around demographics, socioeconomic conditions, and entrenched geographic patterns. To address these biases and improve fairness, we present FairST, a fairness-aware demand prediction model for spatiotemporal urban applications, with emphasis on new mobility. We use 1D (time-varying, space-constant), 2D (space-varying, time-constant) and 3D (both time- and space-varying) convolutional branches to integrate heterogeneous features, while including fairness metrics as a form of regularization to improve equity across demographic groups. We propose two spatiotemporal fairness metrics, region-based fairness gap (RFG), applicable when demographic information is provided as a constant for a region, and individual-based fairness gap (IFG), applicable when a continuous distribution of demographic information is available. Experimental results on bike share and ride share datasets show that FairST can reduce inequity in demand prediction for multiple sensitive attributes (i.e. race, age, and education level), while achieving better accuracy than even state-of-the-art fairness-oblivious methods.
more » « less
Full Text Available

« Prev Next »

Search for: All records