115 privacy policies from the OPP-115 corpus have been re-annotated with the specific data retention periods disclosed, aligned with the GDPR requirements disclosed in Art. 13 (2)(a). Those retention periods have been categorized into the following 6 distinct cases: C0: No data retention period is indicated in the privacy policy/segment. C1: A specific data retention period is indicated (e.g., days, weeks, months...). C2: Indicate that the data will be stored indefinitely. C3: A criterion is determined during which a defined period during which the data will be stored can be understood (e.g., as long as the user has an active account). C4: It is indicated that personal data will be stored for an unspecified period, for fraud prevention, legal or security reasons. C5: It is indicated that personal data will be stored for an unspecified period, for purposes other than fraud prevention, legal, or security. Note: If the privacy policy or segment accounts for more than one case, the case with the highest value was annotated (e.g., if case C2 and case C4 apply, C4 is annotated). Then, the ground truth dataset served as validation for our proposed ChatGPT-based method, the results of which have also been included in this dataset. Columns description: - policy_id: ID of the policy in the OPP-115 dataset - policy_name: Domain of the privacy policy - policy_text: Privacy policy collected at the time of OPP-115 dataset creation - info_type_value: Type of personal data to which data retention refers - retention_period: Period of retention annotated by OPP-115 annotators - actual_case: Our annotated case ranging from C0-C5 - GPT_case: ChatGPT classification of the case identified in the segment - actual_Comply_GDPR: Boolean denoting True if they apparently comply with GDPR (cases C1-C5) or False if not (case C0) - GPT_Comply_GDPR: Boolean denoting True if they apparently comply with GDPR (cases C1-C5) or False if not (case C0) - paragraphs_retention_period: List containing the paragraphs annotated as Data Retention by OPP-115 annotators and our red text describing the relevant information used for our annotation decision
more »
« less
This content will become publicly available on July 25, 2026
Taxonomizing Synthetic Data for Law
Synthetic data is increasingly important in data usage and AI design, creating novel legal and policy dilemmas. All too often, discussions of synthetic data treat it as entirely distinct from “real,” collected data, overlooking the risks posed by different kinds and uses of synthetic data. This piece comments on Michal Gal and Orla Lynskey’s work, which persuasively argues that synthetic data will transform information privacy, market competition, and data quality. While the risks posed by synthetic data depend on its connection to collected data, we argue that background knowledge and assumptions about ground truth used to create it are at least as important. We bring that focus to Gal and Lynskey’s taxonomy of synthetic data, arguing that it is essential to grasp synthetic data’s legal and policy implications. As such, we divide synthetic data into (1) transformed data, which modifies collected data to preserve certain statistical properties for an end use; (2) augmented data, which relies on assumptions to bolster a collected dataset’s fidelity to the ground truth; and (3) simulated data, which relies almost entirely on background knowledge and ground-truth assumptions. As policymakers weigh whether to incentivize, mandate, or discourage the use of synthetic data, they should consider the validity of the ground-truth assumptions used in producing that data.
more »
« less
- Award ID(s):
- 2131532
- PAR ID:
- 10649359
- Publisher / Repository:
- SSRN
- Date Published:
- Journal Name:
- Iowa law review online
- ISSN:
- 2766-3051
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Off-policy evaluation, and the complementary problem of policy learning, use historical data collected under a logging policy to estimate and/or optimize the value of a target policy. Methods for these tasks typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. Absent such an overlap assumption, existing work either relies on a well-specified model or optimizes needlessly conservative bounds. In this work, we develop methods for no-overlap policy evaluation without a well-specified model, relying instead on non-parametric assumptions on the expected outcome, with a particular focus on Lipschitz smoothness. Under such assumptions we are able to provide sharp bounds on the off-policy value, along with asymptotically optimal estimators of those bounds. For Lipschitz smoothness, we construct a pair of linear programs that upper and lower bound the contribution of the no-overlap region to the off-policy value. We show that these programs have a concise closed form solution, and that their solutions converge under the Lipschitz assumption to the sharp partial identification bounds at a minimax optimal rate, up to log factors. We demonstrate the effectiveness our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.more » « less
-
null (Ed.)Learning policies in simulation is promising for reducing human effort when training robot controllers. This is especially true for soft robots that are more adaptive and safe but also more difficult to accurately model and control. The sim2real gap is the main barrier to successfully transfer policies from simulation to a real robot. System identification can be applied to reduce this gap but traditional identification methods require a lot of manual tuning. Data-driven alternatives can tune dynamical models directly from data but are often data hungry, which also incorporates human effort in collecting data. This work proposes a data-driven, end-to-end differentiable simulator focused on the exciting but challenging domain of tensegrity robots. To the best of the authors’ knowledge, this is the first differentiable physics engine for tensegrity robots that supports cable, contact, and actuation modeling. The aim is to develop a reasonably simplified, data-driven simulation, which can learn approximate dynamics with limited ground truth data. The dynamics must be accurate enough to generate policies that can be transferred back to the ground-truth system. As a first step in this direction, the current work demonstrates sim2sim transfer, where the unknown physical model of MuJoCo acts as a ground truth system. Two different tensegrity robots are used for evaluation and learning of locomotion policies, a 6-bar and a 3-bar tensegrity. The results indicate that only 0.25% of ground truth data are needed to train a policy that works on the ground truth system when the differentiable engine is used for training against training the policy directly on the ground truth system.more » « less
-
Ground truth depth information is necessary for many computer vision tasks. Collecting this information is chal-lenging, especially for outdoor scenes. In this work, we propose utilizing single-view depth prediction neural networks pre-trained on synthetic scenes to generate relative depth, which we call pseudo-depth. This approach is a less expen-sive option as the pre-trained neural network obtains ac-curate depth information from synthetic scenes, which does not require any expensive sensor equipment and takes less time. We measure the usefulness of pseudo-depth from pre-trained neural networks by training indoor/outdoor binary classifiers with and without it. We also compare the difference in accuracy between using pseudo-depth and ground truth depth. We experimentally show that adding pseudo-depth to training achieves a 4.4% performance boost over the non-depth baseline model on DIODE, a large stan-dard test dataset, retaining 63.8% of the performance boost achieved from training a classifier on RGB and ground truth depth. It also boosts performance by 1.3% on another dataset, SUN397, for which ground truth depth is not avail-able. Our result shows that it is possible to take information obtained from a model pre-trained on synthetic scenes and successfully apply it beyond the synthetic domain to real-world data.more » « less
-
Villata, S. (Ed.)The European Union’s General Data Protection Regulation (GDPR) has compelled businesses and other organizations to update their privacy policies to state specific information about their data practices. Simultaneously, researchers in natural language processing (NLP) have developed corpora and annotation schemes for extracting salient information from privacy policies, often independently of specific laws. To connect existing NLP research on privacy policies with the GDPR, we introduce a mapping from GDPR provisions to the OPP-115 annotation scheme, which serves as the basis for a growing number of projects to automatically classify privacy policy text. We show that assumptions made in the annotation scheme about the essential topics for a privacy policy reflect many of the same topics that the GDPR requires in these documents. This suggests that OPP-115 continues to be representative of the anatomy of a legally compliant privacy policy, and that the legal assumptions behind it represent the elements of data processing that ought to be disclosed within a policy for transparency. The correspondences we show between OPP-115 and the GDPR suggest the feasibility of bridging existing computational and legal research on privacy policies, benefiting both areas.more » « less
An official website of the United States government
