skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.


Title: Sharing Social Network Data: Differentially Private Estimation of Exponential Family Random-Graph Models
Summary

Motivated by a real life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyse synthetic graphs to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case-study using a version of the Enron e-mail corpus data set demonstrates the application and usefulness of the proposed techniques in solving the challenging problem of maintaining privacy and supporting open access to network data to ensure reproducibility of existing studies and discovering new scientific insights that can be obtained by analysing such data. We use a simple yet effective randomized response mechanism to generate synthetic networks under ε-edge differential privacy and then use likelihood-based inference for missing data and Markov chain Monte Carlo techniques to fit exponential family random-graph models to the generated synthetic networks.

 
more » « less
NSF-PAR ID:
10398875
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series C: Applied Statistics
Volume:
66
Issue:
3
ISSN:
0035-9254
Page Range / eLocation ID:
p. 481-500
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In recent years, Online Social Networks (OSN) have become popular content-sharing environments. With the emergence of smartphones with high-quality cameras, people like to share photos of their life moments on OSNs. The photos, however, often contain private information that people do not intend to share with others (e.g., their sensitive relationship). Solely relying on OSN users to manually process photos to protect their relationship can be tedious and error-prone. Therefore, we designed a system to automatically discover sensitive relations in a photo to be shared online and preserve the relations by face blocking techniques. We first used the Decision Tree model to learn sensitive relations from the photos labeled private or public by OSN users. Then we defined a face blocking problem and developed a linear programming model to optimize the tradeoff between preserving relationship privacy and maintaining the photo utility. In this paper, we generated synthetic data and used it to evaluate our system performance in terms of privacy protection and photo utility loss. 
    more » « less
  2. Security monitoring is crucial for maintaining a strong IT infrastructure by protecting against emerging threats, identifying vulnerabilities, and detecting potential points of failure. It involves deploying advanced tools to continuously monitor networks, systems, and configurations. However, organizations face challenges in adapting modern techniques like Machine Learning (ML) due to privacy and security risks associated with sharing internal data. Compliance with regulations like GDPR further complicates data sharing. To promote external knowledge sharing, a secure and privacy-preserving method for organizations to share data is necessary. Privacy-preserving data generation involves creating new data that maintains privacy while preserving key characteristics and properties of the original data so that it is still useful in creating downstream models of attacks. Generative models, such as Generative Adversarial Networks (GAN), have been proposed as a solution for privacy preserving synthetic data generation. However, standard GANs are limited in their capabilities to generate realistic system data. System data have inherent constraints, e.g., the list of legitimate I.P. addresses and port numbers are limited, and protocols dictate a valid sequence of network events. Standard generative models do not account for such constraints and do not utilize domain knowledge in their generation process. Additionally, they are limited by the attribute values present in the training data. This poses a major privacy risk, as sensitive discrete attribute values are repeated by GANs. To address these limitations, we propose a novel model for Knowledge Infused Privacy Preserving Data Generation. A privacy preserving Generative Adversarial Network (GAN) is trained on system data for generating synthetic datasets that can replace original data for downstream tasks while protecting sensitive data. Knowledge from domain-specific knowledge graphs is used to guide the data generation process, check for the validity of generated values, and enrich the dataset by diversifying the values of attributes. We specifically demonstrate this model by synthesizing network data captured by the network capture tool, Wireshark. We establish that the synthetic dataset holds up to the constraints of the network-specific datasets and can replace the original dataset in downstream tasks. 
    more » « less
  3. Abstract

    Due to natural phenomena like global warming and climate change, agricultural production is increasingly faced with threats that transcend farm boundaries. Management practices at the landscape or community level are often required to adequately respond to these new challenges (e.g., pest migration). Such decision-making at a community or beyond-farm level—i.e., practices that are jointly developed by farmers within a community—can be aided by computing and communications technology. In this study, we employ a translational research process to examine the social and behavioral drivers of adoption of smart and connected farm networks among commodity crop farmers in the United States. We implement focus groups and questionnaires to bring to the fore views on the use of digital technologies in collaborative contexts. We find that participating farmers are concerned with several issues about the potential features of the network (e.g., the ability to ensure data validity while maintaining data privacy) and the nature of their interactions with the various stakeholders involved in the network management. The participatory approach we adopt helps provide insights into the process of developing technologies that are both actionable and trusted by potential end users.

     
    more » « less
  4. The emergence of mobile apps (e.g., location-based services, geo-social networks, ride-sharing) led to the collection of vast amounts of trajectory data that greatly benefit the understanding of individual mobility. One problem of particular interest is next-location prediction, which facilitates location-based advertising, point-of-interest recommendation, traffic optimization,etc. However, using individual trajectories to build prediction models introduces serious privacy concerns, since exact whereabouts of users can disclose sensitive information such as their health status or lifestyle choices. Several research efforts focused on privacy-preserving next-location prediction, but they have serious limitations: some use outdated privacy models (e.g., k-anonymity), while others employ learning models with limited expressivity (e.g., matrix factorization). More recent approaches(e.g., DP-SGD) integrate the powerful differential privacy model with neural networks, but they provide only generic and difficult-to-tune methods that do not perform well on location data, which is inherently skewed and sparse.We propose a technique that builds upon DP-SGD, but adapts it for the requirements of next-location prediction. We focus on user-level privacy, a strong privacy guarantee that protects users regardless of how much data they contribute. Central to our approach is the use of the skip-gram model, and its negative sampling technique. Our work is the first to propose differentially-private learning with skip-grams. In addition, we devise data grouping techniques within the skip-gram framework that pool together trajectories from multiple users in order to accelerate learning and improve model accuracy. Experiments conducted on real datasets demonstrate that our approach significantly boosts prediction accuracy compared to existing DP-SGD techniques. 
    more » « less
  5. The emergence of mobile apps (e.g., location-based services,geo-social networks, ride-sharing) led to the collection of vast amounts of trajectory data that greatly benefit the understanding of individual mobility. One problem of particular interest is next-location prediction, which facilitates location-based advertising, point-of-interest recommendation, traffic optimization,etc. However, using individual trajectories to build prediction models introduces serious privacy concerns, since exact whereabouts of users can disclose sensitive information such as their health status or lifestyle choices. Several research efforts focused on privacy-preserving next-location prediction, but they have serious limitations: some use outdated privacy models (e.g., k-anonymity), while others employ learning models with limited expressivity (e.g., matrix factorization). More recent approaches(e.g., DP-SGD) integrate the powerful differential privacy model with neural networks, but they provide only generic and difficult-to-tune methods that do not perform well on location data, which is inherently skewed and sparse.We propose a technique that builds upon DP-SGD, but adapts it for the requirements of next-location prediction. We focus on user-level privacy, a strong privacy guarantee that protects users regardless of how much data they contribute. Central toour approach is the use of the skip-gram model, and its negative sampling technique. Our work is the first to propose differentially-private learning with skip-grams. In addition, we devise data grouping techniques within the skip-gram framework that pool together trajectories from multiple users in order to acceleratelearning and improve model accuracy. Experiments conducted on real datasets demonstrate that our approach significantly boosts prediction accuracy compared to existing DP-SGD techniques. 
    more » « less