skip to main content

This content will become publicly available on August 9, 2024

Title: PrivTrace: Differentially Private Trajectory Synthesis by Adaptive Markov Models
Publishing trajectory data (individual’s movement information) is very useful, but it also raises privacy concerns. To handle the privacy concern, in this paper, we apply differential privacy, the standard technique for data privacy, together with Markov chain model, to generate synthetic trajectories. We notice that existing studies all use Markov chain model and thus propose a framework to analyze the usage of the Markov chain model in this problem. Based on the analysis, we come up with an effective algorithm PrivTrace that uses the first-order and second-order Markov model adaptively. We evaluate PrivTrace and existing methods on synthetic and real-world datasets to demonstrate the superiority of our method.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Date Published:
Medium: X
Anaheim, CA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Motivated by a real life problem of sharing social network data that contain sensitive personal information, we propose a novel approach to release and analyse synthetic graphs to protect privacy of individual relationships captured by the social network while maintaining the validity of statistical results. A case-study using a version of the Enron e-mail corpus data set demonstrates the application and usefulness of the proposed techniques in solving the challenging problem of maintaining privacy and supporting open access to network data to ensure reproducibility of existing studies and discovering new scientific insights that can be obtained by analysing such data. We use a simple yet effective randomized response mechanism to generate synthetic networks under ε-edge differential privacy and then use likelihood-based inference for missing data and Markov chain Monte Carlo techniques to fit exponential family random-graph models to the generated synthetic networks.

    more » « less
  2. Nowadays, large-scale graph data is being generated in a variety of real-world applications, from social networks to co-authorship networks, from protein-protein interaction networks to road traffic networks. Many existing works on graph mining focus on the vertices and edges, with the first-order Markov chain as the underlying model. They fail to explore the high-order network structures, which are of key importance in many high impact domains. For example, in bank customer personally identifiable information (PII) networks, the star structures often correspond to a set of synthetic identities; in financial transaction networks, the loop structures may indicate the existence of money laundering. In this paper, we focus on mining user-specified high-order network structures and aim to find a structure-rich subgraph which does not break many such structures by separating the subgraph from the rest. A key challenge associated with finding a structure-rich subgraph is the prohibitive computational cost. To address this problem, inspired by the family of local graph clustering algorithms for efficiently identifying a low-conductance cut without exploring the entire graph, we propose to generalize the key idea to model high-order network structures. In particular, we start with a generic definition of high-order conductance, and define the high-order diffusion core, which is based on a high-order random walk induced by user-specified high-order network structure. Then we propose a novel High-Order Structure-Preserving LOcal Cut (HOSPLOC) algorithm, which runs in polylogarithmic time with respect to the number of edges in the graph. It starts with a seed vertex and iteratively explores its neighborhood until a subgraph with a small high-order conductance is found. Furthermore, we analyze its performance in terms of both effectiveness and efficiency. The experimental results on both synthetic graphs and real graphs demonstrate the effectiveness and efficiency of our proposed HOSPLOC algorithm. 
    more » « less
  3. Abstract

    Following the reanalysis of individual experimental runs of some widely cited studies (Jain et al., 2018,, we revisit the global data analysis of Korenaga and Karato (2008, with a significantly improved version of their Markov chain Monte Carlo inversion. Their algorithm, previously corrected by Mullet et al. () to minimize potential parameter bias, is further modified here to estimate more efficiently interrun biases in global data sets. Using the refined Markov chain Monte Carlo inversion technique, we simultaneously analyze experimental data on the deformation of olivine aggregates compiled from different studies. Realistic composite rheological models, including both diffusion and dislocation creep, are adopted, and the role of dislocation‐accommodated grain boundary sliding is also investigated. Furthermore, the influence of interrun biases on inversion results is studied using experimental and synthetic data. Our analysis shows that existing data can tightly constrain the grain‐size exponent for diffusion creep at ∼2, which is different from the value commonly assumed (p= 3). Different data sets and model assumptions, however, yield nonoverlapping estimates on other flow‐law parameters, and the flow‐law parameters for grain boundary sliding are poorly resolved in most cases. We thus provide a few plausible candidate flow‐law models for olivine rheology to facilitate future geodynamic modeling. The availability of more data that explore a wider range of experimental conditions, especially higher pressures, is essential to improve our understanding of upper mantle rheology.

    more » « less
  4. We propose a Bayesian decision making framework for control of Markov Decision Processes (MDPs) with unknown dynamics and large, possibly continuous, state, action, and parameter spaces in data-poor environments. Most of the existing adaptive controllers for MDPs with unknown dynamics are based on the reinforcement learning framework and rely on large data sets acquired by sustained direct interaction with the system or via a simulator. This is not feasible in many applications, due to ethical, economic, and physical constraints. The proposed framework addresses the data poverty issue by decomposing the problem into an offline planning stage that does not rely on sustained direct interaction with the system or simulator and an online execution stage. In the offline process, parallel Gaussian process temporal difference (GPTD) learning techniques are employed for near-optimal Bayesian approximation of the expected discounted reward over a sample drawn from the prior distribution of unknown parameters. In the online stage, the action with the maximum expected return with respect to the posterior distribution of the parameters is selected. This is achieved by an approximation of the posterior distribution using a Markov Chain Monte Carlo (MCMC) algorithm, followed by constructing multiple Gaussian processes over the parameter space for efficient prediction of the means of the expected return at the MCMC sample. The effectiveness of the proposed framework is demonstrated using a simple dynamical system model with continuous state and action spaces, as well as a more complex model for a metastatic melanoma gene regulatory network observed through noisy synthetic gene expression data. 
    more » « less
  5. We consider the task of learning a parametric Continuous Time Markov Chain (CTMC) sequence model without examples of sequences, where the training data consists entirely of aggregate steady-state statistics. Making the problem harder, we assume that the states we wish to predict are unobserved in the training data. Specifically, given a parametric model over the transition rates of a CTMC and some known transition rates, we wish to extrapolate its steady state distribution to states that are unobserved. A technical roadblock to learn a CTMC from its steady state has been that the chain rule to compute gradients will not work over the arbitrarily long sequences necessary to reach steady state —from where the aggregate statistics are sampled. To overcome this optimization challenge, we propose ∞-SGD, a principled stochastic gradient descent method that uses randomly-stopped estimators to avoid infinite sums required by the steady state computation, while learning even when only a subset of the CTMC states can be observed. We apply ∞-SGD to a real-world testbed and synthetic experiments showcasing its accuracy, ability to extrapolate the steady state distribution to unobserved states under unobserved conditions (heavy loads, when training under light loads), and succeeding in difficult scenarios where even a tailor-made extension of existing methods fails. 
    more » « less