skip to main content


Title: Mycelium: Large-Scale Distributed Graph Queries with Differential Privacy
This paper introduces Mycelium, the first system to process differentially private queries over large graphs that are distributed across millions of user devices. Such graphs occur, for instance, when tracking the spread of diseases or malware. Today, the only practical way to query such graphs is to upload them to a central aggregator, which requires a great deal of trust from users and rules out certain types of studies entirely. With Mycelium, users' private data never leaves their personal devices unencrypted, and each user receives strong privacy guarantees. Mycelium does require the help of a central aggregator with access to a data center, but the aggregator merely facilitates the computation by providing bandwidth and computation power; it never learns the topology of the graph or the underlying data. Mycelium accomplishes this with a combination of homomorphic encryption, a verifiable secret redistribution scheme, and a mix network based on telescoping circuits. Our evaluation shows that Mycelium can answer a range of different questions from the medical literature with millions of devices.  more » « less
Award ID(s):
2045861 2107147 2124184
NSF-PAR ID:
10312007
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
ACM SIGOPS Symposium on Operating Systems Principles (SOSP)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Federated learning (FL) is a highly pursued machine learning technique that can train a model centrally while keeping data distributed. Distributed computation makes FL attractive for bandwidth limited applications especially in wireless communications. There can be a large number of distributed edge devices connected to a central parameter server (PS) and iteratively download/upload data from/to the PS. Due to limited bandwidth, only a subset of connected devices can be scheduled in each round. There are usually millions of parameters in the state-of-art machine learning models such as deep learning, resulting in a high computation complexity as well as a high communication burden on collecting/distributing data for training. To improve communication efficiency and make the training model converge faster, we propose a new scheduling policy and power allocation scheme using non-orthogonal multiple access (NOMA) settings to maximize the weighted sum data rate under practical constraints during the entire learning process. NOMA allows multiple users to transmit on the same channel simultaneously. The user scheduling problem is transformed into a maximum-weight independent set problem that can be solved using graph theory. Simulation results show that the proposed scheduling and power allocation scheme can help achieve a higher FL testing accuracy in NOMA based wireless networks than other existing schemes within the same learning time. 
    more » « less
  2. When collecting information, local differential privacy (LDP) alleviates privacy concerns of users because their private information is randomized before being sent it to the central aggregator. LDP imposes large amount of noise as each user executes the randomization independently. To address this issue, recent work introduced an intermediate server with the assumption that this intermediate server does not collude with the aggregator. Under this assumption, less noise can be added to achieve the same privacy guarantee as LDP, thus improving utility for the data collection task. This paper investigates this multiple-party setting of LDP. We analyze the system model and identify potential adversaries. We then make two improvements: a new algorithm that achieves a better privacy-utility tradeoff; and a novel protocol that provides better protection against various attacks. Finally, we perform experiments to compare different methods and demonstrate the benefits of using our proposed method. 
    more » « less
  3. Abstract

    The examination of multivariate brain morphometry patterns has gained attention in recent years, especially for their powerful exploratory capabilities in the study of differences between patients and controls. Among the many existing methods and tools for the analysis of brain anatomy based on structural magnetic resonance imaging data, data‐driven source‐based morphometry (SBM) focuses on the exploratory detection of such patterns. Here, we implement a semi‐blind extension of SBM, called constrained source‐based morphometry (constrained SBM), which enables the extraction of maximally independent reference‐alike sources using the constrained independent component analysis (ICA) approach. To do this, we combine SBM with a set of reference components covering the full brain, derived from a large independent data set (UKBiobank), to provide a fully automated SBM framework. This also allows us to implement a federated version of constrained SBM (cSBM) to allow analysis of data that is not locally accessible. In our proposed decentralized constrained source‐based morphometry (dcSBM), the original data never leaves the local site. Each site operates constrained ICA on its private local data using a common distributed computation platform. Next, an aggregator/master node aggregates the results estimated from each local site and applies statistical analysis to estimate the significance of the sources. Finally, we utilize two additional multisite patient data sets to validate our model by comparing the resulting group difference estimates from both cSBM and dcSBM.

     
    more » « less
  4. We present and analyze UDM, a new protocol for user discovery in anonymous communication systems that minimizes the information disclosed to the system and users. Unlike existing systems, including those based on private set intersection, UDM learns nothing about the contact lists and social graphs of the users, is not vulnerable to off-line dictionary attacks that expose contact lists, does not reveal platform identifiers to users without the owner’s explicit permission, and enjoys low computation and communication complexity. UDM solves the following user-discovery problem. User Alice wishes to communicate with Bob over an anonymous communication system, such as cMix or Tor. Initially, each party knows each other’s public contact identifier (e.g., email address or phone number), but neither knows the other’s private platform identifier in the communication system. If both parties wish to communicate with each other, UDM enables them to establish a shared key and learn each other’s private platform identifier. UDM uses an untrusted user-discovery system, which processes and stores only public information, hashed values, or values encrypted with keys it does not know. Therefore, UDM cannot learn any information about the social graphs of its users. Using the anonymous communication system, each pair of users who wish to communicate with each other uploads to the user-discovery system their private platform identifier, encrypted with their shared key. Indexing their request by a truncated cryptographic hash of their shared key, each user can then download each other’s encrypted private platform identifier. 
    more » « less
  5. In this paper, we examine private-sector collection and use of metadata and telemetry information and provide three main contributions: First, we lay out the extent to which “non-content”—the hidden parts of Internet communications (aspects the user does not explicitly enter) and telemetry—are highly revelatory of personal behavior. We show that, privacy policies notwithstanding, users rarely know that the metadata and telemetry information is being collected and almost never know the uses to which it is being put. Second, we show that consumers, even if they knew the uses to which this type of personal information were being put, lack effective means to control the use of this type of data. The standard tool of notice-and-choice has well known problems, including the user’s lack of information with which to make a choice; and then, even if the user had sufficient information, doing so is not practical.49 These are greatly exacerbated by the nature of the interchanges for communications metadata and telemetry information. Each new transmission—each click on an internal link on a webpage, for example—may carry different implications for a user in terms of privacy. The current regimen, notice-and-choice, presents a completely unworkable set of requests for a user, who could well be responding many times a minute regarding whether to allow the use of metadata beyond the purposes of content delivery and display. This is especially the case for telemetry, where the ability to understand both present and future use of the data provided from the sensors requires a deeper understanding of what information these devices can provide than anyone but a trained engineer would know. Third, while there has been academic and industry research on telemetry’s use, there has been little exploration of the policy and legal implications stemming from that use. We provide this factor, while at the same time addressing the closely related issues raised by industry’s use of communications metadata to track user interests and behavior 
    more » « less