skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Meaningful Data Erasure in the Presence of Dependencies
Data regulations like GDPR require systems to support data erasure but leave the definition of erasure open to interpretation. This ambiguity makes compliance challenging, especially in databases where data dependencies can lead to erased data being inferred from remaining data. We formally define a precise notion of data erasure that ensures any inference about deleted data, through dependencies, remains bounded to what could have been inferred before its insertion. We design erasure mechanisms that enforce this guarantee at minimal cost. Additionally, we explore strategies to balance cost and throughput, batch multiple erasures, and proactively compute data retention times when possible. We demonstrate the practicality and scalability of our algorithms using both real and synthetic datasets.  more » « less
Award ID(s):
2133391 2420846 1952247
PAR ID:
10677220
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
ACM Proceedings of the VLDB Endowment, Volume 18, Issue 10
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
18
Issue:
10
ISSN:
2150-8097
Page Range / eLocation ID:
3435 to 3448
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Nowadays erasure coding is one of the most significant techniques in cloud storage systems, which provides both quick parallel I/O processing and high capabilities of fault tolerance on massive data accesses. In these systems, triple disk failure tolerant arrays (3DFTs) is a typical configuration, which is supported by several classic erasure codes like Reed-Solomon (RS) codes, Local Reconstruction Codes (LRC), Minimum Storage Regeneration (MSR) codes, etc. For an online recovery process, the foreground application workloads and the background recovery workloads are handled simultaneously, which requires a comprehensive understanding on both two types of workload characteristics. Although several techniques have been proposed to accelerate the I/O requests of online recovery processes, they are typically unilateral due to the fact that the above two workloads are not combined together to achieve high cost-effective performance.To address this problem, we propose Erasure Codes Fusion (EC-Fusion), an efficient hybrid erasure coding framework in cloud storage systems. EC-Fusion is a combination of RS and MSR codes, which dynamically selects the appropriate code based on its properties. On one hand, for write-intensive application workloads or low risk on data loss in recovery workloads, EC-Fusion uses RS code to decrease the computational overhead and storage cost concurrently. On the other hand, for read-intensive or frequent reconstruction in workloads, MSR code is a proper choice. Therefore, a better overall application and recovery performance can be achieved in a cost-effective fashion. To demonstrate the effectiveness of EC-Fusion, several experiments are conducted in hadoop systems. The results show that, compared with the traditional hybrid erasure coding techniques, EC-Fusion accelerates the response time for application by up to 1.77×, and reduces the reconstruction time by up to 69.10%. 
    more » « less
  2. We study the problem of answering queries when (part of) the data may be sensitive and should not be leaked to the querier. Simply restricting the computation to non-sensitive part of the data may leak sensitive data through inference based on data dependencies. While inference control from data dependencies during query processing has been studied in the literature, existing solution either detect and deny queries causing leakage, or use a weak security model that only protects against exact reconstruction of the sensitive data. In this paper, we adopt a stronger security model based on full deniability that prevents any information about sensitive data to be inferred from query answers. We identify conditions under which full deniability can be achieved and develop an efficient algorithm that minimally hides non-sensitive cells during query processing to achieve full deniability. We experimentally show that our approach is practical and scales to increasing proportion of sensitive data, as well as, to increasing database size. 
    more » « less
  3. Abelló, A; Vassiliadis, P; Romero, O; Wrembel, R; Bugiotti, F; Gamper, J; Vargas-Solar, G; Zumpano, E (Ed.)
    Constructing knowledge graphs from heterogeneous data sources and evaluating their quality and consistency are important research questions in the field of knowledge graphs. We propose mapping rules to guide users to translate data from relational and graph sources into a meaningful knowledge graph and design a user-friendly language to specify the mapping rules. Given the mapping rules and constraints on source data, equivalent constraints on the target graph can be inferred, which is referred to as data source constraints. Besides this type of constraint, we design other two types: user-specified constraints and general rules that a high-quality knowledge graph should adhere to. We translate the three types of constraints into uniform expressions in the form of graph functional dependencies and extended graph dependencies, which can be used for consistency checking. Our approach provides a systematic way to build and evaluate knowledge graphs from diverse data sources. 
    more » « less
  4. Internet-scale web applications are becoming increasingly storage-intensive and rely heavily on in-memory object caching to attain required I/O performance. We argue that the emerging serverless computing paradigm provides a well-suited, cost-effective platform for object caching. We present InfiniCache, a first-of-its-kind in-memory object caching system that is completely built and deployed atop ephemeral serverless functions. InfiniCache exploits and orchestrates serverless functions' memory resources to enable elastic pay-per-use caching. InfiniCache's design combines erasure coding, intelligent billed duration control, and an efficient data backup mechanism to maximize data availability and cost-effectiveness while balancing the risk of losing cached state and performance. We implement InfiniCache on AWS Lambda and show that it: (1) achieves 31 – 96× tenant-side cost savings compared to AWS ElastiCache for a large-object-only production workload, (2) can effectively provide 95.4% data availability for each one hour window, and (3) enables comparative performance seen in a typical in-memory cache. 
    more » « less
  5. The prevalence of disaggregated storage in public clouds has led to increased latency in modern OLAP cloud databases, particularly when handling ad-hoc and highly-selective queries on large objects. To address this, cloud databases have adopted computation pushdown, executing query predicates closer to the storage layer. However, existing pushdown solutions are ine!cient in erasure-coded storage. Cloud storage employs erasure coding that partitions analytics file objects into fixed-sized blocks and distributes them across storage nodes. Consequently, when a speci"c part of the object is queried, the storage system must reassemble the object across nodes, incurring significant network latency. In this work, we present Fusion, an object store for analytics that is optimized for query pushdown on erasure-coded data. It co-designs its erasure coding and file placement topologies, taking into account popular analytics file formats (e.g., Parquet). Fusion employs a novel stripe construction algorithm that prevents fragmentation of computable units within an object, and minimizes storage overhead during erasure coding. Compared to existing erasure-coded stores, Fusion improves median and tail latency by 64% and 81%, respectively, on TPC-H, and up to 40% and 48% respectively, on real-world SQL queries. Fusion achieves this while incurring a modest 1.2% storage overhead compared to the optimal. 
    more » « less