skip to main content


Title: What Is Lurking in Your Backups?
Best practices in data management and privacy mandate that old data must be irreversibly destroyed. However, due to performance optimization reasons, old (deleted or updated) data is not immediately purged from active database storage. Database backups that typically work by backing up table and index pages (rather than logical rows) greatly exacerbate the privacy problem of the old surviving data. Copying such deleted data into backups ensures that unknown quantities of old data can be stored indefinitely. In this paper, we quantify the amount of deleted data retained in backups by four major representative databases, comparing the default behavior versus an explicit defrag operation. We review the defrag options available in these databases and discuss the impact they have on eliminating old data from backups. We demonstrate that each database has a defrag mechanism that can eliminate most of old deleted data (although in Oracle pre-update content may survive defrag). Finally, we outline the factors that organizations should consider when deciding whether to apply defrag prior to executing their backups.  more » « less
Award ID(s):
2016548
NSF-PAR ID:
10310751
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IFIP International Conference on ICT Systems Security and Privacy Protection
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Preserving the history of storage states is critical to ensuring system reliability and security. It facilitates system functions such as debugging, data recovery, and forensics. Existing software-based approaches like data journaling, logging, and backups not only introduce performance and storage cost, but also are vulnerable to malware attacks, as adversaries can obtain kernel privileges to terminate or destroy them. In this paper, we present Project Almanac, which includes (1) a time-travel solid-state drive (SSD) named TimeSSD that retains a history of storage states in hardware for a window of time, and (2) a toolkit named TimeKits that provides storage-state query and rollback functions. TimeSSD tracks the history of storage states in the hardware device, without relying on explicit backups, by exploiting the property that the flash retains old copies of data when they are updated or deleted. We implement TimeSSD with a programmable SSD and develop TimeKits for several typical system applications. Experiments, with a variety of real-world case studies, demonstrate that TimeSSD can retain all the storage states for eight weeks, with negligible performance overhead, while providing the device-level time-travel property. 
    more » « less
  2. Data retention laws establish rules intended to protect privacy. These define both retention durations (how long data must be kept) and purging deadlines (when the data must be destroyed in storage). To comply with the laws and to minimize liability, companies should destroy data that must be purged or is no longer needed. However, database backups generally cannot be edited to purge “expired” data and erasing the entire backup is impractical. To maintain compliance, data curators need a mechanism to support targeted destruction of data in backups. In this paper, we present a cryptographic erasure framework that can purge data from all database backups. Our approach can be transparently integrated into existing database backup processes. We demonstrate how different purge policies can be defined through views and enforced by triggers without violating database constraints. 
    more » « less
  3. Data compliance laws establish rules intended to protect privacy. These define both retention durations (how long data must be kept) and purging deadlines (when the data must be destroyed in storage). To comply with the laws and to minimize liability, companies must destroy data that must be purged or is no longer needed. However, database backups generally cannot be edited to purge ``expired'' data and erasing the entire backup is impractical. To maintain compliance, data curators need a mechanism to support targeted destruction of data in backups. In this paper, we present a cryptographic erasure framework that can purge data from across database backups. We demonstrate how different purge policies can be defined through views and enforced without violating database constraints. 
    more » « less
  4. The increasing use of databases in the storage of critical and sensitive information in many organizations has lead to an increase in the rate at which databases are exploited in computer crimes. While there are several techniques and tools available for database forensics, they mostly assume apriori database preparation, such as relying on tamper-detection software to be in place or use of detailed logging. Investigators, alternatively, need forensic tools and techniques that work on poorly-configured databases and make no assumptions about the extent of damage in a database. In this paper, we present DBCarver, a tool for reconstructing database content from a database image without using any log or system metadata. The tool uses page carving to reconstruct both query-able data and non-queryable data (deleted data). We describe how the two kinds of data can be combined to enable a variety of forensic analysis questions hitherto unavailable to forensic investigators. We show the generality and efficiency of our tool across several databases through a set of robust experiments. 
    more » « less
  5. Asynchronously replicated primary-backup databases are commonly deployed to improve availability and offload read-only transactions. To both apply replicated writes from the primary and serve read-only transactions, the backups implement a cloned concurrency control protocol. The protocol ensures read-only transactions always return a snapshot of state that previously existed on the primary. This compels the backup to exactly copy the commit order resulting from the primary's concurrency control. Existing cloned concurrency control protocols guarantee this by limiting the backup's parallelism. As a result, the primary's concurrency control executes some workloads with more parallelism than these protocols. In this paper, we prove that this parallelism gap leads to unbounded replication lag, where writes can take arbitrarily long to replicate to the backup and which has led to catastrophic failures in production systems. We then design C5, the first cloned concurrency protocol to provide bounded replication lag. We implement two versions of C5: Our evaluation in MyRocks, a widely deployed database, demonstrates C5 provides bounded replication lag. Our evaluation in Cicada, a recent in-memory database, demonstrates C5 keeps up with even the fastest of primaries.

     
    more » « less