skip to main content

Title: Making AI Forget You: Data Deletion in Machine Learning
Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used — the EU’s Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of efficiently deleting individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual’s data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably efficient deletion algorithms which achieve an average of over 100x improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.
Authors:
; ; ;
Award ID(s):
1813049 1704417
Publication Date:
NSF-PAR ID:
10159347
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract
    Excessive phosphorus (P) applications to croplands can contribute to eutrophication of surface waters through surface runoff and subsurface (leaching) losses. We analyzed leaching losses of total dissolved P (TDP) from no-till corn, hybrid poplar (Populus nigra X P. maximowiczii), switchgrass (Panicum virgatum), miscanthus (Miscanthus giganteus), native grasses, and restored prairie, all planted in 2008 on former cropland in Michigan, USA. All crops except corn (13 kg P ha−1 year−1) were grown without P fertilization. Biomass was harvested at the end of each growing season except for poplar. Soil water at 1.2 m depth was sampled weekly to biweekly for TDP determination during March–November 2009–2016 using tension lysimeters. Soil test P (0–25 cm depth) was measured every autumn. Soil water TDP concentrations were usually below levels where eutrophication of surface waters is frequently observed (> 0.02 mg L−1) but often higher than in deep groundwater or nearby streams and lakes. Rates of P leaching, estimated from measured concentrations and modeled drainage, did not differ statistically among cropping systems across years; 7-year cropping system means ranged from 0.035 to 0.072 kg P ha−1 year−1 with large interannual variation. Leached P was positively related to STP, which decreased over the 7 years in all systems. These results indicate that both P-fertilized and unfertilized cropping systems mayMore>>
  2. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues.« less
  3. Ho, Simon (Ed.)
    Abstract Widely used approaches for extracting phylogenetic information from aligned sets of molecular sequences rely upon probabilistic models of nucleotide substitution or amino-acid replacement. The phylogenetic information that can be extracted depends on the number of columns in the sequence alignment and will be decreased when the alignment contains gaps due to insertion or deletion events. Motivated by the measurement of information loss, we suggest assessment of the effective sequence length (ESL) of an aligned data set. The ESL can differ from the actual number of columns in a sequence alignment because of the presence of alignment gaps. Furthermore, the estimation of phylogenetic information is affected by model misspecification. Inevitably, the actual process of molecular evolution differs from the probabilistic models employed to describe this process. This disparity means the amount of phylogenetic information in an actual sequence alignment will differ from the amount in a simulated data set of equal size, which motivated us to develop a new test for model adequacy. Via theory and empirical data analysis, we show how to disentangle the effects of gaps and model misspecification. By comparing the Fisher information of actual and simulated sequences, we identify which alignment sites and tree branches aremore »most affected by gaps and model misspecification. [Fisher information; gaps; insertion; deletion; indel; model adequacy; goodness-of-fit test; sequence alignment.]« less
  4. Recent advances in DNA sequencing technology and DNA storage systems have rekindled the interest in deletion channels. Multiple recent works have looked at variants of sequence reconstruction over a single and over multiple deletion channels, a notoriously difficult problem due to its highly combinatorial nature. Although works in theoretical computer science have provided algorithms which guarantee perfect reconstruction with multiple independent observations from the deletion channel, they are only applicable in the large blocklength regime and more restrictively, when the number of observations is also large. Indeed, with only a few observations, perfect reconstruction of the input sequence may not even be possible in most cases. In such situations, maximum likelihood (ML) and maximum aposteriori (MAP) estimates for the deletion channels are natural questions that arise and these have remained open to the best of our knowledge. In this work, we take steps to answer the two aforementioned questions. Specifically: 1. We show that solving for the ML estimate over the single deletion channel (which can be cast as a discrete optimization problem) is equivalent to solving its relaxation, a continuous optimization problem; 2. We exactly compute the symbolwise posterior distributions (under some assumptions on the priors) for both themore »single as well as multiple deletion channels. As part of our contributions, we also introduce tools to visualize and analyze error events, which we believe could be useful in other related problems concerning deletion channels.« less
  5. Abstract. We present here results from the Geoengineering Model Intercomparison Project (GeoMIP) simulations for the experiments G6sulfur and G6solar for six Earth system models participating in the Climate Model Intercomparison Project (CMIP) Phase 6. The aim of the experiments is to reduce the warming that results from a high-tier emission scenario (Shared Socioeconomic Pathways SSP5-8.5) to that resulting from a medium-tier emission scenario (SSP2-4.5). These simulations aim to analyze the response of climate models to a reduction in incoming surface radiation as a means to reduce global surface temperatures, and they do so either by simulating a stratospheric sulfate aerosol layer or, in a more idealized way, through a uniform reduction in the solar constant in the model. We find that over the final two decades of this century there are considerable inter-model spreads in the needed injection amounts of sulfate (29 ± 9 Tg-SO2/yr between 2081 and 2100), in the latitudinal distribution of the aerosol cloud and in the stratospheric temperature changes resulting from the added aerosol layer. Even in the simpler G6solar experiment, there is a spread in the needed solar dimming to achieve the same global temperature target (1.91 ± 0.44 %). The analyzed models already show significant differences in the response tomore »the increasing CO2 concentrations for global mean temperatures and global mean precipitation (2.05 K ± 0.42 K and 2.28 ± 0.80 %, respectively, for SSP5-8.5 minus SSP2-4.5 averaged over 2081–2100). With aerosol injection, the differences in how the aerosols spread further change some of the underlying uncertainties, such as the global mean precipitation response (−3.79 ± 0.76 % for G6sulfur compared to −2.07 ± 0.40 % for G6solar against SSP2-4.5 between 2081 and 2100). These differences in the behavior of the aerosols also result in a larger uncertainty in the regional surface temperature response among models in the case of the G6sulfur simulations, suggesting the need to devise various, more specific experiments to single out and resolve particular sources of uncertainty. The spread in the modeled response suggests that a degree of caution is necessary when using these results for assessing specific impacts of geoengineering in various aspects of the Earth system. However, all models agree that compared to a scenario with unmitigated warming, stratospheric aerosol geoengineering has the potential to both globally and locally reduce the increase in surface temperatures.« less