skip to main content


Title: Robustly-reliable learners under poisoning attacks
Data poisoning attacks, in which an adversary corrupts a training set with the goal of inducing specific desired mistakes, have raised substantial concern: even just the possibility of such an attack can make a user no longer trust the results of a learning system. In this work, we analyze when strong robustness guarantees can be achieved even in the face of such attacks. We define and show how to provide robustly-reliable predictions, in which the predicted label is guaranteed to be correct so long as the adversary has not exceeded a given corruption budget, even in the presence of instance targeted attacks, where the adversary aims to cause a failure on specific test examples. Our guarantees are substantially stronger than those in prior approaches, which were only able to provide certificates that the prediction of the learning algorithm does not change, as opposed to certifying that the prediction is correct, as we do here. Remarkably, we provide a complete characterization of learnability in this setting, in particular, nearly-tight matching upper and lower bounds on the region that can be certified, as well as efficient algorithms for computing this region given an ERM oracle. Moreover, for the case of linear separators over logconcave distributions, we provide efficient truly polynomial time algorithms (i.e., non-oracle algorithms) for such robustly-reliable predictions. We also extend these results to the active setting where the algorithm adaptively asks for labels of specific informative examples, and the difficulty is that the adversary might even be adaptive to this interaction, as well as to the agnostic learning setting where there is no perfect classifier even over the uncorrupted data.  more » « less
Award ID(s):
1815011
NSF-PAR ID:
10377059
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Conference on Learning Theory
Volume:
35
Page Range / eLocation ID:
4498-4534
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. (Ed.)
    The Neural Engineering Data Consortium (NEDC) is developing the Temple University Digital Pathology Corpus (TUDP), an open source database of high-resolution images from scanned pathology samples [1], as part of its National Science Foundation-funded Major Research Instrumentation grant titled “MRI: High Performance Digital Pathology Using Big Data and Machine Learning” [2]. The long-term goal of this project is to release one million images. We have currently scanned over 100,000 images and are in the process of annotating breast tissue data for our first official corpus release, v1.0.0. This release contains 3,505 annotated images of breast tissue including 74 patients with cancerous diagnoses (out of a total of 296 patients). In this poster, we will present an analysis of this corpus and discuss the challenges we have faced in efficiently producing high quality annotations of breast tissue. It is well known that state of the art algorithms in machine learning require vast amounts of data. Fields such as speech recognition [3], image recognition [4] and text processing [5] are able to deliver impressive performance with complex deep learning models because they have developed large corpora to support training of extremely high-dimensional models (e.g., billions of parameters). Other fields that do not have access to such data resources must rely on techniques in which existing models can be adapted to new datasets [6]. A preliminary version of this breast corpus release was tested in a pilot study using a baseline machine learning system, ResNet18 [7], that leverages several open-source Python tools. The pilot corpus was divided into three sets: train, development, and evaluation. Portions of these slides were manually annotated [1] using the nine labels in Table 1 [8] to identify five to ten examples of pathological features on each slide. Not every pathological feature is annotated, meaning excluded areas can include focuses particular to these labels that are not used for training. A summary of the number of patches within each label is given in Table 2. To maintain a balanced training set, 1,000 patches of each label were used to train the machine learning model. Throughout all sets, only annotated patches were involved in model development. The performance of this model in identifying all the patches in the evaluation set can be seen in the confusion matrix of classification accuracy in Table 3. The highest performing labels were background, 97% correct identification, and artifact, 76% correct identification. A correlation exists between labels with more than 6,000 development patches and accurate performance on the evaluation set. Additionally, these results indicated a need to further refine the annotation of invasive ductal carcinoma (“indc”), inflammation (“infl”), nonneoplastic features (“nneo”), normal (“norm”) and suspicious (“susp”). This pilot experiment motivated changes to the corpus that will be discussed in detail in this poster presentation. To increase the accuracy of the machine learning model, we modified how we addressed underperforming labels. One common source of error arose with how non-background labels were converted into patches. Large areas of background within other labels were isolated within a patch resulting in connective tissue misrepresenting a non-background label. In response, the annotation overlay margins were revised to exclude benign connective tissue in non-background labels. Corresponding patient reports and supporting immunohistochemical stains further guided annotation reviews. The microscopic diagnoses given by the primary pathologist in these reports detail the pathological findings within each tissue site, but not within each specific slide. The microscopic diagnoses informed revisions specifically targeting annotated regions classified as cancerous, ensuring that the labels “indc” and “dcis” were used only in situations where a micropathologist diagnosed it as such. Further differentiation of cancerous and precancerous labels, as well as the location of their focus on a slide, could be accomplished with supplemental immunohistochemically (IHC) stained slides. When distinguishing whether a focus is a nonneoplastic feature versus a cancerous growth, pathologists employ antigen targeting stains to the tissue in question to confirm the diagnosis. For example, a nonneoplastic feature of usual ductal hyperplasia will display diffuse staining for cytokeratin 5 (CK5) and no diffuse staining for estrogen receptor (ER), while a cancerous growth of ductal carcinoma in situ will have negative or focally positive staining for CK5 and diffuse staining for ER [9]. Many tissue samples contain cancerous and non-cancerous features with morphological overlaps that cause variability between annotators. The informative fields IHC slides provide could play an integral role in machine model pathology diagnostics. Following the revisions made on all the annotations, a second experiment was run using ResNet18. Compared to the pilot study, an increase of model prediction accuracy was seen for the labels indc, infl, nneo, norm, and null. This increase is correlated with an increase in annotated area and annotation accuracy. Model performance in identifying the suspicious label decreased by 25% due to the decrease of 57% in the total annotated area described by this label. A summary of the model performance is given in Table 4, which shows the new prediction accuracy and the absolute change in error rate compared to Table 3. The breast tissue subset we are developing includes 3,505 annotated breast pathology slides from 296 patients. The average size of a scanned SVS file is 363 MB. The annotations are stored in an XML format. A CSV version of the annotation file is also available which provides a flat, or simple, annotation that is easy for machine learning researchers to access and interface to their systems. Each patient is identified by an anonymized medical reference number. Within each patient’s directory, one or more sessions are identified, also anonymized to the first of the month in which the sample was taken. These sessions are broken into groupings of tissue taken on that date (in this case, breast tissue). A deidentified patient report stored as a flat text file is also available. Within these slides there are a total of 16,971 total annotated regions with an average of 4.84 annotations per slide. Among those annotations, 8,035 are non-cancerous (normal, background, null, and artifact,) 6,222 are carcinogenic signs (inflammation, nonneoplastic and suspicious,) and 2,714 are cancerous labels (ductal carcinoma in situ and invasive ductal carcinoma in situ.) The individual patients are split up into three sets: train, development, and evaluation. Of the 74 cancerous patients, 20 were allotted for both the development and evaluation sets, while the remain 34 were allotted for train. The remaining 222 patients were split up to preserve the overall distribution of labels within the corpus. This was done in hope of creating control sets for comparable studies. Overall, the development and evaluation sets each have 80 patients, while the training set has 136 patients. In a related component of this project, slides from the Fox Chase Cancer Center (FCCC) Biosample Repository (https://www.foxchase.org/research/facilities/genetic-research-facilities/biosample-repository -facility) are being digitized in addition to slides provided by Temple University Hospital. This data includes 18 different types of tissue including approximately 38.5% urinary tissue and 16.5% gynecological tissue. These slides and the metadata provided with them are already anonymized and include diagnoses in a spreadsheet with sample and patient ID. We plan to release over 13,000 unannotated slides from the FCCC Corpus simultaneously with v1.0.0 of TUDP. Details of this release will also be discussed in this poster. Few digitally annotated databases of pathology samples like TUDP exist due to the extensive data collection and processing required. The breast corpus subset should be released by November 2021. By December 2021 we should also release the unannotated FCCC data. We are currently annotating urinary tract data as well. We expect to release about 5,600 processed TUH slides in this subset. We have an additional 53,000 unprocessed TUH slides digitized. Corpora of this size will stimulate the development of a new generation of deep learning technology. In clinical settings where resources are limited, an assistive diagnoses model could support pathologists’ workload and even help prioritize suspected cancerous cases. ACKNOWLEDGMENTS This material is supported by the National Science Foundation under grants nos. CNS-1726188 and 1925494. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. REFERENCES [1] N. Shawki et al., “The Temple University Digital Pathology Corpus,” in Signal Processing in Medicine and Biology: Emerging Trends in Research and Applications, 1st ed., I. Obeid, I. Selesnick, and J. Picone, Eds. New York City, New York, USA: Springer, 2020, pp. 67 104. https://www.springer.com/gp/book/9783030368432. [2] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning.” Major Research Instrumentation (MRI), Division of Computer and Network Systems, Award No. 1726188, January 1, 2018 – December 31, 2021. https://www. isip.piconepress.com/projects/nsf_dpath/. [3] A. Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2020, pp. 5036-5040. https://doi.org/10.21437/interspeech.2020-3015. [4] C.-J. Wu et al., “Machine Learning at Facebook: Understanding Inference at the Edge,” in Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344. https://ieeexplore.ieee.org/document/8675201. [5] I. Caswell and B. Liang, “Recent Advances in Google Translate,” Google AI Blog: The latest from Google Research, 2020. [Online]. Available: https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html. [Accessed: 01-Aug-2021]. [6] V. Khalkhali, N. Shawki, V. Shah, M. Golmohammadi, I. Obeid, and J. Picone, “Low Latency Real-Time Seizure Detection Using Transfer Deep Learning,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2021, pp. 1 7. https://www.isip. piconepress.com/publications/conference_proceedings/2021/ieee_spmb/eeg_transfer_learning/. [7] J. Picone, T. Farkas, I. Obeid, and Y. Persidsky, “MRI: High Performance Digital Pathology Using Big Data and Machine Learning,” Philadelphia, Pennsylvania, USA, 2020. https://www.isip.piconepress.com/publications/reports/2020/nsf/mri_dpath/. [8] I. Hunt, S. Husain, J. Simons, I. Obeid, and J. Picone, “Recent Advances in the Temple University Digital Pathology Corpus,” in Proceedings of the IEEE Signal Processing in Medicine and Biology Symposium (SPMB), 2019, pp. 1–4. https://ieeexplore.ieee.org/document/9037859. [9] A. P. Martinez, C. Cohen, K. Z. Hanley, and X. (Bill) Li, “Estrogen Receptor and Cytokeratin 5 Are Reliable Markers to Separate Usual Ductal Hyperplasia From Atypical Ductal Hyperplasia and Low-Grade Ductal Carcinoma In Situ,” Arch. Pathol. Lab. Med., vol. 140, no. 7, pp. 686–689, Apr. 2016. https://doi.org/10.5858/arpa.2015-0238-OA. 
    more » « less
  2. We present a general, efficient technique for providing contextual predictions that are "multivalid" in various senses, against an online sequence of adversarially chosen examples (x,y). This means that the resulting estimates correctly predict various statistics of the labels y not just marginally --- as averaged over the sequence of examples --- but also conditionally on x in G for any G belonging to an arbitrary intersecting collection of groups. We provide three instantiations of this framework. The first is mean prediction, which corresponds to an online algorithm satisfying the notion of multicalibration from Hebert-Johnson et al. The second is variance and higher moment prediction, which corresponds to an online algorithm satisfying the notion of mean-conditioned moment multicalibration from Jung et al. Finally, we define a new notion of prediction interval multivalidity, and give an algorithm for finding prediction intervals which satisfy it. Because our algorithms handle adversarially chosen examples, they can equally well be used to predict statistics of the residuals of arbitrary point prediction methods, giving rise to very general techniques for quantifying the uncertainty of predictions of black box algorithms, even in an online adversarial setting. When instantiated for prediction intervals, this solves a similar problem as conformal prediction, but in an adversarial environment and with multivalidity guarantees stronger than simple marginal coverage guarantees. 
    more » « less
  3. We study online convex optimization with switching costs, a practically important but also extremely challenging problem due to the lack of complete offline information. By tapping into the power of machine learning (ML) based optimizers, ML-augmented online algorithms (also referred to as expert calibration in this paper) have been emerging as state of the art, with provable worst-case performance guarantees. Nonetheless, by using the standard practice of training an ML model as a standalone optimizer and plugging it into an ML-augmented algorithm, the average cost performance can be highly unsatisfactory. In order to address the "how to learn" challenge, we propose EC-L2O (expert-calibrated learning to optimize), which trains an ML-based optimizer by explicitly taking into account the downstream expert calibrator. To accomplish this, we propose a new differentiable expert calibrator that generalizes regularized online balanced descent and offers a provably better competitive ratio than pure ML predictions when the prediction error is large. For training, our loss function is a weighted sum of two different losses --- one minimizing the average ML prediction error for better robustness, and the other one minimizing the post-calibration average cost. We also provide theoretical analysis for EC-L2O, highlighting that expert calibration can be even beneficial for the average cost performance and that the high-percentile tail ratio of the cost achieved by EC-L2O to that of the offline optimal oracle (i.e., tail cost ratio) can be bounded. Finally, we test EC-L2O by running simulations for sustainable datacenter demand response. Our results demonstrate that EC-L2O can empirically achieve a lower average cost as well as a lower competitive ratio than the existing baseline algorithms. 
    more » « less
  4. The research area of algorithms with predictions has seen recent success showing how to incorporate machine learning into algorithm design to improve performance when the predictions are correct, while retaining worst-case guarantees when they are not. Most previous work has assumed that the algorithm has access to a single predictor. However, in practice, there are many machine learning methods available, often with incomparable generalization guarantees, making it hard to pick a best method a priori. In this work we consider scenarios where multiple predictors are available to the algorithm and the question is how to best utilize them. Ideally, we would like the algorithm's performance to depend on the quality of the {\em best} predictor. However, utilizing more predictions comes with a cost, since we now have to identify which prediction is best. We study the use of multiple predictors for a number of fundamental problems, including matching, load balancing, and non-clairvoyant scheduling, which have been well-studied in the single predictor setting. For each of these problems we introduce new algorithms that take advantage of multiple predictors, and prove bounds on the resulting performance. 
    more » « less
  5. Large-scale quantum computing is a significant threat to classical public-key cryptography. In strong "quantum access" security models, numerous symmetric-key cryptosystems are also vulnerable. We consider classical encryption in a model which grants the adversary quantum oracle access to encryption and decryption, but where the latter is restricted to non-adaptive (i.e., pre-challenge) queries only. We define this model formally using appropriate notions of ciphertext indistinguishability and semantic security (which are equivalent by standard arguments) and call it QCCA1 in analogy to the classical CCA1 security model. Using a bound on quantum random-access codes, we show that the standard PRF- and PRP-based encryption schemes are QCCA1-secure when instantiated with quantum-secure primitives. We then revisit standard IND-CPA-secure Learning with Errors (LWE) encryption and show that leaking just one quantum decryption query (and no other queries or leakage of any kind) allows the adversary to recover the full secret key with constant success probability. In the classical setting, by contrast, recovering the key uses a linear number of decryption queries, and this is optimal. The algorithm at the core of our attack is a (large-modulus version of) the well-known Bernstein-Vazirani algorithm. We emphasize that our results should *not* be interpreted as a weakness of these cryptosystems in their stated security setting (i.e., post-quantum chosen-plaintext secrecy). Rather, our results mean that, if these cryptosystems are exposed to chosen-ciphertext attacks (e.g., as a result of deployment in an inappropriate real-world setting) then quantum attacks are even more devastating than classical ones. 
    more » « less