In offline multiagent reinforcement learning (MARL), agents estimate policies from a given dataset. We study rewardpoisoning attacks in this setting where an exogenous attacker modifies the rewards in the dataset before the agents see the dataset. The attacker wants to guide each agent into a nefarious target policy while minimizing the Lp norm of the reward modification. Unlike attacks on singleagent RL, we show that the attacker can install the target policy as a Markov Perfect Dominant Strategy Equilibrium (MPDSE), which rational agents are guaranteed to follow. This attack can be significantly cheaper than separate singleagent attacks. We show that the attack works on various MARL agents including uncertaintyaware learners, and we exhibit linear programs to efficiently solve the attack problem. We also study the relationship between the structure of the datasets and the minimal attack cost. Our work paves the way for studying defense in offline MARL.
more »
« less
This content will become publicly available on December 11, 2024
What Distributions are Robust to Indiscriminate Poisoning Attacks for Linear Learners?
We study indiscriminate poisoning for linear learners where an adversary injects a few crafted examples into the training data with the goal of forcing the induced model to incur higher test error. Inspired by the observation that linear learners on some datasets are able to resist the best known attacks even without any defenses, we further investigate whether datasets can be inherently robust to indiscriminate poisoning attacks for linear learners. For theoretical Gaussian distributions, we rigorously characterize the behavior of an optimal poisoning attack, defined as the poisoning strategy that attains the maximum risk of the induced model at a given poisoning budget. Our results prove that linear learners can indeed be robust to indiscriminate poisoning if the classwise data distributions are wellseparated with low variance and the size of the constraint set containing all permissible poisoning points is also small. These findings largely explain the drastic variation in empirical attack performance of the stateoftheart poisoning attacks on linear learners across benchmark datasets, making an important initial step towards understanding the underlying reasons some learning tasks are vulnerable to data poisoning attacks.
more »
« less
 Award ID(s):
 1804603
 NSFPAR ID:
 10474939
 Publisher / Repository:
 Neural Information Processing Systems Foundation
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 Volume:
 2023
 ISSN:
 10495258
 Format(s):
 Medium: X
 Location:
 New Orleans, LA
 Sponsoring Org:
 National Science Foundation
More Like this


A backdoor data poisoning attack is an adversarial attack wherein the attacker injects several watermarked, mislabeled training examples into a training set. The watermark does not impact the testtime performance of the model on typical data; however, the model reliably errs on watermarked examples. To gain a better foundational understanding of backdoor data poisoning attacks, we present a formal theoretical framework within which one can discuss backdoor data poisoning attacks for classification problems. We then use this to analyze important statistical and computational issues surrounding these attacks. On the statistical front, we identify a parameter we call the memorization capacity that captures the intrinsic vulnerability of a learning problem to a backdoor attack. This allows us to argue about the robustness of several natural learning problems to backdoor attacks. Our results favoring the attacker involve presenting explicit constructions of backdoor attacks, and our robustness results show that some natural problem settings cannot yield successful backdoor attacks. From a computational standpoint, we show that under certain assumptions, adversarial training can detect the presence of backdoors in a training set. We then show that under similar assumptions, two closely related problems we call backdoor filtering and robust generalization are nearly equivalent. This implies that it is both asymptotically necessary and sufficient to design algorithms that can identify watermarked examples in the training set in order to obtain a learning algorithm that both generalizes well to unseen data and is robust to backdoors.more » « less

The pervasiveness of neural networks (NNs) in critical computer vision and image processing applications makes them very attractive for adversarial manipulation. A large body of existing research thoroughly investigates two broad categories of attacks targeting the integrity of NN models. The first category of attacks, commonly called Adversarial Examples, perturbs the model's inference by carefully adding noise into input examples. In the second category of attacks, adversaries try to manipulate the model during the training process by implanting Trojan backdoors. Researchers show that such attacks pose severe threats to the growing applications of NNs and propose several defenses against each attack type individually. However, such onesided defense approaches leave potentially unknown risks in realworld scenarios when an adversary can unify different attacks to create new and more lethal ones bypassing existing defenses. In this work, we show how to jointly exploit adversarial perturbation and model poisoning vulnerabilities to practically launch a new stealthy attack, dubbed AdvTrojan. AdvTrojan is stealthy because it can be activated only when: 1) a carefully crafted adversarial perturbation is injected into the input examples during inference, and 2) a Trojan backdoor is implanted during the training process of the model. We leverage adversarial noise in the input space to move Trojaninfected examples across the model decision boundary, making it difficult to detect. The stealthiness behavior of AdvTrojan fools the users into accidentally trusting the infected model as a robust classifier against adversarial examples. AdvTrojan can be implemented by only poisoning the training data similar to conventional Trojan backdoor attacks. Our thorough analysis and extensive experiments on several benchmark datasets show that AdvTrojan can bypass existing defenses with a success rate close to 100% in most of our experimental scenarios and can be extended to attack federated learning as well as highresolution images.more » « less

Belkin, M. ; Kpotufe, S. (Ed.)We study the problem of robust learning under cleanlabel datapoisoning attacks, where the attacker injects (an arbitrary set of) correctlylabeled examples to the training set to fool the algorithm into making mistakes on specific test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than optimal PAC learning. As we show, any robust algorithm with diminishing attackable rate can achieve the optimal dependence on ε in its PAC sample complexity, i.e., O(1/ε). On the other hand, the attackable rate might be large even for some optimal PAC learners, e.g., SVM for linear classifiers. Furthermore, we show that the class of linear hypotheses is not robustly learnable when the data distribution has zero margin and is robustly learnable in the case of positive margin but requires sample complexity exponential in the dimension. For a general hypothesis class with bounded VC dimension, if the attacker is limited to add at most t >0 poison examples, the optimal robust learning sample complexity grows almost linearly with t.more » « less

Unsupervised domain adaptation (UDA) enables crossdomain learning without target domain labels by transferring knowledge from a labeled source domain whose distribution differs from the target. However, UDA is not always successful and several accounts of ‘negative transfer’ have been reported in the literature. In this work, we prove a simple lower bound on the target domain error that complements the existing upper bound. Our bound shows the insufficiency of minimizing source domain error and marginal distribution mismatch for a guaranteed reduction in the target domain error, due to the possible increase of induced labeling function mismatch. This insufficiency is further illustrated through simple distributions for which the same UDA approach succeeds, fails, and may succeed or fail with an equal chance. Motivated from this, we propose novel data poisoning attacks to fool UDA methods into learning representations that produce large target domain errors. We evaluate the effect of these attacks on popular UDA methods using benchmark datasets where they have been previously shown to be successful. Our results show that poisoning can significantly decrease the target domain accuracy, dropping it to almost 0% in some cases, with the addition of only 10% poisoned data in the source domain. The failure of UDA methods demonstrates the limitations of UDA at guaranteeing crossdomain generalization consistent with the lower bound. Thus, evaluation of UDA methods in adversarial settings such as data poisoning can provide a better sense of their robustness in scenarios unfavorable for UDA.more » « less