NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning and Forgetting Unsafe Examples in Large Language Models

Zhao, Jiachen; Deng, Zhun; Madras, David; Zou, James; Ren, Mengye (July 2024, International Conference on Machine Learning (ICML))

As the number of large language models (LLMs) released to the public grows, there is a pressing need to understand the safety implications associated with these models learning from third-party custom finetuning data. We explore the behavior of LLMs finetuned on noisy custom data containing unsafe content, represented by datasets that contain biases, toxicity, and harmfulness, finding that while aligned LLMs can readily learn this unsafe content, they also tend to forget it more significantly than other examples when subsequently finetuned on safer content. Drawing inspiration from the discrepancies in forgetting, we introduce the “ForgetFilter” algorithm, which filters unsafe data based on how strong the model’s forgetting signal is for that data. We demonstrate that the ForgetFilter algorithm ensures safety in customized finetuning without compromising downstream task performance, unlike sequential safety finetuning. ForgetFilter outperforms alternative strategies like replay and moral self-correction in curbing LLMs’ ability to assimilate unsafe content during custom finetuning, e.g. 75% lower than not applying any safety measures and 62% lower than using self-correction in toxicity score.
more » « less
Full Text Available
Reinforcement Learning with Stepwise Fairness Constraints

Deng, Zhun; Sun, He; Wu, Steven; Zhang, Linjun; Parkes, David (April 2023, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics)

Full Text Available
Reinforcement Learning with Stepwise Fairness Constraints

Deng, Zhun; Sun, He; Wu, Steven; Zhang, Linjun; Parkes, David (April 2023, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics)

Full Text Available
HappyMap : A Generalized Multicalibration Method

Deng, Zhun; Dwork, Cynthia; Zhang, Linjun (January 2023, 14th Innovations in Theoretical Computer Science Conference (ITCS 2023))

Full Text Available
Understanding Multimodal Contrastive Learning and Incorporating Unpaired Data

Nakada, Ryumei; Gulluk, Halil Ibrahim; Deng, Zhun; Ji, Wenlong; Zou, James; Zhang, Linjun (April 2023, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics)

Full Text Available
FIFA: Making Fairness More Generalizable in Classifiers Trained on Imbalanced Data

Deng, Zhun; Zhang, Jiayao; Zhang, Linjun; Ye, Ting; Coley, Yates; Su, Weijie; Zou, James (April 2023, International Conference on Learning Representations 2023)

Full Text Available
When and How Mixup Improves Calibration

Zhang, Linjun; Deng, Zhun; Kawaguchi, Kenji; Zou, James (July 2022, International Conference on Machine Learning)

Full Text Available
Understanding Dynamics of Nonlinear Representation Learning and Its Application

https://doi.org/10.1162/neco_a_01483

Kawaguchi, Kenji; Zhang, Linjun; Deng, Zhun (March 2022, Neural Computation)

Abstract Representations of the world environment play a crucial role in artificial intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a deep neural network learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss. In this letter, we study the dynamics of such implicit nonlinear representation learning. We identify a pair of a new assumption and a novel condition, called the on-model structure assumption and the data architecture alignment condition. Under the on-model structure assumption, the data architecture alignment condition is shown to be sufficient for the global convergence and necessary for global optimality. Moreover, our theory explains how and when increasing network size does and does not improve the training behaviors in the practical regime. Our results provide practical guidance for designing a model structure; for example, the on-model structure assumption can be used as a justification for using a particular model structure instead of others. As an application, we then derive a new training framework, which satisfies the data architecture alignment condition without assuming it by automatically modifying any given training algorithm dependent on data and architecture. Given a standard training algorithm, the framework running its modified version is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for deep residual neural networks with convolutions, skip connections, and batch normalization with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN.
more » « less
Full Text Available
Improving Adversarial Robustness via Unlabeled Out-of-Domain Data

Deng, Zhun; Zhang, Linjun; Ghorbani, Amirata; Zou, James (April 2021, Proceedings of Machine Learning Research)

Data augmentation by incorporating cheap unlabeled data from multiple domains is a powerful way to improve prediction especially when there is limited labeled data. In this work, we investigate how adversarial robustness can be enhanced by leveraging out-of-domain unlabeled data. We demonstrate that for broad classes of distributions and classifiers, there exists a sample complexity gap between standard and robust classification. We quantify the extent to which this gap can be bridged by leveraging unlabeled samples from a shifted domain by providing both upper and lower bounds. Moreover, we show settings where we achieve better adversarial robustness when the unlabeled data come from a shifted domain rather than the same domain as the labeled data. We also investigate how to leverage out-of-domain data when some structural information, such as sparsity, is shared between labeled and unlabeled domains. Experimentally, we augment object recognition datasets (CIFAR-10, CINIC-10, and SVHN) with easy-to-obtain and unlabeled out-of-domain data and demonstrate substantial improvement in the model’s robustness against l_infty adversarial attacks on the original domain.
more » « less
Full Text Available
How Does Mixup Help With Robustness and Generalization

Zhang, Linjun; Deng, Zhun; Kawaguchi, Kenji; Ghorbani, Amirata; Zou, James (April 2021, The International Conference on Learning Representations)

Full Text Available

« Prev Next »

Search for: All records