skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LANCE: stress-testing visual models by generating language-guided counterfactual images
We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE). Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights. We benchmark the performance of a diverse set of pretrained models on our generated data and observe significant and consistent performance drops. We further analyze model sensitivity across different types of edits, and demonstrate its applicability at surfacing previously unknown class-level model biases in ImageNet.  more » « less
Award ID(s):
2144194
PAR ID:
10514116
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Neural Information Processing Systems
Date Published:
Journal Name:
International Conference on Neural Information Processing Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Alam, Mohammad S.; Asari, Vijayan K. (Ed.)
    Iris recognition is one of the well-known areas of biometric research. However, in real-world scenarios, subjects may not always provide fully open eyes, which can negatively impact the performance of existing systems. Therefore, the detection of blinking eyes in iris images is crucial to ensure reliable biometric data. In this paper, we propose a deep learning-based method using a convolutional neural network to classify blinking eyes in off-angle iris images into four different categories: fully-blinked, half-blinked, half-opened, and fully-opened. The dataset used in our experiments includes 6500 images of 113 subjects and contains images of a mixture of both frontal and off-angle views of the eyes from -50 to 50 in gaze angle. We train and test our approach using both frontal and off-angle images and achieve high classification performance for both types of images. Compared to training the network with only frontal images, our approach shows significantly better performance when tested on off-angle images. These findings suggest that training the model with a more diverse set of off-angle images can improve its performance for off-angle blink detection, which is crucial for real-world applications where the iris images are often captured at different angles. Overall, the deep learning-based blink detection method can be used as a standalone algorithm or integrated into existing standoff biometrics frameworks to improve their accuracy and reliability, particularly in scenarios where subjects may blink. 
    more » « less
  2. Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community 
    more » « less
  3. Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.1 
    more » « less
  4. This paper introduces a novel approach for learning natural language descriptions of scenery in Minecraft. We apply techniques from Computer Vision and Natural Language Processing to create an AI framework called MineObserver for assessing the accuracy of learner-generated descriptions of science-related images. The ultimate purpose of the system is to automatically assess the accuracy of learner observations, written in natural language, made during science learning activities that take place in Minecraft. Eventually, MineObserver will be used as part of a pedagogical agent framework for providing in-game support for learning. Preliminary results are mixed, but promising with approximately 62% of images in our test set being properly classified by our image captioning approach. Broadly, our work suggests that computer vision techniques work as expected in Minecraft and can serve as a basis for assessing learner observations. 
    more » « less
  5. Abstract In this paper, we propose and compare two novel deep generative model-based approaches for the design representation, reconstruction, and generation of porous metamaterials characterized by complex and fully connected solid and pore networks. A highly diverse porous metamaterial database is curated, with each sample represented by solid and pore phase graphs and a voxel image. All metamaterial samples adhere to the requirement of complete connectivity in both pore and solid phases. The first approach employs a Dual Decoder Variational Graph Autoencoder to generate both solid phase and pore phase graphs. The second approach employs a Variational Graph Autoencoder for reconstructing/generating the nodes in the solid phase and pore phase graphs and a Transformer-based Large Language Model (LLM) for reconstructing/generating the connections, i.e., the edges among the nodes. A comparative study is conducted, and we found that both approaches achieved high accuracy in reconstructing node features, while the LLM exhibited superior performance in reconstructing edge features. Reconstruction accuracy is also validated by voxel-to-voxel comparison between the reconstructions and the original images in the test set. Additionally, discussions on the advantages and limitations of using LLMs in metamaterial design generation, along with the rationale behind their utilization, are provided. 
    more » « less