skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LANCE: stress-testing visual models by generating language-guided counterfactual images
We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE). Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights. We benchmark the performance of a diverse set of pretrained models on our generated data and observe significant and consistent performance drops. We further analyze model sensitivity across different types of edits, and demonstrate its applicability at surfacing previously unknown class-level model biases in ImageNet.  more » « less
Award ID(s):
2144194
PAR ID:
10514116
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Neural Information Processing Systems
Date Published:
Journal Name:
International Conference on Neural Information Processing Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Alam, Mohammad S.; Asari, Vijayan K. (Ed.)
    Iris recognition is one of the well-known areas of biometric research. However, in real-world scenarios, subjects may not always provide fully open eyes, which can negatively impact the performance of existing systems. Therefore, the detection of blinking eyes in iris images is crucial to ensure reliable biometric data. In this paper, we propose a deep learning-based method using a convolutional neural network to classify blinking eyes in off-angle iris images into four different categories: fully-blinked, half-blinked, half-opened, and fully-opened. The dataset used in our experiments includes 6500 images of 113 subjects and contains images of a mixture of both frontal and off-angle views of the eyes from -50 to 50 in gaze angle. We train and test our approach using both frontal and off-angle images and achieve high classification performance for both types of images. Compared to training the network with only frontal images, our approach shows significantly better performance when tested on off-angle images. These findings suggest that training the model with a more diverse set of off-angle images can improve its performance for off-angle blink detection, which is crucial for real-world applications where the iris images are often captured at different angles. Overall, the deep learning-based blink detection method can be used as a standalone algorithm or integrated into existing standoff biometrics frameworks to improve their accuracy and reliability, particularly in scenarios where subjects may blink. 
    more » « less
  2. Abstract Image segmentation of the liver is an important step in treatment planning for liver cancer. However, manual segmentation at a large scale is not practical, leading to increasing reliance on deep learning models to automatically segment the liver. This manuscript develops a generalizable deep learning model to segment the liver on T1-weighted MR images. In particular, three distinct deep learning architectures (nnUNet, PocketNet, Swin UNETR) were considered using data gathered from six geographically different institutions. A total of 819 T1-weighted MR images were gathered from both public and internal sources. Our experiments compared each architecture’s testing performance when trained both intra-institutionally and inter-institutionally. Models trained using nnUNet and its PocketNet variant achieved mean Dice-Sorensen similarity coefficients>0.9 on both intra- and inter-institutional test set data. The performance of these models suggests that nnUNet and PocketNet liver segmentation models trained on a large and diverse collection of T1-weighted MR images would on average achieve good intra-institutional segmentation performance. 
    more » « less
  3. Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community 
    more » « less
  4. Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with heldout speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.1 
    more » « less
  5. This paper introduces a novel approach for learning natural language descriptions of scenery in Minecraft. We apply techniques from Computer Vision and Natural Language Processing to create an AI framework called MineObserver for assessing the accuracy of learner-generated descriptions of science-related images. The ultimate purpose of the system is to automatically assess the accuracy of learner observations, written in natural language, made during science learning activities that take place in Minecraft. Eventually, MineObserver will be used as part of a pedagogical agent framework for providing in-game support for learning. Preliminary results are mixed, but promising with approximately 62% of images in our test set being properly classified by our image captioning approach. Broadly, our work suggests that computer vision techniques work as expected in Minecraft and can serve as a basis for assessing learner observations. 
    more » « less