skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Self-Consistency of Large Language Models under Ambiguity
Large language models (LLMs) that do not give consistent answers across contexts are problematic when used for tasks with expectations of consistency–e.g. question-answering, explanations, etc. Our work presents an evaluation benchmark for self-consistency in cases of under-specification where two or more answers can be correct. We conduct a series of behavioral experiments on the OpenAI model suite using an ambiguous integer sequence completion task. We find that average consistency ranges from 67% to 82%, far higher than would be predicted if a model’s consistency was random, and increases as model capability improves. Furthermore, we show that models tend to maintain self-consistency across a series of robustness checks, including prompting speaker changes and sequence length changes. These results suggest that self- consistency arises as an emergent capability without specifically training for it. Despite this, we find that models are uncalibrated when judging their own consistency, with models display- ing both over- and under-confidence. We also propose a nonparametric test for determining from token output distribution whether a model assigns non-trivial probability to alternative answers. Using this test, we find that despite increases in self-consistency, models usually place significant weight on alternative, inconsistent answers. This distribution of probability mass provides evidence that even highly self- consistent models internally compute multiple possible responses.  more » « less
Award ID(s):
2046556
PAR ID:
10544337
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
ISSN:
1530-9312
Page Range / eLocation ID:
89 to 105
Format(s):
Medium: X
Location:
Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning. 
    more » « less
  2. Making a categorical judgment can systematically bias our subsequent perception of the world. We show that these biases are well explained by a self-consistent Bayesian observer whose perceptual inference process is causally conditioned on the preceding choice. We quantitatively validated the model and its key assumptions with a targeted set of three psychophysical experiments, focusing on a task sequence where subjects first had to make a categorical orientation judgment before estimating the actual orientation of a visual stimulus. Subjects exhibited a high degree of consistency between categorical judgment and estimate, which is difficult to reconcile with alternative models in the face of late, memory related noise. The observed bias patterns resemble the well-known changes in subjective preferences associated with cognitive dissonance, which suggests that the brain’s inference processes may be governed by a universal self-consistency constraint that avoids entertaining ‘dissonant’ interpretations of the evidence. 
    more » « less
  3. Large language models (LLMs) have achieved widespread success on a variety of in-context few shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self consistency that are particularly important for multi-step reasoning – hypothetical consistency (a model’s ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model’s final outputs when intermediate sub-steps are replaced with the model’s outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks. 
    more » « less
  4. null (Ed.)
    Robust estimation of camera motion under the presence of outlier noise is a fundamental problem in robotics and computer vision. Despite existing efforts that focus on detecting motion and scene degeneracies, the best existing approach that builds on Random Consensus Sampling (RANSAC) still has non-negligible failure rate. Since a single failure canlead to the failure of the entire visual simultaneous localization and mapping, it is important to further improve the robust estimation algorithm. We propose a new robust camera motion estimator (RCME) by incorporating two main changes: a model-sample consistency test at the model instantiation stepand an inlier set quality test that verifies model-inlier consistency using differential entropy. We have implemented our RCME algorithm and tested it under many public datasets. The results have shown a consistent reduction in failure rate when comparing to the RANSAC-based Gold Standard approach and two recent variations of RANSAC methods. 
    more » « less
  5. Code Large Language Models (Code LLMs) are being increasingly employed in real-life applications, so evaluating them is critical. While the conventional accuracy evaluates the performance of Code LLMs on a set of individual tasks, their self-consistency across different tasks is overlooked. Intuitively, a trustworthy model should be self-consistent when generating natural language specifications for its own code and generating code for its own specifications. Failure to preserve self-consistency reveals a lack of understanding of the shared semantics underlying natural language and programming language, and therefore undermines the trustworthiness of a model. In this paper, we first formally define the self-consistency of Code LLMs and then design a framework, IdentityChain, which effectively and efficiently evaluates the self-consistency and conventional accuracy of a model at the same time. We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. Furthermore, we show that IdentityChain can be used as a model debugging tool to expose weaknesses of Code LLMs by demonstrating three major weaknesses that we identify in current models using IdentityChain. Our code is available at https://github.com/marcusm117/IdentityChain. 
    more » « less