Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Chen, A; Phang, J; Parrish, A; Padmakumar, V; Zhao, C; Bowman, SR; Cho, K

Citation Details

Large language models (LLMs) have achieved widespread success on a variety of in-context few shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self consistency that are particularly important for multi-step reasoning – hypothetical consistency (a model’s ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model’s final outputs when intermediate sub-steps are replaced with the model’s outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks. more »

Award ID(s):: 2046556

PAR ID:: 10542787

Author(s) / Creator(s):: Chen, A; Phang, J; Parrish, A; Padmakumar, V; Zhao, C; Bowman, SR; Cho, K

Publisher / Repository:: OpenReview

Date Published:: 2024-01-01

Journal Name:: Transactions on machine learning research

ISSN:: 2835-8856

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
The DOI is not currently available.

More Like this