GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Sung, Yoo Yeon; Fleisig, Eve; Hope, Yu; Upadhyay, Ishan; Boyd-Graber, Jordan

Citation Details

This content will become publicly available on July 27, 2026

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

As AI use becomes more common, it's important to measure not just whether the systems are correct but whether they know when they're incorrect. We propose a new metric to measure this mismatch between correctness and confidence, compare computer ability with human ability, and show that computers have a long way to go before they're well-calibrated. more »

Award ID(s):: 2403436

PAR ID:: 10608225

Author(s) / Creator(s):: Sung, Yoo Yeon; Fleisig, Eve; Hope, Yu; Upadhyay, Ishan; Boyd-Graber, Jordan

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2025-07-27

ISSN:: 0736-587X

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 27, 2026
Conference Proceeding:
The DOI is not currently available.

More Like this