We study calibration measures in a sequential prediction setup. In addition to rewarding accurate predictions (completeness) and penalizing incorrect ones (soundness), an important desideratum of calibration measures is truthfulness, a minimal condition for the forecaster not to be incentivized to exploit the system. Formally, a calibration measure is truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. We conduct a taxonomy of existing calibration measures. Perhaps surprisingly, all of them are far from being truthful. We introduce a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE), which is complete and sound, and under which truthful prediction is optimal up to a constant multiplicative factor. In contrast, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty. 
                        more » 
                        « less   
                    This content will become publicly available on July 27, 2026
                            
                            GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
                        
                    
    
            As AI use becomes more common, it's important to measure not just whether the systems are correct but whether they know when they're incorrect. We propose a new metric to measure this mismatch between correctness and confidence, compare computer ability with human ability, and show that computers have a long way to go before they're well-calibrated. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2403436
- PAR ID:
- 10608225
- Publisher / Repository:
- Association for Computational Linguistics
- Date Published:
- ISSN:
- 0736-587X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            We study calibration measures in a sequential prediction setup. In addition to rewarding accurate predictions (completeness) and penalizing incorrect ones (soundness), an important desideratum of calibration measures is truthfulness, a minimal condition for the forecaster not to be incentivized to exploit the system. Formally, a calibration measure is truthful if the forecaster (approximately) minimizes the expected penalty by predicting the conditional expectation of the next outcome, given the prior distribution of outcomes. We conduct a taxonomy of existing calibration measures. Perhaps surprisingly, all of them are far from being truthful. We introduce a new calibration measure termed the Subsampled Smooth Calibration Error (SSCE), which is complete and sound, and under which truthful prediction is optimal up to a constant multiplicative factor. In contrast, under existing calibration measures, there are simple distributions on which a polylogarithmic (or even zero) penalty is achievable, while truthful prediction leads to a polynomial penalty.more » « less
- 
            Calibration measures quantify how much a forecaster’s predictions violate calibration, which requires that forecasts are unbiased conditioning on the forecasted probabilities. Two important desiderata for a calibration measure are its decision-theoretic implications (i.e., downstream decision-makers that best respond to the forecasts are always no-regret) and its truthfulness (i.e., a forecaster approximately minimizes error by always reporting the true probabilities). Existing measures satisfy at most one of the properties, but not both. We introduce a new calibration measure termed subsampled step calibration, StepCEsub, that is both decision-theoretic and truthful. In particular, on any product distribution, StepCEsub is truthful up to an O(1) factor whereas prior decision-theoretic calibration measures suffer from an e−Ω(T)–Ω(T−−√) truthfulness gap. Moreover, in any smoothed setting where the conditional probability of each event is perturbed by a noise of magnitude c>0, StepCEsub is truthful up to an O(log(1/c)−−−−−−−√) factor, while prior decision-theoretic measures have an e−Ω(T)–Ω(T1/3) truthfulness gap. We also prove a general impossibility result for truthful decision-theoretic forecasting: any complete and decision-theoretic calibration measure must be discontinuous and non-truthful in the non-smoothed setting.more » « less
- 
            Abstract Soil heat flux plates (SHFPs) are widely used to measure soil heat flux (Gs). Gs is often underestimated by SHFPs (Gp). Although calibration methods are used, they are not always effective. The objective of this study is to evaluate the effectiveness of a field calibration method applied to various SHFPs installed in a full canopy maize field. A 5‐day measurement period with wet and dry soil conditions was used for calibration, while 80‐day and 60‐day measurement periods were used for evaluation. Uncorrected SHFP measured values (Gp) underestimated the actual reference Gs determined by the gradient method (Gs_grad) by 42%–64%. Gp values in the evaluation period were corrected (Gp_corr) by dividing them by the ratio of Gp/Gs_grad determined over the calibration period. After the correction, the Gp_corr agreed well with the Gs_grad, with Gp_corr/Gs_grad of four of six SHFPs being 0.90–1.01, improving to 74%–98%. The field calibration performed approximately the same with the wet and dry calibration periods, whether the calibration and evaluation periods were consecutive in time or had relatively long time intervals, indicating that this method accounted for almost all errors with SHFP. This is largely due to the slight variation in soil thermal conductivity and the linearity between soil temperature gradients from SHFP and the gradient method under relatively stable soil moisture conditions. This study deepens our understanding and improves the accuracy of soil heat flux measurements. Calibration of SHFPs under various land covers and weather conditions is warranted in future studies.more » « less
- 
            Combining uncertainty information with AI recommendations supports calibration with domain knowledgeThe use of Artificial Intelligence (AI) decision support is increasing in high-stakes contexts, such as healthcare, defense, and finance. Uncertainty information may help users better leverage AI predictions, especially when combined with their domain knowledge. We conducted a human-subject experiment with an online sample to examine the effects of presenting uncertainty information with AI recommendations. The experimental stimuli and task, which included identifying plant and animal images, are from an existing image recognition deep learning model, a popular approach to AI. The uncertainty information was predicted probabilities for whether each label was the true label. This information was presented numerically and visually. In the study, we tested the effect of AI recommendations in a within-subject comparison and uncertainty information in a between-subject comparison. The results suggest that AI recommendations increased both participants’ accuracy and confidence. Further, providing uncertainty information significantly increased accuracy but not confidence, suggesting that it may be effective for reducing overconfidence. In this task, participants tended to have higher domain knowledge for animals than plants based on a self-reported measure of domain knowledge. Participants with more domain knowledge were appropriately less confident when uncertainty information was provided. This suggests that people use AI and uncertainty information differently, such as an expert versus second opinion, depending on their level of domain knowledge. These results suggest that if presented appropriately, uncertainty information can potentially decrease overconfidence that is induced by using AI recommendations.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
