The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant miscalibration as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing calibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based calibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.
more »
« less
This content will become publicly available on January 1, 2026
An Elementary Predictor Obtaining 2√T Distance to Calibration
Blasiok et al. [2023] proposed distance to calibration as a natural measure of calibration error that unlike expected calibration error (ECE) is continuous. Recently, Qiao and Zheng [2024] gave a non-constructive argument establishing the existence of an online predictor that can obtain O(√T ) distance to calibration in the adversarial setting, which is known to be impossible for ECE. They leave as an open problem finding an explicit, efficient algorithm. We resolve this problem and give an extremely simple, efficient, deterministic algorithm that obtains distance to calibration error at most 2√T .
more »
« less
- PAR ID:
- 10596503
- Publisher / Repository:
- SODA 2025
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper gives some theory and efficient design of binary block codes capable of controlling the deletions of the symbol “0” (referred to as 0-deletions) and/or the insertions of the symbol “0” (referred to as 0-insertions). This problem of controlling 0-deletions and/or 0-insertions (referred to as 0-errors) is shown to be equivalent to the efficient design of L1 metric asymmetric error control codes over the natural alphabet, IIN . In this way, it is shown that the t0 -insertion correcting codes are actually capable of controlling much more; namely, they can correct t0 -errors, detect (t+1)0 -errors and, simultaneously, detect all occurrences of only 0-deletions or only 0-insertions in every received word (briefly, they are t -Symmetric 0-Error Correcting/ (t+1) -Symmetric 0-Error Detecting/All Unidirectional 0-Error Detecting ( t -Sy0EC/ (t+1) -Sy0ED/AU0ED) codes). From the relations with the L1 distance error control codes, new improved bounds are given for the optimal t0 -error correcting codes. Optimal non-systematic code designs are given. Decoding can be efficiently performed by algebraic means using the Extended Euclidean Algorithm (EEA).more » « less
-
The uncertainty in modeling emotions makes speech emotion recognition (SER) systems less reliable. An intuitive way to increase trust in SER is to reject predictions with low confidence. This approach assumes that an SER system is well calibrated, where highly confident predictions are often right and low confident predictions are often wrong. Hence, it is desirable to calibrate the confidence of SER classifiers. We evaluate the reliability of SER systems by exploring the relationship between confidence and accuracy, using the expected calibration error (ECE) metric. We develop a multi-label variant of the post-hoc temperature scaling (TS) method to calibrate SER systems, while preserving their accuracy. The best method combines an emotion co-occurrence weight penalty function, a class-balanced objective function, and the proposed multi-label TS calibration method. The experiments show the effectiveness of our developed multi-label calibration method in terms of ac- curacy and ECE.more » « less
-
We study the problem of making calibrated probabilistic forecasts for a binary sequence generated by an adversarial nature. Following the seminal paper of Foster and Vohra (1998), nature is often modeled as an adaptive adversary who sees all activity of the forecaster except the randomization that the forecaster may deploy. A number of papers have proposed randomized forecasting strategies that achieve an ϵ-calibration error rate of O(1/sqrt T), which we prove is tight in general. On the other hand, it is well known that it is not possible to be calibrated without randomization, or if nature also sees the forecaster's randomization; in both cases the calibration error could be Ω(1). Inspired by the equally seminal works on the "power of two choices" and imprecise probability theory, we study a small variant of the standard online calibration problem. The adversary gives the forecaster the option of making two nearby probabilistic forecasts, or equivalently an interval forecast of small width, and the endpoint closest to the revealed outcome is used to judge calibration. This power of two choices, or imprecise forecast, accords the forecaster with significant power -- we show that a faster ϵ-calibration rate of O(1/T) can be achieved even without deploying any randomization.more » « less
-
This paper considers a basic problem at the interface of submodular optimization and online learning. In the online unconstrained submodular maximization problem, there is a universe [n] = {1, 2, . . . , n} and a sequence of T nonnegative (not necessarily monotone) submodular functions arrive over time. The goal is to design a computationally efficient online algorithm, which chooses a subset of [n] at each time step as a function only of the past, such that the accumulated value of the chosen subsets is as close as possible to the maximum total value of a fixed subset in hindsight. Our main result is a polynomial time no-1/2-regret algorithm for this problem, meaning that for every sequence of nonnegative submodular functions, the algorithm’s expected total value is at least 1/2 times that of the best subset in hindsight, up to an error term sublinear in T. The factor of 1/2 cannot be improved upon by any polynomial-time online algorithm when the submodular functions are presented as value oracles.more » « less
An official website of the United States government
