- NSF-PAR ID:
- 10460957
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- Volume:
- 37
- Issue:
- 7
- ISSN:
- 2159-5399
- Page Range / eLocation ID:
- 8211 to 8219
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
A model is considered well-calibrated when its probability estimate aligns with the actual likelihood of the output being correct. Calibrating language models (LMs) is crucial, as it plays a vital role in detecting and mitigating hallucinations of LMs as well as building more trustworthy models. However, standard calibration techniques may not be suited for LM calibration. For instance, post-processing methods such as temperature scaling do not reorder the candidate generations. On the other hand, training-based methods require fine-tuning the entire model, which is impractical for LMs of large scale. We present LITCAB, a lightweight calibration mechanism consisting of a single linear layer that takes the input text representation and predicts a bias term, which is then added to the LM output logits. LITCAB improves model calibration by only adding < 2% of the original model parameters. For evaluation, we construct CAT, a benchmark consisting of eight text generation tasks, covering responses ranging from short phrases to paragraphs. We test LITCAB with Llama2-7B, where it improves calibration across all tasks, reducing the average ECE score by as large as 30%. We further conduct a comprehensive evaluation with multiple popular open-sourced LMs from GPT and LLaMA families, yielding the following key findings: (i) Larger models within the same family exhibit better calibration on tasks with short generation tasks, but not necessarily for longer ones. (ii) GPT-family models show superior calibration compared to LLaMA, Llama2, and Vicuna models, despite having much fewer parameters. (iii) Fine-tuning pretrained model (e.g., LLaMA) with samples of limited purpose (e.g., conversations) may lead to worse calibration, highlighting the importance of fine-tuning setups for calibrating LMs.more » « less
-
Abstract We consider the proportional hazards model in which the covariates include the discretized categories of a continuous time‐dependent exposure variable measured with error. Naively ignoring the measurement error in the analysis may cause biased estimation and erroneous inference. Although various approaches have been proposed to deal with measurement error when the hazard depends linearly on the time‐dependent variable, it has not yet been investigated how to correct when the hazard depends on the discretized categories of the time‐dependent variable. To fill this gap in the literature, we propose a smoothed corrected score approach based on approximation of the discretized categories after smoothing the indicator function. The consistency and asymptotic normality of the proposed estimator are established. The observation times of the time‐dependent variable are allowed to be informative. For comparison, we also extend to this setting two approximate approaches, the regression calibration and the risk‐set regression calibration. The methods are assessed by simulation studies and by application to data from an HIV clinical trial.
-
The uncertainty in modeling emotions makes speech emotion recognition (SER) systems less reliable. An intuitive way to increase trust in SER is to reject predictions with low confidence. This approach assumes that an SER system is well calibrated, where highly confident predictions are often right and low confident predictions are often wrong. Hence, it is desirable to calibrate the confidence of SER classifiers. We evaluate the reliability of SER systems by exploring the relationship between confidence and accuracy, using the expected calibration error (ECE) metric. We develop a multi-label variant of the post-hoc temperature scaling (TS) method to calibrate SER systems, while preserving their accuracy. The best method combines an emotion co-occurrence weight penalty function, a class-balanced objective function, and the proposed multi-label TS calibration method. The experiments show the effectiveness of our developed multi-label calibration method in terms of ac- curacy and ECE.more » « less
-
Summary We consider functional measurement error models, i.e. models where covariates are measured with error and yet no distributional assumptions are made about the mismeasured variable. We propose and study a score-type local test and an orthogonal series-based, omnibus goodness-of-fit test in this context, where no likelihood function is available or calculated—i.e. all the tests are proposed in the semiparametric model framework. We demonstrate that our tests have optimality properties and computational advantages that are similar to those of the classical score tests in the parametric model framework. The test procedures are applicable to several semiparametric extensions of measurement error models, including when the measurement error distribution is estimated non-parametrically as well as for generalized partially linear models. The performance of the local score-type and omnibus goodness-of-fit tests is demonstrated through simulation studies and analysis of a nutrition data set.
-
null (Ed.)ABSTRACT Precision calibration poses challenges to experiments probing the redshifted 21-cm signal of neutral hydrogen from the Cosmic Dawn and Epoch of Reionization (z ∼ 30–6). In both interferometric and global signal experiments, systematic calibration is the leading source of error. Though many aspects of calibration have been studied, the overlap between the two types of instruments has received less attention. We investigate the sky based calibration of total power measurements with a HERA dish and an EDGES-style antenna to understand the role of autocorrelations in the calibration of an interferometer and the role of sky in calibrating a total power instrument. Using simulations we study various scenarios such as time variable gain, incomplete sky calibration model, and primary beam model. We find that temporal gain drifts, sky model incompleteness, and beam inaccuracies cause biases in the receiver gain amplitude and the receiver temperature estimates. In some cases, these biases mix spectral structure between beam and sky resulting in spectrally variable gain errors. Applying the calibration method to the HERA and EDGES data, we find good agreement with calibration via the more standard methods. Although instrumental gains are consistent with beam and sky errors similar in scale to those simulated, the receiver temperatures show significant deviations from expected values. While we show that it is possible to partially mitigate biases due to model inaccuracies by incorporating a time-dependent gain model in calibration, the resulting errors on calibration products are larger and more correlated. Completely addressing these biases will require more accurate sky and primary beam models.more » « less