skip to main content


Title: Evaluating the Impact of Uncertainty on Risk Prediction: Towards More Robust Prediction Models
Risk prediction models are crucial for assessing the pretest probability of cancer and are applied to stratify patient management strategies. These models are frequently based on multivariate regression analysis, requiring that all risk factors be specified, and do not convey the confidence in their predictions. We present a framework for uncertainty analysis that accounts for variability in input values. Uncertain or missing values are replaced with a range of plausible values. These ranges are used to compute individualized risk confidence intervals. We demonstrate our approach using the Gail model to evaluate the impact of uncertainty on management decisions. Up to 13% of cases (uncertain) had a risk interval that falls within the decision threshold (e.g., 1.67% 5-year absolute risk). A small number of cases changed from low- to high-risk when missing values were present. Our analysis underscores the need for better communication of input assumptions that influence the resulting predictions.  more » « less
Award ID(s):
1722516
NSF-PAR ID:
10064294
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
AMIA ... Annual Symposium proceedings
ISSN:
1559-4076
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In situ sensors that collect high-frequency data are used increasingly to monitor aquatic environments. These sensors are prone to technical errors, resulting in unrecorded observations and/or anomalous values that are subsequently removed and create gaps in time series data. We present a framework based on generalized additive and auto-regressive models to recover these missing data. To mimic sporadically missing (i) single observations and (ii) periods of contiguous observations, we randomly removed (i) point data and (ii) day- and week-long sequences of data from a two-year time series of nitrate concentration data collected from Arikaree River, USA, where synoptically collected water temperature, turbidity, conductance, elevation, and dissolved oxygen data were available. In 72% of cases with missing point data, predicted values were within the sensor precision interval of the original value, although predictive ability declined when sequences of missing data occurred. Precision also depended on the availability of other water quality covariates. When covariates were available, even a sudden, event-based peak in nitrate concentration was reconstructed well. By providing a promising method for accurate prediction of missing data, the utility and confidence in summary statistics and statistical trends will increase, thereby assisting the effective monitoring and management of fresh waters and other at-risk ecosystems. 
    more » « less
  2. Abstract

    The ambient solar wind plays a significant role in propagating interplanetary coronal mass ejections and is an important driver of space weather geomagnetic storms. A computationally efficient and widely used method to predict the ambient solar wind radial velocity near Earth involves coupling three models: Potential Field Source Surface, Wang‐Sheeley‐Arge (WSA), and Heliospheric Upwind eXtrapolation. However, the model chain has 11 uncertain parameters that are mainly non‐physical due to empirical relations and simplified physics assumptions. We, therefore, propose a comprehensive uncertainty quantification (UQ) framework that is able to successfully quantify and reduce parametric uncertainties in the model chain. The UQ framework utilizes variance‐based global sensitivity analysis followed by Bayesian inference via Markov chain Monte Carlo to learn the posterior densities of the most influential parameters. The sensitivity analysis results indicate that the five most influential parameters are all WSA parameters. Additionally, we show that the posterior densities of such influential parameters vary greatly from one Carrington rotation to the next. The influential parameters are trying to overcompensate for the missing physics in the model chain, highlighting the need to enhance the robustness of the model chain to the choice of WSA parameters. The ensemble predictions generated from the learned posterior densities significantly reduce the uncertainty in solar wind velocity predictions near Earth.

     
    more » « less
  3. Accurate traffic speed prediction is critical to many applications, from routing and urban planning to infrastructure management. With sufficient training data where all spatio-temporal patterns are well- represented, machine learning models such as Spatial-Temporal Graph Convolutional Networks (STGCN), can make reasonably accurate predictions. However, existing methods fail when the training data distribution (e.g., traffic patterns on regular days) is different from test distribution (e.g., traffic patterns on special days). We address this challenge by proposing a traffic-law-informed network called Reaction-Diffusion Graph Ordinary Differential Equation (RDGODE) network, which incorporates a physical model of traffic speed evolution based on a reliable and interpretable reaction- diffusion equation that allows the RDGODE to adapt to unseen traffic patterns. We show that with mismatched training data, RDGODE is more robust than the state-of-the-art machine learning methods in the following cases. (1) When the test dataset exhibits spatio-temporal patterns not represented in the training dataset, the performance of RDGODE is more consistent and reliable. (2) When the test dataset has missing data, RDGODE can maintain its accuracy by intrinsically imputing the missing values. 
    more » « less
  4. Abstract

    Estimating uncertainty in flood model predictions is important for many applications, including risk assessment and flood forecasting. We focus on uncertainty in physics‐based urban flooding models. We consider the effects of the model's complexity and uncertainty in key input parameters. The effect of rainfall intensity on the uncertainty in water depth predictions is also studied. As a test study, we choose the Interconnected Channel and Pond Routing (ICPR) model of a part of the city of Minneapolis. The uncertainty in the ICPR model's predictions of the floodwater depth is quantified in terms of the ensemble variance using the multilevel Monte Carlo (MC) simulation method. Our results show that uncertainties in the studied domain are highly localized. Model simplifications, such as disregarding the groundwater flow, lead to overly confident predictions, that is, predictions that are both less accurate and uncertain than those of the more complex model. We find that for the same number of uncertain parameters, increasing the model resolution reduces uncertainty in the model predictions (and increases the MC method's computational cost). We employ the multilevel MC method to reduce the cost of estimating uncertainty in a high‐resolution ICPR model. Finally, we use the ensemble estimates of the mean and covariance of the flood depth for real‐time flood depth forecasting using the physics‐informed Gaussian process regression method. We show that even with few measurements, the proposed framework results in a more accurate forecast than that provided by the mean prediction of the ICPR model.

     
    more » « less
  5. SUMMARY

    Operational earthquake forecasting for risk management and communication during seismic sequences depends on our ability to select an optimal forecasting model. To do this, we need to compare the performance of competing models in prospective experiments, and to rank their performance according to the outcome using a fair, reproducible and reliable method, usually in a low-probability environment. The Collaboratory for the Study of Earthquake Predictability conducts prospective earthquake forecasting experiments around the globe. In this framework, it is crucial that the metrics used to rank the competing forecasts are ‘proper’, meaning that, on average, they prefer the data generating model. We prove that the Parimutuel Gambling score, proposed, and in some cases applied, as a metric for comparing probabilistic seismicity forecasts, is in general ‘improper’. In the special case where it is proper, we show it can still be used improperly. We demonstrate the conclusions both analytically and graphically providing a set of simulation based techniques that can be used to assess if a score is proper or not. They only require a data generating model and, at least two forecasts to be compared. We compare the Parimutuel Gambling score’s performance with two commonly used proper scores (the Brier and logarithmic scores) using confidence intervals to account for the uncertainty around the observed score difference. We suggest that using confidence intervals enables a rigorous approach to distinguish between the predictive skills of candidate forecasts, in addition to their rankings. Our analysis shows that the Parimutuel Gambling score is biased, and the direction of the bias depends on the forecasts taking part in the experiment. Our findings suggest the Parimutuel Gambling score should not be used to distinguishing between multiple competing forecasts, and for care to be taken in the case where only two are being compared.

     
    more » « less