<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents</title></titleStmt>
			<publicationStmt>
				<publisher>ICML</publisher>
				<date>06/05/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10598920</idno>
					<idno type="doi"></idno>
					
					<author>Shayan Kiyani</author><author>George Pappas</author><author>Aaron Roth</author><author>Hamed Hassani</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[A fundamental question in data-driven decision making is how to quantify the uncertainty of predictions to inform risk-sensitive downstream actions, as often required in domains such as medicine. We develop a decision-theoretic foundation linking prediction sets to risk-averse decision-making, addressing three questions: (1) What is the correct notion of uncertainty quantification for risk-averse decision makers? We prove that prediction sets are optimal for decision makers who wish to optimize their value at risk. ( 2) What is the optimal policy that a risk averse decision maker should use to map prediction sets to actions? We show that a simple max-min decision policy is optimal for risk-averse decision makers. Finally, (3) How can we derive prediction sets that are optimal for such decision makers? We provide an exact characterization in the population regime and a distribution free finite-sample construction. These insights leads to Risk-Averse Calibration (RAC), a principled algorithm that is both practical-exploiting black-box predictions to enhance downstream utility-and safe-adhering to user-defined risk thresholds. We experimentally demonstrate RAC's advantages in medical diagnosis and recommendation systems, showing that it substantially improves the trade-off between safety and utility, delivering higher utility than existing methods while avoiding critical errors.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Predictions are frequently used to inform actions. For example, in clinical medicine, patient data are used to predict diagnoses and outcomes when choosing treatments. In high-stakes cases-where an incorrect treatment decision could lead to serious complications or death-it is crucial not to rely solely on a model's predictions. Instead, decisions must account for the uncertainty in these predictions, opting for more conservative interventions when that uncertainty makes the potential outcomes (e.g., complications, side effects) highly variable. Connecting uncertain predictions to actionable, principled decisions is a significant challenge in safety-critical domains, including medical diagnosis, finance, robotics, and control, and requires balancing safety with utility. One extreme is to avoid any action entirely-sacrificing prediction's practical value for absolute safety-while the other is to aggressively exploit predictions to maximize expected utility, accepting significant downside risk at the cost of realizing poor outcomes with substantial probability. Balancing this trade-off calls for an optimal approach to risk-sensitive decision making. To this end, we focus on the following question:</p><p>What is the optimal interface between prediction and action that allows for navigating the trade-off between safety and utility in high stakes applications?</p><p>The optimal design of an action policy crucially depends on how uncertainty is quantified. Among various methods, a widely adopted approach-spurred by advances in conformal prediction-is to produce prediction sets rather than point estimates. But what exactly are prediction sets good for? Which decision-making processes make them the right language for communicating uncertainty? And, given such a process, what is the optimal rule for transforming prediction sets into actions? To address these questions, we first introduce our setting and notation. We consider a feature space X and a label set Y, endowed with the distribution (x, y) &#8594; D. A downstream decision maker has an action set A and a utility function u : A &#8593; Y &#8595; R that maps actions a and realized labels y to utilities u(a, y), which the decision maker seeks to maximize. Upon observing x &#8596; X , the decision maker must take an action a &#8596; A without observing the true label y, relying instead on predictions about y. Within this framework, we aim to answer the above questions.</p><p>In seeking answers, it is instructive to reflect on what we can say about calibrated forecasts, an alternative way of quantifying uncertainty with well-established decision-theoretic</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ML model Prediction set design</head><p>Pre-trained Optimal prediction sets Thms. and 3.2 4.1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Action</head><p>Risk Averse Calibration (RAC)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Action policy design</head><p>Optimal action policy Prop. , Thm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">2.3</head><p>Safety guarantee Corollary 4.2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prediction Uncertainty quantification Decision making</head><p>Figure <ref type="figure">1</ref>. RAC pipeline, an interface between prediction and action for high-stakes applications.</p><p>foundations-that has its own limitations. Suppose we are in a multiclass classification setting, and we represent labels y &#8596; Y using one-hot vectors in the k-dimensional probability simplex. A forecasting rule f : X &#8595; [0, 1] k is calibrated if, for every prediction p, we have E[y | f (x) = p] = p, meaning it is unbiased given the forecast. Then a simple consequence of calibration <ref type="bibr">(Foster and Vohra, 1997;</ref><ref type="bibr">Zhao et al., 2021;</ref><ref type="bibr">Noarov et al., 2023)</ref> is that for any expectation maximizing decision maker, choosing the action that would maximize expected utility as if the forecast was correct is the optimal policy amongst all policies mapping forecasts to actions. Formally, if f is calibrated, then applying BR u (f (x)) = arg max a&#8594;A E y&#8593;f (x) [u(a, y)] achieves higher expected utility than any other policy mapping forecasts to actions. In this sense, calibration is the right language for communicating uncertainty to expectation maximizing-i.e. risk neutral-agents, and the right rule for such agents to ingest calibrated forecasts is to act as if they are correct specifications of the label distribution.</p><p>In contrast, we seek the right interface between predictions and actions for risk-averse agents. Let a(&#8226;) : X &#8595; A be an action policy. We call &#969;(&#8226;) : X &#8595; R a utility certificate if it satisfies the following safety guarantee:</p><p>In words, with probability at least 1 &#8600; &#949;, the utility of an agent following the policy a(x) is guaranteed to be at least &#969;(x). Naturally, we aim to maximize the average value of the utility certificate &#969; subject to satisfying the requirement in (1) -i.e., as the risk-averse agent, we seek to maximize the average quantile of their utility, commonly referred to as the value at risk in the financial risk literature (see Section 2 for details on the problem formulation). This objective yields the optimal balance between safety and utility, achieved by finding the pair (a, &#969;) that satisfies the safety constraint while maximizing the average utility certificate.</p><p>In practice, however, the true probability distribution that connects the actions to their utility values is unknown. Instead, the decision maker must rely on (uncertain) predic-tions to best balance the trade-off between safety and utility.</p><p>The core challenge in this regard is to develop the right notion of uncertainty quantification for the predictions and optimal action policies based on such uncertainty measures.</p><p>We show that prediction sets are the right medium for communicating uncertainty to risk-averse decision makers who seek high-probability guarantees on their realized utility, i.e., the quantiles of their utility distribution as formulated in (1). Specifically, we prove that optimizing action policies to maximize utility while satisfying (1) is fundamentally equivalent to designing prediction sets optimally, followed by a simple max-min decision rule. This establishes prediction sets as a sufficient statistic for safe action policies, encapsulating all necessary information for risk-averse decision making. We then derive an explicit formulation for the optimal prediction sets, which serves as the foundation for a finite-sample algorithm providing distribution-free safety guarantees. Put together, these results characterize the optimal interface between predictions and actions for risk-averse decision making as depicted in Figure <ref type="figure">1</ref>. In more detail:</p><p>1. Max-min decision rule. When given prediction sets C(x) with only a marginal coverage guarantee, riskaverse decision makers should choose their action by maximizing worst-case utility over all labels y &#8596; C(x).</p><p>We prove this max-min policy is minimax optimal over all data distributions satisfying the marginal coverage guarantee (Proposition 2.2).</p><p>2. Prediction-set equivalence. The optimal pair of action policy and utility certificate can be obtained by applying the max-min decision rule to a suitably designed prediction set with marginal coverage (Theorem 2.3). This establishes that prediction sets are a sufficient statistic for safe decision making.</p><p>3. Optimal design of prediction sets. We formulate Risk Averse Conformal Prediction Optimization (Section 2.2) to find prediction sets that maximize the target utility quantile under the max-min policy. Using duality theory, we derive an explicit, one-dimensional char-acterization of the optimal sets (Theorem 3.2), which underpins our finite-sample construction.</p><p>4. Finite-sample algorithm. We propose Risk-Averse Calibration (RAC) (Section 4), which can exploit any black-box predictive model to derive action policies and utility certificates while providing a distributionfree safety guarantee (1). This guarantee holds for any given utility function.</p><p>5. Experiments. In Section 5, we compare RAC with several conformal-prediction methods <ref type="bibr">(Cortes-Gomez et al., 2024;</ref><ref type="bibr">Romano et al., 2020;</ref><ref type="bibr">Sadinle et al., 2019)</ref> and best response baselines. Across multiple tasks, such as medical diagnosis, RAC achieves a superior trade-off between safety and utility, delivering higher utility at each user-specified risk threshold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Related Work</head><p>Conformal prediction (CP), introduced by <ref type="bibr">Vovk et al. (2005)</ref>, provides a flexible framework for constructing prediction sets with finite-sample guarantees <ref type="bibr">(Lei et al., 2018;</ref><ref type="bibr">Shafer and Vovk, 2008)</ref>. Recent research has explored adapting CP to various decision-making problems. Here, we briefly discuss the most relevant works, and provide a thorough discussion in the Appendix A.</p><p>Risk Control. A growing line of research extends CP beyond coverage constraints to control more general risk measures <ref type="bibr">(Lindemann et al., 2023;</ref><ref type="bibr">Angelopoulos et al., 2022;</ref><ref type="bibr">2021;</ref><ref type="bibr">Cortes-Gomez et al., 2024;</ref><ref type="bibr">Lekeufack et al., 2024)</ref>. <ref type="bibr">Angelopoulos et al. (2022)</ref> propose conformal risk control for risk measures over prediction sets, and <ref type="bibr">Cortes-Gomez et al. (2024)</ref> extend this by constructing sets that satisfy coverage while achieving low risk. However, they do not explicitly discuss which actions their sets should inform or how to design these sets to best serve the decision maker. <ref type="bibr">Lindemann et al. (2023)</ref> apply conformal prediction to safe planning, and <ref type="bibr">Lekeufack et al. (2024)</ref> focus on decisions parameterized by a single scalar, calibrated to control risk. However, they restrict their action policy to a predefined low-dimensional family, leaving open the question of how to jointly optimize over policy design and uncertainty quantification for risk-averse utility.</p><p>In this paper, we fill this gap by addressing three core questions for a risk-averse decision maker: (1) What is the correct notion of uncertainty quantification? We prove that prediction sets are optimal for high-stakes decisions.</p><p>(2) How can we design these optimal sets? We provide an exact population-level characterization and a distribution-free, finite-sample construction.</p><p>(3) What is the optimal policy given these sets? We show that a simple max-min rule is optimal for risk-averse utility. In Section 5, we implement the most recent approach in this direction, <ref type="bibr">Cortes-Gomez et al. (2024)</ref> and demonstrate that our framework yields significantly more effective action policies.</p><p>Risk Aversion in Economics. Decision-making under risk aversion is fundamental in economics, beginning with Bernoulli's expected utility theory <ref type="bibr">(Bernoulli, 1954)</ref> and formalized by Von <ref type="bibr">Neumann and Morgenstern's axiomatic model (von Neumann and Morgenstern, 1944)</ref>. Pratt <ref type="bibr">(Pratt, 1964)</ref> and Arrow <ref type="bibr">(Arrow, 1965)</ref> introduced precise measures of risk aversion (Arrow-Pratt coefficients), while <ref type="bibr">Hadar and Russell (Hadar and Russell, 1969)</ref> and Hanoch and Levy <ref type="bibr">(Hanoch and Levy, 1969)</ref> developed stochastic dominance criteria. <ref type="bibr">Rothschild and Stiglitz (Rothschild and Stiglitz, 1970)</ref> further refined risk comparison through mean-preserving spreads. Recent extensions have addressed robust criteria such as maximin and minimax-regret under ambiguity <ref type="bibr">(Manski, 2000;</ref><ref type="bibr">2004;</ref><ref type="bibr">Manski and Tetenov, 2007;</ref><ref type="bibr">Manski, 2011)</ref> (see also the recent survey <ref type="bibr">(Royset, 2024)</ref>). Unlike these classical frameworks, our approach emphasizes data-driven learning and distribution-free uncertainty quantification, providing risk-averse guarantees applicable to any black-box pretrained model.</p><p>Domain-Specific CP Methodologies. Decision making with CP has also been explored in specific domains such as robust optimization <ref type="bibr">(Patel et al., 2024b;</ref><ref type="bibr">Johnstone and Cox, 2021;</ref><ref type="bibr">Yeh et al., 2024)</ref>, medical tasks <ref type="bibr">(Banerji et al., 2023;</ref><ref type="bibr">Vazquez and Facelli, 2022)</ref>, power and energy systems <ref type="bibr">(Renkema et al., 2024)</ref>, formal verification <ref type="bibr">(Lindemann et al., 2024)</ref>, and chance-constrained optimization <ref type="bibr">(Zhao et al., 2024)</ref>. While our framework could potentially be extended to these settings, each may involve additional domain-specific challenges beyond the scope of this work. Additionally, recent works also explored the application of CP sets in decision making in the context of counterfactual inference <ref type="bibr">(Lei and Cand&#232;s, 2021;</ref><ref type="bibr">Yin et al., 2024;</ref><ref type="bibr">Jin et al., 2023)</ref>. We, however, focus on risk averse decision making using prediction sets. In particular, we show that prediction sets are a sufficient statistic for risk averse agents that aim to optimize their value at risk.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">The Preliminaries of Risk-Averse Decision Making</head><p>In this section, we will formalize the central objective of a risk averse decision maker. Recall that in our stetting, upon observing x &#8596; X , the decision maker will have to take an action a &#8596; A. The decision maker does not observe the true label y, but its utility will depend on both the action a and label y, which is captured by a given utility function u.</p><p>We focus on risk-averse decision making, where the goal is to choose actions that ensure a sufficiently high utility with high probability over the randomness of the label. That is, risk aversion prioritizes minimizing the likelihood of lowutility outcomes, even at the cost of overlooking higher but uncertain utilities. Formally, given a risk tolerance thresh-old &#949;, a decision maker facing x &#8596; X assigns each action a &#8596; A a value:</p><p>where Y &#8594; p(y | x). This standard risk measure, known in financial risk literature as Value at Risk (VaR) <ref type="bibr">(Duffie and Pan, 1997)</ref>, represents the largest value such that, if action a is taken, the utility is at least &#969; &#969; (a; x) with probability 1&#8600;&#949;. Thus, the risk-averse decision maker selects the action maximizing &#969; &#969; (a; x), ensuring the highest guaranteed utility.</p><p>The above risk-averse utility should be contrasted with the best expected utility max a E[u(a, Y )|X = x]. The latter leads to actions that maximize the average utility whereas the former aims to maximize the worst-case utility that can happen with probability 1 &#8600; &#949;. Hence the former will be more risk averse at the cost of becoming more conservative. It is important to mention that the economic literature extensively explores various other notions of risk aversion, such as Conditional Value-at-Risk (CVaR) (see e.g. <ref type="bibr">(Royset, 2024)</ref>). However, here we only focus on the aforementioned risk measure, and the exploration of these alternative risk notions remains beyond the scope of this work.</p><p>Marginal Version. The quantity in ( <ref type="formula">2</ref>) is a point-wise or conditional quantity; i.e. to find the best action according to (2) the decision maker requires access to the conditional distribution p(y|x). In practice, such distributions are unknown, and guarantees of the form (2) are often intractable when only a finite sample of the distribution is available. An analogous situation arises in conformal prediction (CP), where obtaining fully-conditional coverage guarantees is known to be impossible from a finite sample of data. Consequently, conformal prediction focuses on relaxed, i.e. marginal (or "group conditional", which still marginalize over part of the distribution <ref type="bibr">(Bastani et al., 2022;</ref><ref type="bibr">Jung et al., 2023)</ref>) coverage guarantees which are statistically tractable.</p><p>By analogy, we will now introduce the marginal version of (2). First we rewrite the objective. For a given x &#8596; X , the value v &#969; (x) in ( <ref type="formula">2</ref>) can be equivalently written as follows</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Maximize</head><p>a&#8594;A,&#949;&#8594;R</p><p>Let us examine the constraint in the above optimization more carefully. We are looking for action-value pairs (a, &#969;) such that we are guaranteed with probability at least 1 &#8600; &#949; that, when taking action a, the resulting utility is at least &#969;.</p><p>Of course, to maximize utility, we should maximize over the choice of the action a and the value v which results in the above optimization. Now, the risk-averse constraint in the above optimization has the following marginal counterpart:</p><p>where the function a(&#8226;) : X &#8595; A is a decision-policy that<ref type="foot">foot_2</ref> maps features to actions such that it guarantees average utility according to the function &#969;(&#8226;) : X &#8595; R with probability at least 1 &#8600; &#949;, marginalized over X. Now, rather than optimizing over a single value for a and &#969; for each x separately, we jointly optimize over policies a(&#8226;) and value functions &#969;(&#8226;)<ref type="foot">foot_3</ref> which map X to actions and values respectively. This results in the following marginal version of the decision maker's optimization problem:</p><p>Risk Averse Decision Policy Optimization (RA-DPO):</p><p>Remark 2.1. While our primary focus is on the marginal formulation of risk-averse optimization, one can also consider the more advanced setting of group-conditional validity <ref type="bibr">(Jung et al., 2023;</ref><ref type="bibr">Gibbs et al., 2023)</ref>. Specifically, for arbitrary groups g 1 , . . . , g m &#8656; X , the marginal constraint in RA-DPO generalizes to: Pr</p><p>. Such constraints enable finer control over risk across subpopulations-critical in applications requiring group-specific guarantees. We leave the exploration of this objective to future works and believe our findings provide a principled first step toward that direction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">A Prediction Set Perspective</head><p>Recall that in our setting the (feature, label) pair is generated according to a distribution. The decision maker only observes the feature x based on which it will choose its action a. However, the realized utility will depend on both the action a and the label y. The decision maker does not observe the label, but we assume that it has access to a predictor that provides predictions about the label y given the input feature x. More specifically, we assume that the predictor will provide prediction sets of the form C(x) &#8656; Y, x &#8596; X , that are guaranteed to contain the true label with high probability. We assume that the prediction sets satisfy the marginal coverage guarantee, i.e.,</p><p>Given this framework, two immediate questions arise: (i) Assuming the only information that the decision maker has about the true label is through the prediction sets, how should it choose its actions to maximize (risk-averse) utility? (ii) How should the prediction sets be designed to not only be marginally valid according to (4) but also maximize the utility achieved by the decision maker?</p><p>We will proceed with answering question (i) now, and will provide an answer to question (ii) in the subsequent sections.</p><p>Assuming that the decision maker can only take actions based on the prediction sets -i.e. it has no other information about the label distributions -then its optimal decision rule takes a simple and natural form. It will have to play the action a that maximizes their utility u(a, y) in the worst case over labels y &#8596; C(x). We denote this optimal riskaverse (RA) decision rule by a RA : 2 Y &#8595; A, and the corresponding utility certificate by &#969; RA : 2 Y &#8595; R:</p><p>u(a, y), (5)</p><p>u(a, y).</p><p>(6)</p><p>We will show that this decision rule is minimax optimal over the set of all distributions that are consistent with the marginal guarantee (4). Assume that the decision maker is given access to a set function C : X &#8595; {2 Y }. Let us also define ! as the set of all the data distributions that are consistent with the marginal guarantee; i.e. the set of all distributions P over (X , Y) such that,</p><p>A be a policy that takes as input the prediction set C(x) and outputs an action. Aligned with RA-DPO, the value of policy &#977; with respect to a joint distribution p(x, y) can then be defined as:</p><p>We are now interested in the policy that is minimax optimal meaning that it can perform well with respect to the worst case distribution in !. That is to say we want to find the policy &#977; &#8595; that is the answer to,</p><p>Proposition 2.2. Assume &#949; &lt; 0.5 and let &#977; &#8595; (x) be the optimal solution to (7). Then we have,</p><p>To summarize, Proposition 2.2 states when the risk averse decision maker decides based on a prediction set C, that contains the actual label with high probability, there is a simple, yet minimax optimal policy, a RA C(x) that guarantees the minimum utility of &#969; RA C(x) with high probability.</p><p>We now focus on how to design prediction sets that would be the most useful for the decision maker among all the prediction sets that provide valid marginal guarantee.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">An Equivalent Formulation via Prediction Sets</head><p>In the previous section we argued that, when deciding based on prediction sets, the (minimax) optimal policy a RA and its associated value &#969; RA are given in (5). Hence, assuming that the decision maker is playing a RA , the prediction sets C(x) should be designed to maximize the resulting utility of the decision maker while ensuring marginal coverage; I.e., the following optimization:</p><p>Risk Averse Conformal Prediction Optimization (RA-CPO):</p><p>u(a, y)</p><p>One might expect that the result of RA-CPO, i.e. optimizing the utility using prediction sets, would lead to a lower utility compared to the original optimization RA-DPO. This is because: (i) The policy given in ( <ref type="formula">5</ref>) is a specific policy designed to be valid even for the worst-case distribution for which the prediction sets are marginally valid (see Proposition 2.2). Hence, this policy could be overly conservative;</p><p>(ii) In RA-DPO the optimal action and value functions are obtained assuming full information about the data distribution, whereas in RA-CPO we require that information must be filtered through a (properly designed) prediction set representation. One might expect a-priori that passing from the actual distribution to a lossy prediction set representation would discard information that is critical to finding the optimal policy. However, the following theorem shows, perhaps surprisingly, that this is not the case; the optimal action policy for any distribution can be represented as a max-min rule over a prediction set. Theorem 2.3. RA-DPO and RA-CPO are equivalent. In other words, from any optimal solution of RA-DPO, denoted by (a &#8595; (x), &#969; &#8595; (x)), we can construct an optimal solution C &#8595; (x) to RA-CPO with the same utility, i.e.,</p><p>. Also, from any optimal solution of RA-CPO we can construct an optimal solution for RA-DPO with the same utility.</p><p>Implications. Prediction sets are a fundamental object in risk averse decision making. In particular, the optimal strategy of a risk averse decision maker can be formulated as playing a max min strategy over a well-designed prediction set. To fully characterize such optimal policies, the first step is to derive the optimal solution to RA-CPO.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">The Optimal Prediction Sets</head><p>We characterize the optimal solution (i.e., prediction sets) for RA-CPO given in (2.2) in terms of the conditional distribution p(y | x). We begin by introducing the fundamental probability utility action :</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " w V N P T i d n O F b 0 0 H t j N 3 9 O O c p P z B Y = " &gt; A A A B 6 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K r 2 P Q i 8 e I 5 g H J E m Y n s 8 m Q 2 d l l p l c I S z 7 B i w d F v P p F 3 v w b J 8 k e N L G g o a j q p r s r S K Q w 6 L r f T m F l d W 1 9 o 7 h Z 2 t r e 2 d 0 r 7 x 8 0 T Z x q x h s s l r F</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " I v 2 v B o h 5 r H q 8 o r 3 S 3 z I / m / g M k 8 0</p><p>sum of probs t probability utility action :</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>P4</head><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " q 9 A I S I 9 U b u r W 4 e e M T s X</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " q R d 8 4 8 B n V e t e 9 z 6 6 u J 6 2</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " E O K g 0 8 q j g w f P 0 u q a 7 1 z c f 8 J h 9</p><p>J z 5 l x d J 6 7 T m X N Y u 7 s 8 r 9 Z s 8 j i I c w C F U w Y E r q M M d N K A J B A Q 8 w y u 8 W c p 6 s d 6 t j 1 l r w c p n 9 u E P r M 8 f / R S P z w = = &lt; / l a t e x i t &gt; u(a3, y1) &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " X O L w 0 P M u E 5 n 7 g 8</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " E a P 9 I S 0 t w k 3</p><p>D K 7 w 5 j 8 6 L 8 + 5 8 z F t X n H z m C P 7 A + f w B 4 2 m N A g = = &lt; / l a t e x i t &gt; t &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " q I Z d 3 O o V G 9 r f P I v b m C P P X u Q m 9 v M = " &gt; A A A B 7 n i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K r 2 P Q i 8 c I x g S S J c x O e p M h s w 9 m e o U Q 8 h F e P C j i 1 e / x 5 t 8 4 S f a g</p><p>r L s g m V L I h e I s v L 5 P H s 6 p 3 W b 2 4 P 6 / U b v I 4 i n A E x 3 A K H l x B D e 6 g D g 0 Q M I R n e I U 3 J 3 V e n H f n Y 9 5 a c P K Z Q / g D 5 / M H 9 V + P V g = = &lt; / l a t e x i t &gt; t &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " 2 T y Y o x 6 9 p r P</p><p>{&#10003;(x, t) + t} &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " b P Q u q C T y S E f + c m U w c 4 H j a b T I + 6 c = " &gt; A A A B 8 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 U Q F y F J n + 6 K b l x W s A 9 p Y 5 h M J + 3 Q m S T M T I Q S + h V u X C j i 1 s 9 x 5 9 8 4 a S u o 6 I E L h 3 P u 5 d 5 7 / J h R q S z r w 8 i t r K 6 t b + Q 3 C 1 v b O 7 t 7 x f 2 D j o w S g U k b R y w S P R 9 J w m h I 2 o o q R n q x I I j 7 j H T 9 y W X m d +</p><p>A y e a G U I g 0 j o C h W c q 9 8 n U s S l n H J f d 3 K k x v K 3 l 4 l / e f 1 E B Q 0 3 p W G c K B L i x a I g Y V B F M P s e D q k g W L G p J g g L q m + F e I w E w k p n V N A h f H 0 K / y c d x 7 R r Z v W 6 U m p e L O P I g y N w D E 6 B D e q g C a 5 A C 7 Q B B h w 8 g C f w b A j j 0 X g x X h e t O W M 5 c w h + w H j 7 B L / X k G c = &lt; / l a t e x i t &gt; u &#8676; a1 &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " G a K 5 m v e f c O I P y C 9 q 7 h 6 u 0 N L 5 3 P 0 = " &gt; A A A B 8 H i c d V D L S s N A F J 3 U V 6 2 v q k s 3 g 0 U Q F y F J n + 6 K b l x W s A 9 p Y 5 h M J + 3 Q m S T M T I Q S + h V u X C j i 1 s 9 x 5 9 8 4 a S u o 6 I E L h 3 P u 5 d 5 7 / J h R q S z r w 8 i t r K 6 t b + Q 3 C 1 v b O 7 t 7 x f 2 D j o w S g U k b R y w S P R 9 J w m h I 2 o o q R n q x I I j 7 j H T 9 y W X m d +</p><p>sum of probs t &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " P L g q L X U N q o 1 y c h Here we have three actions A = {a1, a2, a3} and four labels Y = {y1, y2, y3, y4}. We also let Pi := p(yi|x). For each of the actions aj, the value u &#8594; a j is the (1 &#8595; t)-quantile of u(aj, Y ). The value &#969;(x, t) corresponds to the maximum of these quantiles among the actions, and a(x, t) corresponds to the maximizing action. Right: Illustration of how the function g(x, &#949;) is obtained from &#969;(x, t) for a given x.</p><p>notions that relate optimal utility to coverage. We define the functions &#969; :</p><p>a(x, t) = arg max</p><p>In words, given a feature x &#8596; X and a probability coverage value t &#8596; [0, 1], &#969;(x, t) is computed as follows (see also Figure <ref type="figure">2</ref>): For each action a, we first find the (1&#8600;t)-quantile of the random variable u(a, Y ) with Y being distributed according to p(y|x). This quantile value is the largest utility achievable with probability at least t when we take action a. By maximizing such (1 &#8600; t)-quantiles over the choice of the action a we obtain &#969;(x, t). In words, for x &#8596; X , the value &#969;(x, t) represents the optimal (risk-averse) utility achievable under a conditional coverage assignment t, and the maximizing action is denoted by a(x, t).</p><p>Let us now explain how the function &#969;(x, t) plays a role in finding an optimal solution for RA-CPO. Fix an instance x, assume that we would like to assign conditional coverage probability t to x. For the specific instance x, we would like to construct a prediction set C(x) that with coverage at least t, i.e. Pr(Y &#8596; C(x) | X = x) &#8599; t, where the probability is over the conditional distribution p(y|x). We ask: How should C(x) be designed to maximize the objective of RA-CPO? The following proposition provides the answer.</p><p>Proposition 3.1. Fix an instance x &#8596; X and a coverage value t &#8596; [0, 1]. Then, among all the sets C &#8656; Y that have coverage at least t, i.e. Pr(Y &#8596; C(x)|X = x) &#8599; t, the following set has the largest risk-averse utility value &#969; RA (C) = max a&#8594;A min y&#8594;C u(a, y): C(x, t) = y &#8596; Y : u(a(x, t), y) &#8599; &#969;(x, t) , (11)</p><p>Further, we have &#969; RA (C(x, t)) = &#969;(x, t).</p><p>The optimal sets for RA-CPO (2.2) can now be obtained based on the following re-parametrization in terms of the coverage probabilities that we assign to each x &#8596; X . In order to satisfy the marginal constraint of RA-CPO, we will need to assign to each x, a coverage value t(x) such that E X [t(X)] &#8599; 1 &#8600; &#949;. From the above proposition, if an instance x is assigned with t units of (probability) coverage, then it can add the maximum utility amount of &#969;(x, t) to the objective and its corresponding prediction set, which is optimal given t units of coverage assigned to x, is given in (11). Hence, to find the optimal prediction sets we should find the assignment t(x) which optimally distributes the</p><p>(1 &#8600; &#949;) units of probability over the feature space X , such that the expected utility is optimized. This step is captured by the following equivalent reformulation of RA-CPO:</p><p>Once the optimal solution t &#8595; (x) to the above reparametrization of RA-CPO is found, then the optimal policy/actions, denoted by a &#8595; (x) = a(x, t &#8595; (x)), are derived according to (9), and the optimal prediction set is given by:</p><p>Let us summarize what we have done so far: We proved that RA-DPO (2) and RA-CPO (2.2) are equivalent. Then, to solve the RA-CPO we used a reparametrization as in ( <ref type="formula">12</ref>), which we will now solve.</p><p>Using tools from duality theory, we can show that the optimization problem (12) admits a solution with a simple "one-dimensional" structure in terms of scalar parameter &#982; &#8596; R and an assignment function g :</p><p>An illustration of the function g(x, &#982;) is provided in Figure <ref type="figure">2</ref>. One can observe that g(x, &#8226;) is connected to the convex-conjugate transform of the function &#969;(x, &#8226;).</p><p>Theorem 3.2. Assume that the marginal distribution of X is continuous. Then, optimal solution of (12) has the form</p><p>for a value &#982; &#8595; &#8596; R. Consequently, the optimal prediction sets for RA-CPO are obtained using t &#8595; (x) from (13). Further, the value of &#982; &#8595; is a solution to the following equation in terms of the scalar &#982;: E X [g(X, &#982;)] = 1 &#8600; &#949;.</p><p>The main implication of the above theorem is that it provides a simple characterization of the optimal sets given access to the data distribution: (i) Find the scalar &#982;</p><p>) from ( <ref type="formula">14</ref>); (iii) The optimal prediction set for x, C &#8595; (x), is then given by ( <ref type="formula">13</ref>). The scalar characterization via &#982; is particularly useful when only approximate conditional probabilities are available. By substituting p(y | x) with an approximation in all definitions, we can still apply Theorem 3.2 to find a &#982; that ensures valid coverage for the corresponding prediction sets. This simple scalar calibration then yields prediction sets whose risk-averse utility are improved (and eventually becomes optimal) as the quality of the estimated probabilities improves.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">The Main Algorithm: Risk Averse Calibration (RAC)</head><p>In Section 3, we derived the structure of the optimal prediction sets for the RA-CPO problem. These sets are defined by the functions &#969;(x, t) and a(x, t) given in ( <ref type="formula">9</ref>), which fundamentally relate coverage to utility and actions, as well as the assignment function g(x, &#982;) introduced in ( <ref type="formula">14</ref>). These quantities are defined based on the true conditional distribution which is often unknown in practice.</p><p>In this section, we consider the finite-sample setting. We assume access to calibration samples {(X i , Y i )} n i=1 and a predictive model f : X &#8595; " Y , which assigns a |Y|dimensional probability vector to each x &#8596; X . The output f x represents approximate label probabilities, such as those from a pre-trained model's softmax layer. We denote by f x (y) the probability assigned to label y for input x.</p><p>Using the model f , we will estimate the functions &#969;, a, and g, defined in ( <ref type="formula">9</ref>) and ( <ref type="formula">14</ref>), by substituting the true conditional probabilities with their estimated counterparts obtained via f . Concretely,</p><p>Algorithm 1 Risk Averse Calibration (RAC)</p><p>From the result of Theorem 3.2 we know that the optimal prediction sets admit a "one-dimensional" structure in terms of the scalar parameter &#982; &#8596; R, and the optimal conditional coverage assignment is derived using the function g(x, &#982;). Hence, to simplify notation, we analogously define</p><p>Following (13), the prediction sets take the form</p><p>We can now present our main algorithm.</p><p>Theorem 4.1. Assuming that the calibration data {(X i , Y i )} n i=1 and (X n+1 , Y n+1 ) are exchangeable, we have</p><p>Put it differently, Theorem 4.1 states that the prediction sets constructed by RAC have the so-called property of distribution-free coverage guarantee. Recalling the definitions (5), we can now state the following corollary.</p><p>Corollary 4.2. Assuming that the calibration data</p><p>Putting the pieces together, Corollary 4.2 ensures that a simple max-min decision policy over RAC-constructed prediction sets provides a pair of action policy and utility certificate, namely a RA (C RAC (X test )) and &#969; RA (C RAC (X test )), providing a distribution free safety guarantee according to (1). Moreover, Theorem 3.2 highlights RAC's practical relevance in terms of exploiting the predictive model. Specifically, RAC's utility performance depends on the quality of the predictive model f : if f closely estimates the true conditional probabilities, then the model-based definitions in</p><p>Decision Theoretic Foundations for Conformal Prediction Medical Diagnosis Experiment MovieLens Recommendation Experiment (15) and ( <ref type="formula">17</ref>) approximate their true counterparts in ( <ref type="formula">9</ref>) and ( <ref type="formula">14</ref>), ensuring that RAC-informed decisions align closely with the optimal ones, as guaranteed by Theorem 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>In this section, given a pre-trained model which assigns probability f x (y) to input-label pair (x, y), we compare RAC with two groups of baselines:</p><p>Calibration + Best-Response. We calibrate the model on the calibration data using a strengthened version of decision calibration <ref type="bibr">(Zhao et al., 2021)</ref>, specifically the variant from <ref type="bibr">(Noarov et al., 2023)</ref>, which provides swap regret bounds. We then apply the best-response policy: best-response(x) = arg max a&#8594;A E y&#8593;fx(y) u(a, y) . While this method may achieve higher average utility, it fully trusts the model and is prone to critical errors.</p><p>Conformal Prediction + Max-Min. We construct (1 &#8600; &#949;)-valid prediction sets using split conformal prediction with three different scoring rules. The decision policy then applies the max-min rule from Section 2: a RA (C(x)) = arg max a&#8594;A min y&#8594;C(x) u(a, y), which we proved is the optimal strategy when deciding based on prediction sets:</p><p>&#8226; score-1 <ref type="bibr">(Sadinle et al., 2019)</ref>:</p><p>&#8226; score-2 <ref type="bibr">(Romano et al., 2020)</ref>:</p><p>,</p><p>&#8226; score-3 (Cortes-Gomez et al., 2024): a greedy scoring rule tailored to the max-min policy.</p><p>By varying &#949;, we can control the degree of conservativeness, trading off average utility against the avoidance of catastrophic errors. We compare in terms of safety and utility using the following metrics: (a) Average realized max-min value: The mean of the worst-case utility across the prediction sets (i.e., the average of &#969; RA in ( <ref type="formula">5</ref>)). (b) Fraction of critical mistakes: For samples with a critical groundtruth label, we report the fraction of cases in which each method chooses the worst action. (c) Average realized utility: The empirical mean of the realized utilities across all test samples. (d) Realized miscoverage: The fraction of test samples for which the true label is not in the prediction set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Medical Diagnosis</head><p>In this experiment, we explore decision making in medical diagnosis and treatment as a risk-sensitive application.</p><p>We use the COVID-19 Radiography Database <ref type="bibr">(Chowdhury et al., 2020;</ref><ref type="bibr">Rahman et al., 2021)</ref>, containing chest X-ray images of four classes: Normal, Pneumonia, COVID-19, and Lung Opacity. The data are randomly split into training (70%), calibration (10%), and test (20%) sets. We then fine-tune an Inception v3 model <ref type="bibr">(Szegedy et al., 2015;</ref><ref type="bibr">2016)</ref> (pretrained by google on ImageNet) by retraining the higher layers, while preserving the early-layer features.</p><p>To capture clinical priorities, we employ the utility matrix in Table <ref type="table">1</ref>, which maps each true condition (row) to a set of actions (column). Although we use the specific matrix below, our setup can accommodate any alternative design.</p><p>(Further details on the AI-assisted construction appear in the Appendix C) All the baselines then will be calibrated to connect model's predictions to these four actions. After training, we vary the nominal miscoverage parameter &#949; during calibration to study its impact on performance. As shown in Figure <ref type="figure">3</ref>(a), our method achieves the best tradeoff curve among baselines, providing higher worst-case utilities for every nominal &#949;. Equivalently, it offers stronger utility certificates at each high-probability threshold. In Figure <ref type="figure">3</ref>(c), it also consistently outperforms other prediction set-based methods in terms of average utility.</p><p>While the best-response method attains the highest overall average utility, Figure <ref type="figure">3</ref>(b) highlights its susceptibility to critical mistakes. For example, in COVID-19 cases, bestresponse chooses no action over 60% of the time, recommending a wrong treatment on a large fraction of patients with COVID-19. Our risk-averse policy (RAC) drive this error rate below 10% (at &#949; = 0.02), incurring only a modest (under 5%) drop in average utility. Finally, Figure <ref type="figure">3(d</ref>) confirms that all prediction-set-based baselines achieve their target miscoverage levels, ensuring the associated highprobability utility guarantees remain statistically valid. Additionally, in Figure <ref type="figure">4</ref>, we also report the full distribution of the utility of the actions made by RAC for different values of alpha. There, it is even more clear that as we increase 1 &#8600; &#949;, RAC avoids the extremely low utility actions at the cost of missing on some of the highest utility ones, by resorting to conservative decisions.</p><p>The plots reported in this experiment can serve a broader purpose beyond evaluation: they provide a practical interface for choosing the right level of risk aversion in real-world deployments. By inspecting trade-offs across different &#949; values, e.g. by looking at plots similar to Figures <ref type="figure">3</ref> and <ref type="figure">4</ref>, practitioners can tune the system to their needs-for instance, favoring safety over utility in high-stakes settings like medicine, or vice versa in lower-risk applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Recommender Systems</head><p>We next consider a risk-sensitive recommendation scenario using the MovieLens dataset. Each data point is a usermovie pair x = (user features, movie features), y , where the label y &#8596; {1, 2, 3, 4, 5} is the user's rating. We split the data into training (80%), calibration (10%), and test (10%), and train a neural network classifier f (details in the Appendix) to estimate the probability distribution f y (x).</p><p>At test time, the policy must decide whether to recommend or not recommend a movie. We use the utility function in Table <ref type="table">2</ref>: if a movie with true rating y is recommended, the utility is y &#8600; 3, while not recommending yields 0. We vary</p><p>Table 2. Utility matrix for the MovieLens recommendation task.</p><p>the nominal miscoverage &#949; during calibration and measure performance on test data. As shown in Figure <ref type="figure">3</ref>(a), our method achieves the best trade-off among baselines, offering stronger utility certificates (worst-case utility) at all &#949; levels. Figure <ref type="figure">3</ref>(c) also shows that our approach outperforms other CP-based methods in average utility.</p><p>Although the best-response method achieves the highest overall average utility, Figure <ref type="figure">3</ref>(b) reveals its vulnerability to "critical mistakes"-frequently recommending movies rated 1 or 2. Such failures can undermine user trust and harm companies policy in keeping their customers. In contrast, RAC (&#949; = 0.05) cuts these critical errors by 75%, while incurring only a modest (15%) reduction in average utility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussion and Future Work</head><p>In this paper, we established the decision-theoretic foundations of conformal prediction, showing that valid prediction sets act as sufficient statistics for risk-averse agents optimizing their value at risk. We developed an algorithmic interface linking predictions from any black-box model to actions with marginal, distribution-free safety guarantees.</p><p>Although our focus has been primarily on marginal safety guarantees, we acknowledge the practical importance of stronger conditional guarantees. These include groupconditional (based on covariate characteristics), labelconditional (based on true labels), and action-conditional safety (based on chosen actions). Extending our results systematically to these more nuanced scenarios presents promising directions for future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Extended related works</head><p>The foundational idea of prediction sets can be traced back to early studies by <ref type="bibr">Wilks (1941)</ref>; <ref type="bibr">Wald (1943)</ref>; <ref type="bibr">Scheffe and Tukey (1945)</ref>; <ref type="bibr">Tukey (1947)</ref>. The initial concepts of conformal prediction (CP) were introduced in <ref type="bibr">Saunders et al. (1999)</ref>; <ref type="bibr">Vovk et al. (1999;</ref><ref type="bibr">2005)</ref>. With the advancement of machine learning, conformal prediction has become a widely adopted framework for constructing prediction sets <ref type="bibr">(Vovk, 2013;</ref><ref type="bibr">Papadopoulos et al., 2002;</ref><ref type="bibr">Lei et al., 2018;</ref><ref type="bibr">Romano et al., 2020;</ref><ref type="bibr">2019;</ref><ref type="bibr">Park et al., 2022;</ref><ref type="bibr">Angelopoulos et al., 2020)</ref>. There has been a growing body of work aiming to adapt conformal prediction methods for a range of decision-making problems. In the following, we will discuss the ones relevant to the present work.</p><p>Risk Control. A growing line of research extends CP beyond coverage constraints to control more general risk measures <ref type="bibr">(Lindemann et al., 2023;</ref><ref type="bibr">Angelopoulos et al., 2022;</ref><ref type="bibr">2021;</ref><ref type="bibr">Cortes-Gomez et al., 2024;</ref><ref type="bibr">Lekeufack et al., 2024;</ref><ref type="bibr">Zecchin and Simeone, 2024a;</ref><ref type="bibr">Blot et al., 2024;</ref><ref type="bibr">Zecchin and Simeone, 2024b)</ref>. In particular, <ref type="bibr">Angelopoulos et al. (2022)</ref> propose conformal risk control for monotone risk measures over prediction sets, and <ref type="bibr">Cortes-Gomez et al. (2024)</ref> extend this by constructing sets that satisfy coverage while achieving low risk. However, these works do not explicitly discuss which actions their sets should inform or how to design these sets to best serve the decision maker. Moreover, <ref type="bibr">Lindemann et al. (2023)</ref> applies conformal prediction to safe planning, and <ref type="bibr">Lekeufack et al. (2024)</ref> focuses on decisions parameterized by a single scalar, calibrated to control risk. However, these works restrict their action policy to a predefined low-dimensional family, leaving open the question of how to jointly optimize over policy design and uncertainty quantification for risk-averse utility.</p><p>In this paper, we fill this gap by addressing three core questions for a risk-averse decision maker: (1) What is the correct notion of uncertainty quantification? We prove that prediction sets are optimal for high-stakes decisions. ( <ref type="formula">2</ref>) How can we design these optimal sets? We provide an exact population-level characterization and a distribution-free, finite-sample construction. (3) What is the optimal policy given these sets? We show that a simple max-min rule is optimal for risk-averse utility. In Section 5, we implement the most recent approach in this direction <ref type="bibr">(Cortes-Gomez et al., 2024)</ref>, and demonstrate that our framework yields significantly more effective action policies.</p><p>On top of the fundamental differences we mentioned, there are also technical differences. After proving the equivalence of the risk-averse objective defined in Section 2 to the prediction set optimization called RA-CPO in Section 2.2, one might think we can define a risk function of the form l(C) = &#8600; max a&#8594;A min y&#8594;C(x) u(a, y), and then apply risk controlling methods to control this risk. However, controlling this risk alone is meaningless, as it is always possible to control the risk by outputting trivial sets. Hence, the risk should be controlled combined with coverage guarantees. The only risk controlling framework that additionally allows for a coverage constraint is the work of <ref type="bibr">Cortes-Gomez et al. (2024)</ref>, which we compare our performance with in Section 5, and show our superior performance in handling the safety utility trade-off. Furthermore, the defined loss function l for a generic utility function u, lacks any (approximate) separability property or sub-modularity, which are essential for algorithmic development of <ref type="bibr">Cortes-Gomez et al. (2024)</ref>. We, however, work directly with the max-min objective and do not rely on any assumptions. For readers familiar with nested conformal prediction <ref type="bibr">(Gupta et al., 2022)</ref>, perhaps another way to elaborate on this important technical difference is to look at Section ??, where in Theorem 3.2, we characterize the optimal prediction sets over the population. It is clear then that the optimal sets do not necessarily form a nested sequence of sets as we sweep the miscoverage threshold &#949;. This is in contrast to when we want to find optimal sets corresponding to minimum average prediction set size (or any other separable objective). There, the optimal characterization is of the form p(y|x) &#8599; q (or more generally of the form s(x, y) &#8658; q for some score function s), where q is tuned to satisfy the marginal coverage constraint <ref type="bibr">(Lei et al., 2013;</ref><ref type="bibr">Sadinle et al., 2019;</ref><ref type="bibr">Kiyani et al., 2024b)</ref>. This distinction hints to the sub-optimality of the algorithms that rely on monotonicity properties of the risk, e.g. thresholding a score function, in obtaining the best risk averse action policies and safety guarantee.</p><p>Robust Optimization. The max-min policy that we will discuss in Section 2.1 also naturally arises at the intersection of uncertainty quantification and robust optimization <ref type="bibr">(Patel et al., 2024b;</ref><ref type="bibr">Johnstone and Cox, 2021;</ref><ref type="bibr">Chenreddy and Delage, 2024;</ref><ref type="bibr">Li et al., 2025;</ref><ref type="bibr">Yeh et al., 2024;</ref><ref type="bibr">Cao, 2024;</ref><ref type="bibr">Wang et al., 2023;</ref><ref type="bibr">Lou et al., 2024;</ref><ref type="bibr">Patel et al., 2024a;</ref><ref type="bibr">Elmachtoub et al., 2023;</ref><ref type="bibr">Lin et al., 2024;</ref><ref type="bibr">Chan et al., 2024;</ref><ref type="bibr">2023;</ref><ref type="bibr">Chan and Kaw, 2020)</ref>. In robust optimization, decision-making under uncertainty is typically formulated as a minimax problem, where an optimal decision is sought against worst-case realizations within an uncertainty set. Despite a structural resemblance of these works to our framework in that they involve optimization over an uncertainty set, their scope and objectives have some fundamental differences from ours. We fix any black-box predictive model and any utility function, and in contrast to existing approaches, we jointly characterize the optimal notions of uncertainty quantification and action policy. Specifically, we ask: (1) What is the appropriate uncertainty quantification for risk-averse decision makers? We answer that prediction sets are optimal for achieving high-probability utility guarantees.</p><p>(2) How should these prediction sets be optimally constructed? We provide a distribution-free, finite-sample construction that characterizes the optimal sets. (3) What is the optimal decision policy given these sets? We prove that the max-min rule is provably optimal for risk averse agents. In doing so, our Risk-Averse Calibration (RAC) method offers a principled alternative to uncertainty sets based on heuristic conformity score designs, thereby contributing to the growing intersection of conformal prediction and robust optimization. Additionally, on a more technical note, in Section ??, we show that the optimal prediction sets that lead to optimal safe action policies when used in tandem with the max-min rule do not necessarily take the form of thresholding a score function (i.e., s(x, y) &#8658; q for some score function s). There, we characterize an alternative form that, in fact, captures the optimal prediction sets in the context of risk-averse decision-making. That is to say, our results hint to a principled alternative to conventional score-based prediction sets in the pipeline of robust optimization to avoid suboptimality.</p><p>Risk Aversion in Economics. Decision-making under risk aversion is a foundational topic in economics, shaped by seminal contributions. Bernoulli <ref type="bibr">(Bernoulli, 1954)</ref> introduced expected utility theory, explaining risk aversion via diminishing marginal utility. Von <ref type="bibr">Neumann and Morgenstern (von Neumann and Morgenstern, 1944)</ref> formalized this with an axiomatic model of rational choice under uncertainty. Pratt <ref type="bibr">(Pratt, 1964)</ref> and Arrow <ref type="bibr">(Arrow, 1965)</ref> developed the Arrow-Pratt coefficients, providing precise measures of risk aversion and distinguishing between increasing and decreasing risk sensitivity. <ref type="bibr">Hadar and Russell (Hadar and Russell, 1969)</ref>, along with Hanoch and Levy <ref type="bibr">(Hanoch and Levy, 1969)</ref>, introduced stochastic dominance to rank risky alternatives for risk-averse agents. <ref type="bibr">Rothschild and Stiglitz (Rothschild and Stiglitz, 1970)</ref> deepened this framework by defining mean-preserving spreads, a formal notion of increased risk. More recent extensions introduced robust decision-making criteria, such as maximin and minimax-regret, applicable under ambiguous or unknown probabilities <ref type="bibr">(Manski, 2000;</ref><ref type="bibr">2004;</ref><ref type="bibr">Manski and Tetenov, 2007;</ref><ref type="bibr">Manski, 2011)</ref>. Collectively, these works established the theoretical underpinnings of risk aversion that continue to influence modern economic theory (for a recent survey look at <ref type="bibr">(Royset, 2024)</ref>). In contrast to these works, our work focuses on data-driven learning and uncertainty quantification aspects of the risk averse decision making. We develop distribution-free methods capable of leveraging any black-box pretrained model, accompanied with risk aversion guarantees.</p><p>Further Related Work. The potential connection of CP ideas to decision making has also been explored in <ref type="bibr">Vovk and Bendtsen (2018)</ref>, from the point of view of conformal predictive distributions. Conformal predictive distributions produce calibrated distributions rather than prediction sets-see e.g. <ref type="bibr">(Vovk et al., 2017;</ref><ref type="bibr">2018;</ref><ref type="bibr">2020)</ref>. Therefore, they are best to be compared with calibrated forecasts as the methodologies developed in <ref type="bibr">Vovk and Bendtsen (2018)</ref> are also targeting expectation maximizer-i.e. risk neutral-agents. Additionally, recent works also explored the application of CP sets in decision making in the context of counterfactual inference <ref type="bibr">(Lei and Cand&#232;s, 2021;</ref><ref type="bibr">Yin et al., 2024;</ref><ref type="bibr">Jin et al., 2023)</ref>. We, however, focus on risk averse decision making using prediction sets. In particular, we show that prediction sets are a sufficient statistic for risk averse agents that aim to optimize their value at risk. Alternatively, Bayesian methods for risk-averse decision-making often employ Gaussian Processes (GPs) to optimize measures like Value-at-Risk and Conditional Value-at-Risk; e.g. look at <ref type="bibr">(Sui et al., 2015;</ref><ref type="bibr">Nguyen et al., 2021;</ref><ref type="bibr">Demirel et al., 2022;</ref><ref type="bibr">Cakmak et al., 2020;</ref><ref type="bibr">Lin et al., 2022;</ref><ref type="bibr">Baudry et al., 2021;</ref><ref type="bibr">Jia et al., 2024)</ref>. These approaches rely on accurate Bayesian posterior distributions, thus implicitly assuming well-specified probabilistic models. Our conformal approach complements rather than competes with Bayesian methods: our theoretical results (up to Section ??) can be directly employed even in Bayesian settings. In fact, when Bayesian approximations are reliable, one can take advantage of our optimal prediction sets derivation in population, and then calibrate the prediction sets with finite sample under Bayesian models, without employing the finite-sample calibration of Section 4. Alternatively, even when Bayesian assumptions' precision is uncertain, one can still start from Bayesian posteriors and further calibrate prediction sets using our approach, ensuring robust safety guarantees.</p><p>Although our primary aim is to develop a general framework to construct prediction sets for high-stakes decision-making, we note that conformal prediction sets have been explored in a wide range of specific applications and domains of high-stakes nature. For instance, CP methods have been adapted and used in medical tasks <ref type="bibr">(Banerji et al., 2023)</ref>, power and energy systems <ref type="bibr">(Renkema et al., 2024)</ref>, formal verification and control <ref type="bibr">(Lindemann et al., 2024)</ref>, chance-constrained optimization <ref type="bibr">(Zhao et al., 2024)</ref>  </p><p>We construct a "worst-case" distribution in ! for &#977;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pick any</head><p>Under p &#8595; , we have Y &#8596; C(X) almost surely (since C(x) is nonempty and we place all mass on a label in C(x)). Hence p &#8595; &#8596; ! because the marginal coverage constraint</p><p>since Y is chosen (with probability 1) to be the worst-case label within C(x). Thus, for this specific x, no matter how we choose &#977;, its achievable value is at most min y&#8594;C(x) u &#977;(C(x)), y . Also, In other words, any policy &#977; cannot achieve a value larger than the above infimum for the inner minimization in (7). u(a, y).</p><p>For those x &#8596; X such that C(x) is empty put &#969;(x) = max a&#8594;A max y&#8594;Y u(a, y). We claim that with probability at least 1 &#8600; &#949;, the policy a RA (C(x)) achieves a utility at least &#969;(x). Indeed, on the event {Y &#8596; C(X)} (which has probability at least 1 &#8600; &#949; by assumption), it holds that</p><p>Thus, setting the target utility at each x to &#969;(x) satisfies</p><p>u a, y .</p><p>Since p &#8596; ! was arbitrary, we have shown</p><p>u a, y .</p><p>Comparing with the upper bound in Part 1 establishes that a RA attains the best possible (minimax) value. Hence</p><p>u(a, y)</p><p>solves the minimax problem (7).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2. Proof of Theorem 2.3</head><p>We give a constructive proof by showing how from each solution of RA-DPO we can construct a feasible solution of RA-CPO without losing any utility, and vice versa. By applying this to the optimal solutions of both problems, we obtain the result of the theorem.</p><p>(I) From RA-DPO to RA-CPO. Suppose we have an feasible solution a(&#8226;), &#969;(&#8226;) to the RA-DPO problem. Consider a pair</p><p>Here, we have u max = max a max y u(a, y), and as mentioned in Section 2, since &#969; is a utility certificate its value at any x should be less than u max . Since (a, &#969;) is a feasible solution of RA-DPO, it satisfies the following: Pr X,Y [u(a(X), Y ) &#8599; v(X)] &#8599; 1 &#8600; &#949;.</p><p>Define a prediction set C(x) = y | u a(x), y &#8599; &#969;(x) .</p><p>In words, C(x) is the set of labels y for which the utility u a(x), y is at least &#969;(x). By definition, we have</p><p>As a result, we have</p><p>Hence, C(&#8226;) satisfies the marginal coverage constraint of RA-CPO.</p><p>Next, we will improve the prediction sets C to new prediction sets C which satisfy the marginal guarantee but can potentially have larger value under the objective of RA-CPO. The basic idea is to consider points x &#8596; X such that C(x) is empty and augment an additional element to those empty sets. Recall that we defined u max := max a&#8594;A max y&#8594;Y u(a, y). Hence, there exists at least one (action, label) pair, which we call (a max , y max ) such that u max = u(a max , y max ). Now, let us define</p><p>where &#8659; denotes the empty set. We now update C(&#8226;) to C(&#8226;) as follows:</p><p>-if x &#8596; X empty : C(x) = {y max }, -if x / &#8596; X empty : C(x) = C(x).</p><p>Step 1: Any set C with coverage &#8599; t has risk-averse utility at most &#969;(x, t). Take an arbitrary set C &#8656; Y satisfying Pr Y &#8596; C | X = x &#8599; t.</p><p>Then for any action a &#8596; A, min y&#8594;C u(a, y) &#8658; quantile 1&#8596;t u(a, Y ) | X = x .</p><p>(The reason is that with probability at least t, Y lies in C, and so the (1 &#8600; t)-quantile of u(a, Y ) cannot be smaller than the smallest utility on this event.) Taking the maximum over a yields Hence no set with coverage at least t can achieve risk-averse utility larger than &#969;(x, t).</p><p>Step 2: The set C(x, t) attains coverage t and achieves &#969;(x, t). Consider C(x, t) = { y : u(a(x, t), y) &#8599; &#969;(x, t)}. By definition of the (1 &#8600; t)-quantile, we have Pr u(a(x, t), Y ) &#8599; &#969;(x, t) | X = x &#8599; t, which implies Pr[Y &#8596; C(x, t) | X = x] &#8599; t. Moreover, for every y &#8596; C(x, t), by construction u a(x, t), y &#8599; &#969;(x, t), so min y&#8594;C(x,t) u a(x, t), y &#8599; &#969;(x, t). Thus &#969; RA (C(x, t)) = max a&#8594;A min y&#8594;C(x,t) u(a, y) &#8599; min y&#8594;C(x,t)</p><p>u a(x, t), y &#8599; &#969;(x, t).</p><p>Combining both steps shows that C(x, t) is an optimal choice among all sets with coverage at least t, and its risk-averse utility equals &#969;(x, t).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4. Proof of Theorem 3.2</head><p>We start from the reparametrization of RA-CPO given in (12): maximize t:X &#8599;[0,1] E X &#969;(X, t(X))</p><p>subject to: E X t(X) &#8599; 1 &#8600; &#949;.</p><p>(Reparametrization of RA-CPO)</p><p>We will further reparametrize this optimization problem and find equivalent relaxations. To do so, let us define</p><p>Also, we will need to consider the derivative of the function &#969;(x, t) in terms of its second argument t. Since the function &#969; can be discontinuous, we will have to consider its generalized derivative (i.e. consider delta functions). More precisely, let &#969; We can just think of &#969; &#8594; (x, t) as the derivative d dt &#969;(x, t). We can then rewrite the objective of our optimization problem as</p><p>where we used the fact that &#969;(x, 0) = u max for any x &#8596; X by definition and &#969;(x, t) &#8600; &#969;(x, 0) = &#63738; t 0 &#969; &#8600; (x, t)dt. Similarly, we can rewrite the constraint as, E X [t(X)] = E X &#63727; 1 t=0 &#1009;(X, t)dt.</p><p>Given the above notation and relations, we can write down the following equivalent reparametrization of (Reparametrization of RA-CPO). The optimization variable here is the function &#1009;(x, t) which is a step function according to (19). We further note that any such step function defined on the unit interval can be equivalently thought of as a non-increasing function on the unit interval which only takes its value in the set {0, 1}. Hence we arrive at the following integer program that is an equivalent reparametrization of (Reparametrization of RA-CPO) as well as the RA-CPO. We now consider a relaxation of the above integer program to the following convex program. As we will see later, this relaxation becomes equivalent to the above integer program as every solution of the relaxed program would correspond to a solution of the integer program. However, for now, let us focus on the following continuous relaxation whose variable &#1009;(x, t) can take values in the interval [0, 1] (in contrast to the original integer program in which &#1009; could take its value only in the set {0, 1}): Here, the "optimization variable" &#1009;(x, t) belongs to an infinite-dimensional space. Hence, in order to be fully rigorous, we will need to use the duality theory developed for general linear spaces that are not necessarily finite-dimensional. For a reader who is less familiar with infinite-dimensional spaces, what appears below is a direct extension of the duality theory (i.e. writing the Lagrangian) for the usual linear programs in finite-dimensional spaces.</p><p>Let F be the set of all measurable function defined on X &#8593; [0, 1]. Note that F is a linear space. Let ! be the set of all the measurable functions on X &#8593; [0, 1] which are non-increasing in t and are bounded between 0 and 1; I.e. ! = {&#1009; &#8596; F s.t. &#1009; : X &#8593; [0, 1] &#8595; [0, 1]; &#8771;x &#8596; X : &#1009;(x, t) is non-increasing in t} (20)</p><p>Note that ! is a convex set. We can then rewrite the (Relaxed Program) as follows: Now, it it easy to see that if t &#8595; is such that &#962;(t &#8595; ) = &#962; max then both steps (a), (b) will be equality (instead of an inequality) for the following function</p><p>On the other hand, for step (a) to be tight we must have the following: For every point t such that &#1009; &#8600; (t) &lt; 0, we have &#962;(t) = &#962; max . This shows that an optimal solution must be in the convex hull defined in the theorem, and hence, the result of the theorem follows. The uniqueness also follows similarly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.5. Proof of Theorem 4.1</head><p>We have:</p><p>where, (a) comes form the definition of the prediction set. (b) comes from the fact that X 1 , Y 1 , &#960;Y n+1 , . . . , X n , Y n , &#960;Y n+1 , X n+1 , Y n+1 , &#960;Yn+1</p><p>are exchangeable, which is due to the fact that (i) the exchangeability of the original (n + 1) pairs {(X i , Y i )} &#8660; {(X n+1 , Y n+1 )}, and (ii) the symmetric way in which Algorithm 1 assigns &#960;y to each y &#8596; Y. Finally, (c) follows from the definition of &#960;Yn+1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Utility function for medical experiment</head><p>Our results and findings in the medical experiment of section 5.1, can be reproduced with any other reasonable design of utility function. The goal of that experiment is not to capture a precise characterization of difficulties and consequences in medical decision making but rather to pinpoint the advantages of a risk averse calibration approach in sensitive tasks like medical decision making. Of course, in real world scenarios, a more comprehensive approach is needed to define a principled utility function that captures the interests of all the involving parties. That being said, for the sake of proof of concept, we designed a utility matrix using the ChatGPT o1 model by OpenAI. The following is an AI generated text justifying the proposed utility matrix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Clinical Justification of the Utility Matrix</head></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USA. Correspondence to: Shayan Kiyani &lt;shayank@seas.upenn.edu&gt;.Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>267, 2025.  Copyright 2025 by the author(s).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_2"><p>In this paper, we focus on deterministic action policies.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_3"><p>Here, note that since &#969;(x) is a utility function, its value can not be larger than the maximum achievable utility; i.e. &#969;(x) &#8594; umax := maxa maxy u(a, y) for all x &#8593; X .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_4"><p>For simplicity, we assume in this section that the maximizer of &#977;(x, s) + &#949;s is unique with probability 1.</p></note>
		</body>
		</text>
</TEI>
