<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Learning Social Welfare Functions</title></titleStmt>
			<publicationStmt>
				<publisher>NeurIPS</publisher>
				<date>10/11/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10548396</idno>
					<idno type="doi"></idno>
					
					<author>Kanad Pardeshi</author><author>Itai Shapira</author><author>Ariel D Procaccia</author><author>Aarti Singh</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Is it possible to understand or imitate a policy maker's rationale by looking at past decisions they made? We formalize this question as the problem of learning social welfare functions belonging to the well-studied family of power mean functions. We focus on two learning tasks; in the first, the input is vectors of utilities of an action (decision or policy) for individuals in a group and their associated social welfare as judged by a policy maker, whereas in the second, the input is pairwise comparisons between the welfares associated with a given pair of utility vectors. We show that power mean functions are learnable with polynomial sample complexity in both cases, even if the comparisons are social welfare information is noisy. Finally, we design practical algorithms for these tasks and evaluate their performance.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Consider a standard decision making setting that includes a set of possible actions (decisions or policies), and a set of individuals who assign utilities to the actions. A social welfare function aggregates the utilities into a single number, providing a measure for the evaluation of actions with respect to the entire group. Utilitarian social welfare, for example, is the sum of utilities, whereas egalitarian social welfare is the minimum utility. Given two actions that induce the utility vectors (3, 0) and (1, 1) for two individuals, the former is preferred when measured by utilitarian social welfare, whereas the latter is preferred according to egalitarian social welfare.</p><p>When competent decision makers adopt policies that affect groups or even entire societies, they may have a social welfare function in mind, but it is typically implicit. Our goal is to learn a social welfare function that is consistent with the decision maker's rationale. This learned social welfare function has at least two compelling applications: first, understanding the decision maker's priorities and ideas of fairness, and second, potentially imitating a successful decision maker's policy choices in future dilemmas or in other domains.</p><p>As a motivating example, consider the thousands of decisions made by public health officials in the United States during the Covid pandemic: opening and closing schools, restaurants, and gyms, requirements for masking and social distancing, lockdown recommendations, and so on. Each decision induces utilities for individuals in the population; closing schools, for instance, provides higher utility to medically vulnerable individuals compared to opening them, but arguably has much lower utility for students and parents. Assuming that healthcare officials were acting in the public interest and (approximately) optimizing a social welfare function, which one did they have in mind? Our goal is to answer such questions by learning from example decisions.</p><p>Another example we consider in this paper is that of allocating food resources in a community by a US-based nonprofit to hundreds of recipient organizations. Working with a dataset of utility of 18 Table <ref type="table">1</ref>: A summary of our results regarding the sample complexity of various tasks. Here, &#958; = u max (u max -u min ) and &#954; = log(u max /u min ), with all d individual utilities assumed to be in the range [u min , u max ]. &#961; &#8712; [0, 1/2) is the probability of mislabeling for the i.i.d noise model, and &#964; max is the maximum temperature of the logistic noise model.</p><p>Social Welfare Information Loss Known Weights Unknown Weights Cardinal values &#8467;2 O(&#958; 2 ) O(&#958; 2 d log d) Pairwise comparisons 0-1 O(log d) O(d log d) Pairwise comparison with i.i.d noise 0-1 O log d (1-2&#961;) 2 O d log d (1-2&#961;) 2 Pairwise comparisons with logistic noise estimation Logistic O(&#964; 2 max &#954; 2 ) O(&#964; 2 max &#954; 2 d log d)</p><p>different stakeholders such as donors, volunteers, dispatchers and recipient organizations <ref type="bibr">[11]</ref>, we consider the task of learning the social welfare implicit in the decisions that may be made by the nonprofit.</p><p>In order to formalize this problem, there are two issues we need to address. First, to facilitate sample-efficient learnability, we need to make some structural assumptions on the class of social welfare functions. We focus on the class of weighted power mean functions, which includes the most prominent social welfare functions: the aforementioned utilitarian and egalitarian welfare, as well as Nash welfare (the product of utilities). This class is a natural choice, as it is the only class of functions feasible under a set of reasonable social choice axioms such as monotonicity, symmetry, and scale invariance <ref type="bibr">[19,</ref><ref type="bibr">7]</ref>.</p><p>Second, we need to specify the input to our learning problem. There are two natural options, and we explore both: utility vectors coupled with their values under a target social welfare function, or pairwise comparisons between utility vectors. We demonstrate sample complexity bounds for both types of inputs, where the social welfare value or comparisons can be noiseless or corrupted by noise. We note that estimating the utility vector associated with any particular decision or policy is ostensibly challenging, but in fact this has been done in prior work and we have access to relevant data, as we discuss in Section 6.</p><p>Our contributions. Learning weighted power mean functions is a non-standard regression or classification problem due to the complex, highly nonlinear dependence on the power parameter p, which is the parameter of interest. While one can invoke standard hyperparameter selection approaches such as cross-validation to select p from a grid of values, the infinite domain of p does not allow demonstration of a polynomial sample complexity without deriving an appropriate cover. We derive statistical complexity measures such as pseudo-dimension, covering number, VC dimension and Rademacher complexity for this function class, under both cardinal and ordinal observations of the social welfare function. Our sample complexity bounds are summarized in Table <ref type="table">1</ref>. These results may be of interest for other problems where weighted power mean functions are used, such as fairness in federated learning <ref type="bibr">[12]</ref>.</p><p>We highlight some key contributions of this paper. We first establish the statistical learnability of popular social welfare functions belonging to the weighted power mean functions family. We derive a polynomial sample complexity of O(1) for learning using cardinal social welfare values under &#8467; 2 loss, and O(log d) (where d denotes the number of individuals) for learning using comparisons under 0 -1 loss in the unweighted/known weight setting. The upper bounds are a consequence of the monotonicity of the target functions with p in the cardinal case, and analysis which reveals that the target functions have O(log d) roots in the ordinal case. As expected, the &#8467; 2 loss is also sensitive to the range (u max -u min ). We also prove matching lower bounds for the ordinal case.</p><p>We also establish a polynomial sample complexity of O(d log d) for both cardinal and ordinal tasks in the setting when the individual weights are unknown. This result is intuitive, as learning an additional d weight parameters incurs a proportional increase in the sample requirement.</p><p>We then analyze the sample complexity for the more practical ordinal task under different noise models (independent and identically distributed, aka i.i.d, and logistic noise) and characterize the dependence of sample complexity on the amount of noise. For the i.i.d setting, the sample complexity increases with large noise (large &#961;) and reduces to that of noiseless setting when &#961; = 0. Unlike the i.i.d setting where &#961; is known, for the logistic noise, we also consider estimation of the noise level &#964; and evaluate the likelihood with respect to the noisy distribution. Since the noise is harder to estimate with increasing &#964; , the sample complexity increases with &#964; . Also, the likehood is sensitive to the range of utilities &#954;.</p><p>Despite the non-convexity of the problem, we demonstrate a simple, practical algorithm for learning weighted power means functions on the above tasks using simulated data and a dataset of utility vectors for a food resource allocation task from Lee et al. <ref type="bibr">[11]</ref>, and observe good performance over a range of parameters. Additionally, we verify the theoretical scaling of sample complexity through simulations.</p><p>Related work. Conceptually, our work is related to that of Procaccia et al. <ref type="bibr">[17]</ref>. They also study the learnability of decision rules that aggregate individual utilities, but in their case, the individual utilities are represented as rankings over a set of alternatives (rather than cardinal utilities, as in our problem), and the rule to be learned is a voting rule mapping the input rankings of utilities to a winning alternative. They provide sample complexity results with respect to two families of voting rules, namely positional scoring rules and voting trees.</p><p>Basu &amp; Echenique <ref type="bibr">[1]</ref> derive bounds on VC dimension for additive, Choquet, and max-min expected utility for decision-making under uncertainty, bounding the number of pairwise comparisons needed to falsify a candidate decision rule and estabilishing learnability or non-learnability for these classes.</p><p>Note that here the decision rule operates on probability distributions instead of utility vectors, and their results are very different from ours on a technical level; for instance, max-min is not learnable in their setting (infinite VC dimension), whereas it is easily learnable in ours.</p><p>Kalai <ref type="bibr">[10]</ref> studies the learnability of choice functions and establishes PAC guaranteees. Choice functions are defined with respect to a fixed and finite set of alternatives X, with each sample being a subset from X and the choice over this subset. By contrast, our work involves learning the behavior of a function on an infinite number of actions for which the utilities are known.</p><p>Pellegrini et al. <ref type="bibr">[16]</ref> conduct experiments on learning aggregation functions which are assumed to be a composition of L p means, observing that they perform favorably in various tasks such as scalar aggregation, set expansion and graph tasks. Our analysis is of a more theoretical nature, establishing the sample-efficient learnability of weighted power mean aggregation functions.</p><p>Melnikov &amp; H&#252;llermeier <ref type="bibr">[14]</ref> consider learning from actions with feature vectors and their global scores, with local scores for each individual unavailable for learning. They learn both local and global score functions, and consider the ordered weighted averaging operator for aggregating local scores. While we assume that each individual's local score is given, the aggregation function belongs to a richer function family motivated by social choice theory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Setup</head><p>We assume that the decision-making process concerns d individuals. The decision-making setting we consider has each action associated with a positive utility vector u &#8712; [u min , u max ] d &#8834; R d + , which describes the utilities derived from the d individuals.</p><p>We encode the impact of each individual i &#8712; [d] on the decision-making process through a weight value w i &#8805; 0 such that d i=1 w i = 1. These weight values together form a weight vector w &#8712; &#8710; d-1 . The weight vector might be a known or unknown quantity. A common instance in which the weight vector is known is when all agents are assumed to have an equal say, in which case w = 1 d /d. For all settings we consider, we provide PAC guarantees for both known weights and unknown weights.</p><p>We assume that the decision-making process provides a cardinal social welfare value to each action. However, this social welfare value can be latent and need not be available to us as data. For the first task concerned with cardinal decision values, the social welfare values are available and can be used for learning. For the second task, both actions in the pair have a latent social welfare which is not available to us; however, the preferred action in the pair is known to us. We consider learning bounds with the empirical risk minimization (ERM) algorithm for all the losses in this work, with p being learned when the weights are known, and ( &#373;, p) being learned when the weights are unknown.</p><p>Power Mean. The (weighted) power mean is defined on p &#8712; R &#8746; {&#177;&#8734;}, and for u</p><p>It is sometimes more convenient to use the (natural) log power mean than the power mean. Since d i=1 w i = 1, in effect we have d variables, w 1 , . . . , w d-1 and p. We refer to the weighted power mean family with known weight w as M w,d = {M (&#8226;; w, p)|p &#8712; R}. If the weight is unknown, the weighted power mean family is denoted by</p><p>The power mean family is a natural representation for social welfare functions. Cousins <ref type="bibr">[7,</ref><ref type="bibr">6]</ref> puts forward a set of axioms under which the set of possible welfare functions is precisely the weighted power mean family. An unweighted version of these functions results in the family of constant elasticity of substitution (CES) welfare functions <ref type="bibr">[8]</ref>, which are widely studied in econometrics.</p><p>To show the generality of this family of functions, we list a few illustrative cases:</p><p>&#8226; M (u; w, p = -&#8734;) = min i&#8712;d u i , which corresponds to egalitarian social welfare.</p><p>&#8226; M (u; w, p = 0) = d i=1 u wi i , which corresponds to a weighted version of Nash social welfare.</p><p>&#8226; M (u; w, p = 1) = i=1 w i u i , which corresponds to weighted utilitarian welfare. &#8226; M (u; w, p = &#8734;) = max i&#8712;d u i , which corresponds to egalitarian social malfare.</p><p>We note that for p = &#177;&#8734;, the decision utility is independent of w. With w i = 1/d for all i &#8712; [d], we get the conventional interpretations of the welfare notions mentioned above.</p><p>The power mean family has some useful properties. An obvious one is that M (u, w, p) &#8712; [u (1) , u (d) ], where u (1) and u (d) denote the first and d-th order statistics of u = (u 1 , . . . , u n ). u (1) is attained at p = -&#8734;, and u (d) is attained at p = &#8734;. A more general observation is the following: (c) M (u; w, p) and log M (u; w, p) -log M (v; w, p) are quasilinear w.r.t. w if p is fixed.</p><p>A proof for the above lemma is provided in Appendix A.1. This monotonicity of the power mean in w and p was also noted by Qi et al. <ref type="bibr">[18]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Cardinal Social Welfare</head><p>We first consider the case where we know the cardinal value of the social choice associated with each action. Learning in this setting thus corresponds to regression. Formally, we assume an underlying distribution D : [u min , u max ] d &#215; [u min , u max ] over the utilities and social welfare values. We receive i.i.d samples {(u i , y i )} n i=1 &#8764; D n , u i being the utility vector and y i &#8712; [u i(1) , u i(d) ] being the social welfare value associated with action i.</p><p>We consider &#8467; 2 loss over M (u i ; w, p) and y i . The true risk in this case is</p><p>To analyze the PAC learnability of this setting, we first provide bounds on the pseudo-dimensions of M w,d and M d . We begin by noting that M (u; w, p) can also be represented as u (d)</p><p>&#8226; M (r; w, p), where r &#8712; [d], r i = u i /u (d) . Since M (r; w, p) &#8712; [0, 1], we can find pseudo-dimensions of this function class. We now define the function classes S w,d = {f (u; w, p) = M (r; w, p)|(w, p) &#8712; &#8710; d-1 &#215; R} and S d = {f (u; w, p) = M (r; w, p)|p &#8712; R}. We then have the following bounds on pseudo-dimensions: Lemma 3.1. (a) If w is known, then Pdim(S w,d ) = 1. (b) If w is not known, then Pdim(S d ) &lt; 8d(log 2 d + 1). A detailed proof is provided in Appendix A.3. We highlight the fact that p and w are the parameters of the log power mean function family, which calls for the novel bounds provided in this work. These bounds on the pseudo-dimensions can now be used to obtain PAC bounds. Theorem 3.2. Given a set of samples {(u i , y i )} n i=1 drawn from a distribution D n , for any &#948; &gt; 0, the following holds with probability at least 1 -&#948; with respect to the &#8467; 2 loss function: (a) If w is known, then R(w, p) -inf p&#8712;R R(w, p) &#8804; 16&#958; 2 log 2 + 2 log n n + c &#8730; n + 6 log(4/&#948;) 2n (b) If w is unknown, then R( &#373;, p) -inf (w,p)&#8712;&#8710; d-1 &#215;R R(w, p) &#8804;16&#958; 2 log 2 + 16(d log 2 d + 1) log n n + c &#8730; n + 6 log(4/&#948;) 2n where &#958; = u max (u max -u min ). Proof Sketch. We first use the pseudo-dimensions found above to bound the Rademacher complexity of M d and M w,d in Lemma A.7. Since M (u i ; w, p) &#8712; [u min , u max ] and y i &#8712; [u min , u max ], the &#8467; 2 loss function in this case has domain [u min -u max , u max -u min ]. It is Lipschitz continuous in this domain with Lipschitz constant 2&#958;. Using Lemma A.7 and Talagrand's contraction lemma, we obtain the bounds R</p><p>These Rademacher complexity bounds are then used to obtain the uniform convergence bounds above.</p><p>These bounds are distribution-free, with the only assumption being that all utilities and social welfare values are in the range [u min , u max ]. They also imply an O(1) and O(d log d) dependence of sample complexity on d for known and unknown weights respectively. Moreover, we observe the dependence of the upper bound on u max -u min for the &#8467; 2 loss. We note that when u max = u min = u 0 , all utilities and social welfare function values are also u 0 . In this case, the Rademacher complexity bound is also zero, which is expected.</p><p>Computationally, M (u; w, p) is non-convex in w and p, which means that the &#8467; 2 loss is also nonconvex. However, we observe that from Lemma (c), M (u; w, p) is quasilinear w.r.t. w with fixed p, which makes the &#8467; 2 loss function quasi-convex for all (u, y). We use this fact to construct a practical algorithm. A detailed explanation of the quasi-convexity of &#8467; 2 loss is provided in Appendix A.1.1.</p><p>A shortcoming of this setting is that decision-makers are required to provide a social welfare value for each action. A more natural setting might be when decision-makers only provide their preferences between actions -potentially just their revealed preferences, i.e., the choices they have made in the past -and we address this case next.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Pairwise Preference Between Actions</head><p>For this setting, we assume an underlying distribution</p><p>, where (u i , v i ) are the utilities for the i-th pair of actions, and y i is a comparison between their (latent) social choice values. We encode the comparison function as</p><p>We denote the family of above functions by C w,d = {C((u, v); w, p) : p &#8712; R} when the weights are known, and</p><p>} when the weights are unknown. We consider learning with 0 -1 loss over C((u i , v i ); w, p) and y i . The true risk in this case is</p><p>To provide convergence guarantees for the above setting, we bound the VC dimension of the comparison-based function classes mentioned above Lemma 4.1.</p><p>The detailed proof of the above lemma is related to Appendix A.6. We find the asymptotically tight lower bound for the known weights case rather surprising, as it is a priori unclear that the correct bound should be superconstant and scale with d.</p><p>The finiteness of VC dimension guarantees PAC learnability, and we get uniform convergence bounds using the VC theorem. Theorem 4.2. Given samples {((u i , v i ), y i )} n i=1 &#8764; D n and for 0-1 loss and any &#948; &gt; 0, with probability at least 1 -&#948;, (a) If w is known, then</p><p>We note that unlike the bounds on &#8467; 2 loss of Theorem 3.2, these bounds on 0-1 loss are independent of the range of utility values and only depend on d. They provide sample complexity bounds which depend on d as O(log d) and O(d log d) for known and unknown weights respectively. Despite these PAC guarantees, empirical risk minimization can be particularly difficult in this case, since the loss function as well as the function class log M (u; w, p) -log M (v; w, p) can be non-convex.</p><p>To illustrate this non-convexity, we plot the value of the above function for two pairs of utility vectors with respect to p in Figure <ref type="figure">6</ref>, with d = 6 and w = 1 d /d. However, the quasilinearity of log M (u; w, p) -log M (v; w, p) with fixed p can be used to design efficient algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Convergence Bounds Under I.I.D Noise</head><p>Decision making can be especially challenging if two actions are difficult to compare, and the preference data we obtain can potentially be noisy. We first consider each comparison to be mislabeled in an i.i.d. manner with known probability &#961; &#8712; [0, 1/2). We make use of the framework developed by Natarajan et al. <ref type="bibr">[15]</ref>, and we consider convergence guarantees under 0-1 loss.</p><p>Specifically, the unbiased estimator of &#8467; 0-1 is</p><p>We conduct ERM with respect to l0-1 to obtain ( &#373;, p) &#8712; &#8710; d-1 &#215; R (only learning p if weights are known). We observe that &#8467; 0-1 (t, y) = (1 + ty)/2 is 1/2-Lipschitz in t, &#8704;t, y &#8712; {&#177;1}. Using Theorem 3 of Natarajan et al. <ref type="bibr">[15]</ref>, we get the following convergence bounds:</p><p>A detailed proof of the above theorem is provided in Appendix A.7. We note that although ERM is conducted with respect to l0-1 on the noisy distribution, the risks are defined on the underlying noiseless distribution. This gives</p><p>for the known and unknown weights cases respectively. We note that when &#961; = 0, the above bounds reduce to the noiseless bounds in Theorem 4.2. Since the noise level &#961; is usually not known to us, it can be estimated using cross-validation as suggested by Natarajan et al. <ref type="bibr">[15]</ref>.</p><p>However, conducting ERM on l0-1 might be prohibitively difficult due to the non-convex nature of the function. An i.i.d noise model might also be inappropriate in certain settings; we next consider a more natural noise model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Pairwise Preference With Logistic Noise</head><p>Intuitively, we expect that two actions would be harder to compare if their social welfare values are closer to each other. We formalize this intuition in the form of a noise model inspired by the BTL noise model <ref type="bibr">[4,</ref><ref type="bibr">13]</ref>. Let w * and p * be the true power mean parameters, and let &#964; * &#8712; [0, &#964; max ] be a temperature parameter. For an action pair (u, v), we assume that the probability of u being preferred to v is</p><p>We see that a larger difference between the log power means of u and v translates to a higher probability of u being preferred. If u and v lie on the same level set of log M (&#8226;; w * , p * ), the probability becomes 0.5, which matches the intuition of both actions being equally preferred. We also note the dependence of the probability on &#964; * : a higher &#964; * corresponds to more confidence in the preferences, with &#964; * = 0 meaning indifference for all pairs of actions. The mislabeling probability is also invariant to scaling of u and v.</p><p>Our learning task now becomes estimating w, p and &#964; given data. We denote the function family in this case by T w,d = {&#964; (log M (&#8226;; w, p) -log M (&#8226;; w, p)) |&#964;, p} when the weights are known, and</p><p>|&#964;, w, p} when the weights are unknown. A natural loss function to consider in this case is negative log likelihood, and we consider PAC learnability with this loss. Using the framework developed in Section 3, we obtain the following PAC bounds:</p><p>Theorem 5.1. Given samples {((u i , v i ), y i )} n i=1 &#8764; D n and for negative log likelihood loss, for all &#948; &gt; 0, with probability at least 1 -&#948;, (a) If w is known, then</p><p>where &#954; = log(u max /u min ).</p><p>We derive this result in detail in Appendix A.8. This gives us sample complexity bounds of O(1) and O(d log d) with respect to d for the known and unknown weights cases respectively, thus establishing PAC learnability. An important distinction between Theorem 4.3 and the above theorem is that Theorem 4.3 bounds risk with respect to 0-1 loss, while the above theorem bounds risk with respect to logistic loss which is continuous and hence easier to control. Moreover, we estimate the noise level &#964; in the logistic case along with w and p, whereas 4.3 is concerned with estimating w and p.</p><p>As with the previous cases, non-convexity in this setting also makes global optimization with respect to w and p (and hence ERM) difficult. We observe that logistic loss is quasilinear in w with fixed p, and this observation can be used to construct an effective algorithm. A detailed explanation of this fact is provided in Appendix A.1.2. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Empirical Results</head><p>We conduct several simulations on semi-synthetic data to gain additional insight into sample complexity and demonstrate an empirically effective algorithm. The implementation also serves to demonstrate the practicability of our approach, including the availability of individual utility functions.</p><p>Data. The dataset we rely on (which is not publicly available) comes from the work of Lee et al. <ref type="bibr">[11]</ref> with a US-based nonprofit that operates an on-demand donation transportation service supported by volunteers. WeBuildAI is a participatory framework that enables stakeholders, including donors, volunteers, recipient organizations, and nonprofit staff, to collaboratively design algorithmic policies for allocating donations. Donors provide food donations, volunteers transport the donations, recipient organizations receive and distribute the food, and dispatchers (nonprofit staff) manage the allocation and logistics. The "actions" are hundreds of recipient organizations that may receive an incoming donation. As part of this framework, Lee et al. <ref type="bibr">[11]</ref> learned a (verifiably realistic) utility function over the actions for each of 18 stakeholders from the different groups based on 8 features: travel time between donors and recipients, recipient organization size, USDA-defined food access levels in recipient neighborhoods, median household income, poverty rates, the number of weeks since the last donation, the total number of donations received in the last three months, and the type of donation (common or uncommon).</p><p>In our simulations, we use the values of these stakeholder utility functions learned by Lee et al. <ref type="bibr">[11]</ref> as the utility vectors. We fix a p * and weight vector w * to generate the social welfare values M (u; w, p). We use noisy versions of these social welfare values in the cardinal case, whereas noisy pairwise comparisons between random pairs of utility vectors are used in the ordinal case.</p><p>Algorithm. As noted in previous sections, &#8467; 2 and logistic losses are quasiconvex with respect to w for single samples when p is fixed. Although the sum of quasiconvex functions is not guaranteed to be quasiconvex, we empirically observe that gradient descent on the loss function applied to the data can still lead to convergence to a minimum which has empirical risk comparable to that of the true parameters. As our simulations show, this minimum increasingly resembles w * (the real weight) with decreasing noise. Thus, our algorithm consists of performing a grid search on p and conducting gradient descent on w for each p. We provide more details about the algorithm in Appendix B.</p><p>Cardinal case. We consider p * = 2.72 and a random weight w * . We then add Gaussian noise with standard deviation u i(d) -u i(1) &#8226; &#957; to each sample, where &#957; corresponds to the noise level. The Gaussian noise is clipped to stay within u i(1) , u i(d) . Finally, we learn p and w using our algorithm, and we present the results in figure <ref type="figure">1</ref>.</p><p>In Figure <ref type="figure">1a</ref>, we observe that the test loss for learned parameters decreases with decreasing noise and increasing number of samples. We also observe that the test loss for learned parameters closely matches that for real parameters in Figure <ref type="figure">1a</ref>. In Figure <ref type="figure">1b</ref>, we observe that KL divergence between the true and learnt weights decreases uniformly with decreasing noise and increasing number of samples. This supports the fact that our algorithm is indeed able to find the correct minimum. We also plot the trend of mean learned p in Figure <ref type="figure">1c</ref>, and we observe that the learned p increasingly resembles the real p * with lower noise and greater number of samples. Plots for train loss and loss on noiseless test data are provided in Appendix C.</p><p>Ordinal case. We consider p * = -1.62 and a random weight w * . We compare each sample in the considered training data with 10 other randomly chosen samples, with the comparisons being noised according to the logistic noise model in Equation <ref type="bibr">(1)</ref>. We then learn w and &#964; for each p in the chosen grid and then choose the best p. Our results are shown in Figure <ref type="figure">2</ref>.</p><p>In Figure <ref type="figure">2a</ref> we observe that the test loss for learned parameters matches that for real parameters for small &#964; * and large number of training samples. The relative deviation between test losses progressively increases for smaller number of samples and smaller &#964; * . We note that small &#964; * corresponds to more noise in the comparisons, which results in higher losses. However, the deviation between learnt loss and true loss is smaller as we are also estimating the noise parameter which is easier to estimate for small &#964; * , since the logistic function has larger gradient. We observe a uniform decrease in KL divergence between w and w * for larger number of samples and smaller &#964; * , which again points to the effectiveness of the algorithm. We also observe that test accuracy on noiseless data increases with more samples and higher &#964; * . Interestingly, for &#964; * = 0.1 and &#964; * = 1, the test accuracy on noiseless data (Figure <ref type="figure">2c</ref>) is significantly higher than that on (noisy) test data, which is another indicator of effective ERM being conducted by the algorithm. In Figure <ref type="figure">4b</ref> in Appendix C, we observe that there is a greater variation in learned p when compared to the cardinal case. A possible reason behind this is that changes in p result in smaller changes in losses for negative p than for positive p. This hypothesis is supported by simulations for the ordinal case conducted for p = 1.62, whose results are presented in Figure <ref type="figure">5</ref>. In Figure <ref type="figure">5e</ref>, we observe that learned p is much more consistent with the real p as &#964; * decreases.</p><p>We also conduct simulations on fully synthetic data to study the effect of d, and we present the results in Appendix E. We verify the theoretical O(d log d) scaling of error with unknown weights for the ordinal case in figure <ref type="figure">8</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion</head><p>Our work has (at least) several limitations, which can inspire future work. First, as seen in Section 6, we are able to gain access to realistic utility vectors, in this case ones based on models that were learned from pairwise comparisons. Utilities are also routinely estimated for other economicallymotivated algorithms -say, Stackelberg security games <ref type="bibr">[20]</ref>. However, these estimates are of course not completely accurate. It is an interesting direction of future work to extend our results to the setting where the utility vectors need to be estimated, either by an outside expert, or using input from the individuals themselves.</p><p>Although our experiments demonstrate convergence of the algorithm to the correct minimum, rigorous theoretical analysis about the nature of minima for the &#8467; 2 and logistic loss functions is still needed and could lead to algorithmic improvements. One issue is that scaling the algorithm to the national scale -d = 10 8 , say, can be prohibitively expensive.</p><p>Finally, our work only applies to weighted power mean functions. While we have argued that this family is both expressive and natural, it would be exciting to obtain results for even broader, potentially non-parametric families of social welfare functions.</p><p>The ability to learn social welfare functions can enable us to understand a decision maker's priorties and ideas of fairness, based on past decisions they have made. This has direct societal impact as these notions can be used to both understand biases and inform the design of improved fairness metrics. A second potential application is to imitate a successful decision maker's policy choices in future dilemmas or in other domains. This may pose some ethical questions if the learning model is misspecified; however, the restriction of the function class to weighted power means, which is inspired by natural social choice theory axioms, mitigates this risk.</p><p>A Deferred Proofs</p><p>A.1 Proof of Lemma 2.1 Proof of Lemma 2.1 Part (a). Let p &gt; 0. Define t i = ui umax for all i &#8712; [n]. Since t i &#8804; 1 for all i &#8712; [n] and given that n i=1 w i = 1, it follows that n i=1 w i t p i &#8804; n i=1</p><p>w i = 1. Therefore, log(w i t p i ) &#8804; 0 for each i. Given p &gt; q &gt; 0, we obtain</p><p>This derivation similarly holds for the case 0 &gt; p &gt; q, demonstrating the monotonicity of log M (u; w, p) with respect to p. The continuity of log M (u; w, p) with respect to p at p = 0 ensures the monotonicity for all p &#8712; R. Since log is a strictly increasing function, this guarantees the monotonicity of M (u; w, p). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proof of Lemma 2.1 Part (b). Since</head><p>i is positive since it is a positive weighted sum of utilities. We aim to show that p -1 (u p i -</p><p>Conversely, if p &lt; 0, then u p i -u p d &lt; 0, but since p -1 is also negative, the product remains positive. Thus, if u i &gt; u d , the log-norm increases with w i . A similar argument shows that the log-norm decreases if u i &lt; u d .</p><p>For p = 0, we have</p><p>This indicates that for u i &gt; u d , the derivative is positive, implying an increase, and negative for u i &lt; u d , implying a decrease. Thus, the function is monotonic for all</p><p>Since log is a strictly increasing function, this guarantees the monotonicity of M (u; w, p).</p><p>Proof of Lemma 2.1 Part (c). We prove that log M (u; w, p) -log M (v; w, p) is quasilinear. The proof for M (u; w, p) follows by setting v = 1 d .</p><p>As noted in <ref type="bibr">[3]</ref>, the fraction of two linear functions is quasilinear when the denominator is greater than zero. As &#10216;w, v p &#10217; &gt; 0&#8704;w &#8712; &#8710; d-1 , we have that</p><p>is a quasilinear function. Moreover, we note that f (w) &gt; 0&#8704;w &#8712; &#8710; d-1 . We note that for x &gt; 0, x 1/p is a monotone function for p &#8712; R \ {0}. Since a monotone function preserves quasilinearity, M (u; w, p)/M (v; w, p) = f (w)</p><p>1/p is a quasilinear function. Since g(x) = log(x) is also a monotone function, quasilinearity is preserved, which makes log M (u; w, p) -log M (v; w, p) a quasilinear function as well. A.1.1 Quasiconvexity of &#8467; 2 loss Since M (u; , w, p) is quasilinear according to Proposition 2.1 (c), f (w) = M (u; w, p) -y is also quasilinear. For w 1 , w 2 &#8712; &#8710; d-1 , we thus have</p><p>Thus, f (w 2 ) 2 = (M (u; w, p) -y) 2 is quasiconvex.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1.2 Quasilinearity of logistic loss</head><p>From Proposition 2.1 (c), we know that log M (u; w, p) -log M (v; w, p) is quasilinear. We consider two cases:</p><p>&#8226; y = 1: Since -log &#963;(x) = log(1 + exp(-x)) is a monotonic function, it preserves quasilinearity.</p><p>&#8226; y = 0: Since -log(1-&#963;(x)) = log(1+exp(-x))+x is a monotonic function, it preserves quasilinearity.</p><p>Using the above two properties, we conclude that -y log &#963;(x) -(1 -y) log(1 -&#963;(x)) is a quasilinear function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 Lemma A.1</head><p>We state an important property: the log power mean can be expressed using another log power mean with individual utilities in a fixed range. </p><p>Let r &#8712; [0, e] d such that</p><p>)</p><p>and q = p log(u (d) /u (1) ). We then have</p><p>We differentiate between two representations of Rademacher complexity:</p><p>It is clear that R(F ) &#8804; Rabs (F ). We now list a few results related to Pollard's pseudo-dimension. Definition A.2. (Pseudo-shattering) Let H be a set of real valued functions from input space X . We say C = (x 1 , . . . , x m ) is pseudo-shattered if there exists a vector r = (r 1 , . . . , r m ) such that for all</p><p>Definition A.3. The pseudo-dimension Pdim(H) is the cardinality of the largest set pseudo-shattered by H.</p><p>The following lemma connects pseudo-dimensions to VC dimensions: Lemma A.4.</p><p>The following lemma bounds the Rademacher complexity using pseudo-dimension and covering numbers.</p><p>Lemma A.5.</p><p>We also have the following covering number bound for Rademacher complexity:</p><p>We now turn to bounding the complexity for the unknown and known weights cases.:</p><p>Proof of Lemma 3.1 Part (a). The function class is:</p><p>Moreover, from 2.1, it follows that for a fixed u &#8712; R d , M (u; w, p) is a non-decreasing function with respect to p. Consequently, there exists a p * &#8712; R &#8746; {&#177;&#8734;} such that for any y &#8712; (u min , u max ), we have M (u; w, p) &lt; y for all p &lt; p * , and M (u; w, p) &#8805; y for all p &#8805; p * . This implies that B M (u, y) = sign(M (u; w, p) -y) changes its sign exactly once as p increases.</p><p>We note that for B M (x, y), one point can be shattered (by choosing p &lt; p * and p &gt; p * ). However, for two points u and v, the number of times a sign change occurs with increasing p for either u or v is at most twice, meaning that only 3 labels can be achieved. Thus, 2 points cannot be shattered.</p><p>Proof of Lemma 3.1 Part (b). The function class is:</p><p>which, we observe, is exactly the expression in the noiseless comparison-based setup for the unknown weights case. We show in Lemma 4.1 (b) that the VC dimension for this expression is upper bounded by 8(d log 2 d + 1). Thus, our result is proved.</p><p>A.4 Lemma A.7: Rademacher complexity bound for cardinal case Lemma A.7.</p><p>(a) If w is known, then</p><p>where c &gt; 0 is a constant.</p><p>Proof. We prove the result for unknown weights -the result for known weights follows by replacing the pseudo-dimension bound of S d by that of S w,d from Lemma 3.1. Let d p denote the pseudodimension for the unknown weights case.</p><p>From Lemmas A.5 and A.6, and since log M (r; w, q) &#8712; [0, 1], we have</p><p>We thus have</p><p>gives us the required bound</p><p>We observe that the above lemma provides O( log(n)/n) and O( d log(d) log(n)/n) bounds on the Rademacher complexity for unknown and known weights respectively. An important aspect of the above bounds is their dependence on u max . Intuitively, this means that the richness of the function class increases as the maximum possible utility value increases.</p><p>A.5 Proof for Theorem 3.2</p><p>Proof. We prove the result for the unknown weights case -the result for known weights follows a similar process. For &#8467; 2 loss, our function class is</p><p>we have y -M (u; w, p) &#8712; [u min -u max , u maxu min ]. Over a bounded range [-&#947;, &#947;], &#8467; 2 (t, y) = (t -y) 2 is 2&#947; Lipschitz continuous w.r.t. t. Thus, using Talagrand's contraction lemma and Lemma A.7, we have</p><p>We then use the uniform convergence bounds for Rademacher complexity to get</p><p>Replacing R(M d ) from Lemma A.7 provides us with the required bounds.</p><p>A.6 Proof of Lemma 4.1</p><p>First, we state a lemma from Jameson <ref type="bibr">[9]</ref>: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Consider the function f</head><p>Applying Lemma A.8, if w i = w for all i, the sequence {A j }, consisting of sums of w or -w, can have at most d -1 sign changes. A sign change at index k implies A k-1 = 0, and the next sign change cannot occur before index k + 2. Therefore, f (p) has at most d zeros in this case. In the general case, where w i &#824; = w j for some i &#824; = j, a sign change in {A j } can occur at any index except the first and the last. Thus, f (p) can have at most 2d -1 roots, as sign changes are possible at all intermediate indices. We conclude that M p (u; w, p) -M p (v; w, p), defined over R &#8746; {&#177;&#8734;}, can change sign as a function of p at most d -1 times if w i = 1 d , and up to 2d -1 times in the general case. Lemma A.9. Let r, q : R &#8594; R be two polynomials such that c := r(x) -q(x) is a constant for all x &#8712; R, and the sets of roots {x 1 , . . . , x d } and {y 1 , . . . , y d } of r and q respectively are disjoint, positive and of size d. Then, for k = 0, 1, . . . , d -1, the k-th power sums of the roots x 1 , . . . , x d and y 1 , . . . , y d are equal, i.e.:</p><p>Proof. Given that r -q = c, both polynomials have the same non-constant coefficients. According to Vieta's formulas, the k-th elementary symmetric polynomial of x, e k (x), is the sum of all products of k distinct x i 's, and similarly e k (y) for y. If r -q = c, it implies that the symmetric polynomials derived from the roots of r and q are equal, e k (x) = e k (y).</p><p>Newton's identities relate the elementary symmetric polynomials and the power sums as follows:</p><p>where p i (x) is the i-power sums of the roots. Given that e k (x) = e k (y), we can equate the right-hand sides of the above identities to obtain the power sums p i (x) and p i (y). This yields p k (x) = p k (y) for each k, due to the recursive nature of Newton's identities and the fact that the elementary symmetric polynomials of x and y are equal for all k &#8804; d.</p><p>Hence, p k (x) = p k (y) for all k = 0, . . . , d -1, which concludes the proof.</p><p>Given f (p) as defined above, Lemma A.9 implies there exists disjoint u, v &#8712; R d such that M p (u; w, p) = M p (v; w, p), for d -1 unique values of p and for w i = 1/d. Moreover, suppose that for a set</p><p>Then, for any &#955; &gt; 0, there exist</p><p>Lemma A.10 (Jameson <ref type="bibr">[9]</ref>, Theorem 3.4). For any k &lt; d and p</p><p>We now proceed to the proof of Lemma 4.1, which bounds the VC dimensions of the function classes C w,d and C d .</p><p>Proof of Lemma 4.1 Part (a). As there are at most 2d -1 roots to f (p), there can be at most 2d -1 sign changes as p varies from -&#8734; to &#8734;. Consequently, the hypothesis class defined by all p (denoted as M w,d ) is a subset of the hypothesis class that consists of at most 2d -1 sign changes on the real line. This larger hypothesis class is denoted by H d , and we have VC(M w,d ) &#8804; VC(H d ).</p><p>Let us consider m samples {u i , v i } m i=1 . For each sample, sign changes occur at most 2d -1 times, and hence the total number of changes in labeling over the entire real line is bounded by (2d -1)m (as p changes, each change in labeling corresponds to a change in sign for at least one of the samples). This implies that the total number of possible labelings is (2d -1)m + 1.</p><p>If the set of m samples is shattered, the upper bound derived above should be at least as large as the total number of labelings possible. We thus have:</p><p>We can show that m = 2(&#8968;log 2 d&#8969; + 1) points cannot be shattered. Consider</p><p>We now bound the VC dimension for the unknown weight case. Consider p &#824; = 0. In this case, a hypothesis C((u, v); w, p) can be expressed as sign</p><p>where u p = (u</p><p>T . Thus, for a fixed p, the set of viable w's spans a halfspace. We note that each component of sign (p) (u pv p ) is continuous, which means that &#10216;w, sign (p) (u pv p )&#10217; is a continuous function in w and p.</p><p>For n &gt; d samples {((u i , v i ), y i } n i=1 , we define h i (p) = sign (p) (u pv p ) , i &#8712; [n], p &#824; = 0. For a fixed p, we note that the set of possible labelings for w &#8712; &#8710; d-1 is a subset of the set of possible labelings for w &#8712; R d , which in turn is the set of labelings generated by n hyperplanes. Since this problem has VC dimension d, the number of possible labelings for a fixed p is upper bounded by (n + 1) d . Let B(p) denote the set of possible labelings for hyperplanes defined by {h i (p)} n i=1 for a particular p. Lemma A.11. Let p 1 and p 2 have the same sign, with a labeling &#8467; &#8712; {&#177;1}</p><p>n such that &#8467; &#824; &#8712; B(p 1 ) but &#8467; &#8712; B(p 2 ). Then, there is a p &#8712; [p 1 , p 2 ] such that there is a set of d linearly dependent vectors h (1) (p), . . . , h (d) (p).</p><p>Proof. Let &#8467; be the labeling which is in B(p 2 ) but not in B(p 1 ). Since this labeling is not in B(p 1 ), for each w, there is some hyperplane</p><p>Let B &#8834; R d denote the unit hypersphere around the origin. Since the labelings are invariant to the scale of w, the set of possible labelings for w &#8712; R d is exactly the set of possible labelings for w &#8712; B Consider the quantity m(p) = max w&#8712;B min i &#8467; i &#10216;w, h i (p)&#10217;. We observe that if m(p) &lt; 0, for each w there is some i &#8712; [n] such that &#8467; i &#10216;w, h i (p)&#10217; &lt; 0, i.e., the labeling is not attained at any point. On the other hand, if m(p) &#8805; 0, there is some w such that the labeling is attained at w. Since &#8467; i &#10216;w, h i (p)&#10217; is a continuous function in w and p for all i &#8712; [n], min i &#8467; i &#10216;w, h i (p)&#10217; is also a continuous function in w and p. Thus, m(p) is also a continuous function in p.</p><p>Using this fact and the intermediate value theorem, there should be some p &#8712; [p 1 , p 2 ] such that m(p) = 0. Let w * &#8712; B be a vector at which m(p) = 0 is attained. We now show that at this p, at least d of the n vectors {h i (p)} Suppose this were not the case, i.e., any set of d vectors in the set is linearly independent. This means that at most d -1 of the vectors lie on the hyperplane {x : &#10216;w, x&#10217; = 0}. Let h (1) (p), . . . , h (k) (p) denote these vectors, with k &#8804; d -1. Since m(p) = 0, there should be at least one such vector. Let H &#8712; R k&#215;d be the matrix with these vectors as the rows.</p><p>If these k vectors are linearly dependent, we can add any of the remaining n -k vectors to get a set of d linearly dependent vectors. Let us consider the case where they are not linearly dependent. Consider {x : Hx = 1 k }. This is an underdetermined set of linear equations, and the set should be non-empty (because of linear independence of the k vectors). Let x 0 be one of the vectors in this set.</p><p>Let t &gt; max i&#8712;[n] -&#10216;w,hi(p)&#10217; &#10216;x0,hi(p)&#10217; . We have</p><p>w + tx 0 is a point such that &#10216;w + tx 0 , h i (p)&#10217; &gt; 0 for all i &#8712; [n]. This means that m(p) &gt; 0, which is a contradiction. Intuitively, this means that if any d vectors in {h i (p)} n i=1 are linearly independent, then m(p) &gt; 0. Thus, there should be a set of d linearly dependent vectors h (1) (p), . . . , h (d) (p). From the above lemma, we observe that any change in the set of labelings is accompanied by a p which gives d linearly dependent vectors. Proof of Lemma 4.1 Part (b). Using the lemma above, to bound the number of possible labelings, we first bound the number of p's such that there are d linearly dependent vectors. Consider a set of d vectors h 1 (p), . . . , h d (p). As the vectors are linearly dependent, the determinant of the matrix constructed using these vectors should be zero. We should thus have sign (p) (u p 11 -v p 11 ) &#8226; &#8226; &#8226; sign (p) (u p 1d -v p 1d ) . . . . . . . . . sign (p) (u p d1 -v p 11 ) &#8226; &#8226; &#8226; sign (p) (u p 1d -v p dd ) = 0 Upon expanding the determinant, we get an equation of the form m i=1 a i u p i which has 2 d &#8226; d! terms. From an earlier lemma, we know that this equation should have at most 2 d &#8226; d! -1 roots. Upon adding the original configuration, we get 2 d &#8226; d! possible configurations. The choice of the d vectors can be made in n d ways, and hence we have a bound on the possible changes as 2 d d! n d .</p><p>In the worst case, we assume that all the labelings are changed, and we thus get an upper bound on the changes as</p><p>In the beginning of the proof, we had carefully set aside p = 0. We now observe that p = 0 is a root of the above system of equations. Thus, we are implicitly considering any possible changes at p = 0 as well.</p><p>We can show that n = 8(&#8968;d log 2 d&#8969; + 1) points cannot be shattered.</p><p>We now show that for every d &#8712; N, the inequality 2 d-1 n 2d &#8804; 2 n holds, i.e. n -d -2d log 2 n + 1 &gt; 0 for n = 8(&#8968;d log 2 d&#8969; + 1). For d &#8712; {1, 2}, this statement can be verified directly. Therefore, it suffices to show that f (x) := 8(x log 2 x + 1) -2x log 2 [8(x log 2 x + 1)] -x + 1 &gt; 0 for any x &#8805; 3.</p><p>We note that g(3) = 12 log 2 3 -18 &gt; 1 &gt; 0. Moreover, g &#8242; (x) = 4 log 2 x + 4/ log 2 -9, and we note that g &#8242; (2) = 4/ log 2 -5 &gt; 0.77 &gt; 0, with g &#8242; (x) being an increasing function. Thus, f (x) &gt; g(x) &gt; 0 for all x &#8805; 3. This means that f (x) &gt; 0, which proves our bound. This implies that the VC dimension is bounded above by 8(d log 2 &#8804; d (for the existence of such an ordering, see <ref type="bibr">[2]</ref>). For i &lt; 2 m , denote by s i &#8712; [m] the bit position in which &#963; i and &#963; i+1 differ. By using Lemma A.10 d times, there exists a sample {u i , v i } m i=1 and p 1 &lt; . . . &lt; p 2 m-1 , where p i satisfies &#8741;u j &#8741; pi = &#8741;v j &#8741; pi if s i = j. Furthermore, define p 0 = -&#8734;, and note that each interval (p i , p i+1 ) for 0 &#8804; i &lt; 2 m corresponds to a unique combination of labels over {u i , v i } m i=1 . When the weights are unknown, we observe that p = 1 is similar to the case of linear classification with the constraint of positive weights and no bias term. This is because C((u, v); w, 1) = sign (log M (u; w, 1) -log M (v, 1)) = sign (M (u; w, 1) -M (v; w, 1)) (log is increasing)</p><p>Since linear classification with no bias term has a VC dimension of d -1, this is a lower bound for the VC dimension of C w,d .</p><p>A.7 Proof of Theorem 4.3</p><p>Proof. We prove the result for unknown weights, with the known weights result following similar steps. We consider the function class C d as in Section 4, with &#8467; 0-1 loss being &#8467; 0-1 (t, y) = (1 + ty)/2, t, y &#8712; {&#177;1}. We observe that &#8467; 0-1 is 1/2 Lipschitz w.r.t. t. Thus, by applying Theorem 3 of Natarajan et al. <ref type="bibr">[15]</ref>, we observe that w.r.t. &#8467; 0-1 on the noiseless data distribution,</p><p>where</p><p>). Here, &#961; +1 and &#961; -1 are defined as the probability of mislabeling true positive and true negative examples, which in our case are the same value, &#961;. Thus, Proof. We prove the result for unknown weights, with the case for known weights following similar steps.</p><p>We first establish a bound on the Rademacher complexity of T w,d = {&#964; (log M (&#8226;; w, p) -log M (&#8226;; w, p)) |&#964;, p}.</p><p>Let log M (u i ; w, p) = log u i(1) + log(u i(d) /u i( <ref type="formula">1</ref>) ) log M (r i ; w, q), and log M (v i ; w, p)</p><p>We now use the bound on Rabs (S d ) from Appendix A.4 to get the following bound on R(T</p><p>We then use the uniform convergence bounds obtained using Rademacher complexity to obtain the following PAC bound:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Algorithm</head><p>Our algorithm can be broken into two nested steps. The first step consists of choosing p, and the second step involves conducting gradient descent on w (and possibly &#964; ) to obtain their empirically optimal values, &#373; and p. In our experiments we choose p using grid search. However, optimization over p can also be done using other methods like simulated annealing. We minimize the &#8467; 2 loss in the cardinal case with weighted power mean and the logistic loss in the ordinal case with log weighted power mean. The algorithm's pseudocode is presented in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 ERM algorithm for weighted power mean-based optimization</head><p>Require:</p><p>Note that for the ordinal case, we would optimize over &#964; along with w. For our experiments, set p lower = -3.5 and p upper = 3.5. We use a grid resolution of &#1013; = 0.1. Since the function is not convex, we use several tricks to ensure quick convergence:</p><p>&#8226; While we use Algorithm 1 from <ref type="bibr">[5]</ref> for projection onto the simplex, it can potentially be time consuming. Thus, we project the gradient &#8711; w &#8467; itself on the unit simplex and use it for gradient descent, with the simplex projection algorithm being used only when some weights become too small/negative. &#8226; To prevent the algorithm from taking excessively large steps, we use the learning rate to clip the norm of the gradient. More specifically, if g t is the gradient and &#955; is the learning rate, we use the update</p><p>&#8226; If the optimal value hasn't improved in a certain number of iterations, the algorithm may be oscillating above the minimum. We thus halve the learning rate to encourage better convergence. &#8226; If the learning rate becomes too small, the steps taken would be too small to change the loss significantly. Thus, we terminate the algorithm. We also terminate the algorithm if the range of the past few losses is too small. &#8226; We also conduct gradient descent parallely starting from d + 1 points. The d + 1 points correspond to points close to the vertices of the simplex (corresponding to almost one-hot vectors) and the centroid of the simplex. This was done since convergence was observed to be slow for certain weights. At each step, v best is updated according to the point giving the minimum loss.</p><p>We ran the experiments on an NVIDIA RTX A5000 GPU. The algorithm with the above settings takes about 30 minutes to check for all 71 values of p.</p><p>C Semi-synthetic Experiments: Further Information  C.2 Ordinal Case: More Results 500 1000 2000 4000 8000 Number of samples 10 1 Train loss (a) Train loss 500 1000 2000 4000 8000 Number of samples 2.5 2.0 1.5 1.0 0.5 Value of p (b) Learnt p 500 1000 2000 4000 8000 Number of samples 0.5 0.6 0.7 0.8 0.9 1.0 Test accuracy (c) Test accuracy 500 1000 2000 4000 8000 Number of samples 10 1 10 0 Test loss (noiseless) (d) Noiseless test loss 0.1 1.0 10.0 40.0 100.0 C.3 Ordinal Case: Positive p 500 1000 2000 4000 8000 Number of samples 10 2 10 1 Test loss (a) Test loss 500 1000 2000 4000 8000 Number of samples 10 3 10 2 10 1 10 0 10 1 KL divergence (b) KL(w * &#8741;w) 500 1000 2000 4000 8000 Number of samples 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Test accuracy (noiseless) (c) Noiseless test accuracy 500 1000 2000 4000 8000 Number of samples 10 2 10 1 Train loss (d) Train loss 500 1000 2000 4000 8000 Number of samples 1.5 2.0 2.5 3.0 Value of p (e) Learnt p 500 1000 2000 4000 8000 Number of samples 0.5 0.6 0.7 0.8 0.9 1.0 Test accuracy (f) Test accuracy 500 1000 2000 4000 8000 Number of samples 10 2 10 1 Test loss (noiseless) (g) Noiseless test loss 0.1 1.0 10.0 40.0 100.0 D Additional Plots E Simulations (u, v) (u, v ) Figure <ref type="figure">6</ref>: An example showing the non-convexity of log M (u, w, p) -log M (v, w, p). We see that the function has five roots for (u, v), but is translated downwards for (u, v &#8242; ) and has only three roots in this case. If the correct label is 1 for both pairs, then p should be greater than 6; however, gradient-based optimization can stop between 3 and 4, which is a local optimum and does not give correct labels to both points.</p><p>is assumed to have a scaled and translated beta distribution over [u min , u max ] with the parameters (&#945; i , &#946; i ) of the beta distribution being different for each i. The utilities for each action are drawn independently for each individual to construct a utility vector. The underlying weight vector is sampled uniformly from &#8710; d-1 .</p><p>To learn p (and w if needed), we first assume p to be in a fixed range, which in this case is <ref type="bibr">[-10, 10]</ref>.</p><p>We first conduct a random sampling stage, in which N random instances of p (and w) are uniformly randomly sampled. At the end of this stage, we pick the set of parameters giving the lowest training loss, and then conduct gradient descent for N grad steps. We observe that this simple two-stage method is able to provide good results for the range of values of d we consider. Each setting is run thrice to obtain error bounds on the empirical results.</p><p>For the unknown weights case, we observe that we are sampling in d dimensions. As d becomes larger, we increasingly suffer from the curse of dimensionality -N random would have to grow exponentially with d to ensure that we are sampling at the same density across different d. This makes sampling at the same density prohibitively expensive for larger dimensions. As a compromise, we increase N random linearly with d.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.1 Cardinal Values</head><p>For cardinal values, we further add Gaussian noise to each y i with standard deviation (u (d) -u (1) )/10 and clamp the values between [u (1) , u (d) ]. We conduct experiments for both known and unknown weights by setting p = -2. Figure <ref type="figure">7a</ref> (known weights) and Figure <ref type="figure">7b</ref> (unknown weights) show the estimated test loss on noiseless test data generated using the true parameters.</p><p>We observe that there is relatively little change in the difference of test losses for the case of known weights as n increases. On the other hand, there is greater decrease with increasing n for higher d when the weights are also being learned. The estimated test loss also increases with d, with the trend being stronger for the case of unknown weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.2 Logistic Noise</head><p>For logistic noise, we generate pairs of utility vectors with p = 0.9 and a w obtained through random sampling, and then mislabel each instance according to Equation (1) with &#964; * = 10. Since we also have to learn &#964; , we set &#964; max = 50, a sufficiently high value, and uniformly randomly sample it along with p (and w). For known weights, we observe that accuracy increases with n, and mean accuracy stays high (&gt; 93%) across d. There is limited distinction between the curves corresponding to different values</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>n i=1 are linearly dependent.</p></note>
		</body>
		</text>
</TEI>
