<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Emergent specialization from participation dynamics and multi-learnerretraining</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 27th International Conference on Artificial Intelligence and Statistics</publisher>
				<date>05/02/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10547198</idno>
					<idno type="doi"></idno>
					
					<author>Sarah Dean</author><author>Mihaela Curmei</author><author>Lillian J Ratliff</author><author>Jamie Morgenstern</author><author>Maryam Fazel</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Numerous online services are data-driven: the behavior of users affects the system’s parameters,and the system’s parameters affect the users’ experience of the service, which in turn affects the wayusers may interact with the system. For example,people may choose to use a service only for tasksthat already works well, or they may choose toswitch to a different service. These adaptationsinfluence the ability of a system to learn abouta population of users and tasks in order to improve its performance broadly. In this work, weanalyze a class of such dynamics—where usersallocate their participation amongst services toreduce the individual risk they experience, andservices update their model parameters to reducethe service’s risk on their current user population. We refer to these dynamics as risk-reducing,which cover a broad class of common model updates including gradient descent and multiplicative weights. For this general class of dynamics,we show that asymptotically stable equilibria arealways segmented, with sub-populations allocatedto a single learner. Under mild assumptions, theutilitarian social optimum is a stable equilibrium.In contrast to previous work, which shows thatrepeated risk minimization can result in representation disparity and high overall loss with a single learner (Hashimoto et al., 2018; Miller et al.,2021), we find that repeated myopic updates withmultiple learners lead to better outcomes. We illustrate the phenomena via a simulated exampleinitialized from real data.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Many online platforms, including social media networks, personalized recommendation engines, and advertising auction systems, collect user data and make incremental adjustments to the models they use to personalize content. These continuous updates are motivated by many factors, though large amongst them is the fact that the systems operate in non-stationary environments, where the preferences of their users change as the system operates. Changes in user preferences might occur exogeneously of service settings (e.g., global events might spur interest in new topics) or endogeneously (e.g., increasing the ranking of certain content on a platform might lead the content to "go viral"). The fact that user behavior might depend on service settings can take on many forms: people may learn to ignore or avoid clicking on advertisements; they may choose to use the service only for tasks at which it already works well; or they may choose to switch to a different service if they have a better experience with the second service. The latter two examples of adaptation, where users might opt for services that already suit their needs, affect the system's capacity to learn about its user base and improve its overall performance.</p><p>In this work, we study a particular form of endogenously shifting distributions over multiple rounds, in contexts where individuals prefer to use services whose predictions are more accurate for them. Much of the existing work on endogeneous distribution shift focuses on users who modify their features to achieve desired outcomes, as in strategic classification <ref type="bibr">(Hardt et al., 2016)</ref> and related problems <ref type="bibr">(Perdomo et al., 2020;</ref><ref type="bibr">Miller et al., 2021)</ref>. While important, this model of data manipulation does not capture the most straightforward way that individuals express their preferences in a market: by choosing amongst alternative providers. In fact, recent work has shown that in the presence of a choice of participation between competing providers, individuals do not have an incentive to perform costly data manipulations <ref type="bibr">(Hardt et al., 2022</ref>).</p><p>Consider as an example a social media platform. If the platform recommends content that does not appeal to the tastes of younger generations, these users will spend a smaller frac-tion of their time on that platform. This results in positive (i.e., self-reinforcing) feedback loop, where a services's poor performance on young customers dissuades them from using the service, leading to less data and diminishing weight placed on making better predictions for young customers in the future. Within a single service, these effects may lead to representation disparity <ref type="bibr">(Hashimoto et al., 2018)</ref>.</p><p>However, in a broader ecosystem, individuals can choose amongst services. If a new social media platform can predict the tastes of younger users more accurately, the younger users may spend more of their time on the new service, and correspondingly less on an existing platform. The new platform will then receive more data and improve its performance on young customers, while the old platform's predictions may deteriorate, reinforcing their exit. Similarly, in the context of Large Language Models (LLM), if one LLM performs particularly well on creative tasks and another on answering homework questions, the distribution of prompts each receives may shift towards their existing expertise. Such feedback loops can also arise in settings such as music recommendation or healthcare, where demographic and socio-economic factors explain some of the emerging specialization (see examples in Appendix A).</p><p>In this paper, we study the dynamics of populations apportioning themselves amongst services, and services that choose predictors based on their observed user population. Our first contribution is to introduce and formalize this general setting. In Section 3 we introduce risk reducing populations and services who choose their actions myopically, incrementally improving their utility based on current conditions. Our second contribution is to present a complete characterization of stable fixed points for this general class of dynamics in Section 4. By drawing a connection between the dynamics and the total risk, our third contribution is to characterize the implications of this dynamic in terms of a utilitarian notion of social welfare, and argue that increasing the number of available services leads to better outcomes in terms of accurate predictions and user experience. In Section 5 we illustrate our theory with simulated experiments and conclude with a discussion of future work in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>The study of equilibria in the presence of utility optimizing agents has classical roots in game theory, and optimization over decision-dependent probabilities is classically studied by stochastic optimization and control (e.g., the review article by <ref type="bibr">Hellemo et al. (2018)</ref>); we narrow our focus to the most relevant literature on this as it arises in machine learning systems.</p><p>Endogenous Distribution Shifts. In the study of machine learning systems, a large body of literature studies exogenous distribution shifts such as covariate, label, or concept drift <ref type="bibr">(Qui&#241;onero-Candela et al., 2008)</ref>. A more recent trend is to study shifts in the underlying data distribution due to endogenous reactions, for example due to strategic behavior exhibited by a user population. The work of <ref type="bibr">Perdomo et al. (2020)</ref> introduces performative prediction as a model capturing user reaction via endogenous distribution shifts. This work models a single decision-maker facing a risk minimization problem subject to an underlying decision-dependent data distribution. Following its introduction, several relevant solution concepts have been explored and algorithms for achieving them proposed <ref type="bibr">(Izzo et al., 2021;</ref><ref type="bibr">Drusvyatskiy and Xiao, 2020;</ref><ref type="bibr">Mendler-D&#252;nner et al., 2020;</ref><ref type="bibr">Miller et al., 2021)</ref>. A variant of the single decision-maker performative prediction problem studies time-dependent dynamics of the data distribution, with both exogenous <ref type="bibr">(Wood et al., 2021;</ref><ref type="bibr">Cutler et al., 2021)</ref> and endogenous <ref type="bibr">(Ray et al., 2022;</ref><ref type="bibr">Brown et al., 2022)</ref> sources. These works primarily consider strategic covariate shifts in a single distribution. In contrast, we consider a mixture of distributions: sub-populations of users whose participation choices result in attrition and retention dynamics which are not studied in the aforementioned distribution shift literature.</p><p>Multiple Decision-Makers. Endogenous distribution shift has also been studied in settings with multiple decisionmakers as a continuous game. For instance, the multi-player performative prediction problem extends the original problem by allowing for multiple competing decision-makers <ref type="bibr">(Narang et al., 2022;</ref><ref type="bibr">Piliouras and Yu, 2022;</ref><ref type="bibr">Wood and Dall'Anese, 2022)</ref>. This line of work differs from ours in that the population is modeled as homogeneous and stateless. These works focus on characterizing the existence and uniqueness of different types of competitive equilibria for the game, and analyze learning dynamics that lead to different equilibrium concepts. In contrast, in our paper the focus is on asymptotically stable points (equilibrium) for the combined dynamical system resulting from myopic optimization by non-anticipating decision-makers and stateful user participation updates.</p><p>Retention. User retention in machine learning systems is closely related to the population participation dynamics we consider <ref type="bibr">(Hashimoto et al., 2018;</ref><ref type="bibr">Zhang et al., 2019)</ref>. In settings with multiple sub-populations of users of different types, the question of retention has been explored in parallel with the issue of fairness. <ref type="bibr">Hashimoto et al. (2018)</ref> coined the term representation disparity for the phenomenon in which the traditional approach of minimizing average performance leads to high overall accuracy coupled with low accuracy on minority groups, causing an exodus of said groups. For single learners, systems which instead perform robust risk minimization avoid such disparity.</p><p>Our work generalizes the single-learner retention setting and analyzes the fixed points of dynamics between multiple systems and populations without modifying risk functions to be robust. <ref type="bibr">Ginart et al. (2021)</ref> also consider user choice between multiple learning systems, with an empirical inves-tigation and theoretical results in restricted settings focused on finite sample effects. In contrast, we propose a general class of risk reducing dynamics and develop a comprehensive theoretical understanding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">FRAMEWORK AND SETTING</head><p>We consider a setting where the population of individuals is composed of n subpopulations spreading their participation amongst m learners (service providers or decision-makers). Figure <ref type="figure">1</ref>   A subpopulation can be as broad as a demographic or affinity group and as granular as a single individual. The allocation of a subpopulation can represent several things: the fraction of a subpopulation which uses a given service, or the fraction of time users from that subpopulation choose to spend using learners' systems. Accordingly, the relative size &#946; i of the population can represent the proportion of individuals or the total time individuals spend. This framework also allows for a subpopulation to represent types of tasks or activities a user wishes to accomplish, allocating these tasks to learners based upon which systems perform best on which tasks. The only assumption we make about any subpopulation is that individual samples comprising it are i.i.d.</p><p>Throughout, we assume that there are fewer learners than subpopulations, m &#8804; n. Each learner j observes data from the subpopulations who participate in it. Formally it observes features and labels drawn from the mixture distribution determined by the participation and subpopulation sizes:</p><p>Learners make predictions or decisions according to a parameter &#952; j &#8712; R d . Beyond the information encoded in the features and labels, the learners are unaware of which subpopulation individual data points are.</p><p>The quality of predictions made by parameter &#952; j &#8712; R d for an individual instance z j is quantified by the loss &#8467;(&#952; j ; z j ). The quality of &#952; j for a subpopulation is quantified by the average loss, i.e. the risk R i (&#952; j ) = E z&#8764;Di [&#8467;(&#952; j ; z)]. Throughout, we will make the additional assumption that the risk function for each subpopulation R i (&#952;) is convex and differentiable. Figure <ref type="figure">2</ref> illustrates an example of the risk functions arising in linear regression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Decision dynamics of learners and subpopulations</head><p>Subpopulations and learners react to each other; Updates in subpopulation allocations lead to updates in learners parameters &#920; t = (&#952; t 1 , . . . , &#952; t m ), and vice versa. We introduce a broad class of update dynamics by way of a canonical example. Suppose that each subpopulation i updates its allocation by increasing the participation proportional to the quality of various models; for example, by spending more time on recommendation platforms that suggest more engaging content. Recalling that the risk (i.e. average loss) quantifies quality, this manifests as a multiplicative weights update:</p><p>and some parameter &#947; &gt; 0. This is similar to the retention function studied by <ref type="bibr">Hashimoto et al. (2018)</ref> and has connections to replicator dynamics, a foundational evolutionary dynamic that can be interpreted as a process of information diffusion and imitation <ref type="bibr">(Sandholm, 2020)</ref>.</p><p>Recall that each learner j observes data from the mixture distribution ( &#8721;&#65025; n i=1 &#945; ij &#946; i ) -1 &#8721;&#65025; n i=1 &#945; ij &#946; i D i for which we use the shorthand D(&#945; :,j ), where &#945; :,j &#8712; R n denotes the vector of allocations from all subpopulations to learner j. Suppose the learners update their parameters using gradient descent to reduce the average loss over this data (e.g. to improve the prediction of user engagement). Setting aside finite sample issues, for a step size &#947; t the gradient update takes the form</p><p>. This is an incremental version of the repeated retraining dynamics which have been studied in the single learner setting by <ref type="bibr">Hashimoto et al. (2018)</ref>; <ref type="bibr">Perdomo et al. (2020)</ref>.</p><p>Despite the apparent simplicity of independent update rules, the evolution of subpopulations and learners is highly coupled. The sequential interaction between subpopulations and learners leads to complex nonlinear dynamics: i.e. multiplicative weights over non-stationary risks (due to learner updates) and gradient descent over non-stationary data distributions (due to subpopulation updates). To study this complex behavior, we now formalize key properties.</p><p>The first observation is that updates are stateful, with subpopulation allocations and learner parameters depend-</p><p>1 0 1 2 Feature (x) 2 0 2 4 Label (y) Distribution of z=(x,y) Subpop 1 Subpop 2 Subpop 3 6 4 2 0 2 4 6 Parameter 0 20 40 60 Risk Subpopulation Risks Subpop 1 Subpop 2 Subpop 3 1 6 4 2 0 2 4 6 2 6 4 2 0 2 4 6 10 1 10 2 Partial Minimization of Total Risk </p><p>ing on previous values. This motivates a description of the dynamics arising from interactions between n subpopulations and m learners in terms of the overall state</p><p>We thus define for each subpopulation i a general Markovian allocation function &#957; t i : &#8710; m &#215; R m&#215;d &#8594; &#8710; m which describes the participation update &#945; t+1 i,:</p><p>i,: , &#920; t )&#65025; at time t, where &#945; i,: &#8712; &#8710; m denotes the vector of allocations from the subpopulation i to all learners. Similarly, define &#181; t j : R d &#215; R n &#8594; R d so learner j updates their parameter according to &#952; t+1 j = &#181; t j (&#952; t j , &#945; t :,j ). The second observation is that the basis for the updates is the average loss, i.e. risk. This motivates the following definition: given participation &#945; and parameters &#920;, the average risk experienced by each subpopulation i and each learner j is:</p><p>[&#8467;(&#952; j ; z)] .</p><p>In the recommendation example, R &#175;subpop captures the dissatisfaction with content for a subpopulation and R &#175;learner corresponds to the average prediction error of the platform. Intuitively, multiplicative weights reduces the average subpopulation risk while gradient descent reduces the average learner risk. Definition 3.1 (Reducing and Minimizing Dynamics). A u update rule is P -reducing w.r.t. v if P (u t+1 , v t ) &#8804; P (u t , v t ) for all t and any sequence of v t . It is further P -minimizing in the limit if the inequality is strict when u t is not a minimizer and lim t&#8594;&#8734; P (u t , v) = min u P (u, v).</p><p>We call a subpopulation i risk reducing (resp. minimizing) when the allocation update on &#945; i,: is R &#175;subpop i -reducing (resp. minimizing in the limit) with respect to &#920;. Similarly, we call a learner j risk reducing (resp. minimizing) when the parameter update on &#952; j is R &#175;learner j -reducing (resp. minimizing in the limit) with respect to &#945; :,j .</p><p>We remark that the notion of risk minimizing in the limit is reasonable for subpopulations because their average risk is linear in &#945; i,: . It is also reasonable for learners because their average risk is convex in &#952; j (due to the assumption that risks R i (&#952; j ) are convex). However, risk-reducing/minimizing is only a property defined with respect to the participation &#945; or parameter &#920; observed at a previous time step. Thus it does not necessarily hold that R &#175;learner j or R &#175;subpop i decrease when the state evolves (&#945; t , &#920; t ) &#8594; (&#945; t+1 , &#920; t+1 ) by sequential updates of &#957; t and &#181; t . Our experiments (Figure <ref type="figure">4a</ref>) illustrate the non-monotonicity of the coupled updates. Example 3.2 (Semi-static participation). Suppose a population has a constant allocation of 20% to one learner, while the remaining 80% is allocated to the remaining learners inversely proportional to the learner's risk on that population. This is risk reducing but not risk minimizing in the limit. Example 3.3 (Full risk minimization). Suppose that a learner updates its parameter to minimize the average risk function R &#175;learner j (&#945; t :,j , &#8226;) at each timestep. This has been studied as repeated retraining dynamics in the single learner case by <ref type="bibr">Hashimoto et al. (2018)</ref>; <ref type="bibr">Perdomo et al. (2020)</ref>.</p><p>Proposition 3.4. A subpopulation i updating their participation with multiplicative weights is risk minimizing in the limit if &#947; &gt; 0 and &#945; 0 ij &gt; 0 &#8704;j. A learner updating its parameter with gradient descent is risk minimizing in the limit when the risk functions R i (&#952;) are L smooth and the step size</p><p>We provide a proof and detail several other examples of risk reducing dynamics in Appendix D.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Equilibria and stability</head><p>We focus on the equilibrium states resulting from riskreducing subpopulations and learners. Definition 3.5 (Equilibrium). The state (&#945; eq , &#920; eq ) is an equilibrium state if it is stationary under the dynamics update {&#957; t i }, {&#181; t j }; i.e. that for all i &#8712; [n] and j &#8712; [m]: &#945; eq i,: = &#957; t i (&#945; eq i,: , &#920; eq ) and &#952; eq j = &#181; t j (&#945; eq :,j , &#952; eq j ) .</p><p>If learners and subpopulations are in an equilibrium state, they will remain that way indefinitely. However, some equilibrium configurations may be unstable to perturbations. Definition 3.6 (Stable Equilibrium). The state (&#945; eq , &#920; eq ) is a stable equilibrium state if it is an equilibrium and for each &#1013; &#945; , &#1013; &#952; &gt; 0, there exist &#948; &#945; , &#948; &#952; &gt; 0 such that</p><p>It is further asymptotically stable if lim t&#8594;&#8734; &#8741;&#945; t -&#945; eq &#8741; = 0 and lim t&#8594;&#8734; &#8741;&#920; t -&#920; eq &#8741; = 0.</p><p>Stability analysis identifies qualitatively different equilibrium states. For the class of risk reducing dynamics that we study, equilibria may be unstable, stable, or asymptotically stable; Appendix D.2 presents examples. While a quantitative understanding of convergence may also be of interest, it would require stronger assumptions on the behavior of subpopulations and learners; here we favor generality and leave this to future work. Furthermore, characterizing stable equilibria sets the foundation for understanding high probability behavior of systems under noisy updates which are risk reducing only in expectation <ref type="bibr">(Kushner, 1967)</ref>. This sets the stage for finite sample risk minimization or multi-agent user models, a challenge which we leave to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">MAIN RESULTS</head><p>We study a large class of feedback dynamics between risk reducing learners and subpopulations described by the sequential updates: &#945; t+1 = &#957; t (&#945; t , &#920; t ) and &#920; t+1 = &#181; t (&#945; t+1 , &#920; t ).</p><p>Our analysis allows for learners and subpopulations who exhibit a diverse range of behaviors. We do not require that every learner or every subpopulation update their parameter or allocation in the same manner or even at every timestep, allowing for any number of round-robin schemes. Our only assumption on learner and subpopulation updates is that they are risk reducing or minimizing.</p><p>Figure <ref type="figure">3</ref> presents a summary of the equilibria characterization that we present in this section. All omitted proofs can be found in Appendix C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Total Risk Reduction</head><p>Definition 4.1 (Total Risk). The total risk of all subpopulations over all learners is the weighted sum</p><p>The total risk maps &#8710; n m &#215; R m&#215;d &#8594; R. While our assumption that the loss is convex implies that the total risk is convex in &#920;, it is not jointly convex in (&#945;, &#920;), illustrated in the right panel of Figure <ref type="figure">2</ref>.</p><p>&#920;0 &#8712; arg min&#920; R total (&#945;0, &#920;)? &#920;0 unique minimizer? &#945;0 segmented? not an equilibrium no general classification (Appendix D.2) Strict preference? (Eq. (1) in Thm. 4.6) asymptotically stable unstable Support is over risk equivalent learners? (Eq. (2) in Thm. 4.8)</p><p>Support is over risk optimal learners? (Eq. ( <ref type="formula">2</ref>) in Thm. 4.8) not an equilibrium unstable may be stable (balanced equilibrium) Figure <ref type="figure">3</ref>: A summary of our main results on equilibria classification for a given participation &#945; 0 and model parameters &#920; 0 . These results hold for dynamics which are risk minimizing in the limit and loss functions that are convex.</p><p>Our first result shows that the total risk R total (&#945; t , &#920; t ) is non-increasing over time.</p><p>Proposition 4.2. For any risk-reducing subpopulation and learner dynamics, the total risk is non-increasing: R total (&#945; t+1 , &#920; t+1 ) &#8804; R total (&#945; t , &#920; t ), &#8704;t. If subpopulations and learners are risk minimizing in the limit, then the total risk is strictly decreasing unless</p><p>Proof Sketch. First note that the total risk can be decomposed into either a weighted sum of average subpopulation risk or average learner risk. Thus the fact that learner and subpopulation dynamics are risk reducing ensures that the total risk is decreasing after the sequential updates.</p><p>Thus, the total risk acts like a potential function for the feedback dynamics of learners and subpopulations. When the subpopulation and learner dynamics are risk minimizing in the limit, there is a strong connection between properties of the total risk function and equilibria of the dynamics.</p><p>Theorem 4.3. For any learners and subpopulations who are risk minimizing in the limit, an equilibrium (&#945; eq , &#920; eq ) is asymptotically stable if it is an isolated local minimizer of the total risk R total . If it is not a local minimizer of the total risk, then it is not stable. Proof Sketch. By Proposition 4.2 the function V (&#945;, &#920;) := R total (&#945;, &#920;) -R total (&#945; eq , &#920; eq ) is potential function for the autonomous dynamical system (&#945; t , &#920; t ) &#8594; (&#945; t+1 , &#920; t+1 ). The stability result follow from Lyapunov arguments.</p><p>The connection between stability and the total risk function is significant in at least two ways: first, it means that under general classes of myopic and self-interested behaviors on the part of subpopulations and learners, the total risk is driven to at least a local minimum. Second, it is a technically useful connection that will enable us to characterize and classify the stable equilibria for dynamics which are risk minimizing in the limit. We remark that Theorem 4.3 leaves open the question of stability for equilibria which are nonisolated minima of the total risk function. In Appendix D.2, we provide examples which show that such points may be asymptotically stable, stable, or unstable depending on the particular instantiation of dynamics. The following existence result further motivates our focus dynamics which are risk minimizing, rather than just reducing.</p><p>Corollary 4.4. Equilibria exist when learners and subpopulations are risk minimizing in the limit and the total risk function has isolated local minima. They may not exist otherwise.</p><p>Example of dynamics without equilibria. Consider subpopulations with risk functions minimized at the same value &#952; * . If learners use full risk minimization, the setting lacks isolated minima because the total risk is uniform across all allocations &#945;. Assuming that risk-minimizing subpopulations randomly choose among equivalent learners, no equilibrium exists as allocations randomly switch between learners once the learners converge to the optimum &#952; * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Segmented and Balanced Equilibria</head><p>Definition 4.5 (Segmented allocation). An allocation is segmented if &#945; ij &#8712; {0, 1} for all i, j.</p><p>In a segmented allocation, each subpopulation is associated with a single learner, and thus the population is partitioned across learners. For allocation dynamics like multiplicative weights, such configurations are clearly equilibria for any parameter choice &#920; on the part of the learners. We thus consider the set of possible segmented equilibria and characterize which are asymptotically stable.</p><p>Theorem 4.6. Suppose learners and subpopulations are risk minimizing in the limit, &#945; eq is segmented, and R total (&#945; eq , &#920;) has a unique minimizer &#920; eq . Define a mapping &#947; : [n] &#8594; [m] such that &#947;(i) = j is the learner with nonzero mass in &#945; eq i,:</p><p>. If every subpopulation strictly prefers their current learner:</p><p>for all i and learners j &#824; = &#947;(i), then (&#945; eq , &#920; eq ) is an asymptotically stable equilibrium. If there is a subpopulation who would strictly prefer to switch learners, then (&#945; eq , &#920; eq ) is not stable.</p><p>When risks are strongly convex, there is always such a unique minimizer &#920; eq . In particular, in a segmented allocation, each &#952; eq j minimizes the average loss over the group of subpopulations assigned to them.</p><p>Corollary 4.7. Suppose that risk functions satisfy R i (&#952;) &lt; R i (&#952; &#8242; ) &#8656;&#8658; &#8741;&#952; -&#981; i &#8741; &lt; &#8741;&#952; &#8242; -&#981; i &#8741; for &#981; i the subpopulation optimal parameter. Then in an asymptotically stable segmented equilibrium, the convex hulls of the grouped subpopulations optimal parameters {&#981; i } are non-intersecting.</p><p>Proof Sketch. Consider a partition where the convex hulls intersect for some pair of learners. Then there exists at least one subpopulation who would be better off switching to the other learner, and thus the risk condition in Theorem 4.6 cannot hold.</p><p>Applying the Corollary to the example in Figure <ref type="figure">2</ref>, we see that a segmented equilibrium with subpopulation 1 and 3 participating in the same learner cannot be stable. Theorem 4.6 leaves open the question of stability in the case that the risks in Equation (1) are equal. Under such risk equivalence, is it natural to consider equilibria where a subpopulation has support over multiple learners.</p><p>Theorem 4.8. Consider dynamics which are risk minimizing in the limit and an &#945; eq with any subpopulation i having nonzero support on set of two or more learners j &#8712; J . Assume risks are strongly convex and define &#920; eq = arg min R total (&#945; eq , &#920;). Then (&#945; eq , &#920; eq ) cannot be stable unless it is "balanced" in the sense that learners in J are risk equivalent and optimal for i, i.e. for all j, j &#8242; &#8712; J ,</p><p>If it is balanced, so are all allocations for subpopulation i with support over J . Finally, all stable equilibria must be either balanced or segmented.</p><p>This result characterizes a set of possibly stable equilibria. It demonstrates that risk optimality, in addition to equivalence, is necessary. Guaranteeing the stability of such balanced equilibria requires further information about the dynamics, and it is not possible to make a general statement. Examples in Appendix D.2 demonstrate that such balanced equilibria may be asymptotically stable, stable, or unstable. Furthermore, the balance condition is fragile in the sense that it would not hold under small perturbations to the underlying risk functions. While the number of possible balanced equilibria is combinatorial in the number of learners and subpopulations, risk functions are continuous, so it is possible to find arbitrarily small perturbations to any the risk functions that would destabilize all balanced equilibria.</p><p>Proof Sketch of Theorems 4.6 and 4.8. By Theorem 4.3, characterizing the stable equilibria is equivalent to characterizing isolated and non-isolated local minima of the total risk. We show that it suffices to characterize local minima of the partial minimization</p><p>Since F is concave, all minima occur on the boundary, i.e. a face or a vertex. Since F is still concave when restricted to a face of the simplex, the same argument shows the minima are on the boundary, hence vertices, except for the degenerate case where F takes a constant value over the face.</p><p>Thus, the isolated local minima occur at vertices of the simplex product, which correspond to segmented allocation. Further analysis of F yields the conditions presented in Theorem 4.6. The local minima in the degenerate case are characterized by the balanced equilibria conditions in Theorem 4.8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Social Welfare for Segmented Populations</head><p>Definition 4.9. The social welfare of a state (&#945;, &#920;) is strictly decreasing in the total risk R total (&#945;, &#920;).</p><p>This definition of social welfare is utilitarian in the sense that it depends on the cumulative quality of individuals' experiences. Maximizing the social welfare corresponds to minimizing the total risk, which can be posed as the following optimization problem</p><p>(3)</p><p>Here, (&#945; &#8902; , &#920; &#8902; ) is the social welfare maximizer.</p><p>Our discussion of stable equilibria has so far focused on only local minimizers of the total risk. In fact, global minimization of this objective (and therefore maximization of social welfare) is a hard problem. The total risk objective can be viewed as an instance of the k-means clustering problem with k = m. In the language of this literature (e.g., <ref type="bibr">Selim and Ismail (1984)</ref>), each subpopulation is a data point and the parameter selected by each learner is a cluster center. The allocations described by &#945; correspond to (fuzzy) cluster assignment and each risk function R i (&#952; j ) corresponds to a measure of "dissimilarity" between data points (subpopulations) and cluster centers (learners).</p><p>The connection to k-means clustering elucidates the difficulty of minimizing the total risk. The "minimum sum-ofsquares clustering" problem (i.e., squared Euclidean norm dissimilarity) is NP hard with general dimension even when k = 2 <ref type="bibr">(Aloise et al., 2009)</ref>. When the number of clusters and dimension are fixed, <ref type="bibr">Inaba et al. (1994)</ref> present an algorithm for solving the minimum sum-of-squares clustering problem which is polynomial in the number of datapoints. Translated to our setting, its complexity is O(n md ). It is therefore unrealistic to hope that a myopic dynamic might generally lead to social welfare maximization. However, due to the connections with total risk, risk reducing dynamics are at least well-behaved with regards to social welfare.</p><p>Proposition 4.10. For risk reducing subpopulations and learners, social welfare is non-decreasing over time. If the dynamics are furthermore risk minimizing in the limit, social welfare is strictly increasing and stable equilibria correspond to local social welfare maxima.</p><p>Local maximization is not a panacea: Example 4.11 shows a local maximum of the social welfare can be much worse than the global one.</p><p>Example 4.11 (Arbitrarily high total risk at local optimum). Consider three subpopulations with</p><p>for some &#981; &gt; 2. Suppose that subpopulation sizes are &#946; 1 = &#946; 2 = &#946; and &#946; 3 = 1 -2&#946; for some 0 &lt; &#946; &lt; 1/2. Further suppose that there are two learners. Up to permutation, the social welfare optimum is &#952; 1 = 1/2 and &#952; 2 = &#981;, with total risk &#946;/2. However, as long as &#981; &lt; 1-&#946; 1-2&#946; , there is another stable equilibrium. Let &#981; = 1-&#946; 1-2&#946; -&#1013;. Then the following is a stable equilibrium: &#952; 1 = 0 and &#952; 2 = 1 -&#1013;. The total risk is &#946; + (&#946;-&#1013;) 2 1-2&#946; . For &#946; close to 1/2, this risk can be arbitrarily larger than the social optimum.</p><p>In this example, a large gap between a stable local optimum and the global optimum arises in part due to a large difference in subpopulations' sizes. We further remark that minority groups can be under-served particularly when considering worst-case risk over subpopulations <ref type="bibr">(Hashimoto et al., 2018)</ref>. Even at a social welfare maximizer (&#945; &#8902; , &#920; &#8902; ), the worst-case subpopulation risk can be arbitrarily bad. It is straightforward to construct such examples even in the single learner case: consider a minority group with vanishingly small population proportion and arbitrarily high risk at the optimal parameter for the majority group (Example D.10).</p><p>Despite these inherent difficulties, we find that the situation improves as the number of learners increases. It is straightforward to see that the maximal social welfare will increase: any point which is optimal for m learners can be trivially transformed into a feasible point for m + 1 learners which achieves the same social welfare, by allocating no subpopulations to the new learner. There is more nuance involved when considering any possible stable equilibria. Instead, we make a statement about a particular learner growth process which corresponds to existing learner m "splitting in half". Proposition 4.12. Suppose that risks are strongly convex, there are m learners, (&#945; eq , &#920; eq ) is an equilibrium, and at least one subpopulation i allocated to learner m does not have optimal subpopulation risk, so &#8711;R i (&#952; eq m ) &#824; = 0. The state is amended to add an additional learner: &#920; &#732;eq = [&#920; eq , &#952; eq m ] and</p><p>Under dynamics which are risk minimizing in the limit, the equilibrium (&#945; &#732;eq , &#920; &#732;eq ) is not stable, so a small perturbation will send the system to a state with strictly lower total risk (higher social welfare).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">SIMULATIONS</head><p>We illustrate the salient properties of the decision dynamics in simulation<ref type="foot">foot_0</ref> . We consider both a synthetic setting as well as one instantiated from a prediction task on census data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Synthetic</head><p>In Figure <ref type="figure">4a</ref> we consider a simple scenario with n = 3 subpopulations of equal sizes &#946; i = 1/3, quadratic risk functions R i = &#8741;&#981; i -&#952;&#8741; 2 +1 with distinct risk minimizing decisions &#981; i and m = 2 learners. The learners minimize their risk according to full risk minimization (Example 3.3) and the subpopulations update their participation via multiplicative weights update (Section 3.1). When &#945; 0 i,j = 1/2 for all i, j the risk equality condition from Theorem 4.8 is satisfied with &#952; eq j = (&#981; 1 + &#981; 2 + &#981; 3 )/3, however the optimality condition is not. We therefore observe that this equilibrium is not stable, and slightly perturbing the initial conditions leads to split-market equilibria. Figure <ref type="figure">4a</ref> illustrates trajectories from three different perturbations. It demonstrates that the total risk is non-increasing whereas the average risks for both learners and subpopulations are not monotonic. Each of the perturbations has different risk trajectories and equilibrates at a different split-market equilibrium. We repeat these experiments with noisy dynamics, we consider both exogenous noise that independently perturbs the decisions of the learners and/or populations as well as intrinsic noise due to making updates with finite sample estimates rather than at population level. We find that the key properties of the dynamics hold when the updates are noisy, detailed experiments are presented in Appendix E.</p><p>Another set of experiments in Figure <ref type="figure">4b</ref> illustrates how a larger number of learners lead to better outcomes in terms of total risk. We consider a set of m = 2 learners and n = 50 subpopulations. We simulate the dynamics until the market has reached equilibrium, at which point a randomly chosen learner breaks up into two identical learners with half the user base. From this unstable equilibrium (Proposition 4.12) we slightly perturb the parameters of the two learners and allow the system to reach a new equilibrium state. The procedure repeats until the number of learners reaches number of subpopulations. These simulations illustrate that more competition improves social welfare, however the improvements are not uniform for all subpopulations with some groups seeing their risk at equilibrium increase with the addition of new learners.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Census data</head><p>We consider a semi-synthetic setting where subpopulations and their risk functions are instantiated by a prediction task on real data. Using folktables <ref type="bibr">(Ding et al., 2021)</ref> we consider a modified version of ACSTrav-elTime prediction problem derived from the 2018 California Census data. We consider 6 subpopulations corresponding to racial groups with relative size ranging from 1.2% to 61%. We define the least-squares risk functions as R i (&#952;) = 1</p><p>Ni &#8741;X i &#952; -y i &#8741; 2 where X i &#8712; R Ni&#215;d are the features (containing demographics, educational attainment, income levels, and modes of transportation) and y i &#8712; R Ni are the labels (log transform of the daily commute time in minutes) for individuals within subpopulation i. We sim-</p><p>0 5000 10000 Time t 0 2 4 6 8 10 Subpop. risk Subpop. 1 Subpop. 2 Subpop. 3 Subpop. 4 Subpop. 5 Subpop. 6 0 5000 10000 Time t 0.2 0.3 0.4 0.5 0.6 Learner risk Learner 1 Learner 2 Learner 3 (a) Risk dynamics Learner 1 Learner 2 Learner 3 Subpop. 1 Subpop. 2 Subpop. 3 Subpop. 4 Subpop. 5 Subpop. 6 (b) Allocation trajectories ulate risk reducing dynamics from a perturbed balanced equilibrium over 3 learners. As in the synthetic example, the risks of learners and subpopulations are not all monotone (Figure <ref type="figure">5a</ref>), but the total risk function is. Finally Figure <ref type="figure">5b</ref> illustrates the convergence of allocation dynamics to a segmented equilibrium.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DISCUSSION</head><p>In this paper, we study the feedback dynamics of user retention for loss minimizing learners, where subpopulations choose between providers. We introduce a formal notion of risk reducing and minimzing to capture this feedback, and show that there is a close connection between such dynamics and the total risk summed over subpopulations and learners. We provide a comprehensive characterization of stable equilibria and investigate the implications in terms of a utilitarian social welfare. This work relates to questions of fairness and minority representation in several ways. First, our results imply that risk-minimizing dynamics in multi-learner settings can result in higher welfare for small subpopulations compared with single-learner settings, as studied by <ref type="bibr">(Hashimoto et al., 2018;</ref><ref type="bibr">Zhang et al., 2019)</ref>.</p><p>This resonates with recent work showing that monopolies have higher performative power and lead to lower individual utility <ref type="bibr">(Hardt et al., 2022)</ref>.</p><p>The dynamics that we study often lead to segmentation of subpopulations across learners as an emergent phenomenon<ref type="foot">foot_1</ref> . This segmentation can lead to pointwise lower risks for subpopulations, especially when subpopulations have considerably different risk profiles. In some contexts, the benefits of the reduced risk among subpopulations may outweigh possible harms from segregation. In others, where proportional representation of groups across learners, models, or clusters <ref type="bibr">(Kleindessner et al., 2019a,b)</ref> is important, our work implies that independent risk minimization can lead to undesirable outcomes. In short, this work analyzes natural dynamics with consequences for the distribution of subpopulations amongst independent learners; whether or not the consequences are desirable depend on the specific application considered.</p><p>We highlight several directions for future work. Our results lay the groundwork for an investigation of the stochastic dynamics that occur for finite sample approximations to the risk or participation driven by decisions of individuals. Such behaviors are risk reducing in expectation, so we expect the noisy trajectories to converge with high probability to sets around the asymptotically stable equilbria we characterize.</p><p>There are many interesting and relevant questions in the finite sample setting: What is the effect of sample size on the ability of new learners to enter a market and minority subpopulations to be adequately represented? Can we model heterogeneous learners who differ in which features they measure and with how much noise? Are there trade-offs between the expressivity of models and the practical difficulty of minimizing risk from finite samples in high dimensions?</p><p>It would also be interesting to consider extensions or alternative dynamics models for the learner and subpopulation decisions. One could investigate competitive learners who explicitly strategize to capture subpopulations <ref type="bibr">(Ben-Porat and Tennenholtz, 2019;</ref><ref type="bibr">Aridor et al., 2020)</ref>; this setting is related to facility location and Hotelling games <ref type="bibr">(Owen and Daskin, 1998;</ref><ref type="bibr">Hotelling, 1929)</ref>. One might imagine that subpopulations do not act uniformly and may not even be entirely independent of each other-the participation update may depend on some underlying social network. The connections between total risk reduction and k-means clustering algorithms suggest interventions such as subpopulationaware initialization <ref type="bibr">(Bose et al., 2023)</ref> that could improve social welfare. Results on "ground truth recovery" may yield insight into particular population structures that lead to simpler dynamics or restricted sets of equilibria.</p><p>Checklist 1. For all models and algorithms presented, check if you include: (a) A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes] Assumptions are stated in Section 3 and in theoretical statements. (b) An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Not Applicable] We do no propose an algorithm. (c) (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes] A link is provided in section 5. 2. For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Yes] Assumptions are stated in Section 3 and in theoretical statements. (b) Complete proofs of all theoretical results. [Yes] Proofs are presented in the appendix. (c) Clear explanations of any assumptions. [Yes] Assumptions are discussed in Section 3. 3. For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes] A link is provided in section 5. (b) All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes] The details are available in the code. (c) A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes] (d) A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Yes] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses existing assets. [Yes] (b) The license information of the assets, if applicable. [Not Applicable] The census data we use is in the public domain. (c) New assets either in the supplemental material or as a URL, if applicable. [Not Applicable] (d) Information about consent from data providers/curators. [Not Applicable] The census data we use is in the public domain. (e) Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable] 5. If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to participants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable] (c) The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Preliminaries B.1 Notation</head><p>We introduce a compact notation. The simplex product is defined as</p><p>so that the rows sum to 1. Then the state space of subpopulation allocations and learner parameters is X = &#8710; n m &#215; R m&#215;d . For a square matrix A, we use the notation diag(A) to represent the vector containing the diagonal entries of A. For a vector a, Diag(a) is a diagonal matrix with a along the diagonal. Furthermore we will say a &#8804; b for vectors a, b if the inequality holds elementwise.</p><p>Define a matrix valued risk function R : R m&#215;d &#8594; R n&#215;m so that R ij (&#920;) = R i (&#952; j ). Recall that in Section 3.1, the subpopulation and learner risks played a key role. We therefore define vector valued functions R &#175;subpop : X &#8594; R n and R &#175;learner : X &#8594; R m as follows:</p><p>Then the definition of risk reducing dynamics for subpopulations and learners can be written as</p><p>Risk minimizing in the limit is defined similarly, where the inequality is strict for at least one entry of the vectors unless the state is at a local minimum.</p><p>The total risk can be written as R total (&#945;, &#920;) := tr(diag(&#946;)&#945;R(&#920;) &#8868; ) .</p><p>Lemma B.1. Under the assumption that all loss functions are continuous, the risk function R is continuous w.r.t. to &#920;, and thus R total is continuous w.r.t. &#945; and &#920;.</p><p>The sequential dynamics updates described in Section 3.1 can be written as</p><p>Lemma B.2. As long as the subpopulation and learner updates described in Section 3.1 are locally Lipschitz, so is the dynamics function f defined in (4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Background</head><p>For completeness, we include important results and definitions that our proofs will make use of. First, we state two theorems about Lyapunov theory for stability.</p><p>Theorem B.3 (Theorem 1.2 in <ref type="bibr">Bof et al. (2018)</ref>). Let x eq &#8712; D be an equilibrium point for the autonomous systems x t+1 = f (x t ) where f : D &#8594; X is locally Lipschitz in D &#8838; X . Suppose there exists a function V : D &#8594; R which is continuous and such that</p><p>Then x eq is stable (resp. asymptotically stable).</p><p>Theorem B.4 (Theorem 1.5 in <ref type="bibr">Bof et al. (2018)</ref>). Let x eq &#8712; D be an equilibrium point for the autonomous systems x t+1 = f (x t ) where f : D &#8594; X is locally Lipschitz in D &#8838; X . Let V : D &#8594; R be a continuous function with V (x eq ) = 0 and V (x 0 ) &gt; 0 for some x 0 arbitrarily close to x eq . Let r &gt; 0 be such that B r (x eq ) &#8838; D and U = {x &#8712; B r (x eq ) | V (x) &gt; 0}, and suppose that V (f (x)) -V (x) &gt; 0 for all x &#8712; U. Then x eq is not stable.</p><p>where inequality holds for all D &#8712; R n&#215;m such that &#945; 0 + D &#8712; X &#945; by the fact that &#945; 0 is a minimum. Recognizing the gradient from Lemma B.9 and using the uniqueness of &#920; 0 , the expression is equivalently &#10216;&#8711; &#945; F (&#945; 0 ), D&#10217; &#8805; 0. In other words, the directional derivative in any feasible direction D is non-negative. Hence, &#945; 0 is a local minimum of F . The implication for the isolated local minimum case follows by the same arguments with strict inequalities on the total risk.</p><p>Lemma B.9. For F : R n&#215;m &#8594; R defined as in Lemma B.8, suppose the minimizier &#920; * (&#945;) = arg min &#920; R total (&#945;, &#920;) is unique. The gradient is</p><p>Further suppose that the risks are strongly convex. Then second partial derivatives are given by</p><p>Proof. Computing the gradient:</p><p>The first equality follows by product rule. The second equality follows because 1) the total risk is linear in &#945; and 2) the second term is zero due to the optimality of &#920; * (&#945;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#10217;&#65027;</head><p>To compute the derivatives of &#952; &#8902; j (&#945;) we use the implicit function theorem and the assumption that the risks are strongly convex. We apply the implicit function theorem to the first order optimality condition</p><p>The Hessian &#8711; 2 &#952; R &#175;learner j (&#945;, &#920;)) is non-degenerate due to strong convexity of the subpopulation risks. There exists a neighborhood U 0 of &#945; and a unique (sufficiently smooth) map &#952; * j (&#8226;) such that for all &#945; &#8712; U 0 , we have that &#8711; &#952; R &#175;learner j (&#945;, &#952; * (&#945;)) = 0. Then by implicit function theorem we obtain</p><p>by taking the derivative of the first order condition differentiating through &#952; &#8902; j (&#8226;) and setting it to zero. We have that</p><p>The result follows by combining the expressions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Full Proofs of Main Results</head><p>In this section, we present proofs of the main results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Connections between dynamics and total risk</head><p>Proposition 4.2. For any risk-reducing subpopulation and learner dynamics, the total risk is non-increasing: R total (&#945; t+1 , &#920; t+1 ) &#8804; R total (&#945; t , &#920; t ), &#8704;t. If subpopulations and learners are risk minimizing in the limit, then the total risk is strictly decreasing unless</p><p>Proof of Proposition 4.2. The key to seeing that the total risk acts like a potential for the market dynamics is to note two equivalent decompositions of the total risk:</p><p>Being risk-reducing learners' updates satisfy:</p><p>Similarly risk reducing subpopulations satisfy:</p><p>Finally, combining the two updates yields the desired inequality.</p><p>In the case that learners and subpopulations are risk minimizing in the limit, the same argument holds with strict inequality, unless (&#945; t+1 , &#920; t+1 ) is a local minimum.</p><p>Theorem 4.3. For any learners and subpopulations who are risk minimizing in the limit, an equilibrium (&#945; eq , &#920; eq ) is asymptotically stable if it is an isolated local minimizer of the total risk R total . If it is not a local minimizer of the total risk, then it is not stable.</p><p>Proof of Theorem 4.3. We break this proof into two implications.</p><p>1. Isolated local min =&#8658; Asymptotic stability Define V (&#945;, &#920;) = R total (&#945;, &#920;) -R total (&#945; eq , &#920; eq ). The dynamics f are Lipschitz by Lemma B.2 and this V satisfies the conditions of Theorem B.3 with strict inequality, thus we conclude that (&#945; eq , &#920; eq ) is an asymptotically stable equilibrium.</p><p>2. Not local min =&#8658; Not stable Define V (&#945;, &#920;) = R total (&#945; eq , &#920; eq ) -R total (&#945;, &#920;) which will increase along trajectories. Since we are not at a local min, there must be some arbitrarily close &#945; 0 , &#920; 0 such that V (&#945;, &#920;) &gt; 0. Then we apply Theorem B.4 which guarantees that the equilibrium is not stable.</p><p>Corollary 4.4. Equilibria exist when learners and subpopulations are risk minimizing in the limit and the total risk function has isolated local minima. They may not exist otherwise.</p><p>Proof of Corollary 4.3. We first argue that if the dynamics are risk minimizing, then an isolated local minimum of the total risk must be an equilibria. Let (&#945; 0 , &#920; 0 ) denote the isolated local minima of the total risk. It must be that &#945; 0 is an isolated, and thus unique, minimizer of R total (&#945;, &#920; 0 ) since the function is linear in &#945;. We can thus conclude that &#957;(&#945; 0 , &#920; 0 ) = &#945; 0 . It also must be that &#920; 0 is a unique minimizer of R total (&#945; 0 , &#920;) since the function is convex in &#920;. We can thus conclude that &#181;(&#945; 0 , &#920; 0 ) = &#920; 0 . Therefore (&#945; 0 , &#920; 0 ) is equilibrium of the dynamics.</p><p>We next show that equilibria may not exist when the dynamics are not risk minimizing in the limit. To show that they may not exist otherwise, consider the following example. Let all learners be static and identical so &#920; t+1 = &#920; t and &#920; = (&#952;, &#952;, . . . , &#952;).</p><p>Let the subpopulation update break ties among equivalent learners randomly. Then the subpopulations will randomly switch between learners. Though these dynamics satisfy the definition of risk reducing (at equality), they will not converge to an equilibrium.</p><p>We lastly show that equilibria may not exist when the total risk function does not have an isolated local minima. Suppose that learners update with full risk minimization and all subpopulations have risk uniquely minimized at the same value &#952;.</p><p>Finally suppose that subpopulations will break ties among equivalent learners randomly (and are otherwise risk minimizing). As in the previous example, the subpopulations will randomly switch between learners and no equilibrium exists.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Stable equilibria</head><p>Theorem 4.6. Suppose learners and subpopulations are risk minimizing in the limit, &#945; eq is segmented, and R total (&#945; eq , &#920;) has a unique minimizer &#920; eq . Define a mapping &#947; : [n] &#8594; [m] such that &#947;(i) = j is the learner with nonzero mass in &#945; eq i,:</p><p>. If every subpopulation strictly prefers their current learner:</p><p>We now characterize this degenerate case. F takes a constant value over the face if and only if 1) the gradient of F is perpendicular to the face at &#945; and 2) remains perpendicular along the face. The face is described by a set of indices J &#8838; [m].</p><p>Mathematically, we write the two conditions as: for all pairs j, j &#8242; &#8712; J , &#8467; &#8712; [n], and k &#8712; [m],</p><p>Using Lemma B.8, the first expression simplifies to the risk equivalent condition that R i (&#952; eq j ) = R i (&#952; eq j &#8242; ). Turning to the second expression in (5), we first use Lemma B.9 to compute</p><p>Thus, the condition trivially holds for k / &#8712; {j, j &#8242; }. Otherwise, when &#8467; = i, the condition in (5) requires that</p><p>Due to the strong convexity of the risks, the Hessian matrix is positive definite. Thus it must be that &#8711;R i (&#952; eq j ) = 0 for all j &#8712; J , i.e. the risk optimal condition. Risk optimality implies that the condition holds also when &#8467; &#824; = i and thus the characterization is complete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3 Social Welfare</head><p>Proof of Proposition 4.10. Social welfare is non-decreasing (or increasing) if and only if total risk is non-increasing (or decreasing), as guaranteed by Proposition 4.2. Maxima of the social welfare are equivalent to minima of the total risk and therefore the connections to stable equilibria follow by Theorem 4.3.</p><p>Proof of Proposition 4.12. By construction (&#945; &#732;eq , &#920; &#732;eq ) is not segmented, and neither is it a stable balanced equilibrium (by the non-optimality assumption). Therefore, it is not stable (Theorem 4.8), and thus not a local minimum of the total risk (Theorem 4.3). A perturbation will thus send the system along a risk-reducing trajectory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Detailed Examples</head><p>D.1 Risk Reducing and Minimizing Dynamics Proposition 3.4. A subpopulation i updating their participation with multiplicative weights is risk minimizing in the limit if &#947; &gt; 0 and &#945; 0 ij &gt; 0 &#8704;j. A learner updating its parameter with gradient descent is risk minimizing in the limit when the risk functions R i (&#952;) are L smooth and the step size satisfies &#947;</p><p>Proof. To see that the subpopulation is risk minimizing, first see that</p><p>where the strict inequality holds as long as &#945; t ij is not on the boundary of the simplex. Second, observe that for a fixed &#920;, &#945; t ij &#8594; 1 if and only if R i (&#952; j ) is minimal over all learners for which &#945; 0 ij &gt; 0. To see that the learner is risk minimizing, notice that the gradient update is equivalently</p><p>Gradient descent on an L-smooth and convex function leads to strictly decreasing objective values when &#952; t j is not at a minimum and the step size satisfies &#947; t &lt; 2 L . It further converges to a minimum in the limit as long as the step size satisfies the Robbins-Munroe condition (see, e.g. <ref type="bibr">Liu and Yuan (2022)</ref>; Orabona (2020)).</p><p>Example D.1 (Non-continuity of allocation updates). Suppose a population prefers one learner over others, and only shifts participation away from the preferred learner if there is another with risk smaller by at least R 0 &gt; 0. This is risk reducing but not minimizing in the limit.</p><p>Example D.2 (Shifting to lower-risk models). If a subpopulation's allocation updates always shift allocation from learners with high subpopulation risk to learners with lower subpopulation risk, then the allocation is risk reducing. It may or may not be risk minimizing in the limit.</p><p>Example D.3 (Allocations determined by gradient descent). Consider an allocation determined by (projected) gradient descent with respect to a subpopulation's average risk. This is risk-reducing, and may be risk minimizing in the limit depending on the step-size.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Stability</head><p>To illustrate the subtleties of determining stability when the total risk function has non-isolated local minima, we consider a setting with n = m = 2 subpopulations and learners where R 1 (&#952;) = R 2 (&#952;) = &#952; 2 . Then the total risk function is minimized for any &#945; &#8712; &#8710; n m and &#920; = (0, 0). This continuum of minima can contain equilibria of risk minimizing dynamics, and those equilibria may be stable, asymptotically stable, or unstable, which we illustrate with the following examples.</p><p>Example D.4 (Continuum of stable balanced markets). Suppose that subpopulations update their allocation via any Lipschitz continuous risk minimizing update rule which is stationary whenever learners are risk equivalent (i.e. R i (&#952; 1 ) = R i (&#952; 2 )). Suppose that learners update via full risk minimization. Then equilibria will have the form (&#945; eq , (0, 0)) for any &#945; eq &#8712; &#8710; 2</p><p>2 . Then starting from any (&#945; 0 , &#920; 0 ) with a &#948; &#945; , &#948; &#952; ball of any equilibrium (&#945; eq , &#920; eq ),</p><p>at which point the system is in a new equilibrium, since any allocation &#945; is a fixed point when &#920; = (0, 0) so &#945; t = &#945; 1 and &#920; t = &#920; 1 for all t. We have that &#8741;&#920; eq -&#920; 0 &#8741; = 0 and &#8741;&#945; eq -&#945; 1 &#8741; = &#8741;&#957;(&#945; eq , &#920; eq ) -&#957;(&#945; 0 , &#920; 0 )&#8741; By the assumption of Lipschitzness, this distance will scale linearly in &#948; &#945; , &#948; &#952; so the definition of stability is satisfied for &#948; chosen proportionally to &#1013; depending on the Lipschitz constant of &#957;.</p><p>In this example, any perturbation converges to a new fixed point within one time step. The continuity of the update functions ensures that the new fixed point is within a bounded distance of the original, satisfying the definition of stability. This example is not asymptotically stable: the allocation does not convergence back to the original point.</p><p>Example D.5 (Asymptotically stable segmentation). Consider the subpopulation and learner update rules as in the prior example, with one amendment. When R i (&#952; 1 ) = R i (&#952; 2 ), subpopulation 1 re-allocates half of its mass from learner 2 to learner 1, and while subpopulation 2 re-allocates half its mass from learner 1 to learner 2. Thus the subpopulation update can be written as</p><p>The only equilibrium has &#945; eq segmented with subpopulation i associated to learner i for i = 1, 2 and &#920; eq = (0, 0). It is straightforward to see that this is an asymptotically stable equilibrium, since for any a &#8712; &#8710; 2 ,</p><p>Example D.6 (Asymptotically stable balanced market). Consider a setting similar to the previous example except that when R i (&#952; 1 ) = R i (&#952; 2 ), subpopulation i moves half the mass from group 1 to group 2 and half the mass from group 2 to group 1 for all i. Then the subpopulation update can be written as</p><p>The only equilibrium has &#945; eq i,: = [1/2, 1/2] for i = 1, 2 and &#920; eq = (0, 0). It is straightforward to see that this is an asymptotically stable equilibrium, since for any a &#8712; &#8710; 2 ,</p><p>Example D.7 (Unstable balanced market). Suppose that subpopulation allocations follow a projected gradient descent update for all i:</p><p>and &#945; i2 = 1 -&#945; i1 . Further suppose learners update with gradient descent:</p><p>Both rules are risk minimizing in the limit (note that &#952; t j = 1 &#8730; t &#952; 0 j ) and have a continuum of equilibria at any &#945; eq &#8712; &#8710; n m and &#920; eq = (0, 0). However, we show that the equilibria are not stable. Consider the initial condition (&#945; eq , (&#948; &#952; , 0)). We have that</p><p>No matter how small the perturbation &#948; &#952; is, the summation increases with t and participation will converge all weight to learner 2. A similar argument shows that perturbations exist that will send all participation to learner 1.</p><p>In this example, the learners update slowly. Despite eventual convergence to the minimizing parameter, the accumulating error causes the participation allocation to shift completely to the unperturbed learner, precluding stability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Social Welfare</head><p>We begin with a somewhat generic example with m = 2 and n = 3 that illustrates the difference between stable equilibria and social welfare optima.</p><p>Example D.8 (Stability vs. optimality). Consider three subpopulations i &#8712; {1, 2, 3} with risks &#8741;&#952; -&#981; i &#8741; 2 2 , sizes &#946; i , and two learners j &#8712; {1, 2}. Suppose that the &#945; eq is such that the subpopulations are partitioned into {1} and {2, 3}. Then we have that</p><p>By Theorem 4.6, this is stable if and only if</p><p>However, it is only social optimal if and only if &#981; 2 and &#981; 3 are relatively close to each other than to &#981; 1 , i.e.</p><p>The set of subpopulation optima {&#981; 1 , &#981; 2 , &#981; 3 } satisfying the optimality condition are a subset of those satisfying the stability condition. As the difference between &#946; 2 and &#946; 3 becomes more extreme, the number of settings satisfying the stability but not optimality condition increases.</p><p>We use this generic example to illustrate a scenario in which the total risk can be arbitrarily high at a stable equilibria.</p><p>Emergent specialization from participation dynamics and multi-learner retraining Example D.9. Suppose there are two learners and three subpopulations with sizes &#946; 1 = &#946; 2 = &#946; and &#946; 3 = 1 -2&#946; for some 0 &lt; &#946; &lt; 1/2. Consider the following: R 1 (&#952;) = &#952; 2 , R 2 (&#952;) = (&#952; -1) 2 , R 2 (&#952;) = (&#952; -1-&#946; 1-2&#946; + &#1013;) 2 . The social welfare optimizing decision &#920; &#8902; = (1/2, 1-&#946; 1-2&#946; -&#1013;) corresponds to total risk &#946;/2. However, there is a stable equilibrium at &#920; eq = (0, 1 + &#1013;) with total risk &#946; + (&#946;-&#1013;) 2 1-2&#946; . For &#946; &#8594; 1/2, the gap becomes arbitrarily large.</p><p>Finally, we present an example which illustrates that even in the single learner setting, the risk of a subpopulation can be arbitrarily worse than the total risk.</p><p>Example D.10 (Arbitrarily high risk for minority subpopulation). Consider two subpopulations with R 1 (&#952;) = &#952; 2 and R 2 (&#952;) = (&#952; -&#981;) 2 with &#946; 1 = &#946; and &#946; 2 = 1 -&#946; and a single learner. The single equilibrium and total risk minimizer is &#952; 1 = (1 -&#946;)&#981; with total risk &#946;(1 -&#946;)&#981; 2 and R 2 (&#952; &#8902; ) = &#946; 2 &#981; 2 . The difference between the two quantities can be arbitrarily high as &#946; gets close to 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Additional Experiments with Noisy Dynamics</head><p>Figure <ref type="figure">8a</ref> replicates Fig. <ref type="figure">4a</ref> from the main text. The magenta-highlighted trajectory starts precisely at the unstable equilibrium, while the other three, initiated near this point, converge to the three possible split market equilibria, ordered by hue intensity: {(1,2), (3)}, {(2,3), (1)}, and {(1,3), (2)}. In Figure <ref type="figure">8b</ref>, while sub-population dynamics remain as in (a), learner updates experience uncorrelated external perturbations, causing trajectories to be different from (a). Nevertheless, the long term dynamics gravitate near stable split equilibria. Figure <ref type="figure">8c</ref> depicts learners updating decisions based on sampled empirical losses, with sub-populations adjusting participation based on aggregate empirical performance. The fact that each learner uses different samples from each sub-population adds sufficient un-correlated noise to create trajectories similar to when exogenous noise is added.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Experimental Details</head><p>Full experimental details along with instructions for reproducing them can be found at <ref type="url">https://github.com/mcurmei627/  MultiLearnerRiskReduction</ref>. The experiments used Python 3.10 on a MacBook Pro 2019. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Implementation details and reproduction instructions at: https://github.com/mcurmei627/MultiLearnerRiskReduction</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>This connects to economic literature on "rational" discrimination, where competitors have no inherent preference to discriminate and yet equilibria are segregated, e.g.<ref type="bibr">Foster and Vohra (1992)</ref> </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://www.statista.com/statistics/1337563/us-distribution-leading-social-media-platforms-by-gender/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>https://www.statista.com/statistics/1337525/us-distribution-leading-social-media-platforms-by-age-group/</p></note>
		</body>
		</text>
</TEI>
