skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Emergent specialization from participation dynamics and multi-learner retraining
Numerous online services are data-driven: the behavior of users affects the system’s parameters, and the system’s parameters affect the users’ experience of the service, which in turn affects the way users may interact with the system. For example, people may choose to use a service only for tasks that already works well, or they may choose to switch to a different service. These adaptations influence the ability of a system to learn about a population of users and tasks in order to improve its performance broadly. In this work, we analyze a class of such dynamics—where users allocate their participation amongst services to reduce the individual risk they experience, and services update their model parameters to reduce the service’s risk on their current user population. We refer to these dynamics as risk-reducing, which cover a broad class of common model updates including gradient descent and multiplicative weights. For this general class of dynamics, we show that asymptotically stable equilibria are always segmented, with sub-populations allocated to a single learner. Under mild assumptions, the utilitarian social optimum is a stable equilibrium. In contrast to previous work, which shows that repeated risk minimization can result in representation disparity and high overall loss with a single learner (Hashimoto et al., 2018; Miller et al., 2021), we find that repeated myopic updates with multiple learners lead to better outcomes. We illustrate the phenomena via a simulated example initialized from real data.  more » « less
Award ID(s):
2312775 2023166
PAR ID:
10547198
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Proceedings of the 27th International Conference on Artificial Intelligence and Statistics
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Numerous online services are data-driven: the behavior of users affects the system’s parameters, and the system’s parameters affect the users’ experience of the service, which in turn affects the way users may interact with the system. For example, people may choose to use a service only for tasks that already works well, or they may choose to switch to a different service. These adaptations influence the ability of a system to learn about a population of users and tasks in order to improve its performance broadly. In this work, we analyze a class of such dynamics—where users allocate their participation amongst services to reduce the individual risk they experience, and services update their model parameters to reduce the service’s risk on their current user population. We refer to these dynamics as risk-reducing, which cover a broad class of common model updates including gradient descent and multiplicative weights. For this general class of dynamics, we show that asymptotically stable equilibria are always segmented, with sub-populations allocated to a single learner. Under mild assumptions, the utilitarian social optimum is a stable equilibrium. In contrast to previous work, which shows that repeated risk minimization can result in representation disparity and high overall loss with a single learner (Hashimoto et al., 2018; Miller et al., 2021), we find that repeated myopic updates with multiple learners lead to better outcomes. We illustrate the phenomena via a simulated example initialized from real data. 
    more » « less
  2. Globerson, A; Mackey, L; Belgrave, D; Fan, A; Paquet, U; Tomczak, J; Zhang, C (Ed.)
    This paper investigates ML systems serving a group of users, with multiple models/services, each aimed at specializing to a sub-group of users. We consider settings where upon deploying a set of services, users choose the one minimizing their personal losses and the learner iteratively learns by interacting with diverse users. Prior research shows that the outcomes of learning dynamics, which comprise both the services' adjustments and users' service selections, hinge significantly on the initial conditions. However, finding good initial conditions faces two main challenges:(i)\emph {Bandit feedback:} Typically, data on user preferences are not available before deploying services and observing user behavior;(ii)\emph {Suboptimal local solutions:} The total loss landscape (ie, the sum of loss functions across all users and services) is not convex and gradient-based algorithms can get stuck in poor local minima. We address these challenges with a randomized algorithm to adaptively select a minimal set of users for data collection in order to initialize a set of services. Under mild assumptions on the loss functions, we prove that our initialization leads to a total loss within a factor of the\textit {globally optimal total loss, with complete user preference data}, and this factor scales logarithmically in the number of services. This result is a generalization of the well-known k-means++ guarantee to a broad problem class which is also of independent interest. The theory is complemented by experiments on real as well as semi-synthetic datasets. 
    more » « less
  3. Age of information has been proposed recently to measure information freshness, especially for a class of real-time video applications. These applications often demand timely updates with edge cloud computing to guarantee the user experience. However, the edge cloud is usually equipped with limited computation and network resources and therefore, resource contention among different video streams can contribute to making the updates stale. Aiming to minimize a penalty function of the weighted sum of the average age over multiple end users, this paper presents a greedy traffic scheduling policy for the processor to choose the next processing request with the maximum immediate penalty reduction. In this work, we formulate the service process when requests from multiple users arrive at edge cloud servers asynchronously and show that the proposed greedy scheduling algorithm is the optimal work- conserving policy for a class of age penalty functions. 
    more » « less
  4. null (Ed.)
    Age of information has been proposed recently to measure information freshness, especially for a class of real-time video applications. These applications often demand timely updates with edge cloud computing to guarantee the user experience. However, the edge cloud is usually equipped with limited computation and network resources and therefore, resource contention among different video streams can contribute to making the updates stale. Aiming to minimize a penalty function of the weighted sum of the average age over multiple end users, this paper presents a greedy traffic scheduling policy for the processor to choose the next processing request with the maximum immediate penalty reduction. In this work, we formulate the service process when requests from multiple users arrive at edge cloud servers asynchronously and show that the proposed greedy scheduling algorithm is the optimal work-conserving policy for a class of age penalty functions. 
    more » « less
  5. It has become standard to solve NLP tasks by fine-tuning pre-trained language models (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK)—which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization—describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained language models. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods. 
    more » « less