skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Explaining Deep Adaptive Programs via Reward Decomposition
Adaptation Based Programming (ABP) allows programmers to employ "choice points" at program locations where they are uncertain about how to best code the program logic. Reinforcement learning (RL) is then used to automatically learn to make choice-point decisions to optimize the reward achieved by the program. In this paper, we consider a new approach to explaining the learned decisions of adaptive programs. The key idea is to include simple program annotations that define multiple semantically meaningful reward types, which compose to define the overall reward signal used for learning. Using these reward types we define the notion of reward difference explanations (RDXs), which aim to explain why at a choice point an alternative A was selected over another alternative B An RDX gives the difference in the predicted future reward of each type when selecting A versus B and then continuing to run the adaptive program. Significant differences can provide insight into why A was or was not preferred to B. We describe a SARSA-style learning algorithm for learning to optimize the choices at each choice point, while also learning side information for producing RDXs. We demonstrate this explanation approach through a case study in a synthetic domain, which shows the general promise of the approach and highlights future research questions.  more » « less
Award ID(s):
1717300
PAR ID:
10096985
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IJCAI/ECAI Workshop on Explainable Artificial Intelligence
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Primate vision is characterized by constant, sequential processing and selection of visual targets to fixate. Although expected reward is known to influence both processing and selection of visual targets, similarities and differences between these effects remain unclear mainly because they have been measured in separate tasks. Using a novel paradigm, we simultaneously measured the effects of reward outcomes and expected reward on target selection and sensitivity to visual motion in monkeys. Monkeys freely chose between two visual targets and received a juice reward with varying probability for eye movements made to either of them. Targets were stationary apertures of drifting gratings, causing the end points of eye movements to these targets to be systematically biased in the direction of motion. We used this motion-induced bias as a measure of sensitivity to visual motion on each trial. We then performed different analyses to explore effects of objective and subjective reward values on choice and sensitivity to visual motion to find similarities and differences between reward effects on these two processes. Specifically, we used different reinforcement learning models to fit choice behavior and estimate subjective reward values based on the integration of reward outcomes over multiple trials. Moreover, to compare the effects of subjective reward value on choice and sensitivity to motion directly, we considered correlations between each of these variables and integrated reward outcomes on a wide range of timescales. We found that, in addition to choice, sensitivity to visual motion was also influenced by subjective reward value, although the motion was irrelevant for receiving reward. Unlike choice, however, sensitivity to visual motion was not affected by objective measures of reward value. Moreover, choice was determined by the difference in subjective reward values of the two options, whereas sensitivity to motion was influenced by the sum of values. Finally, models that best predicted visual processing and choice used sets of estimated reward values based on different types of reward integration and timescales. Together, our results demonstrate separable influences of reward on visual processing and choice, and point to the presence of multiple brain circuits for the integration of reward outcomes. 
    more » « less
  2. Large language models are typically aligned with human preferences by optimizing reward models (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to overoptimization, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM’s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run. 
    more » « less
  3. Cai, Ming Bo (Ed.)
    A major advance in understanding learning behavior stems from experiments showing that reward learning requires dopamine inputs to striatal neurons and arises from synaptic plasticity of cortico-striatal synapses. Numerous reinforcement learning models mimic this dopamine-dependent synaptic plasticity by using the reward prediction error, which resembles dopamine neuron firing, to learn the best action in response to a set of cues. Though these models can explain many facets of behavior, reproducing some types of goal-directed behavior, such as renewal and reversal, require additional model components. Here we present a reinforcement learning model, TD2Q, which better corresponds to the basal ganglia with two Q matrices, one representing direct pathway neurons (G) and another representing indirect pathway neurons (N). Unlike previous two-Q architectures, a novel and critical aspect of TD2Q is to update the G and N matrices utilizing the temporal difference reward prediction error. A best action is selected for N and G using a softmax with a reward-dependent adaptive exploration parameter, and then differences are resolved using a second selection step applied to the two action probabilities. The model is tested on a range of multi-step tasks including extinction, renewal, discrimination; switching reward probability learning; and sequence learning. Simulations show that TD2Q produces behaviors similar to rodents in choice and sequence learning tasks, and that use of the temporal difference reward prediction error is required to learn multi-step tasks. Blocking the update rule on the N matrix blocks discrimination learning, as observed experimentally. Performance in the sequence learning task is dramatically improved with two matrices. These results suggest that including additional aspects of basal ganglia physiology can improve the performance of reinforcement learning models, better reproduce animal behaviors, and provide insight as to the role of direct- and indirect-pathway striatal neurons. 
    more » « less
  4. Real-world choice options have many features or attributes, whereas the reward outcome from those options only depends on a few features or attributes. It has been shown that humans learn and combine feature-based with more complex conjunction-based learning to tackle challenges of learning in naturalistic reward environments. However, it remains unclear how different learning strategies interact to determine what features or conjunctions should be attended to and control choice behavior, and how subsequent attentional modulations influence future learning and choice. To address these questions, we examined the behavior of male and female human participants during a three-dimensional learning task in which reward outcomes for different stimuli could be predicted based on a combination of an informative feature and conjunction. Using multiple approaches, we found that both choice behavior and reward probabilities estimated by participants were most accurately described by attention-modulated models that learned the predictive values of both the informative feature and the informative conjunction. Specifically, in the reinforcement learning model that best fit choice data, attention was controlled by the difference in the integrated feature and conjunction values. The resulting attention weights modulated learning by increasing the learning rate on attended features and conjunctions. Critically, modulating decision-making by attention weights did not improve the fit of data, providing little evidence for direct attentional effects on choice. These results suggest that in multidimensional environments, humans direct their attention not only to selectively process reward-predictive attributes but also to find parsimonious representations of the reward contingencies for more efficient learning. 
    more » « less
  5. Viale, R. (Ed.)
    Alternative-based approaches to decision making generate overall values for each option in a choice set by processing information within options before comparing options to arrive at a decision. By contrast, attribute-based approaches compare attributes (such as monetary cost and time delay to receipt of a reward) across options and use these attribute comparisons to make a decision. Because they compare attributes, they may not use all available information to make a choice, which categorizes many of them as heuristics. Attribute-based models can better predict choice compared to alternative-based models in some situations (e.g., when there are many options in the choice set, when calculating an overall value for an option is too cognitively taxing). Process data comparing alternative-based and attribute-based processing obtained from eye-tracking and mouse-tracking technology support these findings. Data on attribute-based models thus align with the notion of bounded rationality that people make use of heuristics to make good decisions when under time pressure, informational constraints, and computational constraints. Further study of attribute-based models and processing would enhance our understanding of how individuals process information and make decisions. 
    more » « less