skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Distinct value computations support rapid sequential decisions
Abstract The value of the environment determines animals’ motivational states and sets expectations for error-based learning1–3. How are values computed? Reinforcement learning systems can store or cache values of states or actions that are learned from experience, or they can compute values using a model of the environment to simulate possible futures3. These value computations have distinct trade-offs, and a central question is how neural systems decide which computations to use or whether/how to combine them4–8. Here we show that rats use distinct value computations for sequential decisions within single trials. We used high-throughput training to collect statistically powerful datasets from 291 rats performing a temporal wagering task with hidden reward states. Rats adjusted how quickly they initiated trials and how long they waited for rewards across states, balancing effort and time costs against expected rewards. Statistical modeling revealed that animals computed the value of the environment differently when initiating trials versus when deciding how long to wait for rewards, even though these decisions were only seconds apart. Moreover, value estimates interacted via a dynamic learning rate. Our results reveal how distinct value computations interact on rapid timescales, and demonstrate the power of using high-throughput training to understand rich, cognitive behaviors.  more » « less
Award ID(s):
2042796
PAR ID:
10475075
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Volume:
14
Issue:
1
ISSN:
2041-1723
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We show how hippocampal replay could explain latent learning, a phenomenon observed in animals where unrewarded pre-exposure to an environment, i.e. habituation, improves task learning rates once rewarded trials begin. We first describe a computational model for spatial navigation inspired by rat studies. The model exploits offline replay of trajectories previously learned by applying reinforcement learning. Then, to assess our hypothesis, the model is evaluated in a “multiple T-maze” environment where rats need to learn a path from the start of the maze to the goal. Simulation results support our hypothesis that pre-exposed or habituated rats learn the task significantly faster than non-pre-exposed rats. Results also show that this effect increases with the number of pre-exposed trials. 
    more » « less
  2. Humans and other animals make decisions under uncertainty. Choosing an option that provides information can improve decision making. However, subjects often choose information that does not increase the chances of obtaining reward. In a procedure that promotes such paradoxical choice, animals choose between two alternatives: The richer option is followed by a cue that is rewarded 50% of the time (No-info) and the leaner option is followed by one of two cues, one always rewarded (100%), and the other never rewarded, 0% (Info). Since decisions involve comparing the subjective value of options after integrating all their features perhaps including information value, preference for information may rely on cortico-amygdalar circuitry. To test this, male and female Long-Evans rats were prepared with bilateral inhibitory DREADDs in the anterior cingulate cortex (ACC), orbitofrontal cortex (OFC), basolateral amygdala (BLA), or null virus infusions as a control. Using a counterbalanced design, we inhibited these regions after stable preference was acquired and during learning of new Info and No-info cues. We found that inhibition of ACC, but not OFC or BLA, selectively destabilized choice preference in female rats without affecting latency to choose or the response rate to cues. A logistic regression fit revealed that the previous choice strongly predicted preference in control animals, but not in female rats following ACC inhibition. BLA inhibition tended to decrease the learning of new cues that signaled the Info option, but had no effect on preference. The results reveal a causal, sex-dependent role for ACC in decisions involving information. 
    more » « less
  3. We study the problem of reinforcement learning for a task encoded by a reward machine. The task is defined over a set of properties in the environment, called atomic propositions, and represented by Boolean variables. One unrealistic assumption commonly used in the literature is that the truth values of these propositions are accurately known. In real situations, however, these truth values are uncertain since they come from sensors that suffer from imperfections. At the same time, reward machines can be difficult to model explicitly, especially when they encode complicated tasks. We develop a reinforcement-learning algorithm that infers a reward machine that encodes the underlying task while learning how to execute it, despite the uncertainties of the propositions’ truth values. In order to address such uncertainties, the algorithm maintains a probabilistic estimate about the truth value of the atomic propositions; it updates this estimate according to new sensory measurements that arrive from exploration of the environment. Additionally, the algorithm maintains a hypothesis reward machine, which acts as an estimate of the reward machine that encodes the task to be learned. As the agent explores the environment, the algorithm updates the hypothesis reward machine according to the obtained rewards and the estimate of the atomic propositions’ truth value. Finally, the algorithm uses a Q-learning procedure for the states of the hypothesis reward machine to determine an optimal policy that accomplishes the task. We prove that the algorithm successfully infers the reward machine and asymptotically learns a policy that accomplishes the respective task. 
    more » « less
  4. Abstract Humans and other animals are capable of reasoning. However, there are overwhelming examples of errors or anomalies in reasoning. In two experiments, we studied if rats, like humans, estimate the conjunction of two events as more likely than each event independently, a phenomenon that has been called conjunction fallacy. In both experiments, rats learned through food reinforcement to press a lever under some cue conditions but not others. Sound B was rewarded whereas Sound A was not. However, when B was presented with the visual cue Y was not rewarded, whereas AX was rewarded (i.e., A-, AX+, B+, BY-). Both visual cues were presented in the same bulb. After training, rats received test sessions in which A and B were presented with the bulb explicitly off or occluded by a metal piece. Thus, on the occluded condition, it was ambiguous whether the trials were of the elements alone (A or B) or of the compounds (AX or BY). Rats responded on the occluded condition as if the compound cues were most likely present. The second experiment investigated if this error in probability estimation in Experiment 1, could be due to a conjunction fallacy, and if this could be attenuated by increasing the ratio of element/compound trials from the original 50-50 to 70-30 and 90-10. Only the 90-10 condition (where 90% of the training trials were of just A or just B) did not show a conjunction fallacy, though it emerged in all groups with additional training. These findings open new avenues for exploring the mechanisms behind the conjunction fallacy effect. 
    more » « less
  5. null (Ed.)
    Abstract: Identifying critical decisions is one of the most challenging decision-making problems in real-world applications. In this work, we propose a novel Reinforcement Learning (RL) based Long-Short Term Rewards (LSTR) framework for critical decisions identification. RL is a machine learning area concerned with inducing effective decision-making policies, following which result in the maximum cumulative "reward." Many RL algorithms find the optimal policy via estimating the optimal Q-values, which specify the maximum cumulative reward the agent can receive. In our LSTR framework, the "long term" rewards are defined as "Q-values" and the "short term" rewards are determined by the "reward function." Experiments on a synthetic GridWorld game and real-world Intelligent Tutoring System datasets show that the proposed LSTR framework indeed identifies the critical decisions in the sequences. Furthermore, our results show that carrying out the critical decisions alone is as effective as a fully-executed policy. 
    more » « less