首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 593 毫秒
1.
Previous reports have described that neural activities in midbrain dopamine areas are sensitive to unexpected reward delivery and omission. These activities are correlated with reward prediction error in reinforcement learning models, the difference between predicted reward values and the obtained reward outcome. These findings suggest that the reward prediction error signal in the brain updates reward prediction through stimulus-reward experiences. It remains unknown, however, how sensory processing of reward-predicting stimuli contributes to the computation of reward prediction error. To elucidate this issue, we examined the relation between stimulus discriminability of the reward-predicting stimuli and the reward prediction error signal in the brain using functional magnetic resonance imaging (fMRI). Before main experiments, subjects learned an association between the orientation of a perceptually salient (high-contrast) Gabor patch and a juice reward. The subjects were then presented with lower-contrast Gabor patch stimuli to predict a reward. We calculated the correlation between fMRI signals and reward prediction error in two reinforcement learning models: a model including the modulation of reward prediction by stimulus discriminability and a model excluding this modulation. Results showed that fMRI signals in the midbrain are more highly correlated with reward prediction error in the model that includes stimulus discriminability than in the model that excludes stimulus discriminability. No regions showed higher correlation with the model that excludes stimulus discriminability. Moreover, results show that the difference in correlation between the two models was significant from the first session of the experiment, suggesting that the reward computation in the midbrain was modulated based on stimulus discriminability before learning a new contingency between perceptually ambiguous stimuli and a reward. These results suggest that the human reward system can incorporate the level of the stimulus discriminability flexibly into reward computations by modulating previously acquired reward values for a typical stimulus.  相似文献   

2.
Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a "reward prediction error" (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function.  相似文献   

3.
The Drosophila mushroom body exhibits dopamine dependent synaptic plasticity that underlies the acquisition of associative memories. Recordings of dopamine neurons in this system have identified signals related to external reinforcement such as reward and punishment. However, other factors including locomotion, novelty, reward expectation, and internal state have also recently been shown to modulate dopamine neurons. This heterogeneity is at odds with typical modeling approaches in which these neurons are assumed to encode a global, scalar error signal. How is dopamine dependent plasticity coordinated in the presence of such heterogeneity? We develop a modeling approach that infers a pattern of dopamine activity sufficient to solve defined behavioral tasks, given architectural constraints informed by knowledge of mushroom body circuitry. Model dopamine neurons exhibit diverse tuning to task parameters while nonetheless producing coherent learned behaviors. Notably, reward prediction error emerges as a mode of population activity distributed across these neurons. Our results provide a mechanistic framework that accounts for the heterogeneity of dopamine activity during learning and behavior.  相似文献   

4.
Braver TS  Brown JW 《Neuron》2003,38(2):150-152
Accumulating evidence from nonhuman primates suggests that midbrain dopamine cells code reward prediction errors and that this signal subserves reward learning in dopamine-receiving brain structures. In this issue of Neuron, McClure et al. and O'Doherty et al. use event-related fMRI to provide some of the strongest evidence to date that the reward prediction error model of dopamine system activity applies equally well to human reward learning.  相似文献   

5.
6.
Reinforcement learning (RL) has become a dominant paradigm for understanding animal behaviors and neural correlates of decision-making, in part because of its ability to explain Pavlovian conditioned behaviors and the role of midbrain dopamine activity as reward prediction error (RPE). However, recent experimental findings indicate that dopamine activity, contrary to the RL hypothesis, may not signal RPE and differs based on the type of Pavlovian response (e.g. sign- and goal-tracking responses). In this study, we address this discrepancy by introducing a new neural correlate for learning reward predictions; the correlate is called “cue-evoked reward”. It refers to a recall of reward evoked by the cue that is learned through simple cue-reward associations. We introduce a temporal difference learning model, in which neural correlates of the cue itself and cue-evoked reward underlie learning of reward predictions. The animal''s reward prediction supported by these two correlates is divided into sign and goal components respectively. We relate the sign and goal components to approach responses towards the cue (i.e. sign-tracking) and the food-tray (i.e. goal-tracking) respectively. We found a number of correspondences between simulated models and the experimental findings (i.e. behavior and neural responses). First, the development of modeled responses is consistent with those observed in the experimental task. Second, the model''s RPEs were similar to dopamine activity in respective response groups. Finally, goal-tracking, but not sign-tracking, responses rapidly emerged when RPE was restored in the simulated models, similar to experiments with recovery from dopamine-antagonist. These results suggest two complementary neural correlates, corresponding to the cue and its evoked reward, form the basis for learning reward predictions in the sign- and goal-tracking rats.  相似文献   

7.
Sensitivity to time, including the time of reward, guides the behaviour of all organisms. Recent research suggests that all major reward structures of the brain process the time of reward occurrence, including midbrain dopamine neurons, striatum, frontal cortex and amygdala. Neuronal reward responses in dopamine neurons, striatum and frontal cortex show temporal discounting of reward value. The prediction error signal of dopamine neurons includes the predicted time of rewards. Neurons in the striatum, frontal cortex and amygdala show responses to reward delivery and activities anticipating rewards that are sensitive to the predicted time of reward and the instantaneous reward probability. Together these data suggest that internal timing processes have several well characterized effects on neuronal reward processing.  相似文献   

8.
In contemporary reinforcement learning models, reward prediction error (RPE), the difference between the expected and actual reward, is thought to guide action value learning through the firing activity of dopaminergic neurons. Given the importance of dopamine in reward learning and the involvement of Akt1 in dopamine-dependent behaviors, the aim of this study was to investigate whether Akt1 deficiency modulates reward learning and the magnitude of RPE using Akt1 mutant mice as a model. In comparison to wild-type littermate controls, the expression of Akt1 proteins in mouse brains occurred in a gene-dosage-dependent manner and Akt1 heterozygous (HET) mice exhibited impaired striatal Akt1 activity under methamphetamine challenge. No genotypic difference was found in the basal levels of dopamine and its metabolites. In a series of reward-related learning tasks, HET mice displayed a relatively efficient method of updating reward information from the environment during the acquisition phase of the two natural reward tasks and in the reverse section of the dynamic foraging T-maze but not in methamphetamine-induced or aversive-related reward learning. The implementation of a standard reinforcement learning model and the Bayesian hierarchical parameter estimation show that HET mice have higher RPE magnitudes and that their action values are updated more rapidly among all three test sections in T-maze. These results indicate that Akt1 deficiency modulates natural reward learning and RPE. This study showed a promising avenue for investigating RPE in mutant mice and provided evidence for the potential link from genetic deficiency, to neurobiological abnormalities, to impairment in higher-order cognitive functioning.  相似文献   

9.
An influential concept in contemporary computational neuroscience is the reward prediction error hypothesis of phasic dopaminergic function. It maintains that midbrain dopaminergic neurons signal the occurrence of unpredicted reward, which is used in appetitive learning to reinforce existing actions that most often lead to reward. However, the availability of limited afferent sensory processing and the precise timing of dopaminergic signals suggest that they might instead have a central role in identifying which aspects of context and behavioural output are crucial in causing unpredicted events.  相似文献   

10.
A fundamental goal of neuroscience is to understand how cognitive processes, such as operant conditioning, are performed by the brain. Typical and well studied examples of operant conditioning, in which the firing rates of individual cortical neurons in monkeys are increased using rewards, provide an opportunity for insight into this. Studies of reward-modulated spike-timing-dependent plasticity (RSTDP), and of other models such as R-max, have reproduced this learning behavior, but they have assumed that no unsupervised learning is present (i.e., no learning occurs without, or independent of, rewards). We show that these models cannot elicit firing rate reinforcement while exhibiting both reward learning and ongoing, stable unsupervised learning. To fix this issue, we propose a new RSTDP model of synaptic plasticity based upon the observed effects that dopamine has on long-term potentiation and depression (LTP and LTD). We show, both analytically and through simulations, that our new model can exhibit unsupervised learning and lead to firing rate reinforcement. This requires that the strengthening of LTP by the reward signal is greater than the strengthening of LTD and that the reinforced neuron exhibits irregular firing. We show the robustness of our findings to spike-timing correlations, to the synaptic weight dependence that is assumed, and to changes in the mean reward. We also consider our model in the differential reinforcement of two nearby neurons. Our model aligns more strongly with experimental studies than previous models and makes testable predictions for future experiments.  相似文献   

11.
Reward prediction error (RPE) signals are central to current models of reward-learning. Temporal difference (TD) learning models posit that these signals should be modulated by predictions, not only of magnitude but also timing of reward. Here we show that BOLD activity in the VTA conforms to such TD predictions: responses to unexpected rewards are modulated by a temporal hazard function and activity between a predictive stimulus and reward is depressed in proportion to predicted reward. By contrast, BOLD activity in ventral striatum (VS) does not reflect a TD RPE, but instead encodes a signal on the variable relevant for behavior, here timing but not magnitude of reward. The results have important implications for dopaminergic models of cortico-striatal learning and suggest a modification of the conventional view that VS BOLD necessarily reflects inputs from dopaminergic VTA neurons signaling an RPE.  相似文献   

12.
Philpot K  Smith Y 《Peptides》2006,27(8):1987-1992
Over the past decade, CART peptide has been commonly associated with the rewarding and reinforcing properties of drugs of abuse and natural rewards such as food. The mesolimbic dopamine system is the predominant pathway involved in mediating reward and reinforcement. Many behavioral and neuroanatomical studies have been conducted in order to further elucidate the importance of CART-containing neurons within the mesolimbic dopamine system. This chapter will review the current knowledge of the localization, synaptic connectivity and neurochemical content of CART peptidecontaining neurons in nuclei of the mesolimbic reward pathway. These nuclei include the nucleus accumbens (NA), ventral midbrain, and the lateral hypothalamus (LH). In conclusion, an interconnected CART-containing loop between the NA, ventral midbrain and LH has evolved from these neuroanatomical studies that may have functional implications for CART peptide's involvement in reward and reinforcement.  相似文献   

13.
Instrumental responses are hypothesized to be of two kinds: habitual and goal-directed, mediated by the sensorimotor and the associative cortico-basal ganglia circuits, respectively. The existence of the two heterogeneous associative learning mechanisms can be hypothesized to arise from the comparative advantages that they have at different stages of learning. In this paper, we assume that the goal-directed system is behaviourally flexible, but slow in choice selection. The habitual system, in contrast, is fast in responding, but inflexible in adapting its behavioural strategy to new conditions. Based on these assumptions and using the computational theory of reinforcement learning, we propose a normative model for arbitration between the two processes that makes an approximately optimal balance between search-time and accuracy in decision making. Behaviourally, the model can explain experimental evidence on behavioural sensitivity to outcome at the early stages of learning, but insensitivity at the later stages. It also explains that when two choices with equal incentive values are available concurrently, the behaviour remains outcome-sensitive, even after extensive training. Moreover, the model can explain choice reaction time variations during the course of learning, as well as the experimental observation that as the number of choices increases, the reaction time also increases. Neurobiologically, by assuming that phasic and tonic activities of midbrain dopamine neurons carry the reward prediction error and the average reward signals used by the model, respectively, the model predicts that whereas phasic dopamine indirectly affects behaviour through reinforcing stimulus-response associations, tonic dopamine can directly affect behaviour through manipulating the competition between the habitual and the goal-directed systems and thus, affect reaction time.  相似文献   

14.
Success in a constantly changing environment requires that decision-making strategies be updated as reward contingencies change. How this is accomplished by the nervous system has, until recently, remained a profound mystery. New studies coupling economic theory with neurophysiological techniques have revealed the explicit representation of behavioral value. Specifically, when fluid reinforcement is paired with visually-guided eye movements, neurons in parietal cortex, prefrontal cortex, the basal ganglia, and superior colliculus—all nodes in a network linking visual stimulation with the generation of oculomotor behavior—encode the expected value of targets lying within their response fields. Other brain areas have been implicated in the processing of reward-related information in the abstract: midbrain dopaminergic neurons, for instance, signal an error in reward prediction. Still other brain areas link information about reward to the selection and performance of specific actions in order for behavior to adapt to changing environmental exigencies. Neurons in posterior cingulate cortex have been shown to carry signals related to both reward outcomes and oculomotor behavior, suggesting that they participate in updating estimates of orienting value.  相似文献   

15.
Sensorimotor control has traditionally been considered from a control theory perspective, without relation to neurobiology. In contrast, here we utilized a spiking-neuron model of motor cortex and trained it to perform a simple movement task, which consisted of rotating a single-joint “forearm” to a target. Learning was based on a reinforcement mechanism analogous to that of the dopamine system. This provided a global reward or punishment signal in response to decreasing or increasing distance from hand to target, respectively. Output was partially driven by Poisson motor babbling, creating stochastic movements that could then be shaped by learning. The virtual forearm consisted of a single segment rotated around an elbow joint, controlled by flexor and extensor muscles. The model consisted of 144 excitatory and 64 inhibitory event-based neurons, each with AMPA, NMDA, and GABA synapses. Proprioceptive cell input to this model encoded the 2 muscle lengths. Plasticity was only enabled in feedforward connections between input and output excitatory units, using spike-timing-dependent eligibility traces for synaptic credit or blame assignment. Learning resulted from a global 3-valued signal: reward (+1), no learning (0), or punishment (−1), corresponding to phasic increases, lack of change, or phasic decreases of dopaminergic cell firing, respectively. Successful learning only occurred when both reward and punishment were enabled. In this case, 5 target angles were learned successfully within 180 s of simulation time, with a median error of 8 degrees. Motor babbling allowed exploratory learning, but decreased the stability of the learned behavior, since the hand continued moving after reaching the target. Our model demonstrated that a global reinforcement signal, coupled with eligibility traces for synaptic plasticity, can train a spiking sensorimotor network to perform goal-directed motor behavior.  相似文献   

16.
Operant learning requires that reinforcement signals interact with action representations at a suitable neural interface. Much evidence suggests that this occurs when phasic dopamine, acting as a reinforcement prediction error, gates plasticity at cortico-striatal synapses, and thereby changes the future likelihood of selecting the action(s) coded by striatal neurons. But this hypothesis faces serious challenges. First, cortico-striatal plasticity is inexplicably complex, depending on spike timing, dopamine level, and dopamine receptor type. Second, there is a credit assignment problem—action selection signals occur long before the consequent dopamine reinforcement signal. Third, the two types of striatal output neuron have apparently opposite effects on action selection. Whether these factors rule out the interface hypothesis and how they interact to produce reinforcement learning is unknown. We present a computational framework that addresses these challenges. We first predict the expected activity changes over an operant task for both types of action-coding striatal neuron, and show they co-operate to promote action selection in learning and compete to promote action suppression in extinction. Separately, we derive a complete model of dopamine and spike-timing dependent cortico-striatal plasticity from in vitro data. We then show this model produces the predicted activity changes necessary for learning and extinction in an operant task, a remarkable convergence of a bottom-up data-driven plasticity model with the top-down behavioural requirements of learning theory. Moreover, we show the complex dependencies of cortico-striatal plasticity are not only sufficient but necessary for learning and extinction. Validating the model, we show it can account for behavioural data describing extinction, renewal, and reacquisition, and replicate in vitro experimental data on cortico-striatal plasticity. By bridging the levels between the single synapse and behaviour, our model shows how striatum acts as the action-reinforcement interface.  相似文献   

17.
Midbrain dopamine (DA) neurons are thought to encode reward prediction error. Reward prediction can be improved if any relevant context is taken into account. We found that monkey DA neurons can encode a context-dependent prediction error. In the first noncontextual task, a light stimulus was randomly followed by reward, with a fixed equal probability. The response of DA neurons was positively correlated with the number of preceding unrewarded trials and could be simulated by a conventional temporal difference (TD) model. In the second contextual task, a reward-indicating light stimulus was presented with the probability that, while fixed overall, was incremented as a function of the number of preceding unrewarded trials. The DA neuronal response then was negatively correlated with this number. This history effect corresponded to the prediction error based on the conditional probability of reward and could be simulated only by implementing the relevant context into the TD model.  相似文献   

18.
Fiorillo CD 《PloS one》2008,3(10):e3298
Although there has been tremendous progress in understanding the mechanics of the nervous system, there has not been a general theory of its computational function. Here I present a theory that relates the established biophysical properties of single generic neurons to principles of Bayesian probability theory, reinforcement learning and efficient coding. I suggest that this theory addresses the general computational problem facing the nervous system. Each neuron is proposed to mirror the function of the whole system in learning to predict aspects of the world related to future reward. According to the model, a typical neuron receives current information about the state of the world from a subset of its excitatory synaptic inputs, and prior information from its other inputs. Prior information would be contributed by synaptic inputs representing distinct regions of space, and by different types of non-synaptic, voltage-regulated channels representing distinct periods of the past. The neuron's membrane voltage is proposed to signal the difference between current and prior information ("prediction error" or "surprise"). A neuron would apply a Hebbian plasticity rule to select those excitatory inputs that are the most closely correlated with reward but are the least predictable, since unpredictable inputs provide the neuron with the most "new" information about future reward. To minimize the error in its predictions and to respond only when excitation is "new and surprising," the neuron selects amongst its prior information sources through an anti-Hebbian rule. The unique inputs of a mature neuron would therefore result from learning about spatial and temporal patterns in its local environment, and by extension, the external world. Thus the theory describes how the structure of the mature nervous system could reflect the structure of the external world, and how the complexity and intelligence of the system might develop from a population of undifferentiated neurons, each implementing similar learning algorithms.  相似文献   

19.
To survive, animals have to quickly modify their behaviour when the reward changes. The internal representations responsible for this are updated through synaptic weight changes, mediated by certain neuromodulators conveying feedback from the environment. In previous experiments, we discovered a form of hippocampal Spike-Timing-Dependent-Plasticity (STDP) that is sequentially modulated by acetylcholine and dopamine. Acetylcholine facilitates synaptic depression, while dopamine retroactively converts the depression into potentiation. When these experimental findings were implemented as a learning rule in a computational model, our simulations showed that cholinergic-facilitated depression is important for reversal learning. In the present study, we tested the model’s prediction by optogenetically inactivating cholinergic neurons in mice during a hippocampus-dependent spatial learning task with changing rewards. We found that reversal learning, but not initial place learning, was impaired, verifying our computational prediction that acetylcholine-modulated plasticity promotes the unlearning of old reward locations. Further, differences in neuromodulator concentrations in the model captured mouse-by-mouse performance variability in the optogenetic experiments. Our line of work sheds light on how neuromodulators enable the learning of new contingencies.  相似文献   

20.
The phasic firing rate of midbrain dopamine neurons has been shown to respond both to the receipt of rewarding stimuli, and the degree to which such stimuli are anticipated by the recipient. This has led to the hypothesis that these neurons encode reward prediction error (RPE)-the difference between how rewarding an event is, and how rewarding it was expected to be. However, the RPE model is one of a number of competing explanations for dopamine activity that have proved hard to disentangle, mainly because they are couched in terms of latent, or unobservable, variables. This article describes techniques for dealing with latent variables common in economics and decision theory, and reviews work that uses these techniques to provide simple, non-parametric tests of the RPE hypothesis, allowing clear differentiation between competing explanations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号