首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Behavioral evidence suggests that instrumental conditioning is governed by two forms of action control: a goal-directed and a habit learning process. Model-based reinforcement learning (RL) has been argued to underlie the goal-directed process; however, the way in which it interacts with habits and the structure of the habitual process has remained unclear. According to a flat architecture, the habitual process corresponds to model-free RL, and its interaction with the goal-directed process is coordinated by an external arbitration mechanism. Alternatively, the interaction between these systems has recently been argued to be hierarchical, such that the formation of action sequences underlies habit learning and a goal-directed process selects between goal-directed actions and habitual sequences of actions to reach the goal. Here we used a two-stage decision-making task to test predictions from these accounts. The hierarchical account predicts that, because they are tied to each other as an action sequence, selecting a habitual action in the first stage will be followed by a habitual action in the second stage, whereas the flat account predicts that the statuses of the first and second stage actions are independent of each other. We found, based on subjects'' choices and reaction times, that human subjects combined single actions to build action sequences and that the formation of such action sequences was sufficient to explain habitual actions. Furthermore, based on Bayesian model comparison, a family of hierarchical RL models, assuming a hierarchical interaction between habit and goal-directed processes, provided a better fit of the subjects'' behavior than a family of flat models. Although these findings do not rule out all possible model-free accounts of instrumental conditioning, they do show such accounts are not necessary to explain habitual actions and provide a new basis for understanding how goal-directed and habitual action control interact.  相似文献   

2.
Depression is characterized by deficits in the reinforcement learning (RL) process. Although many computational and neural studies have extended our knowledge of the impact of depression on RL, most focus on habitual control (model-free RL), yielding a relatively poor understanding of goal-directed control (model-based RL) and arbitration control to find a balance between the two. We investigated the effects of subclinical depression on model-based and model-free learning in the prefrontal–striatal circuitry. First, we found that subclinical depression is associated with the attenuated state and reward prediction error representation in the insula and caudate. Critically, we found that it accompanies the disrupted arbitration control between model-based and model-free learning in the predominantly inferior lateral prefrontal cortex and frontopolar cortex. We also found that depression undermines the ability to exploit viable options, called exploitation sensitivity. These findings characterize how subclinical depression influences different levels of the decision-making hierarchy, advancing previous conflicting views that depression simply influences either habitual or goal-directed control. Our study creates possibilities for various clinical applications, such as early diagnosis and behavioral therapy design.  相似文献   

3.
Daw ND  Gershman SJ  Seymour B  Dayan P  Dolan RJ 《Neuron》2011,69(6):1204-1215
The mesostriatal dopamine system is prominently implicated in model-free reinforcement learning, with fMRI BOLD signals in ventral striatum notably covarying with model-free prediction errors. However, latent learning and devaluation studies show that behavior also shows hallmarks of model-based planning, and the interaction between model-based and model-free values, prediction errors, and preferences is underexplored. We designed a multistep decision task in which model-based and model-free influences on human choice behavior could be distinguished. By showing that choices reflected both influences we could then test the purity of the ventral striatal BOLD signal as a model-free report. Contrary to expectations, the signal reflected both model-free and model-based predictions in proportions matching those that best explained choice behavior. These results challenge the notion of a separate model-free learner and suggest a more integrated computational architecture for high-level human decision-making.  相似文献   

4.
Model-based and model-free reinforcement learning (RL) have been suggested as algorithmic realizations of goal-directed and habitual action strategies. Model-based RL is more flexible than model-free but requires sophisticated calculations using a learnt model of the world. This has led model-based RL to be identified with slow, deliberative processing, and model-free RL with fast, automatic processing. In support of this distinction, it has recently been shown that model-based reasoning is impaired by placing subjects under cognitive load—a hallmark of non-automaticity. Here, using the same task, we show that cognitive load does not impair model-based reasoning if subjects receive prior training on the task. This finding is replicated across two studies and a variety of analysis methods. Thus, task familiarity permits use of model-based reasoning in parallel with other cognitive demands. The ability to deploy model-based reasoning in an automatic, parallelizable fashion has widespread theoretical implications, particularly for the learning and execution of complex behaviors. It also suggests a range of important failure modes in psychiatric disorders.  相似文献   

5.
Reinforcement learning (RL) has become a dominant paradigm for understanding animal behaviors and neural correlates of decision-making, in part because of its ability to explain Pavlovian conditioned behaviors and the role of midbrain dopamine activity as reward prediction error (RPE). However, recent experimental findings indicate that dopamine activity, contrary to the RL hypothesis, may not signal RPE and differs based on the type of Pavlovian response (e.g. sign- and goal-tracking responses). In this study, we address this discrepancy by introducing a new neural correlate for learning reward predictions; the correlate is called “cue-evoked reward”. It refers to a recall of reward evoked by the cue that is learned through simple cue-reward associations. We introduce a temporal difference learning model, in which neural correlates of the cue itself and cue-evoked reward underlie learning of reward predictions. The animal''s reward prediction supported by these two correlates is divided into sign and goal components respectively. We relate the sign and goal components to approach responses towards the cue (i.e. sign-tracking) and the food-tray (i.e. goal-tracking) respectively. We found a number of correspondences between simulated models and the experimental findings (i.e. behavior and neural responses). First, the development of modeled responses is consistent with those observed in the experimental task. Second, the model''s RPEs were similar to dopamine activity in respective response groups. Finally, goal-tracking, but not sign-tracking, responses rapidly emerged when RPE was restored in the simulated models, similar to experiments with recovery from dopamine-antagonist. These results suggest two complementary neural correlates, corresponding to the cue and its evoked reward, form the basis for learning reward predictions in the sign- and goal-tracking rats.  相似文献   

6.
Activation of dopamine receptors in forebrain regions, for minutes or longer, is known to be sufficient for positive reinforcement of stimuli and actions. However, the firing rate of dopamine neurons is increased for only about 200 milliseconds following natural reward events that are better than expected, a response which has been described as a "reward prediction error" (RPE). Although RPE drives reinforcement learning (RL) in computational models, it has not been possible to directly test whether the transient dopamine signal actually drives RL. Here we have performed optical stimulation of genetically targeted ventral tegmental area (VTA) dopamine neurons expressing Channelrhodopsin-2 (ChR2) in mice. We mimicked the transient activation of dopamine neurons that occurs in response to natural reward by applying a light pulse of 200 ms in VTA. When a single light pulse followed each self-initiated nose poke, it was sufficient in itself to cause operant reinforcement. Furthermore, when optical stimulation was delivered in separate sessions according to a predetermined pattern, it increased locomotion and contralateral rotations, behaviors that are known to result from activation of dopamine neurons. All three of the optically induced operant and locomotor behaviors were tightly correlated with the number of VTA dopamine neurons that expressed ChR2, providing additional evidence that the behavioral responses were caused by activation of dopamine neurons. These results provide strong evidence that the transient activation of dopamine neurons provides a functional reward signal that drives learning, in support of RL theories of dopamine function.  相似文献   

7.
The negative symptoms of schizophrenia (SZ) are associated with a pattern of reinforcement learning (RL) deficits likely related to degraded representations of reward values. However, the RL tasks used to date have required active responses to both reward and punishing stimuli. Pavlovian biases have been shown to affect performance on these tasks through invigoration of action to reward and inhibition of action to punishment, and may be partially responsible for the effects found in patients. Forty-five patients with schizophrenia and 30 demographically-matched controls completed a four-stimulus reinforcement learning task that crossed action (“Go” or “NoGo”) and the valence of the optimal outcome (reward or punishment-avoidance), such that all combinations of action and outcome valence were tested. Behaviour was modelled using a six-parameter RL model and EEG was simultaneously recorded. Patients demonstrated a reduction in Pavlovian performance bias that was evident in a reduced Go bias across the full group. In a subset of patients administered clozapine, the reduction in Pavlovian bias was enhanced. The reduction in Pavlovian bias in SZ patients was accompanied by feedback processing differences at the time of the P3a component. The reduced Pavlovian bias in patients is suggested to be due to reduced fidelity in the communication between striatal regions and frontal cortex. It may also partially account for previous findings of poorer “Go-learning” in schizophrenia where “Go” responses or Pavlovian consistent responses are required for optimal performance. An attenuated P3a component dynamic in patients is consistent with a view that deficits in operant learning are due to impairments in adaptively using feedback to update representations of stimulus value.  相似文献   

8.
According to a prominent view of sensorimotor processing in primates, selection and specification of possible actions are not sequential operations. Rather, a decision for an action emerges from competition between different movement plans, which are specified and selected in parallel. For action choices which are based on ambiguous sensory input, the frontoparietal sensorimotor areas are considered part of the common underlying neural substrate for selection and specification of action. These areas have been shown capable of encoding alternative spatial motor goals in parallel during movement planning, and show signatures of competitive value-based selection among these goals. Since the same network is also involved in learning sensorimotor associations, competitive action selection (decision making) should not only be driven by the sensory evidence and expected reward in favor of either action, but also by the subject''s learning history of different sensorimotor associations. Previous computational models of competitive neural decision making used predefined associations between sensory input and corresponding motor output. Such hard-wiring does not allow modeling of how decisions are influenced by sensorimotor learning or by changing reward contingencies. We present a dynamic neural field model which learns arbitrary sensorimotor associations with a reward-driven Hebbian learning algorithm. We show that the model accurately simulates the dynamics of action selection with different reward contingencies, as observed in monkey cortical recordings, and that it correctly predicted the pattern of choice errors in a control experiment. With our adaptive model we demonstrate how network plasticity, which is required for association learning and adaptation to new reward contingencies, can influence choice behavior. The field model provides an integrated and dynamic account for the operations of sensorimotor integration, working memory and action selection required for decision making in ambiguous choice situations.  相似文献   

9.
Recent discoveries indicate an important role for ghrelin in drug and alcohol reward and an ability of ghrelin to regulate mesolimbic dopamine activity. The role of dopamine in novelty seeking, and the association between this trait and drug and alcohol abuse, led us to hypothesize that ghrelin may influence novelty seeking behavior. To test this possibility we applied several complementary rodent models of novelty seeking behavior, i.e. inescapable novelty-induced locomotor activity (NILA), novelty-induced place preference and novel object exploration, in rats subjected to acute ghrelin receptor (growth hormone secretagogue receptor; GHSR) stimulation or blockade. Furthermore we assessed the possible association between polymorphisms in the genes encoding ghrelin and GHSR and novelty seeking behavior in humans. The rodent studies indicate an important role for ghrelin in a wide range of novelty seeking behaviors. Ghrelin-injected rats exhibited a higher preference for a novel environment and increased novel object exploration. Conversely, those with GHSR blockade drastically reduced their preference for a novel environment and displayed decreased NILA. Importantly, the mesolimbic ventral tegmental area selective GHSR blockade was sufficient to reduce the NILA response indicating that the mesolimbic GHSRs might play an important role in the observed novelty responses. Moreover, in untreated animals, a striking positive correlation between NILA and sucrose reward behavior was detected. Two GHSR single nucleotide polymorphisms (SNPs), rs2948694 and rs495225, were significantly associated with the personality trait novelty seeking, as assessed using the Temperament and Character Inventory (TCI), in human subjects. This study provides the first evidence for a role of ghrelin in novelty seeking behavior in animals and humans, and also points to an association between food reward and novelty seeking in rodents.  相似文献   

10.
11.
Animals explore novel environments in a cautious manner, exhibiting alternation between curiosity-driven behavior and retreats. We present a detailed formal framework for exploration behavior, which generates behavior that maintains a constant level of novelty. Similar to other types of complex behaviors, the resulting exploratory behavior is composed of exploration motor primitives. These primitives can be learned during a developmental period, wherein the agent experiences repeated interactions with environments that share common traits, thus allowing transference of motor learning to novel environments. The emergence of exploration motor primitives is the result of reinforcement learning in which information gain serves as intrinsic reward. Furthermore, actors and critics are local and ego-centric, thus enabling transference to other environments. Novelty control, i.e. the principle which governs the maintenance of constant novelty, is implemented by a central action-selection mechanism, which switches between the emergent exploration primitives and a retreat policy, based on the currently-experienced novelty. The framework has only a few parameters, wherein time-scales, learning rates and thresholds are adaptive, and can thus be easily applied to many scenarios. We implement it by modeling the rodent’s whisking system and show that it can explain characteristic observed behaviors. A detailed discussion of the framework’s merits and flaws, as compared to other related models, concludes the paper.  相似文献   

12.
13.
Extensive evidence implicates the ventral striatum in multiple distinct facets of action selection. Early work established a role in modulating ongoing behavior, as engaged by the energizing and directing influences of motivationally relevant cues and the willingness to expend effort in order to obtain reward. More recently, reinforcement learning models have suggested the notion of ventral striatum primarily as an evaluation step during learning, which serves as a critic to update a separate actor. Recent computational and experimental work may provide a resolution to the differences between these two theories through a careful parsing of behavior and the instrinsic heterogeneity that characterizes this complex structure.  相似文献   

14.
Thinking about personal future events is a fundamental cognitive process that helps us make choices in daily life. We investigated how the imagination of episodic future events is influenced by implicit motivational factors known to guide decision making. In a two-day functional magnetic resonance imaging (fMRI) study, we controlled learned reward association and stimulus novelty by pre-familiarizing participants with two sets of words in a reward learning task. Words were repeatedly presented and consistently followed by monetary reward or no monetary outcome. One day later, participants imagined personal future events based on previously rewarded, unrewarded and novel words. Reward association enhanced the perceived vividness of the imagined scenes. Reward and novelty-based construction of future events were associated with higher activation of the motivational system (striatum and substantia nigra/ ventral tegmental area) and hippocampus, and functional connectivity between these areas increased during imagination of events based on reward-associated and novel words. These data indicate that implicit past motivational experience contributes to our expectation of what the future holds in store.  相似文献   

15.
Adolescence is a unique, transitional period of human development. Once hallmark of this period is progressive improvements (relative to children) in cognitive control, core mental abilities enabling the ‘top-down’, endogenous control over behavior. However, as adolescents transition to more mature (adult) levels of functioning, limitations still exist in the ability to consistently and flexibly exert cognitive control across various contexts into the early twenties. Adolescence is also marked by peaks in sensation, novelty, and reward seeking behaviors thought to stem from normative increases in responsiveness in limbic and paralimbic brain structures, beginning around the onset of puberty. Asynchronous maturation in these systems during the adolescent period likely contributes to immature decision-making, strongly influenced by ‘bottom-up’ reward processes, and may help explain noted increases in risk taking behavior during adolescence. In this paper, structural and functional maturation in brain systems supporting reward and cognitive control processing are reviewed as a means to better understand risk taking. Particular emphasis is placed on adolescents' experimentation with drugs as a specific example of a risky behavior.  相似文献   

16.
Avoidance learning poses a challenge for reinforcement-based theories of instrumental conditioning, because once an aversive outcome is successfully avoided an individual may no longer experience extrinsic reinforcement for their behavior. One possible account for this is to propose that avoiding an aversive outcome is in itself a reward, and thus avoidance behavior is positively reinforced on each trial when the aversive outcome is successfully avoided. In the present study we aimed to test this possibility by determining whether avoidance of an aversive outcome recruits the same neural circuitry as that elicited by a reward itself. We scanned 16 human participants with functional MRI while they performed an instrumental choice task, in which on each trial they chose from one of two actions in order to either win money or else avoid losing money. Neural activity in a region previously implicated in encoding stimulus reward value, the medial orbitofrontal cortex, was found to increase, not only following receipt of reward, but also following successful avoidance of an aversive outcome. This neural signal may itself act as an intrinsic reward, thereby serving to reinforce actions during instrumental avoidance.  相似文献   

17.
Risk is a ubiquitous feature of the environment for most organisms, who must often choose between a small and certain reward and a larger but less certain reward. To study choice behavior under risk in a genetically well characterized species, we trained mice (C57BL/6) on a discrete trial, concurrent-choice task in which they must choose between two levers. Pressing one lever (safe choice) is always followed by a small reward. Pressing the other lever (risky choice) is followed by a larger reward, but only on some of the trials. The overall payoff is the same on both levers. When mice were not food deprived, they were indifferent to risk, choosing both levers with equal probability regardless of the level of risk. In contrast, following food or water deprivation, mice earning 10% sucrose solution were risk-averse, though the addition of alcohol to the sucrose solution dose-dependently reduced risk aversion, even before the mice became intoxicated. Our results falsify the budget rule in optimal foraging theory often used to explain behavior under risk. Instead, they suggest that the overall demand or desired amount for a particular reward determines risk preference. Changes in motivational state or reward identity affect risk preference by changing demand. Any manipulation that increases the demand for a reward also increases risk aversion, by selectively increasing the frequency of safe choices without affecting frequency of risky choices.  相似文献   

18.
Instrumental responses are hypothesized to be of two kinds: habitual and goal-directed, mediated by the sensorimotor and the associative cortico-basal ganglia circuits, respectively. The existence of the two heterogeneous associative learning mechanisms can be hypothesized to arise from the comparative advantages that they have at different stages of learning. In this paper, we assume that the goal-directed system is behaviourally flexible, but slow in choice selection. The habitual system, in contrast, is fast in responding, but inflexible in adapting its behavioural strategy to new conditions. Based on these assumptions and using the computational theory of reinforcement learning, we propose a normative model for arbitration between the two processes that makes an approximately optimal balance between search-time and accuracy in decision making. Behaviourally, the model can explain experimental evidence on behavioural sensitivity to outcome at the early stages of learning, but insensitivity at the later stages. It also explains that when two choices with equal incentive values are available concurrently, the behaviour remains outcome-sensitive, even after extensive training. Moreover, the model can explain choice reaction time variations during the course of learning, as well as the experimental observation that as the number of choices increases, the reaction time also increases. Neurobiologically, by assuming that phasic and tonic activities of midbrain dopamine neurons carry the reward prediction error and the average reward signals used by the model, respectively, the model predicts that whereas phasic dopamine indirectly affects behaviour through reinforcing stimulus-response associations, tonic dopamine can directly affect behaviour through manipulating the competition between the habitual and the goal-directed systems and thus, affect reaction time.  相似文献   

19.
Neurons in a small number of brain structures detect rewards and reward-predicting stimuli and are active during the expectation of predictable food and liquid rewards. These neurons code the reward information according to basic terms of various behavioural theories that seek to explain reward-directed learning, approach behaviour and decision-making. The involved brain structures include groups of dopamine neurons, the striatum including the nucleus accumbens, the orbitofrontal cortex and the amygdala. The reward information is fed to brain structures involved in decision-making and organisation of behaviour, such as the dorsolateral prefrontal cortex and possibly the parietal cortex. The neural coding of basic reward terms derived from formal theories puts the neurophysiological investigation of reward mechanisms on firm conceptual grounds and provides neural correlates for the function of rewards in learning, approach behaviour and decision-making.  相似文献   

20.
The prefrontal cortex subserves executive control and decision-making, that is, the coordination and selection of thoughts and actions in the service of adaptive behaviour. We present here a computational theory describing the evolution of the prefrontal cortex from rodents to humans as gradually adding new inferential Bayesian capabilities for dealing with a computationally intractable decision problem: exploring and learning new behavioural strategies versus exploiting and adjusting previously learned ones through reinforcement learning (RL). We provide a principled account identifying three inferential steps optimizing this arbitration through the emergence of (i) factual reactive inferences in paralimbic prefrontal regions in rodents; (ii) factual proactive inferences in lateral prefrontal regions in primates and (iii) counterfactual reactive and proactive inferences in human frontopolar regions. The theory clarifies the integration of model-free and model-based RL through the notion of strategy creation. The theory also shows that counterfactual inferences in humans yield to the notion of hypothesis testing, a critical reasoning ability for approximating optimal adaptive processes and presumably endowing humans with a qualitative evolutionary advantage in adaptive behaviour.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号