首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 406 毫秒
1.
Reinforcement learning methods can be used in robotics applications especially for specific target-oriented problems, for example the reward-based recalibration of goal directed actions. To this end still relatively large and continuous state-action spaces need to be efficiently handled. The goal of this paper is, thus, to develop a novel, rather simple method which uses reinforcement learning with function approximation in conjunction with different reward-strategies for solving such problems. For the testing of our method, we use a four degree-of-freedom reaching problem in 3D-space simulated by a two-joint robot arm system with two DOF each. Function approximation is based on 4D, overlapping kernels (receptive fields) and the state-action space contains about 10,000 of these. Different types of reward structures are being compared, for example, reward-on- touching-only against reward-on-approach. Furthermore, forbidden joint configurations are punished. A continuous action space is used. In spite of a rather large number of states and the continuous action space these reward/punishment strategies allow the system to find a good solution usually within about 20 trials. The efficiency of our method demonstrated in this test scenario suggests that it might be possible to use it on a real robot for problems where mixed rewards can be defined in situations where other types of learning might be difficult. This work was supported by EU-Grant PACO-PLUS.  相似文献   

2.
Classic reinforcement learning (RL) theories cannot explain human behavior in the absence of external reward or when the environment changes. Here, we employ a deep sequential decision-making paradigm with sparse reward and abrupt environmental changes. To explain the behavior of human participants in these environments, we show that RL theories need to include surprise and novelty, each with a distinct role. While novelty drives exploration before the first encounter of a reward, surprise increases the rate of learning of a world-model as well as of model-free action-values. Even though the world-model is available for model-based RL, we find that human decisions are dominated by model-free action choices. The world-model is only marginally used for planning, but it is important to detect surprising events. Our theory predicts human action choices with high probability and allows us to dissociate surprise, novelty, and reward in EEG signals.  相似文献   

3.
A long-standing goal in artificial intelligence is creating agents that can learn a variety of different skills for different problems. In the artificial intelligence subfield of neural networks, a barrier to that goal is that when agents learn a new skill they typically do so by losing previously acquired skills, a problem called catastrophic forgetting. That occurs because, to learn the new task, neural learning algorithms change connections that encode previously acquired skills. How networks are organized critically affects their learning dynamics. In this paper, we test whether catastrophic forgetting can be reduced by evolving modular neural networks. Modularity intuitively should reduce learning interference between tasks by separating functionality into physically distinct modules in which learning can be selectively turned on or off. Modularity can further improve learning by having a reinforcement learning module separate from sensory processing modules, allowing learning to happen only in response to a positive or negative reward. In this paper, learning takes place via neuromodulation, which allows agents to selectively change the rate of learning for each neural connection based on environmental stimuli (e.g. to alter learning in specific locations based on the task at hand). To produce modularity, we evolve neural networks with a cost for neural connections. We show that this connection cost technique causes modularity, confirming a previous result, and that such sparsely connected, modular networks have higher overall performance because they learn new skills faster while retaining old skills more and because they have a separate reinforcement learning module. Our results suggest (1) that encouraging modularity in neural networks may help us overcome the long-standing barrier of networks that cannot learn new skills without forgetting old ones, and (2) that one benefit of the modularity ubiquitous in the brains of natural animals might be to alleviate the problem of catastrophic forgetting.  相似文献   

4.
On 202 male rats of Wistar line, a study was carried out of the effect of chronic and acute deprivations of the brain CA-systems activity resulting from administration of 6-OHDA on investigating behaviour and learning. Chronic deprivation of CA-systems activity by neonatal administration of 6-OHDA (100 mg/kg subcutaneously) and their acute deprivation by intracerebral administration of 6-OHDA to adult rats (150 mkg in each lateral ventriculus) was accompanied by similar deep changes of behaviour. Both forms of deprivation reduced the investigating activity of the animals in the open field. In both cases, the above 6-OHDA dozes sharply impeded the learning of animals with emotionally negative reinforcement, with no significant influence on learning with emotionally positive reinforcement. Both forms of deprivation of CA-systems activity weakened the reaction of frustration elicited by a sharp reduction of food reinforcement.  相似文献   

5.
Recently, evidence has emerged that humans approach learning using Bayesian updating rather than (model-free) reinforcement algorithms in a six-arm restless bandit problem. Here, we investigate what this implies for human appreciation of uncertainty. In our task, a Bayesian learner distinguishes three equally salient levels of uncertainty. First, the Bayesian perceives irreducible uncertainty or risk: even knowing the payoff probabilities of a given arm, the outcome remains uncertain. Second, there is (parameter) estimation uncertainty or ambiguity: payoff probabilities are unknown and need to be estimated. Third, the outcome probabilities of the arms change: the sudden jumps are referred to as unexpected uncertainty. We document how the three levels of uncertainty evolved during the course of our experiment and how it affected the learning rate. We then zoom in on estimation uncertainty, which has been suggested to be a driving force in exploration, in spite of evidence of widespread aversion to ambiguity. Our data corroborate the latter. We discuss neural evidence that foreshadowed the ability of humans to distinguish between the three levels of uncertainty. Finally, we investigate the boundaries of human capacity to implement Bayesian learning. We repeat the experiment with different instructions, reflecting varying levels of structural uncertainty. Under this fourth notion of uncertainty, choices were no better explained by Bayesian updating than by (model-free) reinforcement learning. Exit questionnaires revealed that participants remained unaware of the presence of unexpected uncertainty and failed to acquire the right model with which to implement Bayesian updating.  相似文献   

6.
In contemporary reinforcement learning models, reward prediction error (RPE), the difference between the expected and actual reward, is thought to guide action value learning through the firing activity of dopaminergic neurons. Given the importance of dopamine in reward learning and the involvement of Akt1 in dopamine-dependent behaviors, the aim of this study was to investigate whether Akt1 deficiency modulates reward learning and the magnitude of RPE using Akt1 mutant mice as a model. In comparison to wild-type littermate controls, the expression of Akt1 proteins in mouse brains occurred in a gene-dosage-dependent manner and Akt1 heterozygous (HET) mice exhibited impaired striatal Akt1 activity under methamphetamine challenge. No genotypic difference was found in the basal levels of dopamine and its metabolites. In a series of reward-related learning tasks, HET mice displayed a relatively efficient method of updating reward information from the environment during the acquisition phase of the two natural reward tasks and in the reverse section of the dynamic foraging T-maze but not in methamphetamine-induced or aversive-related reward learning. The implementation of a standard reinforcement learning model and the Bayesian hierarchical parameter estimation show that HET mice have higher RPE magnitudes and that their action values are updated more rapidly among all three test sections in T-maze. These results indicate that Akt1 deficiency modulates natural reward learning and RPE. This study showed a promising avenue for investigating RPE in mutant mice and provided evidence for the potential link from genetic deficiency, to neurobiological abnormalities, to impairment in higher-order cognitive functioning.  相似文献   

7.
This paper investigates the effectiveness of spiking agents when trained with reinforcement learning (RL) in a challenging multiagent task. In particular, it explores learning through reward-modulated spike-timing dependent plasticity (STDP) and compares it to reinforcement of stochastic synaptic transmission in the general-sum game of the Iterated Prisoner's Dilemma (IPD). More specifically, a computational model is developed where we implement two spiking neural networks as two "selfish" agents learning simultaneously but independently, competing in the IPD game. The purpose of our system (or collective) is to maximise its accumulated reward in the presence of reward-driven competing agents within the collective. This can only be achieved when the agents engage in a behaviour of mutual cooperation during the IPD. Previously, we successfully applied reinforcement of stochastic synaptic transmission to the IPD game. The current study utilises reward-modulated STDP with eligibility trace and results show that the system managed to exhibit the desired behaviour by establishing mutual cooperation between the agents. It is noted that the cooperative outcome was attained after a relatively short learning period which enhanced the accumulation of reward by the system. As in our previous implementation, the successful application of the learning algorithm to the IPD becomes possible only after we extended it with additional global reinforcement signals in order to enhance competition at the neuronal level. Moreover it is also shown that learning is enhanced (as indicated by an increased IPD cooperative outcome) through: (i) strong memory for each agent (regulated by a high eligibility trace time constant) and (ii) firing irregularity produced by equipping the agents' LIF neurons with a partial somatic reset mechanism.  相似文献   

8.
Operant learning requires that reinforcement signals interact with action representations at a suitable neural interface. Much evidence suggests that this occurs when phasic dopamine, acting as a reinforcement prediction error, gates plasticity at cortico-striatal synapses, and thereby changes the future likelihood of selecting the action(s) coded by striatal neurons. But this hypothesis faces serious challenges. First, cortico-striatal plasticity is inexplicably complex, depending on spike timing, dopamine level, and dopamine receptor type. Second, there is a credit assignment problem—action selection signals occur long before the consequent dopamine reinforcement signal. Third, the two types of striatal output neuron have apparently opposite effects on action selection. Whether these factors rule out the interface hypothesis and how they interact to produce reinforcement learning is unknown. We present a computational framework that addresses these challenges. We first predict the expected activity changes over an operant task for both types of action-coding striatal neuron, and show they co-operate to promote action selection in learning and compete to promote action suppression in extinction. Separately, we derive a complete model of dopamine and spike-timing dependent cortico-striatal plasticity from in vitro data. We then show this model produces the predicted activity changes necessary for learning and extinction in an operant task, a remarkable convergence of a bottom-up data-driven plasticity model with the top-down behavioural requirements of learning theory. Moreover, we show the complex dependencies of cortico-striatal plasticity are not only sufficient but necessary for learning and extinction. Validating the model, we show it can account for behavioural data describing extinction, renewal, and reacquisition, and replicate in vitro experimental data on cortico-striatal plasticity. By bridging the levels between the single synapse and behaviour, our model shows how striatum acts as the action-reinforcement interface.  相似文献   

9.
In an elevated maze consisting of three reconvergent radial arms, golden hamsters were tested with the same experimental rule: to choose each path without repeating any choice. However, variations of procedure concerning (a) the location of the reward in the maze, and (b) reinforcement contingencies, were introduced in order to define several problems involving variable levels of difficulty. The relationship between response strategies and difficulty of the task was then studied. The common learning criterion was the achievement of three consecutive correct daily sessions, each session corresponding to a particular sequence (pattern) of choices of paths. Response strategies were studied by analyzing the patterns obtained over the three final sessions in which an animal reached the learning criterion. Such a set of patterns (triplet) could be heterogeneous (patterns all different), mixed (two identical patterns, one different) or stereotyped (identical patterns). No relationship was found between the mean level of difficulty presented by each learning problem and the occurrence of a particular type of triplet. However, in each situation, mixed triplets were the most frequently recorded and corresponded to the medium individual speeds of learning whereas heterogeneous triplets corresponded to rapid successes and stereotyped triplets to delayed successes. These findings indicate that, whatever the problem designed to be tested in a three-arm maze, the various forms of solutions reflect different individual adaptative mechanisms.  相似文献   

10.
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach.  相似文献   

11.
In the present study we will try to single out several principles of the nervous system functioning essential for describing mechanisms of learning and memory basing on our own experimental investigation of cellular mechanisms of memory in the nervous system of gastropod molluscs and literature data: main changes in functioning due to learning occur in effectivity of synaptic inputs and in the intrinsic properties of postsynaptic neurons; due to learning some synaptic inputs of neurons selectively change its effectivity due to pre- and postsynaptic changes, but the induction of plasticity always starts in postsynapse, maintaining of long-term memory in postsynapse is also shown; reinforcement is not related to activity of the neural chain receptor-sensory neuron-interneuron-motoneuron-effector; reinforcement is mediated via activity of modulatory neurons, and in some cases can be exerted by a single neuron; activity of modulatory neurons is necessary for development of plastic modifications of behavior (including associative), but is not needed for recall of conditioned responses. At the same time, the modulatory neurons (in fact they constitute a neural reinforcement system) are necessary for recall of context associative memory; changes due to learning occur at least in two independent loci in the nervous system. A possibility for erasure of memory with participation of nitroxide is experimentally and theoretically based.  相似文献   

12.
13.
Ninety-eight Sprague-Dawley rats, implanted with electrodes in the mesencephalic tegmentum (reticular activating system, RAS) served as subjects in two experiments. In the first experiment (n = 42) we investigated the effects of a RAS stimulation (5 μ A, 300 Hz, 90 sec in duration) on the acquisition of a positively reinforced light–dark discrimination in a T-maze. In the second experiment (n = 56) the reinforcement and the treatment were dissociated by comparing the effects of the RAS stimulation administered after correct or incorrect choices, during the same discrimination task.In the two experiments, despite large differences in learning conditions, the results show a considerable learning facilitation by administering the RAS stimulation immediately after each trial. This facilitation does not seem to be due to an interaction between reinforcement and stimulation, since the results of experiment 2 show the maximum facilitation in animals stimulated after each (non-reinforced) error, compared to subjects stimulated after each (reinforced) correct choice. These results are discussed both in terms of consolidation processes and in terms of comparison of the cue values of S+ and S- in a discriminative learning situation.  相似文献   

14.
Kahnt T  Grueschow M  Speck O  Haynes JD 《Neuron》2011,70(3):549-559
The dominant view that perceptual learning is accompanied by changes in early sensory representations has recently been challenged. Here we tested the idea that perceptual learning can be accounted for by reinforcement learning involving changes in higher decision-making areas. We trained subjects on an orientation discrimination task involving feedback over 4 days, acquiring fMRI data on the first and last day. Behavioral improvements were well explained by a reinforcement learning model in which learning leads to enhanced readout of sensory information, thereby establishing noise-robust representations of decision variables. We find stimulus orientation encoded in early visual and higher cortical regions such as lateral parietal cortex and anterior cingulate cortex (ACC). However, only activity patterns in the ACC tracked changes in decision variables during learning. These results provide strong evidence for perceptual learning-related changes in higher order areas and suggest that perceptual and reward learning are based on a common neurobiological mechanism.  相似文献   

15.
Recent evidence from cerebellum-dependent motor learning and amygdala-dependent fear conditioning indicates that, despite being mediated by different brain systems, these forms of learning might use a similar sequence of events to form new memories. In each case, learning seems to induce changes in two different groups of neurons. Changes in the first class of cells are induced very rapidly during the initial stages of learning, whereas changes in the second class of cells develop more slowly and are resistant to extinction. So, anatomically distinct cell populations might contribute differentially to the initial encoding and the long-term storage of memory in these two systems.  相似文献   

16.
The changes in the serotonin (5-OT) and noradrenaline (NA) content in the brain of rats taught on the emotionally-positive (food) and emotionally-negative (pain) reinforcement were compared. The process of animal learning was accompanied by increase in the biogenic amine level. But when the teaching was conducted on the emotionally-positive reinforcement there was a greater increase in the 5-OT and NA level than in the case of the emotionally negative reinforcement. The process of animal teaching on food reinforcement was accompanied by an elevation of 5-OT chiefly in the cerebral cortex which apparently reflected the active functioning of the 5-OT system. An intensification of the NA system occurred in teaching the animals in the defence situation. A conclusion was thus drawn that the character of the changes in the biogenic amine level in the brain during learning depended on the emotions experienced by the animal (the emotional reinforcement utilized).  相似文献   

17.
Model-based analysis of fMRI data is an important tool for investigating the computational role of different brain regions. With this method, theoretical models of behavior can be leveraged to find the brain structures underlying variables from specific algorithms, such as prediction errors in reinforcement learning. One potential weakness with this approach is that models often have free parameters and thus the results of the analysis may depend on how these free parameters are set. In this work we asked whether this hypothetical weakness is a problem in practice. We first developed general closed-form expressions for the relationship between results of fMRI analyses using different regressors, e.g., one corresponding to the true process underlying the measured data and one a model-derived approximation of the true generative regressor. Then, as a specific test case, we examined the sensitivity of model-based fMRI to the learning rate parameter in reinforcement learning, both in theory and in two previously-published datasets. We found that even gross errors in the learning rate lead to only minute changes in the neural results. Our findings thus suggest that precise model fitting is not always necessary for model-based fMRI. They also highlight the difficulty in using fMRI data for arbitrating between different models or model parameters. While these specific results pertain only to the effect of learning rate in simple reinforcement learning models, we provide a template for testing for effects of different parameters in other models.  相似文献   

18.
To produce skilled movements, the brain flexibly adapts to different task requirements and movement contexts. Two core abilities underlie this flexibility. First, depending on the task, the motor system must rapidly switch the way it produces motor commands and how it corrects movements online, i.e. it switches between different (feedback) control policies. Second, it must also adapt to environmental changes for different tasks separately. Here we show these two abilities are related. In a bimanual movement task, we show that participants can switch on a movement-by-movement basis between two feedback control policies, depending only on a static visual cue. When this cue indicates that the hands control separate objects, reactions to force field perturbations of each arm are purely unilateral. In contrast, when the visual cue indicates a commonly controlled object, reactions are shared across hands. Participants are also able to learn different force fields associated with a visual cue. This is however only the case when the visual cue is associated with different feedback control policies. These results indicate that when the motor system can flexibly switch between different control policies, it is also able to adapt separately to the dynamics of different environmental contexts. In contrast, visual cues that are not associated with different control policies are not effective for learning different task dynamics.  相似文献   

19.
While vocal learning has been studied extensively in birds and mammals, little effort has been made to define what exactly constitutes vocal learning and to classify the forms that it may take. We present such a theoretical framework for the study of social learning in vocal communication. We define different forms of social learning that affect communication and discuss the required methodology to show each one. We distinguish between contextual and production learning in animal communication. Contextual learning affects the behavioural context or serial position of a signal. It can affect both usage and comprehension. Production learning refers to instances where the signals themselves are modified in form as a result of experience with those of other individuals. Vocal learning is defined as production learning in the vocal domain. It can affect one or more of three systems: the respiratory, phonatory and filter systems. Each involves a different level of control over the sound production apparatus. We hypothesize that contextual learning and respiratory production learning preceded the evolution of phonatory and filter production learning. Each form of learning potentially increases the complexity of a communication system. We also found that unexpected genetic or environmental factors can have considerable effects on vocal behaviour in birds and mammals and are often more likely to cause changes or differences in vocalizations than investigators may assume. Finally, we discuss how production learning is used in innovation and invention, and present important future research questions. Copyright 2000 The Association for the Study of Animal Behaviour.  相似文献   

20.
The present paper discusses an optimal learning control method using reinforcement learning for biological systems with a redundant actuator. It is difficult to apply reinforcement learning to biological control systems because of the redundancy in muscle activation space. We solve this problem with the following method. First, we divide the control input space into two subspaces according to a priority order of learning and restrict the search noise for reinforcement learning to the first priority subspace. Then the constraint is reduced as the learning progresses, with the search space extending to the second priority subspace. The higher priority subspace is designed so that the impedance of the arm can be high. A smooth reaching motion is obtained through reinforcement learning without any previous knowledge of the arms dynamics.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号