Reinforcement learning
|
|
- Timothy Stokes
- 6 years ago
- Views:
Transcription
1 einforcement learning How to learn to make decisions in sequential problems (like: chess, a maze) Why is this difficult? Temporal credit assignment Prediction can help Further reading For modeling: Chapter 9, Dayan & Abbott, Theoretical Neuroscience (but v mathematical); For dopamine: Schultz W Getting formal with dopamine and. Neuron 36: For algorithms: Sutton S & Barto AG einforcement learning: An Introduction Classical conditioning Pair stimulus (bell, light) with significant event (food, shock) Measure anticipatory behavior (salivation, freezing) 1
2 Classical conditioning : Conditioned Stimulus (S) Pair stimulus (bell, light) US: Unconditioned stimulus (reinforcement r) with significant event (food, shock) C: Conditioned response Measure anticipatory behavior (salivation, freezing) escorla-wagner rule (1972) Experimental terms Phase 1: Phase 2: Test: Acquisition: S r S? response Extinction: S r S - S? Partial einf.: S r or - S? weak resp Simple delta-rule model, if stimulus present S=1 (S=0 if not present): w w + esd; d = r - ws, i.e. Error-driven learning Equivalently, if S=1: w (1-e) w + er w comes to estimate the reinforcement r associated with S. Minimizes error <(r-ws) 2 > whenever S is present. r S w reinforcement stimulus 2
3 escorla-wagner rule (1972) Experimental terms Phase 1: Phase 2: Test: Acquisition: S r S? response Extinction: S r S - S? Partial einf.: S r or - S? weak resp Simple delta-rule model, if stimulus present S=1 (S=0 if not present): w w + esd; d = r - ws, i.e. Error-driven learning Equivalently, if S=1: w (1-e) w + er w comes to estimate the reinforcement r associated with S. acquisition extinction <r>=0.5 partial reinforcement r=1 r=0 Multiple stimuli What about when multiple stimuli are present? e.g. S1, S2 r How would animals respond to S1 or S2? How should the model be modified? r w i w i + e S i d i w1 w2 (a) d i = r - w i S i i.e. separate error terms for each S i S1 S2 (b) d i = d = r -V; V = S i w i S i V is expected reinforcement r given all stimuli i.e. single error term for all stimuli S i : d the difference between actual r and V (expected r) 3
4 Experiments with multiple stimuli Experimental terms Phase 1: Phase 2: Test: Overshadowing: S1, S2 r S1? weak resp Blocking: S1 r S1, S2 r S2? Which model is favoured? r w i w i + es i d i Blocking (Kamin, 1969) and overshadowing (Kamin, 1969; Pavlov, 1927) imply: (b) d i = d = r - V; V = S i w i S i w1 S1 w2 S2 i.e. single error term for all stimuli (the escorla-wagner rule) = difference between reinforcement and expected reinforcement given all stimuli Second-order conditioning Phase 1: Phase 2: Test: S1 r S2 S1 S2? What happens? S2 r -W rule does not work no r in Phase 2 One problem: time (important that S2 precede S1 for effect) => the temporal credit assignment problem. 4
5 Temporal-difference learning (Sutton) We need to predict the sum of future s, not just, so we can learn S2 S1, i.e. we want = < S t t > V If = S i w i (t) S i. How can we learn the right w i? w1 S1 w2 S2 Temporal-difference learning (Sutton) We need to predict the sum of future s, not just, so we can learn S2 S1, i.e. we want = < S t t > V If = S i w i (t) S i. How can we learn the right w i? current If = < S t t >, then = r (t) + V(t+1) estimate of subsequent w1 S1 w2 S2 5
6 Temporal-difference learning (Sutton) We need to predict the sum of future s, not just, so we can learn S2 S1, i.e. we want = < S t t > V If = S i w i (t) S i. How can we learn the right w i? current If = < S t t >, then = r (t) + V(t+1) estimate of subsequent So use delta rule to ensure that this happens, i.e. modify connection weights to make closer to r (t) + V(t+1), i.e. use: w i (t+1) = w i (t) + es i d(t); d(t) = [r (t) + V(t+1)] so that d is the difference between and estimate of all future. Compare with -W rule: d(t) = (i.e. d is the difference between and current only). w1 S1 w2 S2 Time: expectation of V(t 1) Need representation of time as an input. Code time as sequential activity of set of input neurons. Then output can represent t 1 w 1 t 2 t 3 w N V(t 2) w1 w N t 1 t 2 t 3 V(t 3) w1 w N t 1 t 2 t 3 6
7 Temporal difference learning rule: w i w i + es i d(t i ); d(t) = +V(t+1) d(t) t t trial 1 unexpected w 1 t 1 t -1 Temporal difference learning rule: w i w i + es i d(t i ); d(t) = +V(t+1) d(t) t t trial 1 unexpected w 1 t 1 t -1 t trial 2 w 1 d(t) d(-1)=0+1-0 d()=1+0-1 t 1 t -1 7
8 Temporal difference learning rule: w i w i + es i d(t i ); d(t) = +V(t+1) d(t) t t trial 1 unexpected w 1 t 1 t -1 t trial 2 d(t) d(-1)=0+1-0 t trial 3 d(t) d()=1+0-1 d(-2)=0+1-0 d(-1)=0+1-1 w 1 t 1 t -1 r d(t) Many trials later, expected future V increases as soon as the occurs.. t d(t +1)=0+1-1 t d(t )=0+1-0 Does dopamine signal d(t)? eward Self-stimulation Addiction Motor control Synaptic plasticity 8
9 many trials many trials Dopamine responses interpreted as d(t) (Schultz, Montague & Dayan, Science, 1997) Burst to unexpected esponse transfers to predictors Pause at time of omitted Dopamine responses interpreted as d(t) (Schultz, Montague & Dayan, Science, 1997) Burst to unexpected V(t+1) d(t) = + V(t+1) esponse transfers to predictors V(t+1) Pause at time of omitted d(t) = + V(t+1) Omission of expected d(t) = + V(t+1) X 9
10 Increasing probability of (given stimulus) More dopamine responses Partial reinforcement task (Fiorillo, Tobler & Schultz) Accords with TD models V(t+1) } p=0.5 d(t) = + V(t+1) Action choice In operant (aka instrumental) conditioning, s are contingent on actions (e.g., lever press) cf. classical conditioning (s are just dependent on stimuli, we just model expectation of ) Consider simple bee foraging problem: Choose between yellow and blue flowers Each pays off probabilistically, with different amounts of nectar r y versus r b Bees rapidly learn to choose richer colour switch r y and r b 10
11 Modeling action choice Actor-critic architecture: use value function V to decide on actions to maximize expected. If V is correct, it allows future to be taken into account in reinforcing actions: solving the temporal credit assignment (we can also evaluate whether actions are good or bad using δ(t) even if = 0) critic v w b y state dopamine? δ(t) = +v(t+1)-v(t) w actor b action y Sequential choice Suppose we know how to learn what to do at B and C (e.g. choose action associated with maximum r) How do we solve the temporal credit assignment problem at A? equires second order conditioning: A-B-r; A-C-r Use Temporal Difference learning to learn values (expected future ) at B and C; use these to learn best action at A ( direct actor ) actor/critic architecture. w critic value v(t) dopamine? δ(t) = +v(t+1)-v(t) output o(t) or action state or stimuli input w actor 11
12 Prob. turning Left State evaluation The critic learns to estimate value states S i (t): = <S t>t > Given initially random choice of actions L/ ( policy ), values V are learned by changing w i with Temporal Difference rule: w i w i + es i d(t); d(t) = +V(t+1) D S i : V critic B V E A w A w B w C A B C F C δ(t) = +v(t+1)-v(t) w D w E D E. G Policy improvement D B E A F C G V The actor learns to act, using to calculate d and d to assess actions: d(t) =+V(t+1) If left at A, d(t) = = 0.75 If right at A, d(t) = = 0.75 Thus should choose left more frequently from A w i w i + es i d(t); critic V δ(t) = +v(t+1)-v(t) w A w B w C A B C w D w E D E w L actor 12
13 Back to dopamine Dual dopamine systems project same signal to motivational (ventral striatum) & motor (dorsal striatum) areas For state evaluation & policy improvement respectively? Macaque brain Caudate Nucleus accumbens Putamen Actor? Caudate/ putamen Nucleus accumbens Critic? Striatal targets of DA ~ prediction-error signal for reinforcement learning (e.g. Seymour et al., 2004) 13
14 Summary of temporal difference or sequential reinforcement solution to the temporal credit assignment problem In Arp the strength of the smell of the tree gives reinforcement to evaluate any action (i.e. output o(t)): does it lead closer to the goal or not? If reinforcement is intermittent (e.g. only when goal is reached), a critic learns an evaluation function : the value v of each state (or input) S is the expected future from that state (given how actions are usually made). The change in value v(t+1)-v(t) + any is used to evaluate any action o(t) so output weights w can be modified as in Arp, using δ(t) = v(t+1)-v(t)+ (if δ(t)>0, action o(t) good): w w + εs(t)δ(t) But how to learn v? A simple learning rule creates connection weights w so that v(t) = w.s(t). This is: w w + εs(t)δ(t), i.e. you can also use δ to learn weights for the critic! w critic value v(t) state or input S(t) (e.g. place cells) w dopamine? δ(t) = +v(t+1)-v(t) output o(t) or action See Intro. to temporal difference learning (spatial example). Use δ(t) = v(t+1)-v(t)+ to evaluate actions Use v(t) = w.s(t) where w w + ε S(t)δ(t) to learn value function. v w critic value v(t) state or input u(t) (e.g. place cells) r=1 x first encounters goal at location x: r=1 so increase weight to critic from place cell at x. Now v(x)=1 dopamine? δ(t) = +v(t+1)-v(t) output o(t) or action North, South, East, West v(t+1)=1 v v v(t+1)=1 v(t)=0 v(t)=0 x x next time go to x from x, v(t+1)-v(t)=1, so increase weight to critic from place cell at x. Now v(x )=1 x x next time go to x from x, v(t+1)-v(t)=1, so increase weight to critic from place cell at x, and so on.. 14
15 Temporal difference learning is a slow trial and error based method. Development of own value function overcomes problem of temporal credit assignment goal value function value function Barrier Agent After initial exploration (e.g. 100 runs to goal) After 2000 runs to goal Initial actions will be random (takes a long time to get to goal), actions improve as a result of learning, as in the Arp action network (but a random element is required to enable ongoing improvement: i.e. finding new shorter routes). Dayan (1991) Neural Information Processing Systems 3. p Morgan Kaufmann Temporal discounting Often, we value more (temporally) distant s less than immediate of the same size. We can do this with a small change to the einforcement Learning rule: We want to predict the sum of future s, discounted by how long you have to wait for them, i.e. we want = < S t t γ (t - t) > where γ < 1 If = < S t t γ (t - t) >, then = + γv(t+1) So use delta rule to ensure that this happens, i.e. modify connection weights to make closer to + γv(t+1), i.e. use: w i (t+1) = w i (t) + es i d(t); where d(t) = [ + γv(t+1)] so that d is the difference between and the estimate of all (temporally discounted) future. 15
Animal learning theory
Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)
More informationRewards and Value Systems: Conditioning and Reinforcement Learning
Rewards and Value Systems: Conditioning and Reinforcement Learning Acknowledgements: Many thanks to P. Dayan and L. Abbott for making the figures of their book available online. Also thanks to R. Sutton
More informationReinforcement Learning. Odelia Schwartz 2016
Reinforcement Learning Odelia Schwartz 2016 Forms of learning? Forms of learning Unsupervised learning Supervised learning Reinforcement learning Forms of learning Unsupervised learning Supervised learning
More informationKey concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate
Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associated to the value Teaching signal : difference between
More informationSummary of part I: prediction and RL
Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent
More informationFundamentals of Computational Neuroscience 2e
Fundamentals of Computational Neuroscience 2e Thomas Trappenberg March 21, 2009 Chapter 9: Modular networks, motor control, and reinforcement learning Mixture of experts Expert 1 Input Expert 2 Integration
More informationSpike timing dependent plasticity - STDP
Spike timing dependent plasticity - STDP Post before Pre: LTD + ms Pre before Post: LTP - ms Markram et. al. 997 Spike Timing Dependent Plasticity: Temporal Hebbian Learning Synaptic cange % Pre t Pre
More informationThe convergence limit of the temporal difference learning
The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector
More informationModel- based fmri analysis. Shih- Wei Wu fmri workshop, NCCU, Jan. 19, 2014
Model- based fmri analysis Shih- Wei Wu fmri workshop, NCCU, Jan. 19, 2014 Outline: model- based fmri analysis I: General linear model: basic concepts II: Modeling how the brain makes decisions: Decision-
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationCO6: Introduction to Computational Neuroscience
CO6: Introduction to Computational Neuroscience Lecturer: J Lussange Ecole Normale Supérieure 29 rue d Ulm e-mail: johann.lussange@ens.fr Solutions to the 2nd exercise sheet If you have any questions regarding
More informationToday s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes
Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationOpen Theoretical Questions in Reinforcement Learning
Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem
More informationLecture 3: The Reinforcement Learning Problem
Lecture 3: The Reinforcement Learning Problem Objectives of this lecture: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationCourse 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016
Course 16:198:520: Introduction To Artificial Intelligence Lecture 13 Decision Making Abdeslam Boularias Wednesday, December 7, 2016 1 / 45 Overview We consider probabilistic temporal models where the
More informationComputational Reinforcement Learning: An Introduction
Computational Reinforcement Learning: An Introduction Andrew Barto Autonomous Learning Laboratory School of Computer Science University of Massachusetts Amherst barto@cs.umass.edu 1 Artificial Intelligence
More informationChapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
More informationChapter 6: Temporal Difference Learning
Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D) learning Focus first on policy evaluation, or prediction, methods hen extend to control methods R. S.
More informationA latent variable model of configural conditioning
A latent variable model of configural conditioning Aaron C. Courville obotics Institute, CMU Work with: Nathaniel D. Daw, UCL (Gatsby), David S. Touretzky, CMU and Geoff Gordon, CMU Similarity & Discrimination
More informationThe Reinforcement Learning Problem
The Reinforcement Learning Problem Slides based on the book Reinforcement Learning by Sutton and Barto Formalizing Reinforcement Learning Formally, the agent and environment interact at each of a sequence
More informationIntroduction to Reinforcement Learning
Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationApproximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.
An Upper Bound on the oss from Approximate Optimal-Value Functions Satinder P. Singh Richard C. Yee Department of Computer Science University of Massachusetts Amherst, MA 01003 singh@cs.umass.edu, yee@cs.umass.edu
More informationReading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1
Reading Response: Due Wednesday R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Another Example Get to the top of the hill as quickly as possible. reward = 1 for each step where
More informationStructure of a Neuron:
Structure of a Neuron: At the dendrite the incoming signals arrive (incoming currents) At the soma current are finally integrated. At the axon hillock action potential are generated if the potential crosses
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationReinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN
Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:
More informationCS 570: Machine Learning Seminar. Fall 2016
CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or
More informationEvery animal is represented by a blue circle. Correlation was measured by Spearman s rank correlation coefficient (ρ).
Supplementary Figure 1 Correlations between tone and context freezing by animal in each of the four groups in experiment 1. Every animal is represented by a blue circle. Correlation was measured by Spearman
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationLecture 23: Reinforcement Learning
Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:
More informationReinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil
Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department of Computer Science Colorado
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationIntroduction to Reinforcement Learning. Part 5: Temporal-Difference Learning
Introduction to Reinforcement Learning Part 5: emporal-difference Learning What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet.in Department of Computer Science and Automation Indian Institute of Science August 2014 What is Reinforcement
More informationProceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III
Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright
More informationTemporal Difference Learning & Policy Iteration
Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of
More informationTemporal difference learning
Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).
More information6 Reinforcement Learning
6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,
More informationReinforcement Learning II
Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini
More informationLecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010
Lecture 25: Learning 4 Victor R. Lesser CMPSCI 683 Fall 2010 Final Exam Information Final EXAM on Th 12/16 at 4:00pm in Lederle Grad Res Ctr Rm A301 2 Hours but obviously you can leave early! Open Book
More informationAn Introduction to Reinforcement Learning
An Introduction to Reinforcement Learning Shivaram Kalyanakrishnan shivaram@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology Bombay April 2018 What is Reinforcement
More informationReinforcement Learning
Reinforcement Learning 1 Reinforcement Learning Mainly based on Reinforcement Learning An Introduction by Richard Sutton and Andrew Barto Slides are mainly based on the course material provided by the
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationAssignment 1 (Sol.) Introduction to Machine Learning Prof. B. Ravindran. 1. Which of the following tasks can be best solved using Clustering.
Assignment 1 (Sol.) Introduction to Machine Learning Prof. B. Ravindran 1. Which of the following tasks can be best solved using Clustering. (a) Predicting the amount of rainfall based on various cues
More informationNeuronal Dynamics: Computational Neuroscience of Single Neurons
Week 5 part 3a :Three definitions of rate code Neuronal Dynamics: Computational Neuroscience of Single Neurons Week 5 Variability and Noise: The question of the neural code Wulfram Gerstner EPFL, Lausanne,
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationDual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks
Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks Kary Främling Helsinki University of Technology, PL 55, FI-25 TKK, Finland Kary.Framling@hut.fi Abstract. Reinforcement
More informationSynthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil
Synthesis of Reinforcement Learning, Neural Networks, and PI Control Applied to a Simulated Heating Coil Charles W. Anderson 1, Douglas C. Hittle 2, Alon D. Katz 2, and R. Matt Kretchmar 1 1 Department
More informationTemporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI
Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning
More informationChapter 9: The Perceptron
Chapter 9: The Perceptron 9.1 INTRODUCTION At this point in the book, we have completed all of the exercises that we are going to do with the James program. These exercises have shown that distributed
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationReinforcement Learning and Time Perception - a Model of Animal Experiments
Reinforcement Learning and Time Perception - a Model of Animal Experiments J. L. Shapiro Department of Computer Science University of Manchester Manchester, M13 9PL U.K. jls@cs.man.ac.uk John Wearden Department
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Formal models of interaction Daniel Hennes 27.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Taxonomy of domains Models of
More informationChapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem Objectives of this chapter: describe the RL problem we will be studying for the remainder of the course present idealized form of the RL problem for which
More informationThis question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.
This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer. 1. Suppose you have a policy and its action-value function, q, then you
More informationMonte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.
Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte
More informationCSE/NB 528 Final Lecture: All Good Things Must. CSE/NB 528: Final Lecture
CSE/NB 528 Final Lecture: All Good Things Must 1 Course Summary Where have we been? Course Highlights Where do we go from here? Challenges and Open Problems Further Reading 2 What is the neural code? What
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationarxiv: v1 [cs.ai] 5 Nov 2017
arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important
More informationReinforcement Learning
Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart
More informationNeural Modeling and Computational Neuroscience. Claudio Gallicchio
Neural Modeling and Computational Neuroscience Claudio Gallicchio 1 Neuroscience modeling 2 Introduction to basic aspects of brain computation Introduction to neurophysiology Neural modeling: Elements
More informationReinforcement Learning: the basics
Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error
More informationReinforcement Learning. Spring 2018 Defining MDPs, Planning
Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state
More informationReinforcement learning
Reinforcement learning Stuart Russell, UC Berkeley Stuart Russell, UC Berkeley 1 Outline Sequential decision making Dynamic programming algorithms Reinforcement learning algorithms temporal difference
More information9/19/2016 CHAPTER 1: THE STUDY OF LIFE SECTION 1: INTRODUCTION TO THE SCIENCE OF LIFE BIOLOGY
CHAPTER 1: THE STUDY OF LIFE SECTION 1: INTRODUCTION TO BIOLOGY Ms. Diana THE SCIENCE OF LIFE Biology is the study of living things. In biology, you study the origins and history of life and once-living
More informationRL 3: Reinforcement Learning
RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.
More information(Deep) Reinforcement Learning
Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015
More informationUsing Expectation-Maximization for Reinforcement Learning
NOTE Communicated by Andrew Barto and Michael Jordan Using Expectation-Maximization for Reinforcement Learning Peter Dayan Department of Brain and Cognitive Sciences, Center for Biological and Computational
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationToday s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning
CSE 473: Artificial Intelligence Reinforcement Learning Dan Weld Today s Outline Reinforcement Learning Q-value iteration Q-learning Exploration / exploitation Linear function approximation Many slides
More information, and rewards and transition matrices as shown below:
CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount
More informationAn introduction to stochastic control theory, path integrals and reinforcement learning
An introduction to stochastic control theory, path integrals and reinforcement learning Hilbert J. Kappen Department of Biophysics, Radboud University, Geert Grooteplein 21, 6525 EZ Nijmegen Abstract.
More informationAn Introductory Course in Computational Neuroscience
An Introductory Course in Computational Neuroscience Contents Series Foreword Acknowledgments Preface 1 Preliminary Material 1.1. Introduction 1.1.1 The Cell, the Circuit, and the Brain 1.1.2 Physics of
More informationExperimental design of fmri studies
Experimental design of fmri studies Sandra Iglesias With many thanks for slides & images to: Klaas Enno Stephan, FIL Methods group, Christian Ruff SPM Course 2015 Overview of SPM Image time-series Kernel
More informationClassical Conditioning IV: TD learning in the brain
Classical Condiioning IV: TD learning in he brain PSY/NEU338: Animal learning and decision making: Psychological, compuaional and neural perspecives recap: Marr s levels of analysis David Marr (1945-1980)
More informationThe Markov Decision Process Extraction Network
The Markov Decision Process Extraction Network Siegmund Duell 1,2, Alexander Hans 1,3, and Steffen Udluft 1 1- Siemens AG, Corporate Research and Technologies, Learning Systems, Otto-Hahn-Ring 6, D-81739
More informationDueling Network Architectures for Deep Reinforcement Learning (ICML 2016)
Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016) Yoonho Lee Department of Computer Science and Engineering Pohang University of Science and Technology October 11, 2016 Outline
More informationFinding a Basis for the Neural State
Finding a Basis for the Neural State Chris Cueva ccueva@stanford.edu I. INTRODUCTION How is information represented in the brain? For example, consider arm movement. Neurons in dorsal premotor cortex (PMd)
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationThe homogeneous Poisson process
The homogeneous Poisson process during very short time interval Δt there is a fixed probability of an event (spike) occurring independent of what happened previously if r is the rate of the Poisson process,
More informationPlanning by Probabilistic Inference
Planning by Probabilistic Inference Hagai Attias Microsoft Research 1 Microsoft Way Redmond, WA 98052 Abstract This paper presents and demonstrates a new approach to the problem of planning under uncertainty.
More informationIn: Proc. BENELEARN-98, 8th Belgian-Dutch Conference on Machine Learning, pp 9-46, 998 Linear Quadratic Regulation using Reinforcement Learning Stephan ten Hagen? and Ben Krose Department of Mathematics,
More informationExperimental design of fmri studies
Experimental design of fmri studies Zurich SPM Course 2016 Sandra Iglesias Translational Neuromodeling Unit (TNU) Institute for Biomedical Engineering (IBT) University and ETH Zürich With many thanks for
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationExperimental design of fmri studies & Resting-State fmri
Methods & Models for fmri Analysis 2016 Experimental design of fmri studies & Resting-State fmri Sandra Iglesias With many thanks for slides & images to: Klaas Enno Stephan, FIL Methods group, Christian
More informationReinforcement Learning in Non-Stationary Continuous Time and Space Scenarios
Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal
More informationMCDB 4777/5777 Molecular Neurobiology Lecture 29 Neural Development- In the beginning
MCDB 4777/5777 Molecular Neurobiology Lecture 29 Neural Development- In the beginning Learning Goals for Lecture 29 4.1 Describe the contributions of early developmental events in the embryo to the formation
More informationBack-propagation as reinforcement in prediction tasks
Back-propagation as reinforcement in prediction tasks André Grüning Cognitive Neuroscience Sector S.I.S.S.A. via Beirut 4 34014 Trieste Italy gruening@sissa.it Abstract. The back-propagation (BP) training
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationMachine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396
Machine Learning Reinforcement learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1396 1 / 32 Table of contents 1 Introduction
More information1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5
Table of contents 1 Introduction 2 2 Markov Decision Processes 2 3 Future Cumulative Reward 3 4 Q-Learning 4 4.1 The Q-value.............................................. 4 4.2 The Temporal Difference.......................................
More informationControl of Dynamical System
Control of Dynamical System Yuan Gao Applied Mathematics University of Washington yuangao@uw.edu Spring 2015 1 / 21 Simplified Model for a Car To start, let s consider the simple example of controlling
More information