Animal learning theory

Similar documents
Reinforcement learning

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Reinforcement learning

Rewards and Value Systems: Conditioning and Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Temporal difference learning

Reinforcement Learning In Continuous Time and Space

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Temporal Difference Learning & Policy Iteration

Basics of reinforcement learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

CO6: Introduction to Computational Neuroscience

Elements of Reinforcement Learning

Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Introduction to Reinforcement Learning

Reinforcement Learning: the basics

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Q-Learning in Continuous State Action Spaces

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning

RL 3: Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning and Time Perception - a Model of Animal Experiments

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

An Adaptive Clustering Method for Model-free Reinforcement Learning

6 Basic Convergence Results for RL Algorithms

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

, and rewards and transition matrices as shown below:

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Lecture 23: Reinforcement Learning

Reinforcement learning an introduction

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Reinforcement Learning

Machine Learning I Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning. Odelia Schwartz 2016

On the Convergence of Optimistic Policy Iteration

REINFORCEMENT LEARNING

The Nature of Learning - A study of Reinforcement Learning Methodology

(Deep) Reinforcement Learning

Summary of part I: prediction and RL

Reinforcement Learning II

An Introduction to Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

arxiv: v1 [cs.ai] 5 Nov 2017

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Bias-Variance Error Bounds for Temporal Difference Updates

Reinforcement Learning and NLP

Reinforcement learning crash course

Computational Reinforcement Learning: An Introduction

Reinforcement Learning. Yishay Mansour Tel-Aviv University

The convergence limit of the temporal difference learning

Reinforcement Learning in a Nutshell

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Lecture 9: Policy Gradient II 1

POMDPs and Policy Gradients

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

Policy Gradient Reinforcement Learning for Robotics

MDP Preliminaries. Nan Jiang. February 10, 2019

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Lecture 8: Policy Gradient

State Space Abstractions for Reinforcement Learning

Reinforcement Learning II. George Konidaris

CSC242: Intro to AI. Lecture 23

Lecture 1: March 7, 2018

Olivier Sigaud. September 21, 2012

Reinforcement Learning. George Konidaris

Reinforcement Learning II. George Konidaris

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

6 Reinforcement Learning

Augmented Rescorla-Wagner and Maximum Likelihood Estimation.

ilstd: Eligibility Traces and Convergence Analysis

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

An Introduction to Reinforcement Learning

Decision Theory: Q-Learning

QUICR-learning for Multi-Agent Coordination

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Q-learning. Tambet Matiisen

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Kalman Based Temporal Difference Neural Network for Policy Generation under Uncertainty (KBTDNN)

Convergence of reinforcement learning algorithms and acceleration of learning

Off-Policy Actor-Critic

Transcription:

Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen

[Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US) are paired in close succession - The US produces a response UR - After sufficient pairing, the CS produces a response CR similar to UR Example: rabbit eye blink condition - CS is sound of buzzer, US air puff to rabbit eye, UR is eye blink - After CS-US pairing, the buzzer alone yields CR eye blink Bert Kappen 1

[Sutton and Barto, 1990] Classical theory proposes that a causal relation between CS and US is learned. V = (level of US processing) (level of CS processing) = Reinforcement Eligibility V is association strength. - US processing is about reinforcement (positive or negative) - CS processing is about attention (which stimulus is credited) We focus on reinforcement processing with very simple eligibility models. Bert Kappen 2

Trial level vs. Real-time theories [Sutton and Barto, 1990] Trial level theories - ignore temporal aspects of individual trial and updates apply at end of trial - Rescorla-Wagner model - mostly used in experiments Real-time theories. - consider detailed timing aspects of the stimuli and rewards - TD learning Bert Kappen 3

Rescorla-Wagner (RW) model [Sutton and Barto, 1990] Central idea is that learning occurs whenever events violate expectations. In particular, when expected US V differs from the actual US level λ: V i := V i + V i V i = β (λ V) } {{ } V = reinforcement V i X i i α i X i }{{} saliency X i = 0, 1 to indicate absence/presence of CS i. α i is saliency (attention) of CS i (fixed). Term α i X i is often ignored for single stimulus. V i is the association strength of stimulus CS i to US. λ denotes the level of US, combining, in an unspecified way, the USs intensity, duration, and temporal relationship with the CS. 1 1 For instance, strenght, duration of air puff and timing relative to CS. Bert Kappen 4

RW is linear regression [Sutton and Barto, 1990] Consider sequence {(X i,t, λ t ), t = 1,..., T}. Define error C = 1 2 T λ t t=1 X i,t V i i 2 Minimizing C wrt V i gives best predictive model. Identical to RW rule with α i = 1. V i = β C = β V λ t i X i,t V i X i,t i Bert Kappen 5

Rescorla-Wagner model [Sutton and Barto, 1990] Rescola Wagner rule for multiple inputs can predict various phenomena: - Blocking: learned s 1 r prevents learning of association s 2 r - Inhibition: s 2 reduces prediction when combined with any predicting stimulus Bert Kappen 6

Rescorla-Wagner model [Sutton and Barto, 1990] Fails to explain second order conditioning: B A US - Empirical finding: if A predicts US and B predicts A, then B predicts US. - A US relation correctly reinforced by RW - B A relation not reinforced by RW because US λ = 0 in B A trials Solution is to assume that A also generates a λ Bert Kappen 7

Temporal aspect play important role [Sutton and Barto, 1990] Bert Kappen 8

Real-Time Theories of Eligibility [Sutton and Barto, 1990] Conditioning depends on time interval between CS and US. Mechanisms for coincidence of CS and US: - Stimulus trace (Hull 1939) is internal representation. Disadvantage of using stimulus trace for learning is that long inter-stimulus intervals (ISIs) require broad stimulus trace, which disagrees with fast, precisely timed responses. - Eligibility trace X (Knopf 1972) is similar to stimulus trace but only used for learning Bert Kappen 9

From RW to a real time theory [Sutton and Barto, 1990] Left :Trial based reinforcement λ is area under the curve. Right: Real-time reinforcement is future Problems: area under the curve. - Prediction is constant for all times prior to US. This disagrees with empirical observation Solution is imminence weighting to reduce long time association. Bert Kappen 10

Temporal difference learning [Sutton and Barto, 1990] Subject predicts at time t the imminence weighted area (c) rather than the unweighted area (a) which may be infinite. Bert Kappen 11

Temporal difference learning Dayan and Abbott 9.2 Consider time interval 0 t < with u(t) stimulus, λ(t) the reward. The expected (discounted=imminence weighted) future reward is R(t) = E γ τ t λ(τ) τ=t The aim of reinforcement learning is to learn a signal v(t) that predicts the expected future reward R(t) from the past stimulus. v(t) = T w(τ)u(t τ) τ=0 with w(τ) adaptive parameters. Minimize quadratic error C = 1 2 t=0 (v(t) R(t)) 2 with respect w(τ) yields w(τ) = ɛ C(t) w(τ) = ɛ t=0 (R(t) v(t)) u(t τ) τ = 0,..., T Bert Kappen 12

Temporal difference learning Dayan and Abbott 9.2 Problem is that we do not know R(t). But we can use the recursive relation R(t) = E γ τ t λ(τ) = λ(t) + γe γ τ t 1 λ(τ) = λ(t) + γr(t + 1) λ(t) + γv(t + 1) τ=t τ=t+1 We approximate R(t + 1) by v(t + 1). Update at time t becomes: w(τ) = ɛ (R(t) v(t)) u(t τ) ɛ (λ(t) + γv(t + 1) v(t)) u(t τ) Bert Kappen 13

Temporal difference learning Dayan and Abbott 9.2 When v(t) does not depend on stimulus, but is simply a parameter that is estimated for each t: C = 1 2 (R(t) v(t)) 2 t=0 v(t) = ɛ C v(t) = ɛ t=0 (R(t) v(t)) ɛ (r(t) + γv(t + 1) v(t)) t=0 Bert Kappen 14

Inter-Stimulus-Interval Dependency Dayan and Abbott 9.2 Rabbit CR (eye blink) to CS (sound) as a function of CS to US (puff) time interval shows good agreement with TD model. Bert Kappen 15

Fig. 9.2 Dayan and Abbott 9.2 Stimulus is u(t) = δ t,100, thus Eq. 9.6 v(t) = w(τ)u(t τ) = w(t 100) t 100 τ = 0 0 t < 100 Denote τ = t 100 with t 100, then v(t) = w(τ). Delta rule Eq. 9.10: w(τ) = ɛδ(t)u(t τ) = ɛδ(t)δ t τ,100 v(t) = w(τ) = ɛδ(t) δ(t) = r(t) + v(t + 1) v(t) and w(τ) = 0 if τ t 100. Reward is r = δ t,200. Bert Kappen 16

First iteration: Second iteration: Third iteration: Dayan and Abbott 9.2 v(t) = 0 δ(t) = δ t,200 v(t) = ɛδ t,200 δ(t) = (1 ɛ)δ t,200 + ɛδ t,199 v(t) = (2ɛ ɛ 2 )δ t,200 + ɛδ t,199 δ(t) =... δ(t) = v(t) + r(t) Bert Kappen 17

Dopamine Dayan and Abbott 9.2 Activity of dopamine neurons in ventral tegmental area (VTA) of monkey encode δ in fig. 9.2. Monkey button press after stimulus (sound) to receive reward (fruit juice). - A: Left panels, trials locked to stimulus, right panels trials locked to reward. Top row is early trials, bottom row is late trials. VTA cells respond to reward in early trials and to stimulus in late trials. - B Top: stimulus locked trials after learning as in bottom fig. A - B bottom: witholding the (expected) reward yields inhibition Bert Kappen 18

Instrumental conditioning/static action choice Dayan Abbott 9.3 Conditioning: Classical: reward (punishment) independent of action taken (Pavlov dog) Instrumental: reward depends on action taken Reward timing: immediate: reward follows directly after action (static action choice) delayed: reward follows after sequence of actions Bert Kappen 19

Foraging of bees Dayan Abbott 9.3 Foraging of bees as example of learning behavior with immediate reward: bees visit flowers of two colors (blue, yellow), prefering color with higher reward (nectar) when rewards are swapped, bees adjust their preference Quantity of nectar r b,y is stochastic from (fixed) distribution q r,b (r). This is like a two-armed bandit problem. The policy of the bees is given by a probability p b,y with p b + p r parametrize this as = 1. We can p b = e βm b e βm b + e βm y p y = e βm y e βm b + e βm y The parameters m b,y are called action values, that do no require normalization. The parameter β controls exploration-exploitation: β = 0: exploration, β = : exploitation Bert Kappen 20

Foraging of bees Dayan Abbott 9.3 Two learning strategies: learn action value as past expected reward (indirect actor) learn action value to maximize future expected reward (direct actor) Bert Kappen 21

Indirect actor Dayan Abbott 9.3 We need a learning rule to estimate m b = r b = q b (r b )r b m y = r b ry = q y (r y )r y r y Consider the cost function E(m b ) = 1 2 when de dm b = r b q b (r)(m b r b ) 2. The function is minimized q b (r b )(m b r b ) = m b r b = 0 r We use stochastic gradient descent, replacing r b by a single sample r b : m b := m b ɛ de dm b = m b + ɛ(r b m b ) and similar for m y : m y := m y + ɛ(r y m y ). This is the Delta rule. Bert Kappen 22

Indirect actor Dayan Abbott 9.3 r b = 1, r y = 2 for first 100 visits with qb (r) = 1 2 δ r,0 + 1 2 δ r,2, q y (r) = 1 2 δ r,0 + 1 2 δ r,4. Rewards are reversed for second 100 visits. A) Visits selected according action values with β = 1 and learning of m b,y with delta rule. B) Cumulated reward for β = 1 and (C,D) for β = 50 shows better exploitation but worse exploration (sometimes) for large β. This behavior is seen in real bees. Bert Kappen 23

Indirect actor Dayan Abbott 9.3 Real bees are risk averse. Blue flowers provide 2 µl nectar, yellow flowers provide 6µl on 1/3 of the trials and zero otherwise. After 15 trials reversed. A) mean preference of 5 bees for blue flowers over 30 trials, each consisting of 40 visits. B) action value is concave function of nectar volume prefers low risk option. C) preference of single model bee (ɛ = 0.3, β = 23/8) Bert Kappen 24

Direct actor Dayan Abbott 9.3 Direct actor method estimates m b,y such as to maximize the expected future reward r = p b r b + p y ry We use stochastic gradient ascent on r : m b := m b + ɛβp b p y ( rb r y ) = mb + ɛβp b (1 p b ) r b ɛβp y p b ry m y := m y + ɛβp b p y ( ry rb ) = m y + ɛβp y (1 p y ) r y ɛβpb p y r b where we used dp b dm b = βp b p y dp y dm b = βp b p y Bert Kappen 25

Direct actor Dayan Abbott 9.3 The stochastic version is δm b = ɛ(1 p b )r b δm y = ɛ p y r b i f b is selected δm b = ɛ p b r y δm y = ɛ(1 p y )r b i f y is selected For multiple actions the rule generalizes to, when action a is taken: δm a = ɛ(δ a,a p a )r a a (For instance, take a = b and a = b, y gives first line above). We will use this in the actor-critic algorithm (DA 9.4). Bert Kappen 26

Direct actor Dayan Abbott 9.3 Setting as before; two different runs. A,B) Succesful learning. C,D) Failure to learn to switch due to larger difference m y m b Illustrates that learning methods that directly optimize expected future reward may be too greedy and have poor exploration. Bert Kappen 27

Delayed reward/sequential decision making Dayan Abbott 9.4 Reward obtained after sequence of actions. Rat moves without back tracking. After reward removed from maze and restart. Delayed reward problem: Choice at A has no direct reward Bert Kappen 28

Delayed reward/sequential decision making Dayan Abbott 9.4 Policy iteration (see [Kaelbling et al., 1996] 3.2.2): Loop: Policy evaluation: Compute value V π for policy π. Run Bellman backup until convergence Policy improvement: Improve π Bert Kappen 29

Delayed reward/sequential decision making Dayan Abbott 9.4 Actor Critic (see [Kaelbling et al., 1996] 4.1): Loop: Critic: use TD eval. V(state) using current policy Actor: improve policy p(state) Bert Kappen 30

Policy evaluation with TD Dayan Abbott 9.4 Policy is random left/right at each turn. v(b) = 1 2 (0 + 5) = 2.5 v(c) = 1 2 (0 + 2) = 1 v(a) = 1 (v(b) + v(c)) = 1.75 2 Learn through TD learning (v(s) = w(s), s = A, B, C): v(s) := v(s) + ɛδ δ = r(s) + v(s ) v(s) Bert Kappen 31

Policy improvement Dayan Abbott 9.4 We use the direct actor method, which updates all m a upon taking action a as: m a := m a + ɛ(δ a,a p a )r a a In the maze, the actions (and action values) depend on the state and reward is delayed r a r a + v(s ) v(s): m a (s) := m a (s) + ɛ(δ a,a p a (s))δ δ = r a + v(s ) v(s) p a (s) = e βm a(s) a eβm a (s) Example: the values of the random policy are v(a) = 1.75 v(b) = 2.5 v(c) = 1 Bert Kappen 32

When in state A: Dayan Abbott 9.4 δ = 0 + v(b) v(a) = 0.75 f or right turn δ = 0 + v(c) v(a) = 0.75 f or le f t turn The learning rule increase m right (A) and decreases m left (A). β controls the exploration. Actor critic learning: show probability of left turn versus trials. Slow convergence of C due to less visits of right arm. Bert Kappen 33

Generalizations Dayan Abbott 9.4 Generalizations of the basis Actor-Critic model are: 1. State s may not be directly observable, but instead a vector u i (s) of sensory information is available The value function v(s) can be a parametrized function of u i (s). For instance a linear parametrization, the TD learning rule is v(s) = w i u i (s) w i := w i + ɛδu i (s) δ = r(s) + v(s ) v(s) i The action value m a (s) also becomes a function of u i instead of s directly: m a (s) = M ai u i (s) i M a i := M a i + ɛ(δ aa p a (s))δu i (s) Three-term learning rule: The connection between neurons i and a is changed depending on a third term δ 2. Discounted reward. Earlier reward/punishment more important than later. δ = r(s) + γv(s ) v(s) Bert Kappen 34

3. Updating value of past states in each learning step: TD(λ) Dayan Abbott 9.4 Bert Kappen 35

Water Maze Dayan Abbott 9.4 Rat is place in a large pool of milky water and has to swim around until find a small platform that is submerged. After several trials, the rat learns to locate the platform. (A) u i (s) is activity of place cell array, s is physical location (B) Critic: v(s) = i w i u i (s) is value function. (C upper) Actor: m a (s) = i M ai u i (s) is the action value (C lower) Bert Kappen 36

References Dayan Abbott 9.4 References [Dayan and Abbott, 2001] Dayan, P. and Abbott, L. (2001). Theoretical Neuroscience. Computational and Mathematical Modeling of Neural Systems. MIT Press, New York. [Kaelbling et al., 1996] Kaelbling, L., Littman, M., and Moore, A. (1996). Reinforcement learning: a survey. Journal of Artificial Intelligence research, 4:237 285. [Sutton and Barto, 1990] Sutton, R. S. and Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement., pages 497 537. Cambridge University Press. Bert Kappen 37