Summary of part I: prediction and RL

Similar documents
Reinforcement learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CS599 Lecture 1 Introduction To RL

Animal learning theory

Rewards and Value Systems: Conditioning and Reinforcement Learning

Basics of reinforcement learning

Markov decision processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning. Introduction

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Lecture 1: March 7, 2018

Grundlagen der Künstlichen Intelligenz

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Decision making, Markov decision processes

6 Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement learning an introduction

Lecture 23: Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning II

RL 3: Reinforcement Learning

Chapter 6: Temporal Difference Learning

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Machine Learning I Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Grundlagen der Künstlichen Intelligenz

Q-learning. Tambet Matiisen

Introduction to Reinforcement Learning

Autonomous Helicopter Flight via Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning In Continuous Time and Space

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Introduction to Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Lecture 7: Value Function Approximation

Variance Reduction for Policy Gradient Methods. March 13, 2017

Reinforcement Learning and NLP

Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Partially observable Markov decision processes. Department of Computer Science, Czech Technical University in Prague

The Reinforcement Learning Problem

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Computational Reinforcement Learning: An Introduction

Reinforcement Learning

REINFORCEMENT LEARNING

Reinforcement Learning and Deep Reinforcement Learning

Distributed Optimization. Song Chong EE, KAIST

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Procedia Computer Science 00 (2011) 000 6

Reinforcement Learning and Control

Every animal is represented by a blue circle. Correlation was measured by Spearman s rank correlation coefficient (ρ).

Lecture 3: The Reinforcement Learning Problem

Q-Learning in Continuous State Action Spaces

CSC321 Lecture 22: Q-Learning

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

The Art of Sequential Optimization via Simulations

Lecture 3: Markov Decision Processes

Reinforcement Learning

Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Policy Gradient Methods. February 13, 2017

An Analytic Solution to Discrete Bayesian Reinforcement Learning

An Introduction to Reinforcement Learning

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Model- based fmri analysis. Shih- Wei Wu fmri workshop, NCCU, Jan. 19, 2014

Reinforcement Learning: the basics

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Markov Models and Reinforcement Learning. Stephen G. Ware CSCI 4525 / 5525

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Lecture 8: Policy Gradient

An Introduction to Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Planning in Markov Decision Processes

CS 7180: Behavioral Modeling and Decisionmaking

CS 4100 // artificial intelligence. Recap/midterm review!

Reinforcement Learning as Classification Leveraging Modern Classifiers

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Planning by Probabilistic Inference

Open Theoretical Questions in Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning

A Behaviouristic Model of Signalling and Moral Sentiments

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Zero-Sum Stochastic Games An algorithmic review

Transcription:

Summary of part I: prediction and RL Prediction is important for action selection The problem: prediction of future reward The algorithm: temporal difference learning Neural implementation: dopamine dependent learning in BG A precise computational model of learning allows one to look in the brain for hidden variables postulated by the model Precise (normative!) theory for generation of dopamine firing patterns Explains anticipatory dopaminergic responding, second order conditioning Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

prediction error hypothesis of dopamine measured firing rate at end of trial: δ t = r t - V t (just like R-W) V t = η t i=1 (1 η) t i r i model prediction error Bayer & Glimcher (2005)

Global plan Reinforcement learning I: prediction classical conditioning dopamine Reinforcement learning II: dynamic programming; action selection Pavlovian misbehaviour vigor Chapter 9 of Theoretical Neuroscience

Action Selection Evolutionary specification Immediate reinforcement: leg flexion Thorndike puzzle box pigeon; rat; human matching Delayed reinforcement: these tasks mazes chess Bandler; Blanchard

Immediate Reinforcement stochastic policy: based on action values: L m R m ; 5

use RW rule: Indirect Actor r L p L = R R 0.05; r = p 0.25 switch every 100 trials 6

Direct Actor L E ( m) = P[ L] r + P[ R] r R P[ L] L m = βp[ L] P[ R] P[ R] R m = βp[ L] P[ R] E ( m ) = β P [ L ] r P [ L ] r + P [ R ] r L m E( m) [ ]( L = β P L r E( m) ) L m E( m) β L m ( ( )) L L R ( L r E m ) ( ) if L is chosen L R L R a m m (1 ε )( m m ) + ε ( r E( m))( L R)

8 Direct Actor

Could we Tell? correlate past rewards, actions with present choice indirect actor (separate clocks): direct actor (single clock):

Matching: Concurrent VI-VI Lau, Glimcher, Corrado, Sugrue, Newsome

Matching income not return approximately exponential in r alternation choice kernel

Action at a (Temporal) Distance x=1 x=1 x=2 x=3 x=2 x=3 learning an appropriate action at x=1: depends on the actions at x=2 and x=3 gains no immediate feedback idea: use prediction as surrogate feedback 12

Action Selection L R start with policy: P[ L; x] = σ ( m ( x) m ( x)) x=1 x=2 x=3 x=1 x=2 x=3 evaluate it: V ( 1), V (2), V (3) improve it: x=1 x=2 x=3 0.025-0.175-0.125 0.125 13 thus choose R more frequently than L;C m * α δ

Policy δ > 0 if value is too pessimistic action is better than average v P x=1 x=2 x=3 14

actor/critic m 1 m 2 m 3 m n dopamine signals to both motivational & motor striatum appear, surprisingly the same suggestion: training both values & policies

Formally: Dynamic Programming

Variants: SARSA [ V x x u C] * t + ( t ) t = 1 t = ( actual (1, C) + ε r + Q(2, u ) Q(1, )) * Q ( 1, C) = E r + 1, Q( 1, C) Q C t Morris et al, 2006

Q * Variants: Q learning [ + V ( x ) x = 1 u C] * ( 1, C) = E r t+ 1 t, t t = ( r + max Q(2, u) Q(1, )) Q( 1, C) Q(1, C) + ε C t u Roesch et al, 2007

Summary prediction learning Bellman evaluation actor-critic asynchronous policy iteration indirect method (Q learning) asynchronous value iteration Q * V * [ ( ) 1] * t + V xt x = [ + V ( x ) x = 1 u C] * ( 1) E r 1 = + t ( 1, C) = E r t+ 1 t, t t =

Impulsivity & Hyperbolic Discounting humans (and animals) show impulsivity in: diets addiction spending, intertemporal conflict between short and long term choices often explained via hyperbolic discount functions alternative is Pavlovian imperative to an immediate reinforcer framing, trolley dilemmas, etc

Direct/Indirect Pathways Frank direct: D1: GO; learn from DA increase indirect: D2: nogo; learn from DA decrease hyperdirect (STN) delay actions given strongly attractive choices

Frank DARPP-32: D1 effect DRD2: D2 effect

Three Decision Makers tree search position evaluation situation memory

Multiple Systems in RL model-based RL build a forward model of the task, outcomes search in the forward model (online DP) optimal use of information computationally ruinous cached-based RL learn Q values, which summarize future worth computationally trivial bootstrap-based; so statistically inefficient learn both select according to uncertainty

Animal Canary OFC; dlpfc; dorsomedial striatum; BLA? dosolateral striatum, amygdala

Two Systems:

Behavioural Effects

Effects of Learning distributional value iteration (Bayesian Q learning) fixed additional uncertainty per step

One Outcome shallow tree implies goal-directed control wins

Human Canary... a b c if a c and c, then do more of a or b? MB: b MF: a (or even no effect)

Behaviour action values depend on both systems: Q tot ( x, u) = Q ( x, u) + βq ( x, u) MF MB expect that β will vary by subject (but be fixed)

Neural Prediction Errors (1 2) R ventral striatum (anatomical definition) note that MB RL does not use this prediction error training signal?

Neural Prediction Errors (1) right nucleus accumbens behaviour 1-2, not 1

Vigour Two components to choice: what: lever pressing direction to run meal to choose when/how fast/how vigorous free operant tasks real-valued DP 34

The model cost? how fast τ vigour cost C τ V LP unit cost (reward) C U P R U R NP S 0 Other τ 1 time S 1 S τ 2 time 2 choose (action,τ) = (LP,τ 1 ) Costs Rewards choose (action,τ) = (LP,τ 2 ) Costs Rewards goal 35

The model Goal: Choose actions and latencies to maximize the average rate of return (rewards minus costs per time) S 0 τ 1 time S 1 S τ 2 time 2 choose (action,τ) = (LP,τ 1 ) Costs Rewards choose (action,τ) = (LP,τ 2 ) Costs Rewards ARL 36

Average Reward RL Compute differential values of actions Differential value of taking action L with latency τ when in state x ρ = average rewards minus costs, per unit time Q L,τ (x) = Rewards Costs + C u C + v τ Future Returns V (x') τρ steady state behavior (not learning dynamics) 37 (Extension of Schwartz 1993)

Average Reward Cost/benefit Tradeoffs 1. Which action to take? Choose action with largest expected reward minus cost 2.How fast to perform it? slow less costly (vigour cost) slow delays (all) rewards net rate of rewards = cost of delay (opportunity cost of time) Choose rate that balances vigour and opportunity costs explains faster (irrelevant) actions under hunger, etc masochism 38

Optimal response rates probability 1 st Nose poke 0.4 rate per minute 30 Niv, Dayan, Joel, unpublished 20 0.2 10 1 st NP LP 0 0 0 0.5 1 1.5 0 20 40 seconds seconds since reinforcement Experimental data probability 1 st Nose poke 0.4 0.2 0 0 0.5 1 1.5 seconds rate per minute 30 20 10 0 0 20 40 seconds since reinforcement Model simulation 39

Optimal response rates Experimental data Model simulation % Responses on key A 100 80 60 40 20 Pigeon A Pigeon B Perfect matching Herrnstein 1961 0 20 40 60 80 100 % Reinforcements on key A on lever A % Responses 50 0 Model Perfect matching More: 0 50 # responses % Reinforcements on lever A interval length amount of reward ratio vs. interval breaking point temporal structure etc. 40

Effects of motivation (in the model) RR25 Cv Q( x, u, τ ) = prr Cu + V ( x' ) τ R τ Q( x, u, τ ) Cv = R = 0 τ 2 opt = τ τ C R v opt mean latency low utility high utility energizing effect LP Other 41

Effects of motivation (in the model) RR25 directing effect response rate / minute response rate / minute seconds from reinforcement 1 U R 50% seconds from reinforcement low utility high utility mean latency 2 energizing effect LP Other 42

Relation to Dopamine Phasic dopamine firing = reward prediction error What about tonic dopamine? less more 43

Tonic dopamine = Average reward rate 1. explains pharmacological manipulations 2. dopamine control of vigour through BG pathways # LPs in 30 minutes 2500 2000 1500 1000 500 Control DA depleted 1 4 16 64 Aberman and Salamone 1999 # LPs in 30 minutes 1200 1000 800 Control DA depleted 1 4 8 16 eating time confound context/state dependence (motivation & drugs?) less switching=perseveration 44 NB. phasic signal RPE for choice/value learning 600 400 200 0 Model simulation

Tonic dopamine hypothesis $ $ $ $ $ $ also explains effects of phasic dopamine on response times firing rate 45 reaction time Satoh and Kimura 2003 Ljungberg, Apicella and Schultz 1992

Sensory Decisions as Optimal Stopping consider listening to: decision: choose, or sample

Optimal Stopping equivalent of state u=1 is u 1 = n1 σ = C r 2.5 = 0.1 and states u=2,3 is u = ( n + n ) 2 1 2 1 2

Transition Probabilities

Computational Neuromodulation dopamine phasic: prediction error for reward tonic: average reward (vigour) serotonin phasic: prediction error for punishment? acetylcholine: expected uncertainty? norepinephrine unexpected uncertainty; neural interrupt?

Conditioning 50 prediction: of important events control: in the light of those predictions Ethology optimality appropriateness Psychology classical/operant conditioning Computation dynamic progr. Kalman filtering Algorithm TD/delta rules simple weights Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum

Markov Decision Process class of stylized tasks with states, actions & rewards at each timestep t the world takes on state s t and delivers reward r t, and the agent chooses an action a t

Markov Decision Process World: You are in state 34. Your immediate reward is 3. You have 3 actions. Robot: I ll take action 2. World: You are in state 77. Your immediate reward is -7. You have 2 actions. Robot: I ll take action 1. World: You re in state 34 (again). Your immediate reward is 3. You have 3 actions.

Markov Decision Process Stochastic process defined by: reward function: r t ~ P(r t s t ) transition function: s t ~ P(s t+1 s t, a t )

Markov Decision Process Stochastic process defined by: reward function: r t ~ P(r t s t ) transition function: s t ~ P(s t+1 s t, a t ) Markov property future conditionally independent of past, given s t

The optimal policy Definition: a policy such that at every state, its expected value is better than (or equal to) that of all other policies Theorem: For every MDP there exists (at least) one deterministic optimal policy. by the way, why is the optimal policy just a mapping from states to actions? couldn t you earn more reward by choosing a different action depending on last 2 states?

Pavlovian & Instrumental Conditioning Pavlovian learning values and predictions using TD error Instrumental learning actions: by reinforcement (leg flexion) by (TD) critic (actually different forms: goal directed & habitual)

Pavlovian-Instrumental Interactions synergistic conditioned reinforcement Pavlovian-instrumental transfer Pavlovian cue predicts the instrumental outcome behavioural inhibition to avoid aversive outcomes neutral Pavlovian-instrumental transfer Pavlovian cue predicts outcome with same motivational valence opponent Pavlovian-instrumental transfer Pavlovian cue predicts opposite motivational valence negative automaintenance

-ve Automaintenance in Autoshaping simple choice task N: nogo gives reward r=1 G: go gives reward r=0 learn three quantities average value Q value for N Q value for G instrumental propensity is

-ve Automaintenance in Autoshaping Pavlovian action assert: Pavlovian impetus towards G is v(t) weight Pavlovian and instrumental advantages by ω competitive reliability of Pavlov new propensities new action choice

-ve Automaintenance in Autoshaping basic ve automaintenance effect (µ=5) lines are theoretical asymptotes equilibrium probabilities of action