Reinforcement learning crash course

Size: px

Start display at page:

Download "Reinforcement learning crash course"

Douglas McDowell
5 years ago
Views:

1 crash course Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich,

2 Overview : rough overview mainly following Sutton & Barto 1998 Dopamine prediction errors and more Fitting behaviour with RL models hierarchical approaches

3 Setup Environment a t s t r t Agent {a t } argmax {a t } t=1 r t After Sutton and Barto 1998

4 State space Gold +1 Electric shocks -1

5 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

6 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

7 Actions Action left abs Action right T left = T right = Noisy: plants, environments, agent Absorbing state -> max eigenvalue < 1

8 Markovian dynamics p(s t+1 a t,s t,a t 1,s t 1,a t 2,s t 2, )=p(s t+1 a t,s t ) a t 2,s t 2 a t 1,s t 1 a t,s t Velocity s = [position] s = position velocity

9 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

10 A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

11 Tall orders Aim: maximise total future reward 1X t=1 r t i.e. we have to sum over paths through the future and weigh each by its probability Best policy achieves best long-term reward

12 Exhaustive tree search w d

13 Decision tree 1X t=1 r t

14 Policy for this talk Pose the problem mathematically Policy evaluation Policy iteration Monte Carlo techniques: experience samples TD learning Policy Evaluate Update

15 Evaluating a policy Aim: maximise total future reward 1X t=1 r t To know which is best, evaluate it first The policy determines the expected reward from each state V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t #

16 Discounting Given a policy, each state has an expected value V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t # But: t=0 Episodic r t = TX r t < 1 t=0 Discounted infinite horizons t=0 t r t < finite, exponentially distributed horizons T t=0 t r t T 1 e t/

17 Markov Decision Problems V (s t ) = E " 1 X t 0 =1 r t 0 s t = s, # " 1 X # = E [r 1 s t = s, ]+E t=2 r t s t = s, = E [r 1 s t = s, ]+E [V (s t+1 ) s t = s, ] This dynamic consistency is key to many solution approaches. It states that the value of a state s is related to the values of its successor states s.

18 Markov Decision Problems V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] r 1 R(s 2,a 1,s 1 ) 2 3 E [r 1 s t = s, ] = E 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t p(a t s t ) X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t (a t,s t ) X T a t s t s t+1 R(s t+1,a t,s t ) 5 st+1

19 Bellman equation V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] 2 3 E [r 1 s t, ] = X (a, s t ) 4 X Ts a t s t+1 R(s t+1,a,s t ) 5 a st E [V (s t+1 ),,s t ] = X (a, s t ) 4 X Ts a t s t+1 V (s t+1 ) 5 a st+1 " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]

20 Bellman Equation All future reward from state s = E Immediate reward + All future reward from next state s " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]

21 Q values = state-action values V (s) = X a (a s) " X s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] {z } Q (s,a) # so we can define state-action values as: Q(s, a) = s = E T a ss [R(s, a, s)+v (s )] t=1 r t s, a and state values are average state-action values: V (s) = a (a s)q(s, a)

22 Bellman Equation " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] to evaluate a policy, we need to solve the above equation, i.e. find the self-consistent state values options for policy evaluation exhaustive tree search - outwards, inwards, depth-first value iteration: iterative updates linear solution in 1 step experience sampling

23 Solving the Bellman Equation Option 1: turn it into update equation V k+1 (s) = a (a, s t ) s T a ss R(s, a, s)+v k (s ) Option 2: linear solution (w/ absorbing states) V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] v = R + T v v = (I T ) 1 R O( S 3 )

24 Policy update Given the value function for a policy, say via linear solution V (s) = X " # X (a s) Tss a [R(s 0,a,s)+V (s 0 )] 0 a s 0 {z } Q (s,a) Given the values V for the policy, we can improve the policy by always choosing the best action: 0 (a s) = 1ifa = argmaxa Q (s, a) 0 else It is guaranteed to improve: Q (s, 0 (s)) = max a Q (s, a) for deterministic policy Q (s, (s)) = V (s)

25 Policy iteration Policy evaluation v = (I T ) 1 R V (s) = max a Value iteration s T a ss [R a ss + V (s )] (a s) = greedy policy improvement 1 if a = argmaxa 0 else s T a ss R a ss + V pi (s )

26 Model-free solutions So far we have assumed knowledge of R and T R and T are the model of the world, so we assume full knowledge of the dynamics and rewards in the environment What if we don t know them? We can still learn from state-action-reward samples we can learn R and T from them, and use our estimates to solve as above alternatively, we can directly estimate V or Q

27 Solving the Bellman Equation Option 3: sampling V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: more about this later... a = X k f(x k )p(x k ) X x (i) p(x)! â = 1 N i f(x (i) )

28 Learning from samples A new problem: exploration versus exploitation

29 Monte Carlo First visit MC randomly start in all states, generate paths, average for starting state only V(s) = 1 N X i ( T X t 0 =1 r i t 0 s 0 = s More efficient use of samples 10 9 Every visit MC 8 Bootstrap: TD 7 Dyna 6 Better samples 5 4 on policy versus off policy 3 Stochastic search, UCT... 2 )

30 Update equation: towards TD Bellman equation V (s) = a (a, s) s T a ss [R(s, a, s)+v (s )] Not yet converged, so it doesn t hold: dv (s) = V (s)+ (a, s) a s T a ss [R(s, a, s)+v (s )] And then use this to update V i+1 (s) =V i (s)+dv (s)

31 TD learning dv (s) = V (s)+ a (a, s) s T a ss [R(s, a, s)+v (s )] Sample s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t 1 (s t )+r t + V t 1 (s t+1 ) V i+1 (s) =V i (s)+dv (s) V t (s t )=V t 1 (s t )+ t

32 TD learning s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t (s t )+r t + V t (s t+1 ) V t+1 (s t )=V t (s t )+ t

33 SARSA Do TD for state-action values instead: Q(s t,a t ) Q(s t,a t )+ [r t + Q(s t+1,a t+1 ) Q(s t,a t )] s t,a t,r t,s t+1,a t+1 convergence guarantees - will estimate Q (s, a)

34 Q learning: off-policy Learn off-policy draw from some policy only require extensive sampling Q(s t,a t ) Q(s t,a t )+ will estimate Q (s, a) r t + max Q(s t+1,a) a update towards optimum Q(s t,a t )

35 The effect of bootstrapping B1 B1 B1 B1 B1 B1 B0 A0 B0 Markov (every visit) V(B)=3/4 V(A)=0 TD V(B)=3/4 V(A)=~3/4 Average over various bootstrappings: TD( ) after Sutton and Barto 1998

36 Conclusion Long-term rewards have internal consistency This can be exploited for solution Exploration and exploitation trade off when sampling Clever use of samples can produce fast learning Brain most likely does something like this

37 Fitting models to behaviour Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich,

Example task Probability correct Rewarded 1 0.

Avoid Win Avoid 1 Avoids loss Go rewarded Go to

5 0.5 0.5 0 20 40 60 0 20 40 60 0 20 40 60 0 20

38 Example task Probability correct Rewarded Go Nogo Go to Go to Nogo to Nogo to Win Avoid Win Avoid 1 Avoids loss Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) Think of it as four separate two-armed bandit tasks Guitart-Masip, Huys et al. 2012

Analysing behaviour Standard approach: Decide which feature of the data you care about Run descriptive statistical tests, e.g. ANOVA 1 Probability correct 0.

39 Analysing behaviour Standard approach: Decide which feature of the data you care about Run descriptive statistical tests, e.g. ANOVA 1 Probability correct Go to Go to Nogo to Nogo to Win Avoid Win Avoid Many strengths Weakness Piecemeal, not holistic / global Descriptive, not generative No internal variables

40 Models Holistic Aim to model the process by which the data came about in its entirety Generative They can be run on the task to generate data as if a subject had done the task Inference process Capture the inference process subjects have to make to perform the task. Do this in sufficient detail to replicate the data. Parameters replace test statistics their meaning is explicit in the model

41 Actions Q values the process Probabilities link function Features: Q t (a t,s t )=Q t 1 (a t,s t )+ (r t Q t 1 (a t,s t )) p(a t s t,h t, ) = p(a t Q(a t,s t ), ) = p(a t s t ) / Q(a t,s t ) e Q(a t,s t ) P a 0 e Q(a0,s t ) 0 apple p(a) apple 1 links learning process and observations choices, RTs, or any other data

42 Fitting models I Maximum likelihood (ML) parameters ˆ = argmax L( ) where the likelihood of all choices is: L( ) = log p({a t } T t=1 {s t } T t=1, {r t } T t=1, {z} ), = log p({a t } T t=1 {Q(s t,a t ; )} T t=1, ) TY = log p(a t Q(s t,a t ; ), ) = t=1 TX log p(a t Q(s t,a t ; ), ) t=1

43 Fitting models II No closed form Use your favourite method gradients fminunc / fmincon... Gradients for RW model dl( ) d dq t (a t,s t ; ) d = d d = X t X log p(a t Q t (a t,s t ; ), ) t d d Q t(a t,s t ; ) = (1 ) dq t 1(a t,s t ; ) d X a 0 p(a 0 Q t (a 0,s t ; ), ) d d Q t(a 0,s t ; ) +(r t Q t 1 (a t,s t ; ))

44 Little tricks Transform your variables = e = log( ) 1 = 1+e = log 1 d log L( ) d Avoid over/underflow y(a) = Q(a) y m = max a p = e y(a) P y(a) = ey(a) ym P b ey(b) b ey(b) y m

45 ML characteristics ML is asymptotically consistent, but variance high 10-armed bandit, infer beta and epsilon trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 Estimate L( = 10) L( = 100) Run

46 Priors Not so smooth Smooth Prior Probability Prior Probability Parameter Parameter 0.8 Memory Switches 60 Dombrovski et al. 2010

47 Maximum a posteriori estimate P( )=p( a 1...T )= p(a 1...T )p( ) d p( a 1...T )p( ) log P( )= T t=1 log p(a t ) + log p( )+const. log P( ) d = log L( ) d + dp( ) d If likelihood is strong, prior will have little effect mainly has influence on poorly constrained parameters if a parameter is strongly constrained to be outside the typical range of the prior, then it will win over the prior

48 Maximum a posteriori estimate $ )*+,-.+/!!$!%! "!! #!!! #"!! $!!! &'( 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 mbeta=0, meps=-3, n=1

49 But What prior parameters should I use?

50 Hierarchical estimation - random effects i µ, A i A. A. A. A i A. A. A. i µ, µ, µ, i i i A. A. A. A. A. A. A. A. A. A. A. A. Fixed effect conflates within- and between- subject variability Average behaviour disregards between-subject variability need to adapt model Summary statistic treat parameters as random variable, one for each subject overestimates group variance as ML estimates noisy Random effects prior mean = group mean Z p(a i µ, ) = A. A. A. A. d i p(a i i ) p( i µ µ, ) {z }

2 Q 2 (a, s) assumes within-subject mixture, rather than a group mixture of perfect

51 Random effects See subjects as drawn from group Fixed models all the same: fixed effect wrt model parametrically nested A. µ, i M A. A. A. Q(a, s) =! 1 Q 1 (a, s)+! 2 Q 2 (a, s) assumes within-subject mixture, rather than a group mixture of perfect types Random effects in models µ, µ, µ, A T K i i i A. A. A. A. A. A. A. A. A. A. A. A.

52 Estimating the hyperparameters Effectively we now want to do gradient ascent on: d d p(a ) But this contains an integral over individual parameters: p(a ) = d p(a ) p( ) So we need to: ˆ = argmax p(a ) = argmax d p(a ) p( )

53 Inference ˆ = argmax p(a ) = argmax d p(a ) p( ) analytical - rare brute force - for simple problems Expectation Maximisation - approximate, easy Variational Bayes Sampling / MCMC

54 Expectation Maximisation Z log p(a ) = log d p(a, ) Z p(a, ) = log d q( ) q( ) Z p(a, ) d q( ) log q( ) Jensen s inequality k th Estep:q (k+1) ( ) p( A, (k) ) Z k th Mstep: (k+1) argmax d q( ) log p(a, ) Iterate between Estimating MAP parameters given prior parameters Estimating prior parameters from MAP parameters

55 Bayesian Information Criterion Laplace s approximation (saddle-point method) Y log(y) X Just a Gaussian X dx f(x) f (x 0 ) 2 2

56 EM with Laplace approximation E step: only need sufficient statistics to perform M step p( A, (k) ) N (m k, S k ) Approximate and hence: q (k+1) ( ) p( A, (k) ) E step: q k ( ) = N (m k, S k ) m k argmax p(a k )p( (i) ) S 1 k 2 p(a k )p( (i) ) 2 =m k matlab: [m,l,,,s]=fminunc( ) Just what we had before: MAP inference given some prior parameters

57 EM with Laplace approximation Updates Prior mean = mean of MAP estimates M step: (i+1) µ = 1 K m k (i+1) 2 = 1 N k i (m k ) 2 + S k ( (i+1) µ ) 2 Prior variance depends on inverse Hessian S and variance of MAP estimates Take uncertainty of estimates into account And now iterate until convergence

58 Parameter recovery Simulation 1 truly uncorrelated, no GLM Simulation 4 truly correlated, no GLM A B C β 1 correlation D βʼ correlation αʼ correlation true inferred true inferred true inferred T ω correlation true inferred E T F T G T H T Rescorla-Wagner Model T T step model T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 EM-MAP MAP0 ML

59 Correlations A Correlation ML βʼ and αʼ C Correlation MAP βʼʼ and αʼʼ E Prior covariance MAP βʼʼ and αʼʼ RW B T Correlation ML β 1ʼ and ωʼ D T Correlation MAP β 1ʼ and ωʼ F T Prior covariance MAP β 1ʼ and ωʼ 2-step T T T

60 Are parameters ok for correlations? Individual subject parameter estimates NO LONGER INDEPENDENT! Change group -> change parameter estimates compare different params if different priors correlations, t-tests within same prior ok A B true RW 2-step ML 0.1 C EM-MAP p value D MAP p value p value p value Nsj = 10 Nsj = 30 Nsj = 100

61 GLM So far infer individual parameters apply standard tests Alternative View as variation across group Specific - more powerful? µ, µ i = µ Group + i i Infer A. A. A. A.

62 GLM Group-level regressor A GLM estimate GLM estimate C T B GLM estimate GLM estimate T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = T D T E F Nsj = 10 Nsj = 30 Nsj = 100

Fitting - how to Write your likelihood function matlab examples attached with emfit.m don t do 20 ML fits! pass it into emfit.m or julia version www.quentinhuys.com/pub/emfit_151110.

63 Fitting - how to Write your likelihood function matlab examples attached with emfit.m don t do 20 ML fits! pass it into emfit.m or julia version validate: generate data with fitted params compare, have a look, does it look right? re-fit - is it stable? model comparison now: look at parameters, do correlations etc. Future: GLM full random effects over models and parameters jointly? Daniel Schad E Count Reward sensitivity ρ Learning rate ε p value

64 Hierarchical / random effects models Advantages Accurate group-level mean and variance Outliers due to weak likelihood are regularised Strong outliers are not Useful for model selection Disadvantages Individual estimates depend on other data, i.e. on A j6=i and i therefore need to be careful in interpreting these as summary statistics More involved; less transparent Psychiatry Groups often not well defined, covariates better fmri Shrink variance of ML estimates - fixed effects better still?

65 How does it do? LL Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq Sep:βα Q Sep:βαQ 2βα2 GoOnly βα2 GoOnly Q βα GoOnly Q βα4 Q βα Q βαq

66 Overfitting Y X

67 Model comparison A fit by itself is not meaningful Generative test qualitative Comparisons vs random vs other model -> test specific hypotheses and isolate particular effects in a generative setting

68 Model comparison Averaged over its parameter settings, how well does the model fit the data? p(a M) = Model comparison: Bayes factors Z d p(a ) p( M) BF = p(a M 1) p(a M 2 ) Problem: integral rarely solvable approximation: Laplace, sampling, variational...

69 Why integrals? The God Almighty test 0.1 Powerful model 0.1 Weak model N (p(x 1)+p(X 2 )+ ) These two factors fight it out Model complexity vs model fit

70 Group-level BIC log p(a M) = d p(a ) p( M) 1 2 BIC int = log ˆp(A ˆML ) 1 M log( A ) 2 Very simple 1) EM to estimate group prior mean & variance simply done using fminunc, which provides Hessians 2) Sample from estimated priors 3) Average

16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16 Σ i BIC i LL Int

71 How does it do? LL BIC int Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq βαq βα Q βα4 Q βα GoOnly Q βα2 GoOnly Q 2βα2 GoOnly Sep:βαQ Sep:βα Q Σ i BIC i LL Int Fitted by EM... too nice?

72 Group Model selection Integrate out your parameters

73 Model comparison: overfitting? G Nogo Rewarded RW RW + noise RW + noise + Q0 Avoids loss RW + noise + bias RW (rew/pun) + noise + bias RW + noise + bias + Pav BIC int Note: same number of parameters 1 Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go)

74 Behavioural data modelling Are no panacea statistics about specific aspects of decision machinery only account for part of the variance Model needs to match experiment ensure subjects actually do the task the way you wrote it in the model model comparison Model = Quantitative hypothesis strong test need to compare models, not parameters includes all consequences of a hypothesis for choice

75 Thanks Peter Dayan Daniel Schad Nathaniel Daw SNSF DFG

Modelling behavioural data

Modelling behavioural data Quentin Huys MA PhD MBBS MBPsS Translational Neuromodeling Unit, ETH Zürich Psychiatrische Universitätsklinik Zürich Outline An example task Why build models? What is a model