crash course Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich, 1.9.2016
Overview : rough overview mainly following Sutton & Barto 1998 Dopamine prediction errors and more Fitting behaviour with RL models hierarchical approaches
Setup Environment a t s t r t Agent {a t } argmax {a t } t=1 r t After Sutton and Barto 1998
State space Gold +1 Electric shocks -1
A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
Actions Action left 1 2 3 4 5 6 7 abs Action right T left = 1.8 1.8 0 00 0 0 0 0 0.2 0.2 1.80 0 0 0 0 0 00 0.21.80 0 0 0 0 00 0 00.21.8 0 0 0 0 00 0 00 0.2 1 0.8 0 0 00 0 00 0 0 1.2.8 0 00 0 00 0 0 0 0 T right = 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 Noisy: plants, environments, agent Absorbing state -> max eigenvalue < 1
Markovian dynamics p(s t+1 a t,s t,a t 1,s t 1,a t 2,s t 2, )=p(s t+1 a t,s t ) a t 2,s t 2 a t 1,s t 1 a t,s t Velocity s = [position] s = position velocity
A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
A Markov Decision Problem -1 0 +1 s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)
Tall orders Aim: maximise total future reward 1X t=1 r t i.e. we have to sum over paths through the future and weigh each by its probability Best policy achieves best long-term reward
Exhaustive tree search w d
Decision tree 1X t=1 r t 8 64 512...
Policy for this talk Pose the problem mathematically Policy evaluation Policy iteration Monte Carlo techniques: experience samples TD learning Policy Evaluate Update
Evaluating a policy Aim: maximise total future reward 1X t=1 r t To know which is best, evaluate it first The policy determines the expected reward from each state V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t #
Discounting Given a policy, each state has an expected value V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t # But: t=0 Episodic r t = TX r t < 1 t=0 Discounted infinite horizons t=0 t r t < finite, exponentially distributed horizons T t=0 t r t T 1 e t/
Markov Decision Problems V (s t ) = E " 1 X t 0 =1 r t 0 s t = s, # " 1 X # = E [r 1 s t = s, ]+E t=2 r t s t = s, = E [r 1 s t = s, ]+E [V (s t+1 ) s t = s, ] This dynamic consistency is key to many solution approaches. It states that the value of a state s is related to the values of its successor states s.
Markov Decision Problems V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] r 1 R(s 2,a 1,s 1 ) 2 3 E [r 1 s t = s, ] = E 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t p(a t s t ) 2 3 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t (a t,s t ) 2 3 4 X T a t s t s t+1 R(s t+1,a t,s t ) 5 st+1
Bellman equation V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] 2 3 E [r 1 s t, ] = X (a, s t ) 4 X Ts a t s t+1 R(s t+1,a,s t ) 5 a st+1 2 3 E [V (s t+1 ),,s t ] = X (a, s t ) 4 X Ts a t s t+1 V (s t+1 ) 5 a st+1 " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]
Bellman Equation All future reward from state s = E Immediate reward + All future reward from next state s " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]
Q values = state-action values V (s) = X a (a s) " X s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] {z } Q (s,a) # so we can define state-action values as: Q(s, a) = s = E T a ss [R(s, a, s)+v (s )] t=1 r t s, a and state values are average state-action values: V (s) = a (a s)q(s, a)
Bellman Equation " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] to evaluate a policy, we need to solve the above equation, i.e. find the self-consistent state values options for policy evaluation exhaustive tree search - outwards, inwards, depth-first value iteration: iterative updates linear solution in 1 step experience sampling
Solving the Bellman Equation Option 1: turn it into update equation V k+1 (s) = a (a, s t ) s T a ss R(s, a, s)+v k (s ) Option 2: linear solution (w/ absorbing states) V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] v = R + T v v = (I T ) 1 R O( S 3 )
Policy update Given the value function for a policy, say via linear solution V (s) = X " # X (a s) Tss a [R(s 0,a,s)+V (s 0 )] 0 a s 0 {z } Q (s,a) Given the values V for the policy, we can improve the policy by always choosing the best action: 0 (a s) = 1ifa = argmaxa Q (s, a) 0 else It is guaranteed to improve: Q (s, 0 (s)) = max a Q (s, a) for deterministic policy Q (s, (s)) = V (s)
Policy iteration Policy evaluation v = (I T ) 1 R V (s) = max a Value iteration s T a ss [R a ss + V (s )] (a s) = greedy policy improvement 1 if a = argmaxa 0 else s T a ss R a ss + V pi (s )
Model-free solutions So far we have assumed knowledge of R and T R and T are the model of the world, so we assume full knowledge of the dynamics and rewards in the environment What if we don t know them? We can still learn from state-action-reward samples we can learn R and T from them, and use our estimates to solve as above alternatively, we can directly estimate V or Q
Solving the Bellman Equation Option 3: sampling V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: more about this later... a = X k f(x k )p(x k ) X x (i) p(x)! â = 1 N i f(x (i) )
Learning from samples 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 A new problem: exploration versus exploitation
Monte Carlo First visit MC randomly start in all states, generate paths, average for starting state only V(s) = 1 N X i ( T X t 0 =1 r i t 0 s 0 = s More efficient use of samples 10 9 Every visit MC 8 Bootstrap: TD 7 Dyna 6 Better samples 5 4 on policy versus off policy 3 Stochastic search, UCT... 2 ) 1 1 2 3 4 5 6 7 8 9 10
Update equation: towards TD Bellman equation V (s) = a (a, s) s T a ss [R(s, a, s)+v (s )] Not yet converged, so it doesn t hold: dv (s) = V (s)+ (a, s) a s T a ss [R(s, a, s)+v (s )] And then use this to update V i+1 (s) =V i (s)+dv (s)
TD learning dv (s) = V (s)+ a (a, s) s T a ss [R(s, a, s)+v (s )] Sample s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t 1 (s t )+r t + V t 1 (s t+1 ) V i+1 (s) =V i (s)+dv (s) V t (s t )=V t 1 (s t )+ t
TD learning s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t (s t )+r t + V t (s t+1 ) V t+1 (s t )=V t (s t )+ t
SARSA Do TD for state-action values instead: Q(s t,a t ) Q(s t,a t )+ [r t + Q(s t+1,a t+1 ) Q(s t,a t )] s t,a t,r t,s t+1,a t+1 convergence guarantees - will estimate Q (s, a)
Q learning: off-policy Learn off-policy draw from some policy only require extensive sampling Q(s t,a t ) Q(s t,a t )+ will estimate Q (s, a) r t + max Q(s t+1,a) a update towards optimum Q(s t,a t )
The effect of bootstrapping B1 B1 B1 B1 B1 B1 B0 A0 B0 Markov (every visit) V(B)=3/4 V(A)=0 TD V(B)=3/4 V(A)=~3/4 Average over various bootstrappings: TD( ) after Sutton and Barto 1998
Conclusion Long-term rewards have internal consistency This can be exploited for solution Exploration and exploitation trade off when sampling Clever use of samples can produce fast learning Brain most likely does something like this
Fitting models to behaviour Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich, 1.9.2016
Example task Probability correct Rewarded 1 0.5 0 Go Nogo Go to Go to Nogo to Nogo to Win Avoid Win Avoid 1 Avoids loss Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) 0.5 0.5 0.5 0.5 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Think of it as four separate two-armed bandit tasks Guitart-Masip, Huys et al. 2012
Analysing behaviour Standard approach: Decide which feature of the data you care about Run descriptive statistical tests, e.g. ANOVA 1 Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Many strengths Weakness Piecemeal, not holistic / global Descriptive, not generative No internal variables
Models Holistic Aim to model the process by which the data came about in its entirety Generative They can be run on the task to generate data as if a subject had done the task Inference process Capture the inference process subjects have to make to perform the task. Do this in sufficient detail to replicate the data. Parameters replace test statistics their meaning is explicit in the model
Actions Q values the process Probabilities link function Features: Q t (a t,s t )=Q t 1 (a t,s t )+ (r t Q t 1 (a t,s t )) p(a t s t,h t, ) = p(a t Q(a t,s t ), ) = p(a t s t ) / Q(a t,s t ) e Q(a t,s t ) P a 0 e Q(a0,s t ) 0 apple p(a) apple 1 links learning process and observations choices, RTs, or any other data
Fitting models I Maximum likelihood (ML) parameters ˆ = argmax L( ) where the likelihood of all choices is: L( ) = log p({a t } T t=1 {s t } T t=1, {r t } T t=1, {z} ), = log p({a t } T t=1 {Q(s t,a t ; )} T t=1, ) TY = log p(a t Q(s t,a t ; ), ) = t=1 TX log p(a t Q(s t,a t ; ), ) t=1
Fitting models II No closed form Use your favourite method gradients fminunc / fmincon... Gradients for RW model dl( ) d dq t (a t,s t ; ) d = d d = X t X log p(a t Q t (a t,s t ; ), ) t d d Q t(a t,s t ; ) = (1 ) dq t 1(a t,s t ; ) d X a 0 p(a 0 Q t (a 0,s t ; ), ) d d Q t(a 0,s t ; ) +(r t Q t 1 (a t,s t ; ))
Little tricks Transform your variables = e = log( ) 1 = 1+e = log 1 d log L( ) d Avoid over/underflow y(a) = Q(a) y m = max a p = e y(a) P y(a) = ey(a) ym P b ey(b) b ey(b) y m
ML characteristics ML is asymptotically consistent, but variance high 10-armed bandit, infer beta and epsilon 20 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 Estimate 10 0 10 L( = 10) L( = 100) 20 0 500 1000 1500 2000 Run
Priors Not so smooth Smooth 0.4 0.4 0.3 0.3 Prior Probability 0.2 0.1 Prior Probability 0.2 0.1 0 0 5 0 5 Parameter 1.0 5 0 5 Parameter 0.8 Memory 0.6 0.4 0.2 0.0 0 20 40 Switches 60 Dombrovski et al. 2010
Maximum a posteriori estimate P( )=p( a 1...T )= p(a 1...T )p( ) d p( a 1...T )p( ) log P( )= T t=1 log p(a t ) + log p( )+const. log P( ) d = log L( ) d + dp( ) d If likelihood is strong, prior will have little effect mainly has influence on poorly constrained parameters if a parameter is strongly constrained to be outside the typical range of the prior, then it will win over the prior
Maximum a posteriori estimate $ )*+,-.+/!!$!%! "!! #!!! #"!! $!!! &'( 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 mbeta=0, meps=-3, n=1
But What prior parameters should I use?
Hierarchical estimation - random effects i µ, A i A. A. A. A i A. A. A. i µ, µ, µ, i i i A. A. A. A. A. A. A. A. A. A. A. A. Fixed effect conflates within- and between- subject variability Average behaviour disregards between-subject variability need to adapt model Summary statistic treat parameters as random variable, one for each subject overestimates group variance as ML estimates noisy Random effects prior mean = group mean Z p(a i µ, ) = A. A. A. A. d i p(a i i ) p( i µ µ, ) {z }
Random effects See subjects as drawn from group Fixed models all the same: fixed effect wrt model parametrically nested A. µ, i M A. A. A. Q(a, s) =! 1 Q 1 (a, s)+! 2 Q 2 (a, s) assumes within-subject mixture, rather than a group mixture of perfect types Random effects in models µ, µ, µ, A T K i i i A. A. A. A. A. A. A. A. A. A. A. A.
Estimating the hyperparameters Effectively we now want to do gradient ascent on: d d p(a ) But this contains an integral over individual parameters: p(a ) = d p(a ) p( ) So we need to: ˆ = argmax p(a ) = argmax d p(a ) p( )
Inference ˆ = argmax p(a ) = argmax d p(a ) p( ) analytical - rare brute force - for simple problems Expectation Maximisation - approximate, easy Variational Bayes Sampling / MCMC
Expectation Maximisation Z log p(a ) = log d p(a, ) Z p(a, ) = log d q( ) q( ) Z p(a, ) d q( ) log q( ) Jensen s inequality k th Estep:q (k+1) ( ) p( A, (k) ) Z k th Mstep: (k+1) argmax d q( ) log p(a, ) Iterate between Estimating MAP parameters given prior parameters Estimating prior parameters from MAP parameters
Bayesian Information Criterion Laplace s approximation (saddle-point method) Y log(y) X Just a Gaussian X dx f(x) f (x 0 ) 2 2
EM with Laplace approximation E step: only need sufficient statistics to perform M step p( A, (k) ) N (m k, S k ) Approximate and hence: q (k+1) ( ) p( A, (k) ) E step: q k ( ) = N (m k, S k ) m k argmax p(a k )p( (i) ) S 1 k 2 p(a k )p( (i) ) 2 =m k matlab: [m,l,,,s]=fminunc( ) Just what we had before: MAP inference given some prior parameters
EM with Laplace approximation Updates Prior mean = mean of MAP estimates M step: (i+1) µ = 1 K m k (i+1) 2 = 1 N k i (m k ) 2 + S k ( (i+1) µ ) 2 Prior variance depends on inverse Hessian S and variance of MAP estimates Take uncertainty of estimates into account And now iterate until convergence
Parameter recovery Simulation 1 truly uncorrelated, no GLM Simulation 4 truly correlated, no GLM A B C β 1 correlation D βʼ correlation αʼ correlation true inferred true inferred true inferred 1 1 1 1 0.5 0.5 0.5 0.5 0 50 200 600 T 0.5 0.5 ω correlation true inferred 0 0 0 0 60120 300 600 60120 300 600 50 200 600 50 200 600 E T F T G T H T 1 1 1 1 0.5 Rescorla-Wagner Model 0 60120 300 600 T 0.5 0 60120 300 T 600 2-step model 0 50 200 600 T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 EM-MAP MAP0 ML
Correlations A Correlation ML βʼ and αʼ C Correlation MAP βʼʼ and αʼʼ E Prior covariance MAP βʼʼ and αʼʼ 0 0.2 0 RW 0.2 0.4 0.1 0 0.05 0.6 0.1 0.1 B 60 120 300 600 T Correlation ML β 1ʼ and ωʼ D 60 120 300 600 T Correlation MAP β 1ʼ and ωʼ F 60 120 300 600 T Prior covariance MAP β 1ʼ and ωʼ 2-step 0.2 0-0.2 0.4 0.2 0-0.2 0.06 0.04 0.02 0-0.02 50 200 600 T -0.4 50 200 600 T -0.04 50 200 600 T
Are parameters ok for correlations? Individual subject parameter estimates NO LONGER INDEPENDENT! Change group -> change parameter estimates compare different params if different priors correlations, t-tests within same prior ok A B true 0.3 0.2 0.1 0.3 0 0.2 RW 2-step ML 0.1 C EM-MAP 0.3 0 0.2 0.1 p value D MAP0 0.3 0 0.2 0.1 p value 0 0 0.2 0.4 0.6 0. 8 1 p value 0 0.2 0.4 0.6 0.8 1 p value Nsj = 10 Nsj = 30 Nsj = 100
GLM So far infer individual parameters apply standard tests Alternative View as variation across group Specific - more powerful? µ, µ i = µ Group + i i Infer A. A. A. A.
GLM Group-level regressor A GLM estimate GLM estimate C 0.2 0 0.2 0.4 0.2 0-0.2-0.4 60 120 300 600 T B GLM estimate GLM estimate 1 0.5 0 60 120 300 600 T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 50 200 600 T D 1 0.8 0.6 0.4 0.2 0 50 200 600 T E F Nsj = 10 Nsj = 30 Nsj = 100
Fitting - how to Write your likelihood function matlab examples attached with emfit.m don t do 20 ML fits! pass it into emfit.m or julia version www.quentinhuys.com/pub/emfit_151110.zip validate: generate data with fitted params compare, have a look, does it look right? re-fit - is it stable? model comparison now: look at parameters, do correlations etc. Future: GLM full random effects over models and parameters jointly? Daniel Schad E Count 60 50 40 30 20 10 0 Reward sensitivity ρ Learning rate ε 0 0.2 0.4 0.6 0.8 1 p value
Hierarchical / random effects models Advantages Accurate group-level mean and variance Outliers due to weak likelihood are regularised Strong outliers are not Useful for model selection Disadvantages Individual estimates depend on other data, i.e. on A j6=i and i therefore need to be careful in interpreting these as summary statistics More involved; less transparent Psychiatry Groups often not well defined, covariates better fmri Shrink variance of ML estimates - fixed effects better still?
How does it do? LL Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq Sep:βα Q Sep:βαQ 2βα2 GoOnly βα2 GoOnly Q βα GoOnly Q βα4 Q βα Q βαq
Overfitting Y X
Model comparison A fit by itself is not meaningful Generative test qualitative Comparisons vs random vs other model -> test specific hypotheses and isolate particular effects in a generative setting
Model comparison Averaged over its parameter settings, how well does the model fit the data? p(a M) = Model comparison: Bayes factors Z d p(a ) p( M) BF = p(a M 1) p(a M 2 ) Problem: integral rarely solvable approximation: Laplace, sampling, variational...
Why integrals? The God Almighty test 0.1 Powerful model 0.1 Weak model 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 1 0.5 0 0.5 1 0 1 0.5 0 0.5 1 1 N (p(x 1)+p(X 2 )+ ) These two factors fight it out Model complexity vs model fit
Group-level BIC log p(a M) = d p(a ) p( M) 1 2 BIC int = log ˆp(A ˆML ) 1 M log( A ) 2 Very simple 1) EM to estimate group prior mean & variance simply done using fminunc, which provides Hessians 2) Sample from estimated priors 3) Average
How does it do? LL BIC int Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq βαq βα Q βα4 Q βα GoOnly Q βα2 GoOnly Q 2βα2 GoOnly Sep:βαQ Sep:βα Q 16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16 Σ i BIC i LL Int Fitted by EM... too nice? 16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16 16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16
Group Model selection Integrate out your parameters
Model comparison: overfitting? G Nogo Rewarded RW RW + noise RW + noise + Q0 Avoids loss RW + noise + bias RW (rew/pun) + noise + bias RW + noise + bias + Pav 3500 3600 3700 3800 3900 BIC int Note: same number of parameters 1 Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) 0.5 0.5 0.5 0.5 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Behavioural data modelling Are no panacea statistics about specific aspects of decision machinery only account for part of the variance Model needs to match experiment ensure subjects actually do the task the way you wrote it in the model model comparison Model = Quantitative hypothesis strong test need to compare models, not parameters includes all consequences of a hypothesis for choice
Thanks Peter Dayan Daniel Schad Nathaniel Daw SNSF DFG