Reinforcement learning crash course

Similar documents
Modelling behavioural data

Reinforcement learning an introduction

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning II

6 Reinforcement Learning

Reinforcement Learning. George Konidaris

SRNDNA Model Fitting in RL Workshop

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning and NLP

Reinforcement Learning Part 2

Artificial Intelligence

Grundlagen der Künstlichen Intelligenz

Chapter 6: Temporal Difference Learning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning I Reinforcement Learning

Lecture 23: Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Lecture : Probabilistic Machine Learning

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Reinforcement Learning

CS 7180: Behavioral Modeling and Decisionmaking

Temporal difference learning

Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Animal learning theory

Bayesian Methods for Machine Learning

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Reinforcement Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Lecture 3: Markov Decision Processes

Lecture 9: Policy Gradient II 1

Q-Learning in Continuous State Action Spaces

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning Active Learning

Basics of reinforcement learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CSE250A Fall 12: Discussion Week 9

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Reinforcement Learning

CS 570: Machine Learning Seminar. Fall 2016

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 9: Policy Gradient II (Post lecture) 2

Stochastic Gradient Descent

Lecture 8: Policy Gradient

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning

Non-Parametric Bayes

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Notes on Reinforcement Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Olivier Sigaud. September 21, 2012

, and rewards and transition matrices as shown below:

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

The connection of dropout and Bayesian statistics

CPSC 540: Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bias-Variance Tradeoff

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Markov Decision Processes

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Chapter 6: Temporal Difference Learning

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning: the basics

Food delivered. Food obtained S 3

Variational Autoencoders

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning

Elements of Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Planning in Markov Decision Processes

CS-E3210 Machine Learning: Basic Principles

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Machine Learning I Continuous Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Machine Learning Midterm, Tues April 8

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

MODULE -4 BAYEIAN LEARNING

Internet Monetization

Bayesian Learning (II)

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Probabilistic Planning. George Konidaris

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Reinforcement Learning. Value Function Updates

CSC321 Lecture 22: Q-Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Transcription:

crash course Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich, 1.9.2016

Overview : rough overview mainly following Sutton & Barto 1998 Dopamine prediction errors and more Fitting behaviour with RL models hierarchical approaches

Setup Environment a t s t r t Agent {a t } argmax {a t } t=1 r t After Sutton and Barto 1998

State space Gold +1 Electric shocks -1

A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

Actions Action left 1 2 3 4 5 6 7 abs Action right T left = 1.8 1.8 0 00 0 0 0 0 0.2 0.2 1.80 0 0 0 0 0 00 0.21.80 0 0 0 0 00 0 00.21.8 0 0 0 0 00 0 00 0.2 1 0.8 0 0 00 0 00 0 0 1.2.8 0 00 0 00 0 0 0 0 T right = 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 Noisy: plants, environments, agent Absorbing state -> max eigenvalue < 1

Markovian dynamics p(s t+1 a t,s t,a t 1,s t 1,a t 2,s t 2, )=p(s t+1 a t,s t ) a t 2,s t 2 a t 1,s t 1 a t,s t Velocity s = [position] s = position velocity

A Markov Decision Problem s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

A Markov Decision Problem -1 0 +1 s t S a t A T a ss = p(s t+1 s t,a t ) r t R(s t+1,a t,s t ) (a s) = p(a s)

Tall orders Aim: maximise total future reward 1X t=1 r t i.e. we have to sum over paths through the future and weigh each by its probability Best policy achieves best long-term reward

Exhaustive tree search w d

Decision tree 1X t=1 r t 8 64 512...

Policy for this talk Pose the problem mathematically Policy evaluation Policy iteration Monte Carlo techniques: experience samples TD learning Policy Evaluate Update

Evaluating a policy Aim: maximise total future reward 1X t=1 r t To know which is best, evaluate it first The policy determines the expected reward from each state V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t #

Discounting Given a policy, each state has an expected value V (s 1 ) = E " 1 X t=1 r t s 1 =1,a t # But: t=0 Episodic r t = TX r t < 1 t=0 Discounted infinite horizons t=0 t r t < finite, exponentially distributed horizons T t=0 t r t T 1 e t/

Markov Decision Problems V (s t ) = E " 1 X t 0 =1 r t 0 s t = s, # " 1 X # = E [r 1 s t = s, ]+E t=2 r t s t = s, = E [r 1 s t = s, ]+E [V (s t+1 ) s t = s, ] This dynamic consistency is key to many solution approaches. It states that the value of a state s is related to the values of its successor states s.

Markov Decision Problems V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] r 1 R(s 2,a 1,s 1 ) 2 3 E [r 1 s t = s, ] = E 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t p(a t s t ) 2 3 4 X p(s t+1 s t,a t )R(s t+1,a t,s t ) 5 st+1 = X a t (a t,s t ) 2 3 4 X T a t s t s t+1 R(s t+1,a t,s t ) 5 st+1

Bellman equation V (s t ) = E [r 1 s t = s, ]+E [V (s t+1 ), ] 2 3 E [r 1 s t, ] = X (a, s t ) 4 X Ts a t s t+1 R(s t+1,a,s t ) 5 a st+1 2 3 E [V (s t+1 ),,s t ] = X (a, s t ) 4 X Ts a t s t+1 V (s t+1 ) 5 a st+1 " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]

Bellman Equation All future reward from state s = E Immediate reward + All future reward from next state s " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )]

Q values = state-action values V (s) = X a (a s) " X s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] {z } Q (s,a) # so we can define state-action values as: Q(s, a) = s = E T a ss [R(s, a, s)+v (s )] t=1 r t s, a and state values are average state-action values: V (s) = a (a s)q(s, a)

Bellman Equation " X # V (s) = X a (a s) s 0 T a ss 0 [R(s 0,a,s)+V (s 0 )] to evaluate a policy, we need to solve the above equation, i.e. find the self-consistent state values options for policy evaluation exhaustive tree search - outwards, inwards, depth-first value iteration: iterative updates linear solution in 1 step experience sampling

Solving the Bellman Equation Option 1: turn it into update equation V k+1 (s) = a (a, s t ) s T a ss R(s, a, s)+v k (s ) Option 2: linear solution (w/ absorbing states) V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] v = R + T v v = (I T ) 1 R O( S 3 )

Policy update Given the value function for a policy, say via linear solution V (s) = X " # X (a s) Tss a [R(s 0,a,s)+V (s 0 )] 0 a s 0 {z } Q (s,a) Given the values V for the policy, we can improve the policy by always choosing the best action: 0 (a s) = 1ifa = argmaxa Q (s, a) 0 else It is guaranteed to improve: Q (s, 0 (s)) = max a Q (s, a) for deterministic policy Q (s, (s)) = V (s)

Policy iteration Policy evaluation v = (I T ) 1 R V (s) = max a Value iteration s T a ss [R a ss + V (s )] (a s) = greedy policy improvement 1 if a = argmaxa 0 else s T a ss R a ss + V pi (s )

Model-free solutions So far we have assumed knowledge of R and T R and T are the model of the world, so we assume full knowledge of the dynamics and rewards in the environment What if we don t know them? We can still learn from state-action-reward samples we can learn R and T from them, and use our estimates to solve as above alternatively, we can directly estimate V or Q

Solving the Bellman Equation Option 3: sampling V (s) = a (a, s t ) s T a ss [R(s, a, s)+v (s )] this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: more about this later... a = X k f(x k )p(x k ) X x (i) p(x)! â = 1 N i f(x (i) )

Learning from samples 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 A new problem: exploration versus exploitation

Monte Carlo First visit MC randomly start in all states, generate paths, average for starting state only V(s) = 1 N X i ( T X t 0 =1 r i t 0 s 0 = s More efficient use of samples 10 9 Every visit MC 8 Bootstrap: TD 7 Dyna 6 Better samples 5 4 on policy versus off policy 3 Stochastic search, UCT... 2 ) 1 1 2 3 4 5 6 7 8 9 10

Update equation: towards TD Bellman equation V (s) = a (a, s) s T a ss [R(s, a, s)+v (s )] Not yet converged, so it doesn t hold: dv (s) = V (s)+ (a, s) a s T a ss [R(s, a, s)+v (s )] And then use this to update V i+1 (s) =V i (s)+dv (s)

TD learning dv (s) = V (s)+ a (a, s) s T a ss [R(s, a, s)+v (s )] Sample s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t 1 (s t )+r t + V t 1 (s t+1 ) V i+1 (s) =V i (s)+dv (s) V t (s t )=V t 1 (s t )+ t

TD learning s t+1 a t (a s t ) T a t s t,s t+1 r t = R(s t+1,a t,s t ) t = V t (s t )+r t + V t (s t+1 ) V t+1 (s t )=V t (s t )+ t

SARSA Do TD for state-action values instead: Q(s t,a t ) Q(s t,a t )+ [r t + Q(s t+1,a t+1 ) Q(s t,a t )] s t,a t,r t,s t+1,a t+1 convergence guarantees - will estimate Q (s, a)

Q learning: off-policy Learn off-policy draw from some policy only require extensive sampling Q(s t,a t ) Q(s t,a t )+ will estimate Q (s, a) r t + max Q(s t+1,a) a update towards optimum Q(s t,a t )

The effect of bootstrapping B1 B1 B1 B1 B1 B1 B0 A0 B0 Markov (every visit) V(B)=3/4 V(A)=0 TD V(B)=3/4 V(A)=~3/4 Average over various bootstrappings: TD( ) after Sutton and Barto 1998

Conclusion Long-term rewards have internal consistency This can be exploited for solution Exploration and exploitation trade off when sampling Clever use of samples can produce fast learning Brain most likely does something like this

Fitting models to behaviour Quentin Huys Translational Neuromodeling Unit, University of Zurich and ETH Zurich University Hospital of Psychiatry Zurich Computational Psychiatry Course Zurich, 1.9.2016

Example task Probability correct Rewarded 1 0.5 0 Go Nogo Go to Go to Nogo to Nogo to Win Avoid Win Avoid 1 Avoids loss Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) 0.5 0.5 0.5 0.5 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 Think of it as four separate two-armed bandit tasks Guitart-Masip, Huys et al. 2012

Analysing behaviour Standard approach: Decide which feature of the data you care about Run descriptive statistical tests, e.g. ANOVA 1 Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Many strengths Weakness Piecemeal, not holistic / global Descriptive, not generative No internal variables

Models Holistic Aim to model the process by which the data came about in its entirety Generative They can be run on the task to generate data as if a subject had done the task Inference process Capture the inference process subjects have to make to perform the task. Do this in sufficient detail to replicate the data. Parameters replace test statistics their meaning is explicit in the model

Actions Q values the process Probabilities link function Features: Q t (a t,s t )=Q t 1 (a t,s t )+ (r t Q t 1 (a t,s t )) p(a t s t,h t, ) = p(a t Q(a t,s t ), ) = p(a t s t ) / Q(a t,s t ) e Q(a t,s t ) P a 0 e Q(a0,s t ) 0 apple p(a) apple 1 links learning process and observations choices, RTs, or any other data

Fitting models I Maximum likelihood (ML) parameters ˆ = argmax L( ) where the likelihood of all choices is: L( ) = log p({a t } T t=1 {s t } T t=1, {r t } T t=1, {z} ), = log p({a t } T t=1 {Q(s t,a t ; )} T t=1, ) TY = log p(a t Q(s t,a t ; ), ) = t=1 TX log p(a t Q(s t,a t ; ), ) t=1

Fitting models II No closed form Use your favourite method gradients fminunc / fmincon... Gradients for RW model dl( ) d dq t (a t,s t ; ) d = d d = X t X log p(a t Q t (a t,s t ; ), ) t d d Q t(a t,s t ; ) = (1 ) dq t 1(a t,s t ; ) d X a 0 p(a 0 Q t (a 0,s t ; ), ) d d Q t(a 0,s t ; ) +(r t Q t 1 (a t,s t ; ))

Little tricks Transform your variables = e = log( ) 1 = 1+e = log 1 d log L( ) d Avoid over/underflow y(a) = Q(a) y m = max a p = e y(a) P y(a) = ey(a) ym P b ey(b) b ey(b) y m

ML characteristics ML is asymptotically consistent, but variance high 10-armed bandit, infer beta and epsilon 20 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 Estimate 10 0 10 L( = 10) L( = 100) 20 0 500 1000 1500 2000 Run

Priors Not so smooth Smooth 0.4 0.4 0.3 0.3 Prior Probability 0.2 0.1 Prior Probability 0.2 0.1 0 0 5 0 5 Parameter 1.0 5 0 5 Parameter 0.8 Memory 0.6 0.4 0.2 0.0 0 20 40 Switches 60 Dombrovski et al. 2010

Maximum a posteriori estimate P( )=p( a 1...T )= p(a 1...T )p( ) d p( a 1...T )p( ) log P( )= T t=1 log p(a t ) + log p( )+const. log P( ) d = log L( ) d + dp( ) d If likelihood is strong, prior will have little effect mainly has influence on poorly constrained parameters if a parameter is strongly constrained to be outside the typical range of the prior, then it will win over the prior

Maximum a posteriori estimate $ )*+,-.+/!!$!%! "!! #!!! #"!! $!!! &'( 200 trials, 1 stimulus, 10 actions, learning rate =.05, beta=2 mbeta=0, meps=-3, n=1

But What prior parameters should I use?

Hierarchical estimation - random effects i µ, A i A. A. A. A i A. A. A. i µ, µ, µ, i i i A. A. A. A. A. A. A. A. A. A. A. A. Fixed effect conflates within- and between- subject variability Average behaviour disregards between-subject variability need to adapt model Summary statistic treat parameters as random variable, one for each subject overestimates group variance as ML estimates noisy Random effects prior mean = group mean Z p(a i µ, ) = A. A. A. A. d i p(a i i ) p( i µ µ, ) {z }

Random effects See subjects as drawn from group Fixed models all the same: fixed effect wrt model parametrically nested A. µ, i M A. A. A. Q(a, s) =! 1 Q 1 (a, s)+! 2 Q 2 (a, s) assumes within-subject mixture, rather than a group mixture of perfect types Random effects in models µ, µ, µ, A T K i i i A. A. A. A. A. A. A. A. A. A. A. A.

Estimating the hyperparameters Effectively we now want to do gradient ascent on: d d p(a ) But this contains an integral over individual parameters: p(a ) = d p(a ) p( ) So we need to: ˆ = argmax p(a ) = argmax d p(a ) p( )

Inference ˆ = argmax p(a ) = argmax d p(a ) p( ) analytical - rare brute force - for simple problems Expectation Maximisation - approximate, easy Variational Bayes Sampling / MCMC

Expectation Maximisation Z log p(a ) = log d p(a, ) Z p(a, ) = log d q( ) q( ) Z p(a, ) d q( ) log q( ) Jensen s inequality k th Estep:q (k+1) ( ) p( A, (k) ) Z k th Mstep: (k+1) argmax d q( ) log p(a, ) Iterate between Estimating MAP parameters given prior parameters Estimating prior parameters from MAP parameters

Bayesian Information Criterion Laplace s approximation (saddle-point method) Y log(y) X Just a Gaussian X dx f(x) f (x 0 ) 2 2

EM with Laplace approximation E step: only need sufficient statistics to perform M step p( A, (k) ) N (m k, S k ) Approximate and hence: q (k+1) ( ) p( A, (k) ) E step: q k ( ) = N (m k, S k ) m k argmax p(a k )p( (i) ) S 1 k 2 p(a k )p( (i) ) 2 =m k matlab: [m,l,,,s]=fminunc( ) Just what we had before: MAP inference given some prior parameters

EM with Laplace approximation Updates Prior mean = mean of MAP estimates M step: (i+1) µ = 1 K m k (i+1) 2 = 1 N k i (m k ) 2 + S k ( (i+1) µ ) 2 Prior variance depends on inverse Hessian S and variance of MAP estimates Take uncertainty of estimates into account And now iterate until convergence

Parameter recovery Simulation 1 truly uncorrelated, no GLM Simulation 4 truly correlated, no GLM A B C β 1 correlation D βʼ correlation αʼ correlation true inferred true inferred true inferred 1 1 1 1 0.5 0.5 0.5 0.5 0 50 200 600 T 0.5 0.5 ω correlation true inferred 0 0 0 0 60120 300 600 60120 300 600 50 200 600 50 200 600 E T F T G T H T 1 1 1 1 0.5 Rescorla-Wagner Model 0 60120 300 600 T 0.5 0 60120 300 T 600 2-step model 0 50 200 600 T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 EM-MAP MAP0 ML

Correlations A Correlation ML βʼ and αʼ C Correlation MAP βʼʼ and αʼʼ E Prior covariance MAP βʼʼ and αʼʼ 0 0.2 0 RW 0.2 0.4 0.1 0 0.05 0.6 0.1 0.1 B 60 120 300 600 T Correlation ML β 1ʼ and ωʼ D 60 120 300 600 T Correlation MAP β 1ʼ and ωʼ F 60 120 300 600 T Prior covariance MAP β 1ʼ and ωʼ 2-step 0.2 0-0.2 0.4 0.2 0-0.2 0.06 0.04 0.02 0-0.02 50 200 600 T -0.4 50 200 600 T -0.04 50 200 600 T

Are parameters ok for correlations? Individual subject parameter estimates NO LONGER INDEPENDENT! Change group -> change parameter estimates compare different params if different priors correlations, t-tests within same prior ok A B true 0.3 0.2 0.1 0.3 0 0.2 RW 2-step ML 0.1 C EM-MAP 0.3 0 0.2 0.1 p value D MAP0 0.3 0 0.2 0.1 p value 0 0 0.2 0.4 0.6 0. 8 1 p value 0 0.2 0.4 0.6 0.8 1 p value Nsj = 10 Nsj = 30 Nsj = 100

GLM So far infer individual parameters apply standard tests Alternative View as variation across group Specific - more powerful? µ, µ i = µ Group + i i Infer A. A. A. A.

GLM Group-level regressor A GLM estimate GLM estimate C 0.2 0 0.2 0.4 0.2 0-0.2-0.4 60 120 300 600 T B GLM estimate GLM estimate 1 0.5 0 60 120 300 600 T Nsj = 10 Nsj = 20 Nsj = 50 Nsj = 100 50 200 600 T D 1 0.8 0.6 0.4 0.2 0 50 200 600 T E F Nsj = 10 Nsj = 30 Nsj = 100

Fitting - how to Write your likelihood function matlab examples attached with emfit.m don t do 20 ML fits! pass it into emfit.m or julia version www.quentinhuys.com/pub/emfit_151110.zip validate: generate data with fitted params compare, have a look, does it look right? re-fit - is it stable? model comparison now: look at parameters, do correlations etc. Future: GLM full random effects over models and parameters jointly? Daniel Schad E Count 60 50 40 30 20 10 0 Reward sensitivity ρ Learning rate ε 0 0.2 0.4 0.6 0.8 1 p value

Hierarchical / random effects models Advantages Accurate group-level mean and variance Outliers due to weak likelihood are regularised Strong outliers are not Useful for model selection Disadvantages Individual estimates depend on other data, i.e. on A j6=i and i therefore need to be careful in interpreting these as summary statistics More involved; less transparent Psychiatry Groups often not well defined, covariates better fmri Shrink variance of ML estimates - fixed effects better still?

How does it do? LL Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq Sep:βα Q Sep:βαQ 2βα2 GoOnly βα2 GoOnly Q βα GoOnly Q βα4 Q βα Q βαq

Overfitting Y X

Model comparison A fit by itself is not meaningful Generative test qualitative Comparisons vs random vs other model -> test specific hypotheses and isolate particular effects in a generative setting

Model comparison Averaged over its parameter settings, how well does the model fit the data? p(a M) = Model comparison: Bayes factors Z d p(a ) p( M) BF = p(a M 1) p(a M 2 ) Problem: integral rarely solvable approximation: Laplace, sampling, variational...

Why integrals? The God Almighty test 0.1 Powerful model 0.1 Weak model 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 1 0.5 0 0.5 1 0 1 0.5 0 0.5 1 1 N (p(x 1)+p(X 2 )+ ) These two factors fight it out Model complexity vs model fit

Group-level BIC log p(a M) = d p(a ) p( M) 1 2 BIC int = log ˆp(A ˆML ) 1 M log( A ) 2 Very simple 1) EM to estimate group prior mean & variance simply done using fminunc, which provides Hessians 2) Sample from estimated priors 3) Average

How does it do? LL BIC int Sep:βα Q Sep:βαQ 2βα2 GoOnly Q βα2 Q GoOnly βα GoOnly Q βα4 Q βα Q βαq βαq βα Q βα4 Q βα GoOnly Q βα2 GoOnly Q 2βα2 GoOnly Sep:βαQ Sep:βα Q 16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16 Σ i BIC i LL Int Fitted by EM... too nice? 16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16 16 12 6 5 4 7 4 3 3 4 7 4 5 6 12 16

Group Model selection Integrate out your parameters

Model comparison: overfitting? G Nogo Rewarded RW RW + noise RW + noise + Q0 Avoids loss RW + noise + bias RW (rew/pun) + noise + bias RW + noise + bias + Pav 3500 3600 3700 3800 3900 BIC int Note: same number of parameters 1 Go rewarded Go to win 1 Nogo punished Go to avoid 1 Nogo rewarded Nogo to win 1 Go punished Nogo to avoid Probability(Go) 0.5 0.5 0.5 0.5 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60

Behavioural data modelling Are no panacea statistics about specific aspects of decision machinery only account for part of the variance Model needs to match experiment ensure subjects actually do the task the way you wrote it in the model model comparison Model = Quantitative hypothesis strong test need to compare models, not parameters includes all consequences of a hypothesis for choice

Thanks Peter Dayan Daniel Schad Nathaniel Daw SNSF DFG