SRNDNA Model Fitting in RL Workshop

Similar documents
Modelling behavioural data

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

The Laplace Approximation

STA 4273H: Statistical Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Unsupervised Learning

Introduction to Maximum Likelihood Estimation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

PATTERN RECOGNITION AND MACHINE LEARNING

Model comparison and selection

STA414/2104 Statistical Methods for Machine Learning II

Decision Tree Learning. Dr. Xiaowei Huang

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Reinforcement learning crash course

Probabilistic & Bayesian deep learning. Andreas Damianou

An introduction to Bayesian inference and model comparison J. Daunizeau

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

20: Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian inference J. Daunizeau

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

CSC321 Lecture 18: Learning Probabilistic Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian learning Probably Approximately Correct Learning

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Lecture : Probabilistic Machine Learning

Bayesian inference J. Daunizeau

Lecture 2 Machine Learning Review

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

Part 4: Conditional Random Fields

Parametric Techniques Lecture 3

Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain

CPSC 540: Machine Learning

CSE 473: Artificial Intelligence Autumn Topics

Sparse Linear Models (10/7/13)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Density Estimation: ML, MAP, Bayesian estimation

Regression, Ridge Regression, Lasso

Parametric Techniques

Expectation Propagation for Approximate Bayesian Inference

Temporal context calibrates interval timing

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Seminar über Statistik FS2008: Model Selection

Statistical Learning. Philipp Koehn. 10 November 2015

Non-Parametric Bayes

Density Estimation. Seungjin Choi

GAUSSIAN PROCESS REGRESSION

Introduction to Bayesian Data Analysis

Bayesian Inference in Astronomy & Astrophysics A Short Course

Bayesian Regression Linear and Logistic Regression

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Machine Learning

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Model Selection for Gaussian Processes

Bayesian Learning Extension

Bayesian Model Comparison

Statistical learning. Chapter 20, Sections 1 4 1

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Algebraic Geometry and Model Selection

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Minimum Description Length (MDL)

Pattern Recognition and Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Ch 4. Linear Models for Classification

Machine Learning Lecture 7

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Decision Theory

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Bayesian methods in economics and finance

ECE521 week 3: 23/26 January 2017

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

Naïve Bayes classification

Predictive Hypothesis Identification

Statistical learning. Chapter 20, Sections 1 3 1

Bayesian Learning (II)

A Brief Review of Probability, Bayesian Statistics, and Information Theory

Model Selection. Frank Wood. December 10, 2009

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

STA 4273H: Sta-s-cal Machine Learning

Introduction to Machine Learning. Lecture 2

From inductive inference to machine learning

DIC, AIC, BIC, PPL, MSPE Residuals Predictive residuals

Widths. Center Fluctuations. Centers. Centers. Widths

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Testing Restrictions and Comparing Models

Gaussian Process Regression

Intelligent Systems Statistical Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Qualifying Exam in Machine Learning

Likelihood-Based Methods

Algorithmisches Lernen/Machine Learning

A (Brief) Introduction to Crossed Random Effects Models for Repeated Measures Data

CS Homework 3. October 15, 2009

Transcription:

SRNDNA Model Fitting in RL Workshop yael@princeton.edu Topics: 1. trial-by-trial model fitting (morning; Yael) 2. model comparison (morning; Yael) 3. advanced topics (hierarchical fitting etc.) (afternoon; Nathaniel) 4. applications to fmri (afternoon; Nathaniel) Goals: Understand how models are hypotheses that can be tested General tool to apply to any time-varying data of interest Practicalities -- start here, know where you can go from there

Bandit tasks a simple task often used in the laboratory: repeated choice between n options (n-armed bandit)...whose properties (reward amounts, probabilities) are learned through trial-and-error overall approach: 1. learn values for options 2. choose the best option how do you suggest to model this learning process? Bandit tasks suppose we ran this experiment on a person what are the data? what does the model predict? what can we conclude/infer from the data? our models are basically detailed hypotheses about behavior and about the brain we can test these hypotheses!

Writing down a full model of learning what do we know? what can we measure? what do we not know? Estimating model parameters why estimate parameters? 1. May measure quantities of interest (learning rates in different populations, how variance in the task affects learning rate etc.) 2. have to use these to generate hidden variables of interest (eg. prediction errors) in order to look for these in the brain how to estimate parameters? we want: P(α,β D,M) wwbd? P(α,β D,M) P(D α,β,m) This we know! P(D α,β,m) = π P(ct α,β,m)

Estimating parameters P(α,β D,M) P(D α,β,m)...this is a probability distribution we often consider a point estimate: the maximal likelihood point argmaxα,β P(D α,β,m) (equivalently, can maximize the log likelihood) Estimating parameters P(D α,β,m) = π P(ct α,β,m) pragmatics: fmincon, fminunc, fminsearch minimize -LL local minima: multiple starting points simulated 1000 trials α=0.25,β=1 recovered: α=0.26,β=0.86

Error bars on estimates intuition: how fast likelihood is changing as you change the parameters pragmatics: inverse Hessian (2nd derivative matrix; fmincon gives you this) of LL estimates parameter covariance matrix - error bars along the diagonal (sqrt), covariation off the diagonal - can also look at variation in fits across subjects (more on this in the afternoon) Example: do learning rates adapt to volatility? Behrens et al. (2007) dots = Bayesian bar + SE = humans (estimated from choice data)

Summary so far Learning models are detailed hypotheses about trial-by-trial overt and covert variables these can be tested against the brain help understand what different brain regions/networks are doing/computing a whole host of interesting results so far, but many questions still unanswered (relatively new method!) the models help us learn about the brain can we also use the brain (or behavior) to learn about the models?? Which model is best? Model comparison P(Model Data) =? comparing two models: P (M 1 D) P (M 2 D) = P (D M 1) P (M 1 ) P (D M 2 ) P (M 2 ) Bayes factor

Which model is best? Model comparison Which model is best? Model comparison P(Model Data) =? comparing two models: automatic Occam s razor: simple models tend to make precise predictions P (M 1 D) P (M 2 D) = P (D M 1) P (M 1 ) P (D M 2 ) P (M 2 ) can put in preference for simple models here, but don t need to... "Pluralitas non est ponenda sine necessitate Plurality should not be posited without necessity William of Ockham (1349) we should go for the simplest model that explains the data

Which model is best? Model comparison P (M 1 D) P (M 2 D) = P (D M 1) P (M 1 ) P (D M 2 ) P (M 2 ) assuming uniform prior over models all we care about is P(D M) P (D M) = d P (D M, ) P ( ) Bayesian evidence for model M (marginal likelihood) Bayesian model comparison: Occam s Razor at work

Computing P(D M) P (D M) = d P (D M, ) P ( ) Integrating over all settings of the parameters is too hard... Approximate solutions: sample posterior at many places to approximate integral and compute Bayes factor directly Laplace approximation: make Gaussian approximation around MAP parameter estimate BIC: asymptotic approximation to above in limit of infinite data Non-Bayesian methods: Likelihood ratio test AIC hold-out set to verify model fits Laplace approximation P ( D, M) 2 d 1 1 2 A 2 exp 2 ( ˆ)T A( ˆ) and we also know that P (D M) = d P (D M, ) P ( ) For large amounts of data (compared to # of parms d) the posterior is approximately Gaussian around the maximum a postriori (MAP) estimate ˆ P (D M) = P (,D M) P ( D, M) so we can compute around the MAP estimate: where -A is the Hessian matrix of lnp(θ D,M) = P ( M)P (D,M) P ( D, M) lnp (D M) lnp (ˆ M)+lnP (D ˆ,M)+ d 2 ln(2 ) 1 2 ln A A kl = 2 mk ml lnp ( D, M) ˆ

BIC approximation lnp (D M) lnp (ˆ M)+lnP (D ˆ,M)+ d 2 ln(2 ) 1 2 ln A prior on θ data log likelihood easy In the limit of LOTS of data (N ) A grows as NA0 (for fixed A0) so ln A = ln NA0 = ln N d A0 = dlnn + ln A0. Retaining only terms that grow with N, we can approximate further: d lnp(d M) lnp(d ˆ,M) 2 ln(n) (so, for each model we compute the log likelihood for the ML parameters and then add to that a penalty that depends on d (# of parameters), and then we compare the results between the models) Advantages: easy to compute; can use ML rather than MAP estimate Disadvantage: hard to determine d (only identifiable parameters) and N (only samples used to fit parameters; what if not same for diff parms?) Non-Bayesian alternatives Likelihood ratio test: for nested models (one is a special case of the other; compares hypothesis H1 to one where some parameters are fixed, H0). Statistical test on the likelihood differences: compare 2*difference in log likelihood (ML) to χ 2 statistic with df=#additional parameters AIC (Akaike s information criterion, 1974): measures goodness of a model based on bias and variance of the estimated model and measures of entropy. Not statistical test, only ranks models. Penalize log likelihood (ML) by adding # of parameters Fit models on training set and validate fit on hold-out set. Problem: often hard to find two i.i.d. sets in a learning setting

Summary so far... Learning models are detailed hypotheses about trial-by-trial overt and covert variables trial-by-trial model fitting lets us test these hypotheses...and compare alternatives premium on detailed model fitting when considering learning: non-stationary, can t use traditional averaging techniques a lot of leverage to pinpoint the neural correlates of learning and decision making in the brain Additional reading Daw (2011) - Trial by trial data analysis using computational models Hare et al. (2008) - Dissociating the role of the orbitofrontal cortex and the striatum in the computation of goal values and prediction errors Niv (2009) - Reinforcement learning in the brain