Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

Similar documents
Bayesian inference J. Daunizeau

Bayesian inference J. Daunizeau

An introduction to Bayesian inference and model comparison J. Daunizeau

Bayesian inference. Justin Chumbley ETH and UZH. (Thanks to Jean Denizeau for slides)

HGF Workshop. Christoph Mathys. Wellcome Trust Centre for Neuroimaging at UCL, Max Planck UCL Centre for Computational Psychiatry and Ageing Research

Wellcome Trust Centre for Neuroimaging, UCL, UK.

Dynamic Causal Modelling for evoked responses J. Daunizeau

Introduction to Bayesian Data Analysis

Choosing among models

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Gaussian Processes for Machine Learning

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Machine Learning 4771

7. Estimation and hypothesis testing. Objective. Recommended reading

Will Penny. SPM short course for M/EEG, London 2013

M/EEG source analysis

Will Penny. SPM for MEG/EEG, 15th May 2012

Dynamic Causal Modelling for EEG/MEG: principles J. Daunizeau

Will Penny. SPM short course for M/EEG, London 2015

Bayesian Inference Course, WTCN, UCL, March 2013

Foundations of Statistical Inference

Density Estimation. Seungjin Choi

Solution: chapter 2, problem 5, part a:

CPSC 540: Machine Learning

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Will Penny. DCM short course, Paris 2012

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

1 Hypothesis Testing and Model Selection

Minimum Message Length Analysis of the Behrens Fisher Problem

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSC321 Lecture 18: Learning Probabilistic Models

BAYESIAN DECISION THEORY

Expectation Propagation for Approximate Bayesian Inference

Bayesian Inference for Normal Mean

Experimental design of fmri studies

(1) Introduction to Bayesian statistics

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

an introduction to bayesian inference

Probabilistic and Bayesian Machine Learning

Variational Bayesian Logistic Regression

2. A Basic Statistical Toolbox

PATTERN RECOGNITION AND MACHINE LEARNING

Naïve Bayes classification

Experimental design of fmri studies

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y.

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

SRNDNA Model Fitting in RL Workshop

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Group analysis. Jean Daunizeau Wellcome Trust Centre for Neuroimaging University College London. SPM Course Edinburgh, April 2010

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Computing Likelihood Functions for High-Energy Physics Experiments when Distributions are Defined by Simulators with Nuisance Parameters

The Minimum Message Length Principle for Inductive Inference

Robotics. Lecture 4: Probabilistic Robotics. See course website for up to date information.

Bayesian Models in Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Curve Fitting Re-visited, Bishop1.2.5

Why Try Bayesian Methods? (Lecture 5)

Experimental design of fmri studies & Resting-State fmri

Bayesian Learning (II)

Lecture : Probabilistic Machine Learning

GAUSSIAN PROCESS REGRESSION

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

y Xw 2 2 y Xw λ w 2 2

Chapter 1 Statistical Inference

Seminar über Statistik FS2008: Model Selection

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Bayesian model selection: methodology, computation and applications

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Statistical Inference

Frequentist-Bayesian Model Comparisons: A Simple Example

Experimental design of fmri studies

MODULE -4 BAYEIAN LEARNING

Variational Autoencoder

Introduction to Machine Learning

COMP90051 Statistical Machine Learning

Statistical Data Analysis Stat 3: p-values, parameter estimation

Expectation Maximization

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Posterior Regularization

Probabilistic Graphical Models

Model comparison and selection

Bayesian Paradigm. Maximum A Posteriori Estimation

DD Advanced Machine Learning

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Unsupervised Learning

Lecture 6: Model Checking and Selection

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Econometrics

Transcription:

Bayesian Inference Chris Mathys Wellcome Trust Centre for Neuroimaging UCL London SPM Course Thanks to Jean Daunizeau and Jérémie Mattout for previous versions of this talk

A spectacular piece of information 2

A spectacular piece of information Messerli, F. H. (2012). Chocolate Consumption, Cognitive Function, and Nobel Laureates. New England Journal of Medicine, 367(16), 1562 1564. 3

So will I win the Nobel prize if I eat lots of chocolate? This is a question referring to uncertain quantities. Like almost all scientific questions, it cannot be answered by deductive logic. Nonetheless, quantitative answers can be given but they can only be given in terms of probabilities. Our question here can be rephrased in terms of a conditional probability: p Nobel lots of chocolate =? To answer it, we have to learn to calculate such quantities. The tool for this is Bayesian inference. 4

«Bayesian» = logical and logical = probabilistic «The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man's mind.» James Clerk Maxwell, 1850 5

«Bayesian» = logical and logical = probabilistic But in what sense is probabilistic reasoning (i.e., reasoning about uncertain quantities according to the rules of probability theory) «logical»? R. T. Cox showed in 1946 that the rules of probability theory can be derived from threebasic desiderata: 1. Representation of degrees of plausibility by real numbers 2. Qualitative correspondence with common sense (in a well-defined sense) 3. Consistency 6

The rules of probability By mathematical proof (i.e., by deductive reasoning) the three desiderata as set out by Cox imply the rules of probability (i.e., the rules of inductive reasoning). This means that anyone who accepts the desiderata must accept the following rules: 1. p a = 1 3 (Normalization) 2. p b = p a, b 3 (Marginalization also called the sum rule) 3. p a, b = p a b p b = p b a p a (Conditioning also called the product rule) «Probability theory is nothing but common sense reduced to calculation.» Pierre-Simon Laplace, 1819 7

Conditional probabilities The probability of a given b is denoted by p a b. In general, this is different from the probability of a alone (the marginal probability of a), as we can see by applying the sum and product rules: p a = 8 p a, b = 8 p a b p b 9 9 Because of the product rule, we also have the following rule (Bayes theorem) for going from p a b to p b a : p b a = p a b p b p a = p a b p b p a b p b 9; 8

The chocolate example In our example, it is immediately clear that P Nobel chocolate is very different from P chocolate Nobel. While the first is hopeless to determine directly, the second is much easier to find out: ask Nobel laureates how much chocolate they eat. Once we know that, we can use Bayes theorem: posterior p Nobel chocolate = p chocolate Nobel P Nobel p chocolate Inference on the quantities of interest in neuroimaging studies has exactly the same general structure. likelihood evidence model prior 9

Inference in SPM forward problem p y θ, m likelihood posterior distribution p θ y, m inverse problem 10

Inference in SPM Likelihood: p y θ, m Prior: p θ m q generative model m Bayes theorem: p θ y, m = p y θ, m p θ m p y m 11

A simple example of Bayesian inference (adapted from Jaynes (1976)) Two manufacturers, A and B, deliver the same kind of components that turn out to have the following lifetimes (in hours): A: B: Assuming prices are comparable, from which manufacturer would you buy? 12

A simple example of Bayesian inference How do we compare such samples? 13

What next? A simple example of Bayesian inference Is this satisfactory? 14

A simple example of Bayesian inference The procedure in brief: Determine your question of interest («What istheprobability that...?») Specify your model (likelihood and prior) Calculatethefull posteriorusing Bayes theorem [Passto the uninformative limit in the parameters of your prior] Integrate out any nuisanceparameters Ask your question of interest of the posterior All you need is the rules of probability theory. (Ok, sometimes you ll encounter a nasty integral but that s a technical difficulty, not a conceptual one). 15

A simple example of Bayesian inference The question: What is the probability that the components from manufacturer B have a longer lifetime than those from manufacturer A? More specifically: given how much more expensive they are, how much longer do I require the components from B to live. Example of a decision rule: if the components from B live 3 hours longer than those from A with a probability of at least 80%, I will choose those from B. 16

A simple example of Bayesian inference The model (bear with me, this will turn out to be simple): Likelihood (Gaussian): p x A μ, λ = D λ 2π I AJG G H exp λ 2 x A μ H Prior (Gaussian-gamma): p μ, λ μ O, κ O a O, b O = N μ μ O, κ O λ RG Gam λ a O, b O 17

A simple example of Bayesian inference The posterior (Gaussian-gamma): p μ, λ x A = N μ μ I, κ I λ RG Gam λ a I, b I Parameter updates: μ I = μ O + n κ O + n x μ O, κ I = κ O + n, a I = a O + n 2 with b I = b O + n 2 sh + κ O κ O + n x μ O H I x 1 n 8 x A AJG, s H 1 n 8 x A x H I AJG 18

A simple example of Bayesian inference The limit for which the priorbecomes uninformative: For κ O = 0, a O = 0, b O = 0, the updates reduce to: μ I = x κ I = n a I = n 2 b I = n 2 sh As promised, this is really simple: all you need is n, the number of datapoints; x], their mean; and s 2, their variance. This means that only the data influence the posterior and all influence from the parameters of the prior has been eliminated. This is normally not what you want. The prior contains important information that regularizes your inferences. Often, inferenceonly works with informativepriors. In any case, the uninformative limit should only ever be taken after the calculation of the posterior using a proper prior. 19

A simple example of Bayesian inference Integrating out the nuisance parameter λ gives rise to a t- distribution: 20

A simple example of Bayesian inference The joint posterior p μ`, μ a x A `, x b a is simply the product of our two independent posteriors p μ` x A ` and p μ a x b a. It will now give us the answer to our question: g p μ a μ` > 3 = e dμ` Rg g p μ` x A ` e dμ a h i jk p μ a x b a = 0.9501 Note that the t-test told us that there was «no significant difference» even though there is a >95% probability that the parts from B will last at least 3 hours longer than those from A. 21

Bayesian inference The procedure in brief: Determine your question of interest («What istheprobability that...?») Specify your model (likelihood and prior) Calculatethefull posteriorusing Bayes theorem [Passto the uninformative limit in the parameters of your prior] Integrate out any nuisanceparameters Ask your question of interest of the posterior All you need is the rules of probability theory. 22

Frequentist (or: orthodox, classical) versus Bayesian inference: hypothesis testing Classical Bayesian define the null, e.g.: p t H O H O : θ = 0 invert model (obtain posterior pdf) p θ y p t > t H O p H O y t t t Y estimate parameters (obtain test stat. t ) apply decision rule, i.e.: if p t > t H O α then reject H 0 θ O define the null, e.g.: H O : θ > θ O apply decision rule, i.e.: if p H O y α then accept H 0 θ 23

Model comparison: general principles Principle of parsimony: «plurality should not be assumed without necessity» Automatically enforced by Bayesian model comparison Model evidence: y = f(x) p y m = e p y θ, m p θ m dθ exp accuracy complexity x Occam s razor : y=f(x) model evidence p(y m) space of all data sets 24

Model comparison: negative variational free energy F log model evidence log p y m = log e p y, θ m dθ sum rule multiply by 1 = Ž Ž Jensen s inequality product rule = log e q θ e q θ log p y, θ m q θ p y, θ m q θ dθ dθ =: F = negative variational free energy F e q θ log = e q θ log p y, θ m q θ dθ p y θ, m p θ m q θ = e q θ log p y θ, m dθ Accuracy (expected logrlikelihood) dθ a lower bound on the log-model evidence Kullback-Leibler divergence KL q θ, p θ m Complexity 25

Model comparison: F in relation to Bayes factors, AIC, BIC Bayes factor p y m G p y m O exp F G F O = exp log p y m G = exp log p y m p y m G log p y m O O Bayes factor Posterior odds Prior odds [Meaning of the Bayes factor: ž m G y ž m O y = ž y m G ž y m O ž Ÿ ž Ÿ ] F = e q θ log p y θ, m dθ KL q θ, p θ m = Accuracy Complexity AIC Accuracy p BIC Accuracy p log N 2 Number of parameters Number of data points 26

A note on informative priors Any model consists of two parts: likelihood and prior. The choice of likelihood requires as much justification as the choice of prior because it is just as «subjective» as that of the prior. The data never speak for themselves. They only acquire meaning when seen through the lens of a model. However, this does not mean that all is subjective because models differ in their validity. In this light, the widespread concern that informative priors might bias results (while the form of the likelihood is taken as a matter of course requiring no justification) is misplaced. Informative priors are an important tool and their use can be justified by establishing the validity (face, construct, and predictive) of the resulting model as well as by model comparison. 27

A note on uninformative priors Using a flat or «uninformative» prior doesn t make you more «data-driven» than anybody else. It s a choice that requires just as much justification as any other. For example, if you re studying a small effect in a noisy setting, using a flat prior means assigning the same prior probability mass to the interval covering effect sizes -1 to +1 as to that covering effect sizes +999 to +1001. Far from being unbiased, this amounts to a bias in favor of implausibly large effect sizes. Using flat priors is asking for a replicability crisis. One way to address this is to collect enough data to swamp the inappropriate priors. A cheaper way is to use more appropriate priors. Disclaimer: if you look at my papers, you will find flat priors. 28

Applications of Bayesian inference 29

segmentation and normalisation posterior probability maps (PPMs) dynamic causal modelling multivariate decoding realignment smoothing general linear model normalisation statistical inference p <0.05 Gaussian field theory template 30

Segmentation (mixture of Gaussians-model) class variances s 1 s 2 s k µ 1 µ 2 i th voxel label y i c i l i th voxel value class frequencies µ k class means grey matter white matter CSF 31

fmri time series analysis short-term memory design matrix (X) PPM: regions best explained by short-term memory model prior variance of GLM coeff prior variance of data noise AR coeff (correlated noise) GLM coeff fmri time series long-term memory design matrix (X) PPM: regions best explained by long-term memory model 32

Dynamic causal modeling (DCM) m 1 m 2 m 3 m 4 attention attention attention attention PPC PPC PPC PPC stim V1 V5 stim V1 V5 stim V1 V5 stim V1 V5 15 10 models marginal likelihood ln p y m ( ) attention 0.10 PPC 1.25 0.39 0.26 estimated effective synaptic strengths for best model (m 4 ) 5 stim 0.26 V1 0.13 0.46 V5 0 m 1 m 2 m 3 m 4 33

Model comparison for group studies ln ( 1) -ln p( y m2) p y m differences in log- model evidences m 1 m 2 subjects Fixed effect Assume all subjects correspond to the same model Random effect Assume different subjects might correspond to different models 34

Thanks 35