Bayesian Inference. Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH. The Reverend Thomas Bayes ( )

Size: px

Start display at page:

Download "Bayesian Inference. Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH. The Reverend Thomas Bayes ( )"

Darleen Newman
5 years ago
Views:

The Reverend Thomas Baes (1702-1761) Baesian Inference Sudhir Shankar Raman Translational

1 The Reverend Thomas Baes ( ) Baesian Inference Sudhir Shankar Raman Translational Neuromodeling Unit, UZH & ETH With man thanks for some slides to: Klaas Enno Stephan & Ka H. Brodersen

2 Wh do I need to learn about Baesian stats? Because SPM is getting more and more Baesian: Segmentation & spatial normalisation Posterior probabilit maps (PPMs) Dnamic Causal Modelling (DCM) Baesian Model Selection (BMS) EEG: source reconstruction

3 Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

4 Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

Classical and Baesian statistics p-value: probabilit of getting

If small, reject null hpothesis that there is no effect.

Baesian Inference Flexibilit in modelling Incorporating prior

comparison H0 : θ = 0 p( H ) p(θ ) 0 p(, θ ) p( θ ) One can

alwas demonstrate a significant effect Correction for multiple

5 Classical and Baesian statistics p-value: probabilit of getting the observed data in the effect s absence. If small, reject null hpothesis that there is no effect. Probabilit of observing the data, given no effect (θ = 0). Baesian Inference Flexibilit in modelling Incorporating prior information Posterior probabilit of effect Options for model comparison H0 : θ = 0 p( H ) p(θ ) 0 p(, θ ) p( θ ) One can never accept the null hpothesis Given enough data, one can alwas demonstrate a significant effect Correction for multiple comparisons necessar Statistical analsis and the illusion of objectivit. James O. Berger, Donald A. Berr

Baes Theorem Observed Data Posterior Beliefs Reverend Thomas Baes 1702-1761 Prior

6 Baes Theorem Observed Data Posterior Beliefs Reverend Thomas Baes Prior Beliefs Baes theorem describes, how an ideall rational person processes information."

7 ) ( ) ( ) ( ) ( p p p P θ θ θ = Eliminating p(,θ) gives Baes rule: Likelihood Prior Evidence Posterior Baes Theorem Given data and parameters θ, the conditional probabilities are: ) ( ), ( ) ( θ θ θ p p p = ) ( ), ( ) ( p p p θ θ =

8 new data prior knowledge p ( θ ) p(θ ) Baesian statistics p( θ ) p( θ ) p( θ ) posterior likelihood prior Baes theorem allows one to formall incorporate prior knowledge into computing statistical probabilities. Priors can be of different sorts: empirical, principled or shrinkage priors, uninformative. The posterior probabilit of the parameters given the data is an optimal combination of prior knowledge and new data, weighted b their relative precision.

9 Baes in motion - an animation Likelihood Prior Posterior

Principles of Baesian inference Formulation of a

prior distribution p(θ) Observation of data Data

observations, given a prior state of knowledge p( θ

10 Principles of Baesian inference Formulation of a generative model Model Likelihood function p( θ) prior distribution p(θ) Observation of data Data Model Inversion - Update of beliefs based upon observations, given a prior state of knowledge p( θ ) p( θ) p( θ) Maximum a posteriori (MAP) Maximum likelihood (ML)

11 Conjugate Priors Prior and Posterior have the same form p( θ ) = p( θ ) p( θ ) p( ) Same form!! Analtical expression. Conjugate priors for all exponential famil members. Example Gaussian Likelihood, Gaussian prior over mean

12 Likelihood & prior Posterior: Prior Likelihood Posterior Gaussian Model µ p µ ), ( ) ( ), ( ) ( 1 1 = = p p e N p N p λ µ θ θ λ θ θ ), ( ) ( 1 = λ µ θ θ N p p p e p e µ λ λ λ λ µ λ λ λ + = + = Relative precision weighting θ + ε =

13 Baesian regression: univariate case Relative precision weighting Normal densities = xθ + ε + = + = p p e p e x x η σ σ σ η σ σ σ θ θ θ x Univariate linear model ), ( ) ( 2 p p N p σ η θ θ = ), ( ) ( 2 e x N p σ θ θ = ), ( ) ( 2 N p θ η θ σ θ θ = η p η θ

14 Baesian GLM: multivariate case Normal densities p( θ ) = N( θ; η p, C p ) General Linear Model = Xθ + e p θ) = N( ; Xθ, C ( e ) p( θ ) = N( θ; η θ, C θ ) θ 2 C η 1 θ θ = X = C θ T C 1 e X + C ( ) T 1 1 X C + C η e 1 p p p One step if C e is known. Otherwise define conjugate prior or perform iterative estimation with EM. θ 1

15 Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

16 Baesian model selection (BMS) Given competing hpotheses on structure & functional mechanisms of a sstem, which model is the best? Which model represents the best balance between model fit and model complexit? For which model m does p( m) become maximal? Pitt & Miung (2002), TICS

17 Baesian model selection (BMS) Baes rule: p( θ, m) = p( θ, m) p( θ m) p( m) Model evidence: p ( m) = p( θ, m) p( θ m) dθ accounts for both accurac and complexit of the model allows for inference about structure (generalizabilit) of the model Model comparison via Baes factor: p m 1 ) p(m 2 ) = p( m 1) p(m 1 ) p( m 2 ) p(m 2 ) BB = p( m 1) p( m 2 ) Model averaging p θ = p θ, m p(m ) m Kass and Rafter (1995), Penn et al. (2004) NeuroImage BF 10 Evidence against H 0 1 to 3 Not worth more than a bare mention 3 to 20 Positive 20 to 150 Strong > 150 Decisive

18 Model Evidence McKa 1992, Neural Computations, Bishop PRML 2007

Criterion (BIC) Schwarz, 1978 ln p(d) ln p D θ ML ) 1 M ln (N) 2 Negative free

19 Baesian model selection (BMS) Various Approximations: Akaike Information Criterion (AIC) Akaike, 1974 ln p(d) ln p D θ MM ) M Baesian Information Criterion (BIC) Schwarz, 1978 ln p(d) ln p D θ ML ) 1 M ln (N) 2 Negative free energ ( F ) A b-product of Variational Baes Path Sampling (Thermodnamic Integration) - MCMC

20 Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

21 Approximate Baesian inference Baesian inference formalizes model inversion, the process of passing from a prior to a posterior in light of data. posterior p θ = likelihood prior p θ p θ p, θ dθ marginal likelihood p (model evidence) In practice, evaluating the posterior is usuall difficult because we cannot easil evaluate p, especiall when: High dimensionalit, complex form analtical solutions are not available numerical integration is too expensive Source: Ka H. Brodersen, 2013,

inspect distribution statistics of q θ (e.g.

22 Approximate Baesian inference There are two approaches to approximate inference. The have complementar strengths and weaknesses. Deterministic approximate inference in particular variational Baes Stochastic approximate inference in particular sampling find an analtical prox q θ that is maximall similar to p θ inspect distribution statistics of q θ (e.g., mean, quantiles, intervals, ) often insightful and fast often hard work to derive converges to local minima design an algorithm that draws samples θ 1,, θ m from p θ inspect sample statistics (e.g., histogram, sample quantiles, ) asmptoticall exact computationall expensive trick engineering concerns Source: Ka H. Brodersen, 2013,

23 Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

The Laplace approximation The Laplace approximation provides a wa of approximating a densit whose normalization constant we cannot evaluate, b fitting a Normal distribution to its mode.

24 The Laplace approximation The Laplace approximation provides a wa of approximating a densit whose normalization constant we cannot evaluate, b fitting a Normal distribution to its mode. p z = 1 Z normalization constant (unknown) f z main part of the densit (eas to evaluate) Pierre-Simon Laplace ( ) French mathematician and astronomer This is exactl the situation we face in Baesian inference: p θ = 1 p model evidence (unknown) p, θ joint densit (eas to evaluate) Source: Ka H. Brodersen, 2013,

Appling the Laplace approximation Given a model with parameters θ = θ

procedure: Find the mode of the log-joint: θ = arg max θ ln p, θ

a Gaussian approximation: N θ μ, Λ 1 with μ = θ Λ = ln p, θ Source: Ka

25 Appling the Laplace approximation Given a model with parameters θ = θ 1,, θ p, the Laplace approximation reduces to a simple three-step procedure: Find the mode of the log-joint: θ = arg max θ ln p, θ Evaluate the curvature of the log-joint at the mode: ln p, θ We obtain a Gaussian approximation: N θ μ, Λ 1 with μ = θ Λ = ln p, θ Source: Ka H. Brodersen, 2013,

26 Limitations of the Laplace approximation The Laplace approximation is often too strong a simplification ignores global properties of the posterior 0.25 log joint approximation becomes brittle when the posterior is multimodal onl directl applicable to real-valued parameters Source: Ka H. Brodersen, 2013,

27 Variational Baes Goal: Choose q from a hpothesis class s.t. KL( q p) 0 Hpothesis Class Model Evidence log p( m) KL( q p) KL( q p) q2 q1 q6 Free Energ (F) p( θ ) q3 q5 q4

Variational calculus Variational Baesian inference is based on variational calculus.

θ Variational calculus Euler, Lagrange, and others functionals F: f F f derivatives df df

28 Variational calculus Variational Baesian inference is based on variational calculus. Standard calculus Newton, Leibniz, and others functions f: x f x derivatives df dx Example: maximize the likelihood expression p θ w.r.t. θ Variational calculus Euler, Lagrange, and others functionals F: f F f derivatives df df Example: maximize the entrop H p w.r.t. a probabilit distribution p x Leonhard Euler ( ) Swiss mathematician, Elementa Calculi Variationum Source: Ka H. Brodersen, 2013,

29 Variational calculus and the free energ Variational calculus lends itself nicel to approximate Baesian inference. ln p() = ln p(,θ) p θ = q θ ln p(,θ) p θ = q θ ln p(,θ) p θ = q θ ln = q θ ln q θ p θ q θ p θ dθ q θ q θ dθ + ln p(,θ) q θ dθ dθ + q θ ln p(,θ) q θ dθ KL[q p] divergence between q θ and p θ F(q, ) free energ Source: Ka H. Brodersen, 2013,

30 Variational calculus and the free energ In summar, the log model evidence can be expressed as: ln p() = KL[q p + F q, divergence 0 (unknown) free energ (eas to evaluate for a given q) ln p KL[q p ln p F q, KL[q p Maximizing F q, is equivalent to: minimizing KL[q p tightening F q, as a lower bound to the log model evidence F q, initialization convergence Source: Ka H. Brodersen, 2013,

31 Computing the free energ We can decompose the free energ F(q, ) as follows: F(q, ) = q θ ln p,θ q θ dθ = q θ ln p, θ dd q θ ln q θ dd = ln p, θ q + H q expected log-joint Shannon entrop Source: Ka H. Brodersen, 2013,

The mean-field assumption When inverting models with several parameters, a common wa of restricting the class of approximate posteriors q θ is to consider those posteriors that factorize into

32 The mean-field assumption When inverting models with several parameters, a common wa of restricting the class of approximate posteriors q θ is to consider those posteriors that factorize into independent partitions, q θ 1 q θ 2 q θ = q i θ i where q i θ i is the approximate posterior for the i th subset of parameters. i, θ 2 θ 1 Jean Daunizeau, ~jdaunize/presentations/baes2.pdf Source: Ka H. Brodersen, 2013,

33 Variational inference under the mean-field assumption F q, p, θ = q θ ln q θ dθ = q i ln p, θ ln q i i i dθ mean-field assumption: q θ = q i θ i i = q j q i ln p, θ ln q j \j dθ q j q i ln q i dθ \j \j = q j q i ln p,θ \j ln p,θ q\j dθ \j ln q j dθ j q j q i ln q i dθ \j dθ j \j \j = q j ln exp ln p,θ q \j q j dθ j + c = KL q j exp ln p,θ q\j + c

34 Variational algorithm under the mean-field assumption In summar: F q, = KL q j exp ln p,θ q\j + c Suppose the densities q \j q θ \j are kept fixed. Then the approximate posterior q θ j that maximizes F q, is given b: q j = arg max q j F q, = 1 Z exp ln p, θ q \j Therefore: ln q j = ln p, θ q\j =:I θ j ln Z This implies a straightforward algorithm for variational inference: Initialize all approximate posteriors q θ i, e.g., b setting them to their priors. Ccle over the parameters, revising each given the current estimates of the others. Loop until convergence. Source: Ka H. Brodersen, 2013,

35 Tpical strategies in variational inference no mean-field assumption mean-field assumption q θ = q θ i no parametric assumptions (variational inference = exact inference) iterative free-form variational optimization parametric assumptions q θ = F θ δ fixed-form optimization of moments iterative fixed-form variational optimization Source: Ka H. Brodersen, 2013,

36 Example: variational densit estimation We are given a univariate dataset 1,, n which we model b a simple univariate Gaussian distribution. We wish to infer on its mean and precision: p μ, τ Although in this case a closed-form solution exists*, we shall pretend it does not. Instead, we consider approximations that satisf the meanfield assumption: mean precision μ i data i = 1 n τ p μ τ = N μ μ 0, λ 0 τ 1 p τ = Ga τ a 0,b 0 p i μ,τ = N i μ,τ 1 q μ, τ = q μ μ q τ τ Source: Ka H. Brodersen, 2013, ; Bishop (2006) PRML

37 Recurring expressions in Baesian inference Univariate normal distribution ln N x μ, λ 1 = 1 2 ln λ 1 2 ln π λ 2 x μ 2 = 1 2 λx2 + λλλ + c Multivariate normal distribution ln N d x μ, Λ 1 = 1 2 ln Λ 1 d 2 ln 2π 1 2 x μ T Λ x μ = 1 2 xt Λx + x T Λμ + c Gamma distribution ln Ga x a, b = a ln b ln Γ a + a 1 ln x b x = a 1 ln x b x + c Source: Ka H. Brodersen, 2013,

38 Variational densit estimation: mean μ ln q μ = ln p, μ, τ q τ + c n = ln p i μ, τ i q τ + ln p μ τ q τ + ln p τ q τ + c = ln N i μ,τ 1 q τ + ln N μ μ 0, λ 0 τ 1 q τ + ln Ga τ a 0,b 0 q τ + c = τ 2 i μ 2 q τ + λ 0τ 2 μ μ 0 2 q τ + c = τ q τ 2 i 2 + τ q τ n μ n τ q τ 2 μ2 λ 0 τ q τ 2 μ 2 + λ 0 μμ 0 τ q τ λ 0 2 μ 0 2 +c = 1 2 n τ q τ + λ 0 τ q τ μ 2 + n τ q τ + λ 0 μ 0 τ q τ μ + c reinstation b inspection q μ = N μ μ n, λ n 1 with λ n = λ 0 + n τ q τ μ n = n τ q τ + λ 0 μ 0 τ q τ λ n = λ 0μ 0 + n λ 0 + n Source: Ka H. Brodersen, 2013,

39 Variational densit estimation: precision τ ln q τ = ln p, μ, τ q μ + c n = ln N i μ, τ 1 n i=1 q μ = 1 2 ln τ τ 2 i μ 2 i=1 + ln N μ μ 0, λ 0 τ 1 q μ + ln Ga τ a 0, b 0 q μ + c q μ + a 0 1 ln τ b 0 τ q μ + c ln λ 0τ λ 0τ 2 μ μ 0 2 q μ = n 2 lnτ τ 2 i μ 2 q μ lnλ lnτ λ 0τ 2 μ μ 2 0 q μ + a 0 1 lnτ b 0 τ + c = n a ln τ 2 i μ 2 q μ + λ 0 2 μ μ 2 0 q μ + b 0 τ + c q τ = Ga τ a n, b n with a n = a 0 + n b n = b 0 + λ 0 2 μ μ 0 2 q μ i μ 2 q μ Source: Ka H. Brodersen, 2013,

40 Variational densit estimation: illustration p θ q θ q θ Source: Ka H. Brodersen, 2013, Bishop (2006) PRML, p. 472

41 TAPAS Variational Baes Linear Regression

43 Demo VB Linear Regression

44 Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

Markov Chain Monte Carlo(MCMC) sampling Andrei Markov (1856 1922) Russian

.. θ N A general framework for sampling from a large class of distributions

45 Markov Chain Monte Carlo(MCMC) sampling Andrei Markov ( ) Russian mathematician Proposal distribution q(θ) θ 0 θ 1 θ 2 θ 3... θ N A general framework for sampling from a large class of distributions Scales well with dimensionalit of sample space Asmptoticall convergent Posterior distribution p θ

46 Markov chain properties Transition probabilities homogeneous Invariance p θ t+1 θ 1,, θ t ) = p θ t+1 θ t ) = T t (θ t+1, θ t ) p θ = T θ, θ p θ θ Detailed Balance Ergodic Reversible Invariant T θ, θ p θ = T θ, θ p θ Homogeneous Ergodicit p θ = lim n p θ n p(θ 0 ) MCMC

Metropolis-Hastings Algorithm Initialize θ at step 1 -

the proposal distribution: Accept with probabilit: θ ~

)q(θ θ t ) Metropolis Smmetric proposal distribution

47 Metropolis-Hastings Algorithm Initialize θ at step 1 - for example, sample from prior At step t, sample from the proposal distribution: Accept with probabilit: θ ~ q θ θ t ) A(θ, θ t ) ~ mmm 1, p θ ) q(θ t θ ) p θ t )q(θ θ t ) Metropolis Smmetric proposal distribution A(θ, θ t ) ~ mmm 1, p θ ) p θ t ) Bishop (2006) PRML, p. 539

distribution: Acceptance probabilit = 1 Blocked Sampling

48 Gibbs Sampling Algorithm Special case of Metropolis Hastings At step t, sample from the conditional distribution: Acceptance probabilit = 1 Blocked Sampling θ t+1 1 ~ p θ 1 θ 2t,, θ t n θ t+1 2 ~ p(θ 2 θ t+1 1,, θ nt ) : : : θ n

Discard initial burn-in period samples to remove dependence on initialization.

49 Posterior analsis from MCMC θ 0 θ 1 θ 2 θ 3... θ N Obtain independent samples: Generate samples based on MCMC sampling. Discard initial burn-in period samples to remove dependence on initialization. Thinning- select ever m th sample to reduce correlation. Inspect sample statistics (e.g., histogram, sample quantiles, )

50 MAP estimate via Simulated Annealing Add a temperature parameter and schedule to update it T = 0.1 Algorithm Set T = 1 Until convergence For ever K iterations sample from: Reduce T T = 1 p 1/T θ )

51 Convergence Analsis Single chain methods Geweke (1992) Rafter-Lewis (1992) Multi-chain methods Gelman-Rubin (1992) Potential Scale Reduction factor

52 Model evidence using MCMC Importance Sampling Prior arithmetic mean Posterior harmonic mean

53 Thermodnamic Integration Path Sampling (Thermodnamic Integration) T=1.0 T=0.8 E 1.0 (loglk) E 0.8 (loglk) Model Evidence T=0.5 E 0.5 (loglk) T=0.0 E 0.0 (loglk) Ogata,Y Num.Math.55: , Gelman,A Stat. Sci. 13:

54 Derivation - Extra

55 Other MCMC variants Slice Sampling Adaptive MH Hamiltonian Monte Carlo Population MCMC

56 TAPAS mpdcm GPU based MCMC

57 Demo MCMC Linear Regression

58 Summar Baesian Inference Model Selection Approximate Inference Variational Baes MCMC Sampling

References Pattern Recognition and Machine Learning (2006). Christopher M. Bishop Baesian reasoning and machine learning.

net Baesian inference, MCMC, Variational Baes Akaike, H (1974). A new look at statistical model identification. IEEE transactions on Automatic Control 19, 716-723. Schwarz, G.

Baes factors Journal of the American Statistical Association, 90, 773-795.

Statistical Parametric Mapping: The Analsis of Functional Brain Images. N. Lartillot and H. Philippe, Computing baes factors using thermodnamic integration, Sstematic Biolog, vol.

59 References Pattern Recognition and Machine Learning (2006). Christopher M. Bishop Baesian reasoning and machine learning. David Barber Information theor, pattern recognition and learning algorithms. David McKa Conjugate Baesian analsis of the Gaussian distribution. Kevin P. Murph. Videolectures.net Baesian inference, MCMC, Variational Baes Akaike, H (1974). A new look at statistical model identification. IEEE transactions on Automatic Control 19, Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, Kass, R.E. and Rafter, A.E. (1995).Baes factors Journal of the American Statistical Association, 90, Markov Chain Monte Carlo: Stochastic Simulation for Baesian Inference, Second Edition (Chapman & Hall/CRC Texts in Statistical Science) Dani Gamerman, Hedibert F. Lopes. Statistical Parametric Mapping: The Analsis of Functional Brain Images. N. Lartillot and H. Philippe, Computing baes factors using thermodnamic integration, Sstematic Biolog, vol. 55, no. 2, pp , Andrew Gelman, Donald B. Rubin. Inference from iterative simulation using multiple sequences Stat. Sci., 7 (4) (1992), pp J. Geweke. Evaluating the accurac of sampling-based approaches to calculating posterior moments Baesian Statistics 4, (1992), pp Adrian E. Rafter, Steven LewisHow man iterations in the Gibbs sampler? Baesian Statistics 4, Oxford Universit Press (1992), pp

60 Thank You

Models of effective connectivity & Dynamic Causal Modelling (DCM)

Models of effective connectivit & Dnamic Causal Modelling (DCM Presented b: Ariana Anderson Slides shared b: Karl Friston Functional Imaging Laborator (FIL Wellcome Trust Centre for Neuroimaging Universit