USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 3 Conditionals and marginals For Bayesian analysis it is very useful to understand how to write joint, marginal, and conditional distributions for the multivariate normal Given a vector x IR p the multivariate normal density is fx) = exp 2π) p/2 Σ /2 Now split the vector into two parts [ ] x x =, µ = and Σ = x 2 [ µ µ 2 [ ] Σ Σ 2, of size Σ 2 Σ 22 2 x µ)t Σ x µ) ] [ ] q, of size, p q) We now state the joint and marginal distributions and the conditional density ) [ ] q q q p q) p q) q p q) p q) x Nµ,Σ ), x 2 Nµ 2,Σ 22 ), x Nµ,Σ), x x 2 N µ +Σ 2 Σ 22 x 2 µ 2 ),Σ Σ 2 Σ 22 Σ 2) The same idea holds for other sizes of partitions 32 Conjugate priors 32 Univariate normals 32 Fixed variance, random mean We consider the parameter σ 2 fixed so we are interested in the conjugate prior for µ: πµ µ 0,σ 2 ) exp ) σ 0 2σ0 2 µ µ 0 ) 2, where µ 0 and σ 2 are hyper-parameters for the prior distribution when we don t have informative prior knowledge we typically consider µ 0 = 0 and σ 2 large) 3
4 S MUKHERJEE, PROBABILISTIC MACHINE LEARNING The posterior distribution for x,,x n with a univariate normal likelihood and the above prior will be Postµ x,,x n ) N σ 2 0 σ 2 n +σ2 0 x+ σ2 µ σ 2 0, n +σ2 σ 2 + n ) ) 0 0 σ 2 322 Fixed mean, random variance We will formulate this setting with two parameterizations of the scale parameter: ) the variance σ 2, 2) the precision τ = σ 2 The two conjugate distributions are the Gamma and the inverse Gamma really they are the same distribution, just reparameterized) IGα,β) : fσ 2 ) = βα Γα) σ2 ) α exp βσ 2 ) ), Gaα,β) : fτ) = βα Γα) τα exp βτ) The posterior distribution of σ 2 is σ 2 x,,x n IG α+ n 2,β + xi µ) 2) 2 The posterior distribution of τ is not surprisingly τ x,,x n Ga α+ n 2,β + xi µ) 2) 2 323 Random mean, random variance We now put the previous priors together in what is called a Bayesian hierarchical model: x i µ,τ iid Nµ,τ) ) µ τ Nµ 0,κ 0 τ) ) τ Gaα,β) For the above likelihood and priors the posterior distribution for the mean and precision is µ0 κ 0 +n x µ τ,x,,x n N,τn+κ 0 )) ) n+κ 0 τ x,,x n Ga α+ n 2,β + xi x) 2 + n x x i ) 2 ) 2 n+ 2 322 Multivariate normal Given a vector x IR p the multivariate normal density is fx) = exp ) 2π) p/2 Σ /2 2 x µ)t Σ x µ) We will work with the precision matrix instead of the covariance and we will consider the following Bayesian hierarchical model: x i µ,λ iid Nµ,Λ) ) µ Λ Nµ 0,κ 0 Λ) ) Λ WiΛ 0,n 0 ), the precision matrix is modeled using the Wishart distribution fλ;v,n) = Λ n d )/2 exp 5trΛV )) 2 nd/2 V n/2 Γ d n/2)
USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL* 5 For the above likelihood and priors the posterior distribution for the mean and precision is µ0 κ 0 +n x µ Λ,x,,x n N,Λn+κ 0 )) ) n+κ 0 Λ x,,x n Wi n 0 + n 2,Λ 0 + [ Σ+ κ ]) 0 2 κ 0 +n x µ 0) x µ 0 ) T
LECTURE 4 A Bayesian approach to linear regression The main motivations behind a Bayesian formalism for inference are a coherent approach to modeling uncertainty as well as an axiomatic framework for inference We will reformulate multivariate linear regression from a Bayesian formulation in this section Bayesian inference involves thinking in terms of probability distributions and conditional distributions One important idea is that of a conjugate prior Another tool we will use extensively in this class is the multivariate normal distribution and its properties 4 Conjugate priors Given a likelihood function px θ) and a prior πθ) on can write the posterior as px θ)πθ) pθ x) = px θ θ )πθ )dθ = px,θ) px), where px) is the marginal density for the data, px,θ) is the joint density of the data and the parameter θ The idea of a prior and likelihood being conjugate is that the prior and the posterior densities belong to the same family We now state some examples to illustrate this idea Beta, Binomial: Consider the Binomial likelihood with n the number of trials) fixed ) n fx p,n) = p x p) n x, x the parameter of interest the probability of a success) is p [0,] A natural prior distribution for p is the Beta distribution which has density πp;α,β) = Γα+β) Γα)Γβ) pα p) β, p 0,) and α,β > 0, 7
8 S MUKHERJEE, PROBABILISTIC MACHINE LEARNING where Γα+β) Γα)Γβ) is a normalization constant Given the prior and the likelihood densities the posterior density modulo normalizing constants will take the form [ ) n fp x) Γα+β) ] p x p) n x p α p) β, x Γα)Γβ) p x+α p) n x+β, which means that the posterior distribution of p is also a Beta with p x Betaα+x,β +n x) Normal, Normal: Given a normal distribution with unknown mean the density for the likelihood is fx θ,σ 2 ) exp ) 2σ 2x θ)2, and one can specify a normal prior πθ;θ 0,τ0) 2 exp ) 2τ0 2 θ θ 0 ) 2, with hyper-parameters θ 0 and τ 0 The resulting posterior distribution will have the following density function fθ x) exp ) 2σ 2x θ)2 exp ) 2τ0 2 θ θ 0 ) 2, which after completing squares and reordering can be written as θ x Nθ,τ 2 ), θ = 42 Bayesian linear regression We start with the likelihood as and the prior as fy X,β,σ 2 ) = n i= θ 0 + x τ0 2 σ 2 τ0 2 σ 2π exp πβ) exp + σ 2, τ 2 = 2τ 2 0 τ 2 0 y i β T x i 2 ) β T β The density of the posterior is [ n Postβ D) σ 2π exp y i β T x i 2 ) ] 2σ 2 i= 2σ 2 + σ 2 ) exp 2π) p/2 γ/2 ) 2τ0 2 β T β With a good bit of manipulation the above can be rewritten as a multivariate normal distribution β Y,X,σ 2 N p µ,σ ) with Σ = τ0 2 I p +σ 2 X T X), µ = σ 2 Σ X T Y Note the similarities of the above distribution to the MAP estimator Relate the mean of the above estimator to the MAP estimator
LECTURE 4 A BAYESIAN APPROACH TO LINEAR REGRESSION 9 Predictive distribution: Given data D = {x i,y i )} i= n and and a new value x one would like to estimate y This can be done using the posterior and is called the posterior predictive distribution fy D,x,σ 2,τ0) 2 = fy x,β,σ 2 )fβ Y,X,σ 2,τ0) 2 dβ, IR p where with some manipulation where y D,x,σ 2,τ 2 0 Nµ,σ 2 ), µ = σ 2Σ X T Yx, σ 2 = σ 2 +x T Σ x