Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014

Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I. Jordan (U. Berkeley) http://www.cs.berkeley.edu/~jordan/courses/ 260-spring10/ David Blei (U. Princeton) http://www.cs.princeton.edu/courses/archive/ fall07/cos597c/syllabus.html Andrew Ng (Stanford U.) https://www.youtube.com/watch?v=uzxylbk2c7e Taylan Cemgil (Bogazici U.) http://dl.dropboxusercontent.com/u/9787379/ cmpe58k/index.html

Maximum likelihood estimation Given observed data X and the assumption that X p(x θ), the maximum likelihood estimate (MLE) is ˆθ = argmax θ p(x θ)

What are priors for? To induce prior beliefs To avoid overfitting Controlling model complexity: i) inducing sparsity=regularization, ii) marginal likelihood. Marginalization of model parameters (represented as a distribution, not a point estimate.

Overfitting 1 1 Bishop, Pattern Recognition and Machine Learning

Types of priors 2 Objective priors: noninformative priors that attempt to capture ignorance and have good frequentist properties. Subjective priors: priors should capture our beliefs as well as possible. They are subjective but not arbitrary. The key ingredient of Bayesian methods is not the prior, it is the idea of averaging over different possibilities. Hierarchical priors: multiple levels of priors p(θ) = p(θ α)p(α)d α where p(α) is called a hyperprior. Empirical priors: learn some of the parameters of the prior from the data (i.e. Empirical Bayes!) 2 Z. Ghahramani s lecture

Empirical priors 3 Given: p(d α) = p(d θ)p(θ α)d θ where α is the vector of hyperparameters. Estimation: ˆα = argmaxp(d α) α This method is called Type II Maximum Likelihood. Prediction: Plus: Minus: p(x D, ˆα) = Tuning the prior belief to data. 3 Z. Ghahramani s lecture p(x θ)p(θ D, ˆα)d θ Double counting of data: Overfitting.

Bayesian Inference Posterior: p(θ D) = p(d θ)p(θ) p(d θ)p(θ)dθ Prediction: p(x D) = p(x θ)p(θ D)dθ How can we calculate the integrals above? Approximate: Variational Bayes, MCMC, Laplace. Closed-form solution: Conjugate priors

Conjugate Prior 4 If p(θ D) is in the same family as p(θ), then p(θ) is called a conjugate prior for p(d θ). 4 H. Raiffa and R. Schlaifer, Applied Statistical Decision Theory,1961.

Beta distribution PDF: Beta(x α, β) = x α 1 (1 x) β 1 1 0 tα 1 (1 t) β 1 Mean: E[x] = α α + β E[ln x] = ψ(α) ψ(α + β) where ψ( ) is the digamma function. Variance: var[x] = αβ (α + β) 2 (α + β + 1)

PDF of the Beta distribution 5 5 http://www.ntrand.com/beta-distribution/

Binomial distribution For x {0,, n}, Bin(x n, p) = ( ) n p x (1 p) (n x) x n: number of repetitions of a binary experiment. p: success probability. x: number of successes in n trials. Bernoulli distribution: Beta(x 1, p).

Beta-Binomial conjugacy Data: N binary observations: X = [x 1, x 2,, x N ] (e.g. N i.i.d. coin tosses with K heads). Problem: Analyze how fair the coin is (i.e. What is the true heads probability?). Model: Likelihood: p(d θ) = Bin(x = K N, θ). Prior: p(θ) = Beta(θ α, β). Posterior: p(θ D) = Beta(θ α + K, β + N K). Posterior mean: E(θ x) = α + x α + β + N

Dirichlet distribution PDF: Dir(x 1,, x K α) = K i=1 x α i 1 i B(α) where α = [α 1,, α K ] and B(α) = πk i=1 Γ(α i) Γ( K i=1 α i) Mean: α i E[x i ] = K i=1 α k ( K ) E[ln x i ] = ψ(α i ) ψ α i i=1

Dirichlet distribution (2) 6 6 http://projects.csail.mit.edu/church/wiki/models_with_ Unbounded_Complexity

Multinomial distribution For x i [0,, N], K i=1 x i = N, and K i=1 p i = 1, Mult(x 1,, x K N, p 1,, p K ) = N! x 1! x K! px 1 1 px 2 2 px K K

Dirichlet-multinomial conjugacy Let X = [x 1,, x N ] be N i.i.d. draws from a multinomial distribution. The likelihood is N i=1 P(X θ) = θ I(x j =θ 1 ) 1 θ When the Dirichlet prior N i=1 I(x j =θ K ) K P(θ α) θ α 1 1 θα K K is applied, then the posterior becomes N i=1 P(θ X, α) θ I(x j =θ 1 )+α 1 1 1 θ N i=1 I(x j =θ K )+α K 1 K

Dirichlet-multinomial conjugacy (2) 7 The posterior mean is E[θ i x, α] = α i + N j=1 I(x i) N + K l=1 α l α i = κ k l=1 α l + (1 κ) x i where x i = 1 N j=1 N I(x j = i) is the maximum likelihood estimator l and κ = α l N + l α (0, 1) is the prior mean. l Implications: Convex combination of the MLE and the prior mean κ 0 as N E[θ i x, α] is a shrinkage estimator! 7 M.I.Jordan s lecture 4

Logarithm of the normal distribution N (x µ, Σ) = (2π) D 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) log N (x µ, Σ) = D 2 log 2π 1 2 log Σ 1 2 (x µ)t Σ 1 (x µ) = 1 log Σ 1 } 2 {{} 2 xt Σ 1 x 1 2 µt Σ 1 µ + x T Σ 1 µ + const 1 + 2 log Σ 1 = 1 2 log Σ 1 2 tr(σ 1 xx T ) 1 2 tr(σ 1 µµ T ) + x T Σ 1 µ + const

Normal distribution with known variance (one sample) The model: x µ, σ 2 N (x µ, σ 2 ) µ N (µ µ 0, σ 2 0) σ 2 known The posterior is then x p(µ x, σ 2 ) = N µ σ 2 + µ 0 σ0 2 1 σ 2 + 1, 1 σ 2 + 1 σ 2 σ0 2 0 Derivation of this result is to be done on the whiteboard.

Normal distribution with known variance (multiple samples) The model: The posterior is then p(µ x, σ 2 ) = N µ x 1,, x N µ, σ 2 N (x µ, σ 2 ) µ N (µ µ 0, σ 2 0) σ 2 known N i=1 x i σ 2 + µ 0 σ 2 0 N σ 2 + 1 σ 2 0, N σ 2 + 1 σ0 2 Derivation of this result is to be done on the whiteboard.

Gamma distribution PDF: G(x a, b) = x a 1 e bx Z(a, b) Mean: E[x] = a b

Gamma distribution (2)

Normal distribution with known mean unknown variance The model: x 1,, x N µ, σ 2 N (x µ, σ 2 ) = N (x µ, (τ 2 ) 1 ) µ known σ 2 G(τ 2 a, b) The posterior is then ( p(τ 2 x, µ) = G τ 2 a + N 2, b + 1 2 ) N (x i µ) 2. i=1 Derivation of this result is to be done on the whiteboard.

Wishart distribution PDF : W(X V, ν) = Mean : νv Relation to normal distribution : ν D 1 1 Z(V, ν) X 2 e 1 2 tr(v 1 X) s 1,, s ν N (s 0, V) S = [s 1 ; s 2 ; ; s ν ]; X = S T S

Multivariate normal distribution with unknown covariance The model: x 1,, x N µ, Σ N (x µ, Σ) = N (x µ, (Λ) 1 ) The posterior is then µ known Λ W(Λ W 0, ν) p(λ x 1,, x N, µ) = ( ) 1 N W Λ W 1 0 + (x i µ)(x i µ) T, ν + N. i=1 Derivation of this result is to be done on the whiteboard.

Recap Beta Binomial = Beta Dirichlet Multinomial = Dirichlet Normal Normal = Normal Gamma Normal = Gamma Wishart Normal = Wishart