Lecture 6: Model Checking and Selection

Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014

Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which of these choices is the best given our observation set? We require a criterion or a search strategy to answer this question. A yet unsolved problem.

Akaike Information Criterion (AIC) Goal: Minimize the Kullback-Leibler Information Quantity [Kullback, 1959] between the true θ and the estimated θ: KL(θ θ) = {log p(y θ ) log p(y θ)}p(y θ )dy This approach inherits the two nice properties of the KL divergence: KL(θ θ) > 0 if p(y θ ) p(y θ), and KL(θ θ) = 0 iff p(y θ ) = p(y θ).

Akaike Information Criterion (AIC) KL(θ θ) = {log p(y θ ) log p(y θ)}p(y θ )dy }{{} constant Hence, it suffices to maximize H(θ) = log p(y θ)p(y θ )dy

Akaike Information Criterion (AIC) According to Akaike, the best model gives the maximum E θ [H(ˆθ p )]. We can take the integral in H(θ) in a Monte Carlo Integration fashion using our observation set y, which we assume to be drawn from the true distribution: H(ˆθ) N log p(y i ˆθ) = l(ˆθ) i This is basically the log-likelihood l( ) for a given estimate ˆθ.

Akaike Information Criterion (AIC) Akaike shows that l(ˆθ) overestimates E θ [H(ˆθ)] with an amount proportional to the model complexity: E θ [l(ˆθ) H(ˆθ)] p where p is the number of model parameters. Hence he proposes the following criterion: AIC(ˆθ) = 2l(ˆθ) + 2p. Note the correspondence of this criterion to the Lasso regression idea!

AIC Asymptotics E θ [AIC(ˆθ)] E θ [KL(θ ˆθ)] as N, hence AIC is more sensible for large sample sizes, and when N >> P.

AIC is good at [D. Schmidt, E. Makalic] Linear regression, Generalized linear models, Autoregressive model, Histogram estimation, Some forms of hypothesis testing

AIC is bad at [D. Schmidt, E. Makalic] Deep Neural Networks: Many different θ map to the same distribution Mixture Modeling: The Maximum Likelihood estimates are not consistent The Uniform Distribution: Likelihood is not twice differentiable

Bayesian Model Selection The proper Bayesian model selection should be p(m y) = p(y M)p(M) p(y) = p(m) p(y θ, M)p(θ M)dθ p(y)

Bayes Information Criterion (BIC) 2 log p(m y) = [ 2 log p(y) 2 log p(m) 2 log ] p(y θ)p(θ M)dθ

Bayes Information Criterion (BIC) Apply second order Taylor approximation for log p(y θ) log p(y θ) log p(y ˆθ) + (θ Note that, ˆθ) T log p(y ˆθ) θ I(ˆθ, y) = 1 2 log p(y ˆθ) N θ ˆθ T is the sample Fisher information matrix. Hence, [ ] + 1 2 (θ ˆθ) T 2 log p(y ˆθ) θ ˆθ (θ ˆθ) T log p(y θ) log p(y ˆθ) 1 2 (θ ˆθ) [ ] T N I(ˆθ, y) (θ ˆθ).

Bayes Information Criterion (BIC) Taking the exponent and plugging the approximate likelihood back into the target equation, we have p(y θ)p(θ M)dθ = p(y ˆθ) ( exp 1 2 (θ ˆθ) [ ] ) T N I(ˆθ, y) (θ ˆθ) p(θ M)dθ Assuming noninformative prior over model parameters p(θ M), we can analytically compute the integral and get P p(y θ)p(θ M)dθ = p(y ˆθ)(2π) 2 N P 2 I(ˆθ, y) 1 2 = p(y ˆθ) ( ) P 2π 2 I(ˆθ, y) 1 2 N

Bayes Information Criterion (BIC) Let us plug the approximate outcome of the integral to our main formula: ( ) N 2logp(M) = 2 log p(y ˆθ) + P log + log I(ˆθ, y) 2π Ignoring the terms that do not grow with data size, we have the Bayes Information Criterion (BIC): BIC(M) = 2 log p(y ˆθ) + P log N

The BIC scale of significance According to [Raftery,Kass, 1995]: BIC Evidence against higher BIC 0-2 Not worth mentioning 2-6 Positive 6-10 Strong >10 Very Strong

AIC versus BIC From J.Cavanaugh s slides: BIC is more parsimonius than AIC, because Frequentist analysis (AIC) incorporates estimation uncertainty Bayesian analysis (BIC) incorporates estimation uncertainty AND parameter uncertainty. AIC measure predictiveness, while BIC measures descriptiveness of a model. AIC is asymptotically efficient yet not consistent; BIC is consistent yet not asymptotically efficient.

Deviance Information Criterion (DIC) Deviance: D(y, θ) = 2 log p(y θ) Point estimate deviance: Dˆθ(y) = D(y, ˆθ) Expected deviance: D avg (y, ˆθ) = E[D(y, θ) θ] Estimated Expected deviance: ˆD avg (y) = L l=1 D(y, θl ) Effective nr of params: p D = ˆD avg (y) Dˆθ(y) DIC: ˆD avg (y) + p D

Bayes factor Bayesian model selection was mentioned as: p(m y) = p(y M)p(M) p(y) The ratio of model posteriors of two models with noninformative priors is [Kass, Raftery, 1995]: p(y θ1, M 1 )p(θ 1 M 1 )dθ 1 K = p(y θ2, M 2 )p(θ 2 M 2 )dθ 2

Example 1: Is the coin fair? Suppose we tossed a coin 200 times and got heads 115 times. The likelihood is then ( ) 200 p 115 (1 p) 85 115 We compare two models: M 1 : p = 0.5 M 2 : p = Unknown Then, ( ) ( ) 200 1 200 P(X = 115 M 1 ) = = 0.00595, 115 2 P(X = 115 M 2 ) = ( ) 1 200 0 q 115 (1 q) 85 dq = 0.00497. 115 Hence, K = 0.00497 = 1.197, which says barely worth mentioning. 0.00595 We have a weak evidence about the coin s being unfair.

Example 2: Finding the cluster count of a Gaussian mixture model 1 1 http://scikit-learn.org/stable/modules/mixture.html

Bayes factor scale of significance According to [Raftery,Kass, 1995]: K Evidence in favour of M 1 1-3 Not worth more than a bare mention 3-20 Positive 20-150 Strong >150 Very Strong

Bayes factor and BIC Let K be the Bayes factor of two models M 1 and M 2, and BIC(M 1 ) and BIC(M 2 ) be the corresponding BICs: 2 log K [BIC(M 1 ) BIC(M 2 )] 2 log K 0 as N. Thus, BIC = BIC(M 1 ) BIC(M 2 ) approximates 2 log K.

Model selection for variational inference Given data y for which two models M 1 and M 2 are proposed. The exact and approximate posteriors of the models are as follows M 1 : p(θ 1 y) q(θ 1 ; v 1 ) M 2 : p(θ 2 y) q(θ 1 ; v 2 ) with parameter sets v 1 and v 2 learned from y. The Bayes factor can be approximated as K = p(y M 1) p(y M 2 ) exp{l[q(v 1); y]} exp{l[q(v 2 ); y]} where L[q(v i ); y] is the variational lower bound computed with the parameter set v i.

Bayes factor for MCMC