Lecture 6: Model Checking and Selection

Size: px

Start display at page:

Download "Lecture 6: Model Checking and Selection"

Homer Bailey
5 years ago
Views:

1 Lecture 6: Model Checking and Selection Melih Kandemir May 27, 2014

2 Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which of these choices is the best given our observation set? We require a criterion or a search strategy to answer this question. A yet unsolved problem.

3 Akaike Information Criterion (AIC) Goal: Minimize the Kullback-Leibler Information Quantity [Kullback, 1959] between the true θ and the estimated θ: KL(θ θ) = {log p(y θ ) log p(y θ)}p(y θ )dy This approach inherits the two nice properties of the KL divergence: KL(θ θ) > 0 if p(y θ ) p(y θ), and KL(θ θ) = 0 iff p(y θ ) = p(y θ).

4 Akaike Information Criterion (AIC) KL(θ θ) = {log p(y θ ) log p(y θ)}p(y θ )dy }{{} constant Hence, it suffices to maximize H(θ) = log p(y θ)p(y θ )dy

5 Akaike Information Criterion (AIC) According to Akaike, the best model gives the maximum E θ [H(ˆθ p )]. We can take the integral in H(θ) in a Monte Carlo Integration fashion using our observation set y, which we assume to be drawn from the true distribution: H(ˆθ) N log p(y i ˆθ) = l(ˆθ) i This is basically the log-likelihood l( ) for a given estimate ˆθ.

6 Akaike Information Criterion (AIC) Akaike shows that l(ˆθ) overestimates E θ [H(ˆθ)] with an amount proportional to the model complexity: E θ [l(ˆθ) H(ˆθ)] p where p is the number of model parameters. Hence he proposes the following criterion: AIC(ˆθ) = 2l(ˆθ) + 2p. Note the correspondence of this criterion to the Lasso regression idea!

7 AIC Asymptotics E θ [AIC(ˆθ)] E θ [KL(θ ˆθ)] as N, hence AIC is more sensible for large sample sizes, and when N >> P.

8 AIC is good at [D. Schmidt, E. Makalic] Linear regression, Generalized linear models, Autoregressive model, Histogram estimation, Some forms of hypothesis testing

9 AIC is bad at [D. Schmidt, E. Makalic] Deep Neural Networks: Many different θ map to the same distribution Mixture Modeling: The Maximum Likelihood estimates are not consistent The Uniform Distribution: Likelihood is not twice differentiable

10 Bayesian Model Selection The proper Bayesian model selection should be p(m y) = p(y M)p(M) p(y) = p(m) p(y θ, M)p(θ M)dθ p(y)

11 Bayes Information Criterion (BIC) 2 log p(m y) = [ 2 log p(y) 2 log p(m) 2 log ] p(y θ)p(θ M)dθ

12 Bayes Information Criterion (BIC) Apply second order Taylor approximation for log p(y θ) log p(y θ) log p(y ˆθ) + (θ Note that, ˆθ) T log p(y ˆθ) θ I(ˆθ, y) = 1 2 log p(y ˆθ) N θ ˆθ T is the sample Fisher information matrix. Hence, [ ] (θ ˆθ) T 2 log p(y ˆθ) θ ˆθ (θ ˆθ) T log p(y θ) log p(y ˆθ) 1 2 (θ ˆθ) [ ] T N I(ˆθ, y) (θ ˆθ).

13 Bayes Information Criterion (BIC) Taking the exponent and plugging the approximate likelihood back into the target equation, we have p(y θ)p(θ M)dθ = p(y ˆθ) ( exp 1 2 (θ ˆθ) [ ] ) T N I(ˆθ, y) (θ ˆθ) p(θ M)dθ Assuming noninformative prior over model parameters p(θ M), we can analytically compute the integral and get P p(y θ)p(θ M)dθ = p(y ˆθ)(2π) 2 N P 2 I(ˆθ, y) 1 2 = p(y ˆθ) ( ) P 2π 2 I(ˆθ, y) 1 2 N

14 Bayes Information Criterion (BIC) Let us plug the approximate outcome of the integral to our main formula: ( ) N 2logp(M) = 2 log p(y ˆθ) + P log + log I(ˆθ, y) 2π Ignoring the terms that do not grow with data size, we have the Bayes Information Criterion (BIC): BIC(M) = 2 log p(y ˆθ) + P log N

15 The BIC scale of significance According to [Raftery,Kass, 1995]: BIC Evidence against higher BIC 0-2 Not worth mentioning 2-6 Positive 6-10 Strong >10 Very Strong

16 AIC versus BIC From J.Cavanaugh s slides: BIC is more parsimonius than AIC, because Frequentist analysis (AIC) incorporates estimation uncertainty Bayesian analysis (BIC) incorporates estimation uncertainty AND parameter uncertainty. AIC measure predictiveness, while BIC measures descriptiveness of a model. AIC is asymptotically efficient yet not consistent; BIC is consistent yet not asymptotically efficient.

17 Deviance Information Criterion (DIC) Deviance: D(y, θ) = 2 log p(y θ) Point estimate deviance: Dˆθ(y) = D(y, ˆθ) Expected deviance: D avg (y, ˆθ) = E[D(y, θ) θ] Estimated Expected deviance: ˆD avg (y) = L l=1 D(y, θl ) Effective nr of params: p D = ˆD avg (y) Dˆθ(y) DIC: ˆD avg (y) + p D

18 Bayes factor Bayesian model selection was mentioned as: p(m y) = p(y M)p(M) p(y) The ratio of model posteriors of two models with noninformative priors is [Kass, Raftery, 1995]: p(y θ1, M 1 )p(θ 1 M 1 )dθ 1 K = p(y θ2, M 2 )p(θ 2 M 2 )dθ 2

19 Example 1: Is the coin fair? Suppose we tossed a coin 200 times and got heads 115 times. The likelihood is then ( ) 200 p 115 (1 p) We compare two models: M 1 : p = 0.5 M 2 : p = Unknown Then, ( ) ( ) P(X = 115 M 1 ) = = , P(X = 115 M 2 ) = ( ) q 115 (1 q) 85 dq = Hence, K = = 1.197, which says barely worth mentioning We have a weak evidence about the coin s being unfair.

20 Example 2: Finding the cluster count of a Gaussian mixture model 1 1

21 Bayes factor scale of significance According to [Raftery,Kass, 1995]: K Evidence in favour of M Not worth more than a bare mention 3-20 Positive Strong >150 Very Strong

22 Bayes factor and BIC Let K be the Bayes factor of two models M 1 and M 2, and BIC(M 1 ) and BIC(M 2 ) be the corresponding BICs: 2 log K [BIC(M 1 ) BIC(M 2 )] 2 log K 0 as N. Thus, BIC = BIC(M 1 ) BIC(M 2 ) approximates 2 log K.

23 Model selection for variational inference Given data y for which two models M 1 and M 2 are proposed. The exact and approximate posteriors of the models are as follows M 1 : p(θ 1 y) q(θ 1 ; v 1 ) M 2 : p(θ 2 y) q(θ 1 ; v 2 ) with parameter sets v 1 and v 2 learned from y. The Bayes factor can be approximated as K = p(y M 1) p(y M 2 ) exp{l[q(v 1); y]} exp{l[q(v 2 ); y]} where L[q(v i ); y] is the variational lower bound computed with the parameter set v i.

24 Bayes factor for MCMC

Penalized Loss functions for Bayesian Model Choice

Penalized Loss functions for Bayesian Model Choice Martyn International Agency for Research on Cancer Lyon, France 13 November 2009 The pure approach For a Bayesian purist, all uncertainty is represented