Lecture 6: Model Checking and Selection

Similar documents
Penalized Loss functions for Bayesian Model Choice

7. Estimation and hypothesis testing. Objective. Recommended reading

Model comparison and selection

Model selection in penalized Gaussian graphical models

Bayesian Analysis (Optional)

Expectation Propagation for Approximate Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture : Probabilistic Machine Learning

Hierarchical Models & Bayesian Model Selection

Foundations of Statistical Inference

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Lecture 2: Priors and Conjugacy

Will Penny. SPM short course for M/EEG, London 2013

Bayesian Asymptotics

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Lecture 1: Bayesian Framework Basics

Will Penny. DCM short course, Paris 2012

7. Estimation and hypothesis testing. Objective. Recommended reading

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

PATTERN RECOGNITION AND MACHINE LEARNING

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Quantitative Biology II Lecture 4: Variational Methods

Density Estimation. Seungjin Choi

Will Penny. SPM short course for M/EEG, London 2015

Will Penny. SPM for MEG/EEG, 15th May 2012

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Bayesian Inference and MCMC

Introduction to Probabilistic Machine Learning

Introduction into Bayesian statistics

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Model comparison: Deviance-based approaches

CSC321 Lecture 18: Learning Probabilistic Models

David Giles Bayesian Econometrics

A prior distribution over model space p(m) (or hypothesis space ) can be updated to a posterior distribution after observing data y.

CSC 2541: Bayesian Methods for Machine Learning

Lecture 6: Graphical Models: Learning

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

g-priors for Linear Regression

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Minimum Message Length Analysis of the Behrens Fisher Problem

an introduction to bayesian inference

Bayesian Learning (II)

Machine Learning Summer School

Computational Cognitive Science

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Bayesian Regression Linear and Logistic Regression

Statistical Machine Learning Lectures 4: Variational Bayes

MIT Spring 2016

CPSC 540: Machine Learning

Machine Learning using Bayesian Approaches

All models are wrong but some are useful. George Box (1979)

Bayesian Machine Learning

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Statistics 202: Data Mining. c Jonathan Taylor. Model-based clustering Based in part on slides from textbook, slides of Susan Holmes.

Bayesian Machine Learning

Bayesian Inference: Posterior Intervals

ST 740: Model Selection

Naïve Bayes classification

Bayesian Methods: Naïve Bayes

Time Series and Dynamic Models

Bayesian Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Bayesian Model Comparison

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Bayesian Inference. Introduction

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Lecture 13 Fundamentals of Bayesian Inference

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Bayesian Experimental Designs

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1

David Giles Bayesian Econometrics

Bayesian Inference in Astronomy & Astrophysics A Short Course

CS540 Machine learning L8

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Computational Cognitive Science

DIC, AIC, BIC, PPL, MSPE Residuals Predictive residuals

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Computational Perception. Bayesian Inference

Unsupervised Learning

Introduc)on to Bayesian Methods

Part 1: Expectation Propagation

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Lecture 4: Probabilistic Learning

Approximate Inference using MCMC

Statistical Machine Learning Lecture 1: Motivation

Robust Deviance Information Criterion for Latent Variable Models

Bayesian search for other Earths

Non-Parametric Bayes

Bayesian Methods for Machine Learning

Transcription:

Lecture 6: Model Checking and Selection Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 27, 2014

Model selection We often have multiple modeling choices that are equally sensible: M 1,, M T. Which of these choices is the best given our observation set? We require a criterion or a search strategy to answer this question. A yet unsolved problem.

Akaike Information Criterion (AIC) Goal: Minimize the Kullback-Leibler Information Quantity [Kullback, 1959] between the true θ and the estimated θ: KL(θ θ) = {log p(y θ ) log p(y θ)}p(y θ )dy This approach inherits the two nice properties of the KL divergence: KL(θ θ) > 0 if p(y θ ) p(y θ), and KL(θ θ) = 0 iff p(y θ ) = p(y θ).

Akaike Information Criterion (AIC) KL(θ θ) = {log p(y θ ) log p(y θ)}p(y θ )dy }{{} constant Hence, it suffices to maximize H(θ) = log p(y θ)p(y θ )dy

Akaike Information Criterion (AIC) According to Akaike, the best model gives the maximum E θ [H(ˆθ p )]. We can take the integral in H(θ) in a Monte Carlo Integration fashion using our observation set y, which we assume to be drawn from the true distribution: H(ˆθ) N log p(y i ˆθ) = l(ˆθ) i This is basically the log-likelihood l( ) for a given estimate ˆθ.

Akaike Information Criterion (AIC) Akaike shows that l(ˆθ) overestimates E θ [H(ˆθ)] with an amount proportional to the model complexity: E θ [l(ˆθ) H(ˆθ)] p where p is the number of model parameters. Hence he proposes the following criterion: AIC(ˆθ) = 2l(ˆθ) + 2p. Note the correspondence of this criterion to the Lasso regression idea!

AIC Asymptotics E θ [AIC(ˆθ)] E θ [KL(θ ˆθ)] as N, hence AIC is more sensible for large sample sizes, and when N >> P.

AIC is good at [D. Schmidt, E. Makalic] Linear regression, Generalized linear models, Autoregressive model, Histogram estimation, Some forms of hypothesis testing

AIC is bad at [D. Schmidt, E. Makalic] Deep Neural Networks: Many different θ map to the same distribution Mixture Modeling: The Maximum Likelihood estimates are not consistent The Uniform Distribution: Likelihood is not twice differentiable

Bayesian Model Selection The proper Bayesian model selection should be p(m y) = p(y M)p(M) p(y) = p(m) p(y θ, M)p(θ M)dθ p(y)

Bayes Information Criterion (BIC) 2 log p(m y) = [ 2 log p(y) 2 log p(m) 2 log ] p(y θ)p(θ M)dθ

Bayes Information Criterion (BIC) Apply second order Taylor approximation for log p(y θ) log p(y θ) log p(y ˆθ) + (θ Note that, ˆθ) T log p(y ˆθ) θ I(ˆθ, y) = 1 2 log p(y ˆθ) N θ ˆθ T is the sample Fisher information matrix. Hence, [ ] + 1 2 (θ ˆθ) T 2 log p(y ˆθ) θ ˆθ (θ ˆθ) T log p(y θ) log p(y ˆθ) 1 2 (θ ˆθ) [ ] T N I(ˆθ, y) (θ ˆθ).

Bayes Information Criterion (BIC) Taking the exponent and plugging the approximate likelihood back into the target equation, we have p(y θ)p(θ M)dθ = p(y ˆθ) ( exp 1 2 (θ ˆθ) [ ] ) T N I(ˆθ, y) (θ ˆθ) p(θ M)dθ Assuming noninformative prior over model parameters p(θ M), we can analytically compute the integral and get P p(y θ)p(θ M)dθ = p(y ˆθ)(2π) 2 N P 2 I(ˆθ, y) 1 2 = p(y ˆθ) ( ) P 2π 2 I(ˆθ, y) 1 2 N

Bayes Information Criterion (BIC) Let us plug the approximate outcome of the integral to our main formula: ( ) N 2logp(M) = 2 log p(y ˆθ) + P log + log I(ˆθ, y) 2π Ignoring the terms that do not grow with data size, we have the Bayes Information Criterion (BIC): BIC(M) = 2 log p(y ˆθ) + P log N

The BIC scale of significance According to [Raftery,Kass, 1995]: BIC Evidence against higher BIC 0-2 Not worth mentioning 2-6 Positive 6-10 Strong >10 Very Strong

AIC versus BIC From J.Cavanaugh s slides: BIC is more parsimonius than AIC, because Frequentist analysis (AIC) incorporates estimation uncertainty Bayesian analysis (BIC) incorporates estimation uncertainty AND parameter uncertainty. AIC measure predictiveness, while BIC measures descriptiveness of a model. AIC is asymptotically efficient yet not consistent; BIC is consistent yet not asymptotically efficient.

Deviance Information Criterion (DIC) Deviance: D(y, θ) = 2 log p(y θ) Point estimate deviance: Dˆθ(y) = D(y, ˆθ) Expected deviance: D avg (y, ˆθ) = E[D(y, θ) θ] Estimated Expected deviance: ˆD avg (y) = L l=1 D(y, θl ) Effective nr of params: p D = ˆD avg (y) Dˆθ(y) DIC: ˆD avg (y) + p D

Bayes factor Bayesian model selection was mentioned as: p(m y) = p(y M)p(M) p(y) The ratio of model posteriors of two models with noninformative priors is [Kass, Raftery, 1995]: p(y θ1, M 1 )p(θ 1 M 1 )dθ 1 K = p(y θ2, M 2 )p(θ 2 M 2 )dθ 2

Example 1: Is the coin fair? Suppose we tossed a coin 200 times and got heads 115 times. The likelihood is then ( ) 200 p 115 (1 p) 85 115 We compare two models: M 1 : p = 0.5 M 2 : p = Unknown Then, ( ) ( ) 200 1 200 P(X = 115 M 1 ) = = 0.00595, 115 2 P(X = 115 M 2 ) = ( ) 1 200 0 q 115 (1 q) 85 dq = 0.00497. 115 Hence, K = 0.00497 = 1.197, which says barely worth mentioning. 0.00595 We have a weak evidence about the coin s being unfair.

Example 2: Finding the cluster count of a Gaussian mixture model 1 1 http://scikit-learn.org/stable/modules/mixture.html

Bayes factor scale of significance According to [Raftery,Kass, 1995]: K Evidence in favour of M 1 1-3 Not worth more than a bare mention 3-20 Positive 20-150 Strong >150 Very Strong

Bayes factor and BIC Let K be the Bayes factor of two models M 1 and M 2, and BIC(M 1 ) and BIC(M 2 ) be the corresponding BICs: 2 log K [BIC(M 1 ) BIC(M 2 )] 2 log K 0 as N. Thus, BIC = BIC(M 1 ) BIC(M 2 ) approximates 2 log K.

Model selection for variational inference Given data y for which two models M 1 and M 2 are proposed. The exact and approximate posteriors of the models are as follows M 1 : p(θ 1 y) q(θ 1 ; v 1 ) M 2 : p(θ 2 y) q(θ 1 ; v 2 ) with parameter sets v 1 and v 2 learned from y. The Bayes factor can be approximated as K = p(y M 1) p(y M 2 ) exp{l[q(v 1); y]} exp{l[q(v 2 ); y]} where L[q(v i ); y] is the variational lower bound computed with the parameter set v i.

Bayes factor for MCMC