ABHELSINKI UNIVERSITY OF TECHNOLOGY

Similar documents
Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities

Bayesian Model Assessment and Comparison Using Cross-Validation Predictive Densities

Introduction to Machine Learning

Generalized Linear Models Introduction

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

On Bayesian Model Assessment and Choice Using Cross-Validation Predictive Densities

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Generalized Linear Models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

LOGISTIC REGRESSION Joseph M. Hilbe

Machine Learning 4771

Bayesian Input Variable Selection Using Cross-Validation Predictive Densities and Reversible Jump MCMC

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Model Selection for Gaussian Processes

Bayesian Machine Learning

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Lecture 16: Mixtures of Generalized Linear Models

CMU-Q Lecture 24:

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Generalized Linear Models

Lecture 6: Model Checking and Selection

Nonparametric Bayesian Methods (Gaussian Processes)

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Logistic Regression. Seungjin Choi

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Nonresponse weighting adjustment using estimated response probability

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Linear Regression Models P8111

Bayesian linear regression

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Lecture : Probabilistic Machine Learning

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Pattern Recognition and Machine Learning

Generalized Linear Models for Non-Normal Data

Support Vector Machines

CPSC 540: Machine Learning

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Deep Feedforward Networks

Rapid Introduction to Machine Learning/ Deep Learning

Contents. Part I: Fundamentals of Bayesian Inference 1

Density Estimation. Seungjin Choi

Single-level Models for Binary Responses

Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Gaussian discriminant analysis Naive Bayes

Data-analysis and Retrieval Ordinal Classification

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

Generalized linear models

STAT 518 Intro Student Presentation

Dropout as a Bayesian Approximation: Insights and Applications

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Limited Dependent Variables and Panel Data

Gaussian Processes in Machine Learning

The joint posterior distribution of the unknown parameters and hidden variables, given the

Machine Learning Srihari. Gaussian Processes. Sargur Srihari

Gradient-Based Learning. Sargur N. Srihari

Probabilistic Machine Learning. Industrial AI Lab.

Introduction to Bayesian Inference

11. Generalized Linear Models: An Introduction

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 340: Machine Learning and Data Mining

Imputation for Missing Data under PPSWR Sampling

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS60010: Deep Learning

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

arxiv: v4 [stat.me] 23 Mar 2016

An Efficient Estimation Method for Longitudinal Surveys with Monotone Missing Data

Introduction to Gaussian Process

arxiv: v1 [stat.me] 30 Mar 2015

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

PATTERN RECOGNITION AND MACHINE LEARNING

Stat 5101 Lecture Notes

Expectation Propagation for Approximate Bayesian Inference

AGEC 661 Note Eleven Ximing Wu. Exponential regression model: m (x, θ) = exp (xθ) for y 0

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

COMPLEMENTARY LOG-LOG MODEL

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Bayesian Deep Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Introduction to Machine Learning

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Least Squares Regression

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Bayesian Regression Linear and Logistic Regression

A Bayesian Treatment of Linear Gaussian Regression

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Robust Bayesian Simple Linear Regression

Exam 2. Problem 1: Independent normal random variables

Machine Learning

Transcription:

Cross-Validation, Information Criteria, Expected Utilities and the Effective Number of Parameters Aki Vehtari and Jouko Lampinen Laboratory of Computational Engineering

Introduction Expected utility - estimates the predictive performance of the model - possible to use application specific utilities - useful in both model assessment and comparison Estimation of the expected utility - cross-validation - information criteria,

Expected utility Given - the training data D ={(x (i), y (i) ); i = 1, 2,...,n} - a model M - a future input x (n+1) - posterior predictive distribution p(y (n+1) x (n+1), D, M) a utility function u compares the predictive distribution to a future observation Examples of generic utilities - predictive likelihood u = p(y (n+1) x (n+1), D, M) - absolute error u = abs ( ŷ (n+1) y (n+1)) The expected utility is obtained by taking the expectation ū = E (x (n+1),y (n+1) ) [ ] u(y (n+1), x (n+1), D, M)

Estimating expected utility The expected utility is obtained by taking the expectation ū = E (x (n+1),y (n+1) ) [ ] u(y (n+1), x (n+1), D, M) The distribution of (x (n+1), y (n+1) ) is unknown Expected utility can be approximated - sample re-use cross-validation - asymptotic approximations information criteria

Cross-validation The expected utility ū = E (x (n+1),y (n+1) ) [ ] u(y (n+1), x (n+1), D, M) The distribution of (x (n+1), y (n+1) ) is estimated using (x (i), y (i) ) and the predictive distribution is replaced with a collection of CV predictive distributions {p(y (i) x (i), D (\i), M); i = 1, 2,...,n} where D (\i) denotes all the elements of D except (x (i), y (i) ) CV predictive distributions are compared to the actual y (i) s using the utility u, and the expectation is taken over i ] ū CV = E i [u(y (i), x (i), D (\i), M)

Information criteria The expected utility ū = E (x (n+1),y (n+1) ) [ u(y (n+1), x (n+1), D, M) ] The predictive distribution is replaced with a plug-in predictive distribution p(y (n+1) x (n+1), θ, D, M) Using second order Taylor approximation we obtain ] ū NIC = E i [u(y (i), x (i), θ, D, M) + tr(kj 1 ) K = Var[ū( θ) ], and J = E[ū( θ) ]. The ū( θ) and ū( θ) represent the first and second derivatives with respect to θ. Makes Monte Carlo approximation 2 (E θ [ū(θ)] ū(e θ [θ])) tr(kj 1 ) ū =ū(e θ [θ]) + 2 (E θ [ū(θ)] ū(e θ [θ]))

The effective number of parameters Using log-likelihood utility multiplied by n L( θ) = i log p(y(i) x (i), θ, D, M) -tr(kj 1 ) = p eff, the effective number of parameters - 0 < p eff p p eff is influenced by - the amount of the prior influence - dependence between the parameters - number of the training samples (p eff n) - distribution of the noise in the samples - the complexity of the underlying phenomenon to be modeled

The effective number of parameters There is no need to estimate p eff in cross-validation approach Using log-likelihood utility multiplied by n p eff,cv = [ ] log p(y (i) x (i), D, M) i i [ ] log p(y (i) x (i), D (\i), M) = L MPO L CV

Example Robust regression using the stack loss data - 3 predictor variables, 21 cases - linear regression with 5 different error distribution models 1) normal CV 2) double exp. 3) logistic 4) t 4 5) t 4 as scale mixt. 110 115 120 125 130 Expected predictive deviance

Example Robust regression using the stack loss data - slightly underestimates the effective number of parameters - slightly underestimates the expected predictive deviance 1) normal CV 1) normal CV 2) double exp. 2) double exp. 3) logistic 3) logistic 4) t 4 4) t 4 5) t 4 as scale mixt. 5) t as scale mixt. 4 110 115 120 125 130 Expected predictive deviance 4 6 8 10 Effective number of parameters

Example Robust regression using the stack loss data - gives just point estimates - In CV approach it is easy to estimate uncertainty 1) normal 2) double exp. 3) logistic CV 1) normal 2) double exp. 3) logistic 4) t 4 4) t 4 5) t 4 as scale mixt. 110 115 120 125 130 Expected predictive deviance 10 5 0 5 10 Pairwise comparison of t 4 scale mixture to others

Example Concrete quality prediction - 27 predictor variables, 215 cases - Gaussian process model with 4 different error distribution models 1) N 2) t ν CV 1) N 2) t ν 3) in.dep. N 3) in.dep. N 4) in.dep. t ν 0.8 0.9 1 1.1 1.2 Expected mean predictive likelihood 0.7 0.8 0.9 1 1.1 1.2 1.3 Pairwise comparison of 4) in.dep. t ν to others

Dependent data Different dependencies - Group dependencies - Time series - Spatial assumes independence CV can handle some finite range dependencies

Example of group dependencies Forest scene classification - 18 predictor variables 48x100 cases - 20-hidden-unit MLP with the logistic likelihood model Group CV Random CV 86 88 90 92 94 Expected classification accuracy %

Example Forest scene classification - assumes independent data points - CV can handle group dependencies Group CV Random CV Group CV Random CV Number of parameters in model 86 88 90 92 94 Expected classification accuracy % 200 300 400 500 600 700 Effective number of parameters

Example Longitudinal data: the six cities-study - 2 predictor variables and 537 children - linear model with interaction term, three different link functions and Bernoulli likelihood Wheezing status at age of 7 8 9 10 Mother smoking Child 1 0 0 0 0 0 Child 2 0 0 1 0 0 Child 3 0 1 0 1 1 Child n 1 1 1 1 1. 1) logit 2) probit 3) cloglog CV canonical mean 1250 1500 1750 Expected predictive deviance

Cross-validation vs. Information criteria Cross-validation uses full predictive distributions deals directly with predictive distributions easy to estimate the uncertainty can handle certain finite range dependencies up to 10 x more computation uses plug-in predictive distributions parametrization problems estimation of the uncertainty under investigation assumes independence no additional computation after sampling from posterior