Model Choice Lecture 2

Similar documents
Model Choice. Brian D. Ripley. Professor of Applied Statistics University of Oxford.

ISyE 691 Data mining and analytics

Regression I: Mean Squared Error and Measuring Quality of Fit

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

STA414/2104 Statistical Methods for Machine Learning II

Model comparison and selection

Machine Learning for OR & FE

Model Selection. Frank Wood. December 10, 2009

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

VBM683 Machine Learning

7. Estimation and hypothesis testing. Objective. Recommended reading

Lecture 14: Shrinkage

Linear Model Selection and Regularization

Machine Learning Linear Regression. Prof. Matteo Matteucci

Model comparison. Patrick Breheny. March 28. Introduction Measures of predictive power Model selection

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Density Estimation. Seungjin Choi

Introduction to Machine Learning and Cross-Validation

Sparse Linear Models (10/7/13)

Introduction to Statistical modeling: handout for Math 489/583

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Algorithm-Independent Learning Issues

ECE 5424: Introduction to Machine Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Generalized Linear Models

Regression, Ridge Regression, Lasso

Lecture : Probabilistic Machine Learning

Tuning Parameter Selection in L1 Regularized Logistic Regression

Machine Learning, Fall 2009: Midterm

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Statistical Machine Learning from Data

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Proteomics and Variable Selection

MASM22/FMSN30: Linear and Logistic Regression, 7.5 hp FMSN40:... with Data Gathering, 9 hp

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Selection Criteria Based on Monte Carlo Simulation and Cross Validation in Mixed Models

Introduction to Machine Learning Midterm, Tues April 8

ECE 5984: Introduction to Machine Learning

Parametric Techniques Lecture 3

Bayesian model selection: methodology, computation and applications

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Classification and Regression Trees

DIC, AIC, BIC, PPL, MSPE Residuals Predictive residuals

ECE521 week 3: 23/26 January 2017

Lecture 2 Machine Learning Review

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Testing Restrictions and Comparing Models

Model comparison: Deviance-based approaches

Bayesian Machine Learning

Machine Learning, Midterm Exam

Introduction to Gaussian Process

Discrepancy-Based Model Selection Criteria Using Cross Validation

PDEEC Machine Learning 2016/17

Lecture 6: Model Checking and Selection

CS534 Machine Learning - Spring Final Exam

CPSC 540: Machine Learning

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Statistics 262: Intermediate Biostatistics Model selection

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Lecture Data Science

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Course in Data Science

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

MS-C1620 Statistical inference

TDT4173 Machine Learning

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Chapter 1 Statistical Inference

Statistical Data Analysis Stat 3: p-values, parameter estimation

Learning with multiple models. Boosting.

Choosing among models

Fitting a Straight Line to Data

Parametric Techniques

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

GAUSSIAN PROCESS REGRESSION

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Linear Regression Models P8111

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

SRNDNA Model Fitting in RL Workshop

Introduction to Machine Learning Midterm Exam

Chapter 7: Model Assessment and Selection

FREQUENTIST BEHAVIOR OF FORMAL BAYESIAN INFERENCE

Linear Models for Regression

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

MS&E 226: Small Data

Nonparametric Bayesian Methods (Gaussian Processes)

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

7. Estimation and hypothesis testing. Objective. Recommended reading

Linear model selection and regularization

Transcription:

Model Choice Lecture 2 Brian Ripley http://www.stats.ox.ac.uk/ ripley/modelchoice/

Bayesian approaches Note the plural I think Bayesians are rarely Bayesian in their model selection, and Geisser s quote showed that model choice is perhaps not a strict Bayesian concept. Assume M (finite) models, exactly one of which is true, and let T indicate the data in the training set. In the Bayesian formulation, models are compared via P {M T}, the posterior probability assigned to model M. P {M T} p(t M)p M, p(t M) = p(t M, θ)p(θ) dθ so the ratio in comparing models M 1 and M 2 is proportional to p(t M 2 )/p(t M 1 ), known as the Bayes factor. We assume (often implicitly) that models have equal prior probabilities.

However, a formal Bayesian approach then averages predictions from models, weighting by P {M T}, unless a very peculiar loss function is in use. And this has been used for a long time, despite recent attempts to claim the credit for Bayesian Model Averaging. Suppose we just use the Bayes factor as a guide. The difficulty is in evaluating p(t M). Asymptotics are not very useful for Bayesian methods, as the prior on θ is often very important in providing smoothing, yet asymptotically negligible. We can expand out the log posterior density via a Laplace approximation and drop various terms, eventually reaching log p(t M) const + L( θ; T) 1 2 log H. where H is the Hessian of the log-likelihood and we needed to assume that the prior is diffuse.

BIC aka SBC For an iid random sample of size n from the assumed model, the penalty might be roughly proportional to ( 1 2 log n) p provided the parameters are identifiable. This is Schwarz s BIC up to a factor of minus two. As with AIC, the model with minimal BIC is chosen. Thus BIC = 2 log p(t M) + const 2L( θ; T) + (log n) p There are a number of variants, depending on e.g. which terms are dropped in the approximations and what estimator is used for θ (MLE, MAP,... ).

Crucial assumptions 1. The data were derived as an iid sample. (What about e.g. random effects models? Originally for linear models only.) 2. Choosing a single model is relevant in the Bayesian approach. 3. There is a true model. 4. The prior can be neglected. We may not obtain much information about parameters which are rarely effective, even in very large samples. 5. The simple asymptotics are adequate and that the rate of data collection on each parameter would be the same. We should be interested in comparing different models for the same n, and in many problems p will be comparable with n. Note that as this is trying to choose an explanation, we would expect it to neither systematically overfit nor underfit.

BIC vs the exact formula, for variable selection in a 4-variable logistic regression on 189 subjects.

Hannan Quinn and asymptotics The criterion of Hannan & Quinn (1979) is HQ = 2L( θ; T) (2c log log n)p where c > 1 is a constant. It was originally derived for determining the (true) order p 0 of an AR(p) process. Like BIC it is (under regularity conditions) strongly consistent: that is if there is a single true p the strategy of using BIC or HQ will asymptotically choose p with probability tending to one. This is also true of AIC. If there is not a true p we can ask for the model which is least false, that the P which is closest to the true P in the sense of Kullback-Leibler divergence. The problem is that for nested models such as AR(p), if one model is true then so is any larger model. In that case, BIC and HQ are still strongly consistent, but AIC will tend to choose increasingly complex models and will select an order p p 0 with probability tending to one.

For prediction, the relevant asymptotics concern the efficiency of the predictions, for example the mean square error or error rate of the predictions. For variable selection in a regression, AIC, PRESS and C p are asymptotically efficient, and BIC and HQ are not. Similarly for AR(p), AIC and AICC are asymptotically efficient and BIC and HQ are not. Best of both worlds? AIC has desirable asymptotic properties for prediction and BIC for explanation. Is it possible to have both? People have tried, but Yang (Biometrika 2005) showed that the answer is essentially no (and still no if model averaging is allowed).

Example Blood groups Landsteiner s classification of human blood groups as [O A B AB] clearly suggests two antigenes are involved (O = ohne ). There were two theories as to what Mendelian genetics are involved: one had three alleles at a single locus, the other two alleles at each of two loci, each of which give twoparameter models for the frequencies of occurrences of the blood types. Bernstein (1924) collected data on 502 Japanese living in Korea, and found N A = 212, N B = 103, N AB = 39, N O = 148. This gives 2L( θ; T) (at the MLEs) of 1254.2 and 1293.9. The models are not nested, and they have the same number of free parameters. So although we cannot use a testing framework (except perhaps that of Cox (1961)), both AIC and BIC can be used and strongly support the single-locus model.

Deviance Information Criterion (DIC) From Spiegelhalter et al. (2002). For a Bayesian setting where prior information is not negligible, and the model is assumed to be a good approximation but not necessarily true. Consider the deviance to be D(θ) = deviance(θ) = const(t) 2L(θ; T) In GLMs we use D( θ) as the (scaled) (residual) deviance. Define p D = D(θ) D(θ) The first overline means averaging θ over p(θ T), and the second means our estimate of the least false parameter value, usually the posterior mean of θ (but perhaps the median or mode of the posterior distribution). Then define DIC = D(θ) + 2 p D

Clearly DIC is AIC-like, but Like NIC it allows for non-ml fitting, in particular it can incorporate the regularization effect of the prior that should reduce the effective number of parameters. It is not necessary (but is usual) that p D 0. DIC is explicitly meant to apply to non-nested non-iid problems. DIC is intended to be approximated via MCMC samples from the posterior density of θ given T. On the other hand, DIC needs an explicit formula for the likelihood (up to a model-independent normalizing constant). In practice p D is often close to p, and indeed asymptotically is equivalent to the modified p used by Takeuchi and in NIC.

Focussed Information Criterion (FIC) AIC is aimed at prediction in general, but supposed we are interested in a specific prediction. Claeskens & Hjort (2003 ) considered how this should affect the strategy for model selection. The criterion they derive is rather complicated, but called FIC. Suppose we have a (multiple) linear regression and we are interested in prediction at a specific point in the covariate space, or we are interested in a single regression coefficient. Then it is reasonable that the model we choose for prediction might depend on which point or which coefficient. Note though that that applies to the whole model selection strategy, not just to a formal criterion the panoply of models will very likely depend on the question.

Example of FIC A classic example is the dataset menarche in MASS on age of menarche of 3918 girls in Warsaw. This can be fitted by a logistic regression, but there is some evidence that non-linear terms in age are needed, especially for small ages. Thus FIC selects a linear model for predictions near the center of the age distribution, a quadratic model for the first quartile and a quartic model for prediction of the 1% quantile (at 10.76 years). However, there are some strange features of this dataset: 27% of the data are observations that all 1049 17-year-old girls had reached menarche. It is unclear that using all the data (or an unweighted MLE) is appropriate for the questions asked.

Shrinkage The reason we might want to use variable selection for prediction is not that we think some of the variables are irrelevant (if we did, why were they measured in the first place?), but that they are not relevant enough. The issue is the age-old one of bias vs variance. Adding a weakly relevant variable reduces the bias of the predictions, but it also increases the uncertainty of the predictions. Consider a multiple linear regression. For small values of the true regression coefficient β i, the mean square error of the (interesting) predictions will decrease if β i is forced to be zero, whereas for larger values it will increase. Criteria such as FIC are estimating if the effect of dropping the variable is beneficial. Can we avoid this? Yes, if we don t force β i to zero, but we shrink it towards zero. This results in biased but less variable predictions.

Ridge regression For this slide only suppose we have an orthogonal (or even orthonormal) regression. Suppose we shrink all our regression coefficients (or all but the intercept) by a factor λ for 0 < λ 1. You can show that this always reduces the MSE for small λ. So some (unknown) amount of shrinkage is always worthwhile. James & Stein showed that for p > 2 there was a formula for shrinkage that would always reduce MSE, but unfortunately it might correspond to λ < 0. Sclove s variant is ( β i = 1 c ) + βi F where c > 0 and F is the F -statistic for all the coefficients being zero. (http://www.stat.ucla.edu/~cocteau/stat120b/lectures/ lecture4.pdf).

In ridge regression we minimize the sum of squares plus λ times the sum of squares of the coefficients (usually omitting the intercept, and suitably scaling the variables). Lots of related ideas, including really setting small β i = 0 and shrinking larger ones (e.g. LASSO ).

Model averaging For prediction purposes (and that applies to almost all Bayesians) we should average the predictions over models. We do not choose a single model. What do we average? Basic decision theory tells us: The probability predictions made by the models. For linear regression this amounts to averaging the coefficients over the models (being zero where a regressor is excluded), and this becomes a form of shrinkage. [Other forms of shrinkage like ridge regression may be as good at very much lower computational cost.] Note that we may not want to average over all models. We may want to choose a subset for computational reasons, or for plausibility.

How do we choose the weights? In the Bayesian theory this is clear, via the Bayes factors. In practice this is discredited. Even if we can compute them accurately (and via MCMC we may have a chance), we assume that one and exactly one model is true. [Box quote!] In practice Bayes factors can depend heavily on aspects of model inadequacy which are of no interest. Via cross-validation (goes back to Stone, 1974). Via bootstrapping (LeBlanc & Tibshirani, 1993). As an extended estimation problem, with the weights depending on the sample via a model (e.g. a multiple logistic); so-called stacked generalization and mixtures of experts.

Bagging, boosting, random forests Model averaging ideas have been much explored in the field of classification trees. In bagging models are fitted from bootstrap resamples of the data, and weighted equally. In boosting each additional model is chosen to (attempt to) repair the inadequacies of the current averaged model by resampling biased towards the mistakes. In random forests the tree-construction algorithm randomly restricts itself at the choice of each split.

(Practical) model selection in 2010 The concept of a model ought to be much, much larger than in 1977. Even in the 1990s people attempted to fit neural networks with half a million free parameters. Many models are not fitted by maximum likelihood, to very large datasets. Model classes can often overlap in quite extensive ways.

Calibrating GAG in urine Susan Prosser measured the concentration of the chemical GAG in the urine of 314 children aged 0 18 years. Her aim was to establish normal levels at different ages. GAG 0 10 20 30 40 50 0 5 10 15 Age

Clearly we want to fit a smooth curve. What? Polynomial? Exponential? Choosing the degree of a polynomial by F-tests gives degree 6. GAG 0 10 20 30 40 50 0 5 10 15 Age

Is this good enough? Smoothing splines would be the numerical analyst s way to fit a smooth curve to such a scatterplot. The issue is how smooth and in this example it has been chosen automatically by GCV. > plot(gagurine, pch=20) > lines(smooth.spline(age, GAG), lwd = 3, col="blue") Neural networks are another global non-linear model that are usually fitted by penalized least squares. An alternative would be local polynomials, using a kernel to define local and choosing the bandwidth automatically. > plot(gagurine, pch=20) > (h <- dpill(age, GAG)) > lines(locpoly(age, GAG, degree = 0, bandwidth = h)) > lines(locpoly(age, GAG, degree = 1, bandwidth = h), lty = 3) > lines(locpoly(age, GAG, degree = 2, bandwidth = h), lty = 4)

GAG 0 10 20 30 40 50 GAG 0 10 20 30 40 50 const linear quadratic 0 5 10 15 Age 0 5 10 15 Age

(Practical) model selection in 2010... There are lots of formal figures of adequacy for a model. Some have proved quite useful, but Their variability as estimators can be worrying large. Computation, e.g. of effective number of degrees of freedom, can be difficult. Their implicit measure of performance can be overly sensitive to certain aspects of the model which are not relevant to our problem. The assumptions of the theories need to be checked, as the criteria are used way outside their known spheres of validity (and in some cases where they are clearly not valid). Nowadays people do tens of thousands of significance tests, or more.

Plotting multiple p values 10 10 p-value image of a single fmri brain slice thresholded to show p-values below 10 4 and overlaid onto an image of the slice. Colours indicate differential responses within each cluster. An area of activation is shown in the visual cortex. 10 9 10 8 10 7 10 6 10 5 10 4

Formal training/validation/test sets, or the cross-validatory equivalents, are a very general and safe approach. Regression diagnostics are often based on approximations to overfitting or case deletion. Now we can (and some of us do) fit extended models with smooth terms or use fitting algorithms that downweight groups of points. (I rarely use least squares these days.) It is still all too easy to select a complex model just to account for a tiny proportion of aberrant observations. Alternative explanations with roughly equal support are commonplace. Model averaging seems a good solution. Selecting several models, studying their predictions and taking a consensus is also a good idea, when time permits and when non-quantitative information is available.