MS&E 226: Small Data

Similar documents
MS&E 226: Small Data

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Introduction to Machine Learning and Cross-Validation

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

Linear Model Selection and Regularization

Regression, Ridge Regression, Lasso

Machine Learning for OR & FE

Data Mining Stat 588

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

ISyE 691 Data mining and analytics

Model Selection. Frank Wood. December 10, 2009

Regression Models - Introduction

Resampling techniques for statistical modeling

Introduction to Statistical modeling: handout for Math 489/583

High-dimensional regression

Chapter 7: Model Assessment and Selection

Weighted Least Squares

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

STAT 100C: Linear models

Model comparison and selection

MS&E 226: Small Data

Machine Learning for OR & FE

PDEEC Machine Learning 2016/17

Weighted Least Squares

How the mean changes depends on the other variable. Plots can show what s happening...

MS&E 226: Small Data

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Regression Models - Introduction

MS&E 226: Small Data

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Lecture 14 Simple Linear Regression

Linear regression methods

Generalized Linear Models

MS&E 226: Small Data

Regression I: Mean Squared Error and Measuring Quality of Fit

Introduction to Econometrics Midterm Examination Fall 2005 Answer Key

MS&E 226: Small Data

CS6220: DATA MINING TECHNIQUES

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

MS-C1620 Statistical inference

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

High-dimensional regression modeling

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Minimum Description Length (MDL)

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Lecture 9 SLR in Matrix Form

STK-IN4300 Statistical Learning Methods in Data Science

Local regression I. Patrick Breheny. November 1. Kernel weighted averages Local linear regression

Machine Learning, Fall 2009: Midterm

10701/15781 Machine Learning, Spring 2007: Homework 2

Multivariate Regression (Chapter 10)

10. Alternative case influence statistics

Machine Learning. Lecture 9: Learning Theory. Feng Li.

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Categorical Predictor Variables

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Lecture 7: Modeling Krak(en)

STK-IN4300 Statistical Learning Methods in Data Science

9. Least squares data fitting

Machine Learning Linear Regression. Prof. Matteo Matteucci

Sample questions for Fundamentals of Machine Learning 2018

CS6220: DATA MINING TECHNIQUES

Niche Modeling. STAMPS - MBL Course Woods Hole, MA - August 9, 2016

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Sparse Linear Models (10/7/13)

MATH 644: Regression Analysis Methods

Applied Statistics and Econometrics

Opening Theme: Flexibility vs. Stability

STAT5044: Regression and Anova. Inyoung Kim

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

20. REML Estimation of Variance Components. Copyright c 2018 (Iowa State University) 20. Statistics / 36

Cross Validation & Ensembling

Focused fine-tuning of ridge regression

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Ma 3/103: Lecture 24 Linear Regression I: Estimation

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Ch 2: Simple Linear Regression

Ch 3: Multiple Linear Regression

Classification 1: Linear regression of indicators, linear discriminant analysis

1/34 3/ Omission of a relevant variable(s) Y i = α 1 + α 2 X 1i + α 3 X 2i + u 2i

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Multiple Linear Regression

Math 5305 Notes. Diagnostics and Remedial Measures. Jesse Crawford. Department of Mathematics Tarleton State University

Dimension Reduction Methods

arxiv: v2 [math.st] 9 Feb 2017

Linear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1

11. Learning graphical models

Machine Learning. Part 1. Linear Regression. Machine Learning: Regression Case. .. Dennis Sun DATA 401 Data Science Alex Dekhtyar..

Simple Linear Regression

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Transcription:

MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34

Estimating prediction error 2 / 34

Estimating prediction error We saw how we can estimate prediction error using validation or test sets. But what can we do if we don t have enough data to estimate test error? In this set of notes we discuss how we can use in-sample estimates to measure model complexity. 3 / 34

Training error The first idea for estimating prediction error of a fitted model might be to look at the sum of squared error in-sample: Err tr = 1 n n (Y i ˆf(X i )) 2 = 1 n i=1 n ri 2. i=1 This is called the training error; it is the same as 1/n sum of squared residuals we studied earlier. 4 / 34

Training error vs. prediction error Of course, based on our discussion of bias and variance, we should expect that training error is too optimistic relative to the error on a new test set. To formalize this, we can compare Err tr to Err in, the in-sample prediction error: Err in = 1 n n E[(Y ˆf( X)) 2 X, Y, X = X i ]. i=1 This is the prediction error if we received new samples of Y corresponding to each covariate vector in our existing data. 5 / 34

Training error vs. test error Let s first check how these behave relative to each other. Use the same type of simulation as before: Generate 100 X 1, X 2 N(0, 1), i.i.d. Let Y i = 1 + X i1 + 2X i2 + ε i, where ε i N(0, 5), i.i.d. Fit a model ˆf using OLS, and the formula Y ~ 1 + X1 + X2. Compute training error of the model. Generate another 100 test samples of Y corresponding to each row of X, using the same population model. Compute in-sample prediction error of the fitted model on the test set. Repeat in 500 parallel universes, and create a plot of the results. 6 / 34

Training error vs. test error Results: 0.08 0.06 density 0.04 0.02 0.00 10 0 10 Err.in Err.tr Mean of Err in Err tr = 1.42. 7 / 34

Training error vs. test error If we could somehow correct Err tr to behave more like Err in, we would have a way to estimate prediction error on new data (at least, for covariates X i we have already seen). Here is a key result towards that correction. 1 Theorem E[Err in X] = E[Err tr X] + 2 n n Cov( ˆf(X i ), Y i X). i=1 In particular, if Cov( ˆf(X i ), Y i X) > 0, then training error underestimates test error. 1 This result holds more generally for other measures of prediction error, e.g., 0-1 loss in binary classification. 8 / 34

Training error vs. test error: Proof [ ] Proof: If we expand the definitions of Err tr and Err in, we get: Err in Err tr = 1 n ( E[Y 2 X n = X i ] Yi 2 i=1 2(E[Y X = X i ] Y i ) ˆf(X ) i ) Now take expectations over Y. Note that: E[Y 2 X, X = X i ] = E[Y 2 i X], since both are the expectation of the square of a random outcome with associated covariate X i. So we have: E[Err in Err tr X] = 2 n n i=1 E [(E[Y X = X i ] Y i ) ˆf(X i ) ] X. 9 / 34

Training error vs. test error: Proof [ ] Proof (continued): Also note that E[Y X = X i ] = E[Y i X], for the same reason. Finally, since: we get: E[Err in Err tr X] = 2 n E[Y i E[Y i X] X] = 0, n i=1 ( [ E (Y i E[Y X = X i ]) ˆf(X i ) ] X E[Y i E[Y i X] X]E[ ˆf(X i ) X] ), which reduces to (2/n) n i=1 Cov( ˆf(X i ), Y i X), as desired. 10 / 34

The theorem s condition What does Cov( ˆf(X i ), Y i X) > 0 mean? In practice, for any reasonable modeling procedure, we should expect our predictions to be positively correlated with our outcome. 11 / 34

Example: Linear regression Assume a linear population model Y = Xβ + ε, where E[ε X] = 0, Var(ε) = σ 2, and errors are uncorrelated. Suppose we use a subset S of the covariates and fit a linear regression model by OLS. Recall that the bias-variance tradeoff is: E[Err in X] = σ 2 + 1 n n i=1 (X i β E[ ˆf(X i ) X]) 2 + S n σ2. But this is hard to use to precisely estimate prediction error in practice, because we can t compute the bias. 12 / 34

Example: Linear regression The theorem gives us another way forward. For linear regression with a linear population model we have: n Cov( ˆf(X i ), Y i X) = S σ 2. i=1 In other words, in this setting we have: E[Err in X] = E[Err tr X] + 2 S n σ2. 13 / 34

Model complexity for linear regression 14 / 34

A model complexity score The last result suggests how we might measure model complexity: Estimate σ 2 using the sample standard deviation of the residuals on the full fitted model, i.e., with S = {1,..., p}; call this ˆσ 2. 2 For a given model using a set of covariates S, compute: C p = Err tr + 2 S n ˆσ2. This is called Mallow s C p statistic. It is an estimate of the prediction error. 2 For a model with low bias, this will be a good estimate of σ 2. 15 / 34

A model complexity score How to interpret this? C p = Err tr + 2 S n ˆσ2. The first term measures fit to the existing data. The second term is a penalty for model complexity. So the C p statistic balances underfitting and overfitting the data (bias and variance). 16 / 34

AIC, BIC Other model complexity scores: Akaike information criterion (AIC). In the linear population model with normal ε, this is equivalent to: ( n ˆσ 2 Err tr + 2 S ) n ˆσ2. Bayesian information criterion (BIC). In the linear population model with normal ε, this is equivalent to: ( Err tr + n ˆσ 2 ) S ln n ˆσ 2. n Both are more general, and derived from a likelihood approach. (More on that later.) 17 / 34

AIC, BIC Note that: AIC is the same (up to scaling) as C p in the linear population model with normal ε. BIC penalizes model complexity more heavily than AIC. 18 / 34

AIC, BIC in software [ ] In practice, there can be significant differences between the actual values of C p, AIC, and BIC depending on software; but these don t affect model selection. The estimate of sample variance ˆσ 2 for C p will usually be computed using the full fitted model (i.e., with all p covariates), while the estimate of sample variance for AIC and BIC will usually be computed using just the fitted model being evaluated (i.e., with just S covariates). This typically has no substantive effect on model selection. In addition, sometimes AIC and BIC are reported as the negation of the expressions on the previous slide, so that larger values are better; or without the scaling coefficient in front. Again, none of these changes affect model selectin. 19 / 34

Cross validation 20 / 34

Cross validation Accuracy of C p, AIC, and BIC depends on some knowledge of population model (in general, though they are often used even without this). What general tools are avialable that don t depend on such knowledge? Cross validation is a simple, widely used technique for estimating prediction error of a model, when data is (relatively) limited. 21 / 34

K-fold cross validation K-fold cross validation (CV) works as follows: Divide data (randomly) into K equal groups, called folds. Let A k denote the set of data points (Y i, X i ) placed into the k th fold. 3 For k = 1,..., K, train model on all except k th fold. Let ˆf k denote the resulting fitted model. Estimate prediction error as: Err CV = 1 K 1 (Y i K n/k ˆf k (X i )) 2. i A k k=1 In words: for the k th model, the k th fold acts as a validation set. The estimated prediction error from CV Err CV is the average of the test set prediction errors of each model. 3 For simplicity assume n/k is an integer. 22 / 34

K-fold cross validation A picture: 23 / 34

Using CV After running K-fold CV, what do we do? We then build a model from all the training data. Call this ˆf. The idea is that Err CV should be a good estimate of Err, the generalization error of ˆf. 4 So with that in mind, how to choose K? If K = N, the resulting method is called leave-one-out (LOO) cross validation. If K = 1, then there is no cross validation at all. In practice, in part due to computational considerations, often use K = 5 to 10. 4 Recall generalization error is the expected prediction error of ˆf on new samples. 24 / 34

How to choose K? Bias and variance play a role in choosing K. Let s start with bias: How well does Err CV approximate Err? When K = N, the training set for each ˆf k is nearly the entire training data. Therefore Err CV will be nearly unbiased as an estimate of Err. When K N, since the models use much less data than the entire training set, each model ˆf k has higher generalization error; therefore Err CV will tend to overestimate Err. 25 / 34

How to choose K? Bias and variance play a role in choosing K. Now let s look at variance: How much does Err CV vary if the training data is changed? When K = N, because the training sets are very similar across all the models ˆf k, they will tend to have strong positive correlation in their predictions. This contributes to higher variance. When K N, the models ˆf k are less correlated with each other, reducing variance. On the other hand, each model is trained on significantly less data, increasing variance (e.g., for linear regression, the variance term increases if n decreases). The overall effect depends on the population model and the model class being used. 26 / 34

Leave-one-out CV and linear regression Leave-one-out CV is particularly straightforward for linear models fitted by OLS: there is no need to refit the model at all. This is a useful computational trick for linear models. Theorem Given training data X and Y, let H = X(X X) 1 X be the hat matrix, and let Ŷ = HY be the fitted values under OLS with the full training data. Then for leave-one-out cross validation: 5 Err LOOCV = 1 n ( ) 2 n Y i Ŷi. 1 H ii i=1 Interpretation: Observations with H ii close to 1 are very influential in the fit, and therefore have a big effect on generalization error. 5 It can be shown that H ii < 1 for all i. 27 / 34

LOO CV and OLS: Proof sketch [ ] Let ˆf i be the fitted model from OLS when observation i is left out. Define Z j = Y j if j i, and Z i = ˆf i (X i ). Show that OLS with training data X and Z has ˆf i as solution. Therefore ˆf i (X i ) = (HZ) i. Now use the fact that: (HZ) i = j H ij Z j = (HY) i H ii Y i + H ii ˆf i (X i ). 28 / 34

A hypothetical example You are given a large dataset with many covariates. You carry out a variety of visualizations and explorations to conclude that you only want to use p of the covariates. You then use cross validation to pick the best model using these covariates. Question: is Err CV a good estimate of Err in this case? 29 / 34

A hypothetical example (continued) No You already used the data to choose your p covariates! The covariates were chosen because they looked favorable on the training data; this makes it more likely that they will lead to low cross validation error. Thus in this approach, Err CV will typically be lower than true generalization error Err. 6 MORAL: To get unbiased results, any model selection must be carried out without the holdout data included! 6 Analogous to our discussion of validation and test sets in the train-validate-test approach. 30 / 34

Cross validation in R In R, cross validation can be carried out using the cvtools package. > library(cvtools) > cv.folds = cvfolds(n, K) > cv.out = cvfit(lm, formula =..., folds = cv.folds, cost = mspe) When done, cv.out$cv contains Err CV. Can be used more generally with other model fitting methods. 31 / 34

Simulation: Comparing C p, AIC, BIC, CV Repeat the following steps 10 times: For 1 i 100, generate X i uniform[ 3, 3]. For 1 i 100, generate Y i as: Y i = α 1 X i + α 2 X 2 i α 3 X 3 i + α 4 X 4 i α 5 X 5 i + α 6 X 6 i + ε i, where ε i uniform[ 3, 3]. For p = 1,..., 20, we evaluate the model Y ~ 0 + X + I(X^2) +... + I(X^p) using C p, BIC, and 10-fold cross validation. 7 How do these methods compare? 7 We leave out AIC since it is exactly a scaled version of C p. 32 / 34

Simulation: Visualizing the data 10 0 Y 10 3 2 1 0 1 2 3 X 33 / 34

Simulation: Comparing C p, AIC, BIC, CV 100 model Score (in log_10 scale) CV with k = 10 CV with k = 100 C_p Scaled AIC Scaled BIC Error on test set 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 S (number of variables) 34 / 34