Introduction to Machine Learning and Cross-Validation

Similar documents
MS&E 226: Small Data

Dimension Reduction Methods

Shrinkage Methods: Ridge and Lasso

Regression, Ridge Regression, Lasso

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

ISyE 691 Data mining and analytics

PDEEC Machine Learning 2016/17

Big Data, Machine Learning, and Causal Inference

Bias-Variance Tradeoff. David Dalpiaz STAT 430, Fall 2017

Chapter 7: Model Assessment and Selection

Machine Learning for OR & FE

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Linear Model Selection and Regularization

Machine Learning for OR & FE

Lecture 3: Statistical Decision Theory (Part II)

Chart types and when to use them

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Learning with multiple models. Boosting.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Regression I: Mean Squared Error and Measuring Quality of Fit

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

CS6220: DATA MINING TECHNIQUES

Terminology for Statistical Data

The prediction of house price

Decision Tree Learning Lecture 2

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Lecture 14: Shrinkage

Overfitting, Bias / Variance Analysis

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Machine Learning for Microeconometrics

Variance Reduction and Ensemble Methods

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Introduction to Statistical modeling: handout for Math 489/583

Machine Learning Linear Regression. Prof. Matteo Matteucci

Statistical Machine Learning from Data

Applied Machine Learning Annalisa Marsico

High-Dimensional Statistical Learning: Introduction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Minimum Description Length (MDL)

Machine Learning Basics: Estimators, Bias and Variance

MS-C1620 Statistical inference

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

6.036 midterm review. Wednesday, March 18, 15

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Course in Data Science

Machine Learning. Lecture 9: Learning Theory. Feng Li.

A Modern Look at Classical Multivariate Techniques

Lecture 2 Machine Learning Review

Linear model selection and regularization

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

CS6220: DATA MINING TECHNIQUES

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

Sparse Linear Models (10/7/13)

Statistical Machine Learning I

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

The Supervised Learning Approach To Estimating Heterogeneous Causal Regime Effects

Introduction to Gaussian Process

Statistical aspects of prediction models with high-dimensional data

High-dimensional regression

Chapter 6. Ensemble Methods

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Statistics 262: Intermediate Biostatistics Model selection

Machine learning - HT Basis Expansion, Regularization, Validation

Linear regression methods

Recap from previous lecture

High-dimensional regression with unknown variance

Machine Learning Basics: Maximum Likelihood Estimation

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

PATTERN RECOGNITION AND MACHINE LEARNING

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Essence of Machine Learning (and Deep Learning) Hoa M. Le Data Science Lab, HUST hoamle.github.io

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

Machine Learning. Ensemble Methods. Manfred Huber

Data Mining Stat 588

Charles H. Dyson School of Applied Economics and Management Cornell University, Ithaca, New York USA

INTRODUCTION TO DATA SCIENCE

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

ECE521 week 3: 23/26 January 2017

Holdout and Cross-Validation Methods Overfitting Avoidance

Prediction & Feature Selection in GLM

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

12 Statistical Justifications; the Bias-Variance Decomposition

Qualifying Exam in Machine Learning

Ensemble Methods and Random Forests

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Nonparametric Methods

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Cross-validation for detecting and preventing overfitting

Machine Learning

VBM683 Machine Learning

PAC-learning, VC Dimension and Margin-based Bounds

Transcription:

Introduction to Machine Learning and Cross-Validation Jonathan Hersh 1 February 27, 2019 J.Hersh (Chapman ) Intro & CV February 27, 2019 1 / 29

Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance Trade-off 4 Cross-Validation 5 Conclusion J.Hersh (Chapman ) Intro & CV February 27, 2019 2 / 29

Machine learning versus econometrics Machine Learning Developed to solve problems in computer science Econometrics Developed to solve problems in economics J.Hersh (Chapman ) Intro & CV February 27, 2019 3 / 29

Machine learning versus econometrics Machine Learning Developed to solve problems in computer science Prediction/classification Econometrics Developed to solve problems in economics Explicitly testing a theory J.Hersh (Chapman ) Intro & CV February 27, 2019 3 / 29

Machine learning versus econometrics Machine Learning Developed to solve problems in computer science Prediction/classification Want: goodness of fit Econometrics Developed to solve problems in economics Explicitly testing a theory Statistical significance more important than model fit J.Hersh (Chapman ) Intro & CV February 27, 2019 3 / 29

Machine learning versus econometrics Machine Learning Developed to solve problems in computer science Prediction/classification Want: goodness of fit Huge datasets (many terabytes), large # variables (1000s) Econometrics Developed to solve problems in economics Explicitly testing a theory Statistical significance more important than model fit Small datasets, few variables J.Hersh (Chapman ) Intro & CV February 27, 2019 3 / 29

Machine learning versus econometrics Machine Learning Developed to solve problems in computer science Prediction/classification Want: goodness of fit Huge datasets (many terabytes), large # variables (1000s) Whatever works Econometrics Developed to solve problems in economics Explicitly testing a theory Statistical significance more important than model fit Small datasets, few variables It works in practice, but what about in theory? J.Hersh (Chapman ) Intro & CV February 27, 2019 3 / 29

Machine learning versus econometrics Machine Learning Developed to solve problems in computer science Prediction/classification Want: goodness of fit Huge datasets (many terabytes), large # variables (1000s) Whatever works Econometrics Developed to solve problems in economics Explicitly testing a theory Statistical significance more important than model fit Small datasets, few variables It works in practice, but what about in theory? Can we utilize some of this machinery to solve problems in development economics? J.Hersh (Chapman ) Intro & CV February 27, 2019 3 / 29

Applications of Machine Learning in economics (due to Sendhil Mullainathan) 1. New data 2. Predictions for policy 3. Better econometrics See Machine learning: an applied econometric approach, JEP See Athey Beyond Prediction: Using Big Data for Policy Problems (2017) Science See Hersh, Jonathan, and Matthew Harding. "Big Data in economics." IZA World of Labor (2018) J.Hersh (Chapman ) Intro & CV February 27, 2019 4 / 29

This course Primer in statistical learning theory, which grew out of statistics How does this differ from ML? Machine learning places more emphasis on large scale applications and prediction accuracy. Statistical learning covers J.Hersh (Chapman ) Intro & CV February 27, 2019 5 / 29

This course Primer in statistical learning theory, which grew out of statistics How does this differ from ML? Machine learning places more emphasis on large scale applications and prediction accuracy. Statistical learning covers There is much overlap and cross-fertilization J.Hersh (Chapman ) Intro & CV February 27, 2019 5 / 29

This course Primer in statistical learning theory, which grew out of statistics How does this differ from ML? Machine learning places more emphasis on large scale applications and prediction accuracy. Statistical learning covers There is much overlap and cross-fertilization Very little coding, but example code provided: jonathan-hersh.com/machinelearningdev J.Hersh (Chapman ) Intro & CV February 27, 2019 5 / 29

Topics covered 1. Cross-validation [Chapter 2] J.Hersh (Chapman ) Intro & CV February 27, 2019 6 / 29

Topics covered 1. Cross-validation [Chapter 2] 2. Shrinkage methods (Ridge and LASSO) [Chapter 6] J.Hersh (Chapman ) Intro & CV February 27, 2019 6 / 29

Topics covered 1. Cross-validation [Chapter 2] 2. Shrinkage methods (Ridge and LASSO) [Chapter 6] 3. Classification [Chapter 4, APM Chapter 11-12] J.Hersh (Chapman ) Intro & CV February 27, 2019 6 / 29

Topics covered 1. Cross-validation [Chapter 2] 2. Shrinkage methods (Ridge and LASSO) [Chapter 6] 3. Classification [Chapter 4, APM Chapter 11-12] 4. Tree-based methods (Decision trees, bagging, random forest boosting) [Chapter 8] J.Hersh (Chapman ) Intro & CV February 27, 2019 6 / 29

Topics covered 1. Cross-validation [Chapter 2] 2. Shrinkage methods (Ridge and LASSO) [Chapter 6] 3. Classification [Chapter 4, APM Chapter 11-12] 4. Tree-based methods (Decision trees, bagging, random forest boosting) [Chapter 8] 5. Unsupervised learning (PCA, k-means clustering, hierarchical clustering) [Chapter 10] J.Hersh (Chapman ) Intro & CV February 27, 2019 6 / 29

Topics covered 1. Cross-validation [Chapter 2] 2. Shrinkage methods (Ridge and LASSO) [Chapter 6] 3. Classification [Chapter 4, APM Chapter 11-12] 4. Tree-based methods (Decision trees, bagging, random forest boosting) [Chapter 8] 5. Unsupervised learning (PCA, k-means clustering, hierarchical clustering) [Chapter 10] 6. Caret Automated Machine Learning [APM] J.Hersh (Chapman ) Intro & CV February 27, 2019 6 / 29

1 Introduction 2 Preliminary Terminology 3 Bias-Variance Trade-off 4 Cross-Validation 5 Conclusion J.Hersh (Chapman ) Intro & CV February 27, 2019 7 / 29

Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance Trade-off 4 Cross-Validation 5 Conclusion J.Hersh (Chapman ) Intro & CV February 27, 2019 8 / 29

Supervised vs. unsupervised learning Def: Supervised learning for every x i we also observe a response y i J.Hersh (Chapman ) Intro & CV February 27, 2019 9 / 29

Supervised vs. unsupervised learning Def: Supervised learning for every x i we also observe a response y i Ex: Estimating housing values by OLS or random forest; J.Hersh (Chapman ) Intro & CV February 27, 2019 9 / 29

Supervised vs. unsupervised learning Def: Supervised learning for every x i we also observe a response y i Ex: Estimating housing values by OLS or random forest; Def: Unsupervised learning for each observation we only observe x i, but do not observe y i J.Hersh (Chapman ) Intro & CV February 27, 2019 9 / 29

Supervised vs. unsupervised learning Def: Supervised learning for every x i we also observe a response y i Ex: Estimating housing values by OLS or random forest; Def: Unsupervised learning for each observation we only observe x i, but do not observe y i Ex: Clustering customers into segments; using principle component analysis for dimension reduction J.Hersh (Chapman ) Intro & CV February 27, 2019 9 / 29

Test, Training, and Validation Set Training set: (observation-wise) subset of data used to develop models J.Hersh (Chapman ) Intro & CV February 27, 2019 10 / 29

Test, Training, and Validation Set Training set: (observation-wise) subset of data used to develop models Validation set: subset of data used during intermediate stages to tune model parameters J.Hersh (Chapman ) Intro & CV February 27, 2019 10 / 29

Test, Training, and Validation Set Training set: (observation-wise) subset of data used to develop models Validation set: subset of data used during intermediate stages to tune model parameters Test set: subset of data (used sparingly) to approximate out of sample fit J.Hersh (Chapman ) Intro & CV February 27, 2019 10 / 29

Assessing model accuracy Mean squared error measures how well model predictions match observed data MSE (ˆf (x)) = 1 N N y i i=1 }{{} data 2 ˆf (x i ) }{{} model J.Hersh (Chapman ) Intro & CV February 27, 2019 11 / 29

Assessing model accuracy Mean squared error measures how well model predictions match observed data MSE (ˆf (x)) = 1 N N y i i=1 }{{} data 2 ˆf (x i ) }{{} model Training MSE vs Test MSE: good in-sample fit (low training MSE) can often obscure poor out of sample fit (high test MSE) J.Hersh (Chapman ) Intro & CV February 27, 2019 11 / 29

Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance Trade-off 4 Cross-Validation 5 Conclusion J.Hersh (Chapman ) Intro & CV February 27, 2019 12 / 29

Quick example on out-of-sample fit Let s compare three estimators to see how estimator complexity affects out of sample fit 1. f 1 = linear regression (in orange) 2. f 2 = third order polynomial (in black) 3. f 3 = very flexible smoothing spline (in green) J.Hersh (Chapman ) Intro & CV February 27, 2019 13 / 29

Case 1: True f (x) slightly complicated J.Hersh (Chapman ) Intro & CV February 27, 2019 14 / 29

Case 2: True f (x) not complicated at all J.Hersh (Chapman ) Intro & CV February 27, 2019 15 / 29

Case 2: True f (x) not complicated at all J.Hersh (Chapman ) Intro & CV February 27, 2019 15 / 29

Case 3: True f (x) very complicated J.Hersh (Chapman ) Intro & CV February 27, 2019 16 / 29

How to select right model complexity? J.Hersh (Chapman ) Intro & CV February 27, 2019 17 / 29

Generalizing this problem: Bias-Variance tradeoff J.Hersh (Chapman ) Intro & CV February 27, 2019 18 / 29

Bias Variance Tradeoff in Math Prediction Error: E y i }{{} data 2 ˆf (x i ) }{{} model

Bias Variance Tradeoff in Math [ ( ) ] 2 E y ˆf Prediction Error: E y i }{{} data [ ] = E y 2 + ˆf 2 2yˆf 2 ˆf (x i ) }{{} model (expanding terms)

Bias Variance Tradeoff in Math [ ( ) ] 2 E y ˆf Prediction Error: E y i }{{} data [ ] = E y 2 + ˆf 2 2yˆf = E [y 2] }{{} Var(y)+E[y] 2 2 ˆf (x i ) }{{} model (expanding terms) + E [ˆf 2] [ ] E 2yˆf }{{} Var(ˆf )+E[ˆf ] 2

Bias Variance Tradeoff in Math [ ( ) ] 2 E y ˆf Prediction Error: E y i }{{} data [ ] = E y 2 + ˆf 2 2yˆf = E [y 2] }{{} Var(y)+E[y] 2 2 ˆf (x i ) }{{} model (expanding terms) + E [ˆf 2] [ ] E 2yˆf }{{} Var(ˆf )+E[ˆf ] 2 = Var (y) + ) ] E [y] 2 2 [ ] + Var (ˆf + E [ˆf E 2yˆf }{{} y f (x)+ε, E[ε]=0 (def n var)

Bias Variance Tradeoff in Math [ ( ) ] 2 E y ˆf Prediction Error: E y i }{{} data [ ] = E y 2 + ˆf 2 2yˆf = E [y 2] }{{} Var(y)+E[y] 2 2 ˆf (x i ) }{{} model (expanding terms) + E [ˆf 2] [ ] E 2yˆf }{{} Var(ˆf )+E[ˆf ] 2 = Var (y) + ) ] E [y] 2 2 [ ] + Var (ˆf + E [ˆf E 2yˆf }{{} y f (x)+ε, E[ε]=0 (def n var) = Var (y) + Var ) ( ] 2 [ ] ) (ˆf + E [ˆf E 2yˆf + f 2 (def n y)

Bias Variance Tradeoff in Math ) ( ]) 2 = Var (y) + Var (ˆf + f E [ˆf }{{} =E[((y E[y]) 2 ]=E[((y f ) 2 ]=E[ε 2 ]=Var(ε)=σε 2 J.Hersh (Chapman ) Intro & CV February 27, 2019 20 / 29

Bias Variance Tradeoff in Math ) ( ]) 2 = Var (y) + Var (ˆf + f E [ˆf }{{} =E[((y E[y]) 2 ]=E[((y f ) 2 ]=E[ε 2 ]=Var(ε)=σε 2 Prediction Error [( )] E y i ˆf = σε 2 }{{} irreducible error ) + Var (ˆf }{{} variance ([ + ]]) 2 f E [ˆf } {{ } bias J.Hersh (Chapman ) Intro & CV February 27, 2019 20 / 29

Bias Variance Tradeoff in Math ) ( ]) 2 = Var (y) + Var (ˆf + f E [ˆf }{{} =E[((y E[y]) 2 ]=E[((y f ) 2 ]=E[ε 2 ]=Var(ε)=σε 2 Prediction Error [( )] E y i ˆf = σε 2 }{{} irreducible error ] Bias is minimized when f = E [ˆf ) + Var (ˆf }{{} variance ([ + ]]) 2 f E [ˆf } {{ } bias But total error (variance + bias) may be minimized by some other ˆf J.Hersh (Chapman ) Intro & CV February 27, 2019 20 / 29

Bias Variance Tradeoff in Math ) ( ]) 2 = Var (y) + Var (ˆf + f E [ˆf }{{} =E[((y E[y]) 2 ]=E[((y f ) 2 ]=E[ε 2 ]=Var(ε)=σε 2 Prediction Error [( )] E y i ˆf = σε 2 }{{} irreducible error ] Bias is minimized when f = E [ˆf ) + Var (ˆf }{{} variance ([ + ]]) 2 f E [ˆf } {{ } bias But total error (variance + bias) may be minimized by some other ˆf ˆf (x) with smaller variance fewer variables, smaller magnitude coefficients J.Hersh (Chapman ) Intro & CV February 27, 2019 20 / 29

Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance Trade-off 4 Cross-Validation 5 Conclusion J.Hersh (Chapman ) Intro & CV February 27, 2019 21 / 29

Cross-Validation Cross-validation is a tool to approximate out of sample fit J.Hersh (Chapman ) Intro & CV February 27, 2019 22 / 29

Cross-Validation Cross-validation is a tool to approximate out of sample fit In machine learning, many models have parameters that must be tuned We adjust these parameters using using cross-validation J.Hersh (Chapman ) Intro & CV February 27, 2019 22 / 29

Cross-Validation Cross-validation is a tool to approximate out of sample fit In machine learning, many models have parameters that must be tuned We adjust these parameters using using cross-validation Also useful to select between large classes of models e.g random forest vs lasso J.Hersh (Chapman ) Intro & CV February 27, 2019 22 / 29

K-fold Cross-Validation (CV) K-fold CV Algorithm 1. Randomly divide the data into K equal sized parts or folds. J.Hersh (Chapman ) Intro & CV February 27, 2019 23 / 29

K-fold Cross-Validation (CV) K-fold CV Algorithm 1. Randomly divide the data into K equal sized parts or folds. 2. Leave out part k, fit the model to the other K 1 parts. J.Hersh (Chapman ) Intro & CV February 27, 2019 23 / 29

K-fold Cross-Validation (CV) K-fold CV Algorithm 1. Randomly divide the data into K equal sized parts or folds. 2. Leave out part k, fit the model to the other K 1 parts. 3. Use fitted model to obtain predictions for left-out k-th part J.Hersh (Chapman ) Intro & CV February 27, 2019 23 / 29

K-fold Cross-Validation (CV) K-fold CV Algorithm 1. Randomly divide the data into K equal sized parts or folds. 2. Leave out part k, fit the model to the other K 1 parts. 3. Use fitted model to obtain predictions for left-out k-th part 4. Repeat until k = 1,..., K and combine results J.Hersh (Chapman ) Intro & CV February 27, 2019 23 / 29

K-Fold CV Example 2

Details of K-Fold CV Let n k be the number of test observations in fold k, where n k = N/K Cross-Validation Error for fold k: CV {k} = K k=1 n k N MSE k where MSE k = i C k (y i ŷ) 2 /n k is the mean squared error of fold k J.Hersh (Chapman ) Intro & CV February 27, 2019 25 / 29

Details of K-Fold CV Let n k be the number of test observations in fold k, where n k = N/K Cross-Validation Error for fold k: CV {k} = K k=1 n k N MSE k where MSE k = i C k (y i ŷ) 2 /n k is the mean squared error of fold k Setting k = N is referred to as leave-one out cross-validation (LOOCV) J.Hersh (Chapman ) Intro & CV February 27, 2019 25 / 29

Classical frequentist model selection Akaike information criterion (AIC) AIC = 2 N loglik + 2 d N where d is the number of parameters in our model Bayesian information criterion (BIC) BIC = 2 loglik + (log N) d J.Hersh (Chapman ) Intro & CV February 27, 2019 26 / 29

Classical frequentist model selection Akaike information criterion (AIC) AIC = 2 N loglik + 2 d N where d is the number of parameters in our model Bayesian information criterion (BIC) BIC = 2 loglik + (log N) d Both penalize models with more parameters in a somewhat arbitrary fashion This usually helps with model selection, but still does not answer the important question of model assessment J.Hersh (Chapman ) Intro & CV February 27, 2019 26 / 29

What is the Optimal Number of Folds, K? Do you want big or a small training folds? J.Hersh (Chapman ) Intro & CV February 27, 2019 27 / 29

What is the Optimal Number of Folds, K? Do you want big or a small training folds? Because training set is only (K 1)/K as big as the full dataset, the estimates of the prediction error will be biased upward. Bias is minimized when K = N (LOOCV) But LOOCV has higher variance! J.Hersh (Chapman ) Intro & CV February 27, 2019 27 / 29

What is the Optimal Number of Folds, K? Do you want big or a small training folds? Because training set is only (K 1)/K as big as the full dataset, the estimates of the prediction error will be biased upward. Bias is minimized when K = N (LOOCV) But LOOCV has higher variance! No clear statistical rules for how to set k Convention is to set K = 5 or 10 in practice is a good trade-off between bias and variance for most problems J.Hersh (Chapman ) Intro & CV February 27, 2019 27 / 29

Plan 1 Introduction 2 Preliminary Terminology 3 Bias-Variance Trade-off 4 Cross-Validation 5 Conclusion J.Hersh (Chapman ) Intro & CV February 27, 2019 28 / 29

Conclusion Econometric models are at risk for overfitting But what we want are theories that extend beyond the dataset we have on our computer Cross-validation is a key tool that allows us to adjust models so that they more closely match reality J.Hersh (Chapman ) Intro & CV February 27, 2019 29 / 29