LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Similar documents
Linear model selection and regularization

Linear Model Selection and Regularization

Dimension Reduction Methods

Lecture 14: Shrinkage

Machine Learning for OR & FE

TWO-LEVEL FACTORIAL EXPERIMENTS: REGULAR FRACTIONAL FACTORIALS

MS-C1620 Statistical inference

ST3232: Design and Analysis of Experiments

Regression, Ridge Regression, Lasso

Linear regression methods

Machine Learning Linear Regression. Prof. Matteo Matteucci

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2006 EXAM-II FALL 2006 EXAM-II EXAMINATION COVER PAGE Professor Moseley

Linear Model Selection and Regularization

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Day 4: Shrinkage Estimators

Introduction to Statistical modeling: handout for Math 489/583

STAT 462-Computational Data Analysis

Data Mining Stat 588

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 EXAM-I FALL 2005 EXAM-I EXAMINATION COVER PAGE Professor Moseley

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

CS 484 Data Mining. Association Rule Mining 2

Session 3 Fractional Factorial Designs 4

Linear Methods for Regression. Lijun Zhang

CS 5014: Research Methods in Computer Science

ISyE 691 Data mining and analytics

Strategy of Experimentation III

Rirdge Regression. Szymon Bobek. Institute of Applied Computer science AGH University of Science and Technology

Experimental design (DOE) - Design

Prediction & Feature Selection in GLM

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Statistics 262: Intermediate Biostatistics Model selection

Lecture 6: Linear Regression (continued)

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

UNIT 3 BOOLEAN ALGEBRA (CONT D)

Institutionen för matematik och matematisk statistik Umeå universitet November 7, Inlämningsuppgift 3. Mariam Shirdel

Construction of Mixed-Level Orthogonal Arrays for Testing in Digital Marketing

CS 5014: Research Methods in Computer Science. Experimental Design. Potential Pitfalls. One-Factor (Again) Clifford A. Shaffer.

Lasso Regression: Regularization for feature selection

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

The One-Quarter Fraction

STAT451/551 Homework#11 Due: April 22, 2014

Sparse Linear Models (10/7/13)

Machine Learning for Biomedical Engineering. Enrico Grisan

LECTURE 03: LINEAR REGRESSION PT. 1. September 18, 2017 SDS 293: Machine Learning

2 k, 2 k r and 2 k-p Factorial Designs

Data Mining Concepts & Techniques

Lasso Regression: Regularization for feature selection

LECTURE 04: LINEAR REGRESSION PT. 2. September 20, 2017 SDS 293: Machine Learning

Exercise 1. min_sup = 0.3. Items support Type a 0.5 C b 0.7 C c 0.5 C d 0.9 C e 0.6 F

LINEAR REGRESSION, RIDGE, LASSO, SVR

Contents. TAMS38 - Lecture 8 2 k p fractional factorial design. Lecturer: Zhenxia Liu. Example 0 - continued 4. Example 0 - Glazing ceramic 3

Model Selection. Frank Wood. December 10, 2009

The prediction of house price

ESTIMATION METHODS FOR MISSING DATA IN UN-REPLICATED 2 FACTORIAL AND 2 FRACTIONAL FACTORIAL DESIGNS

Chapter 11: Factorial Designs

COM111 Introduction to Computer Engineering (Fall ) NOTES 6 -- page 1 of 12

Lecture 7: Modeling Krak(en)

Karnaugh Maps Objectives

Design and Analysis of Multi-Factored Experiments

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

How the mean changes depends on the other variable. Plots can show what s happening...

MATH602: APPLIED STATISTICS

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 EXAM-3 FALL 2005 EXAM-III EXAMINATION COVER PAGE Professor Moseley

CHAPTER 5 KARNAUGH MAPS

Lectures 6. Lecture 6: Design Theory

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 EXAM-IV FALL 2005 EXAM-IV EXAMINATION COVER PAGE Professor Moseley

CS 584 Data Mining. Association Rule Mining 2

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

ECE521 lecture 4: 19 January Optimization, MLE, regularization

USE OF COMPUTER EXPERIMENTS TO STUDY THE QUALITATIVE BEHAVIOR OF SOLUTIONS OF SECOND ORDER NEUTRAL DIFFERENTIAL EQUATIONS

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Regression I: Mean Squared Error and Measuring Quality of Fit

Lecture Notes for Chapter 6. Introduction to Data Mining

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

MATH 251 MATH 251: Multivariate Calculus MATH 251 SPRING 2010 EXAM-I SPRING 2010 EXAM-I EXAMINATION COVER PAGE Professor Moseley

CPSC 340: Machine Learning and Data Mining

Machine Learning for OR & FE

Unit 6: Fractional Factorial Experiments at Three Levels

Design theory for relational databases

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

Transportation Big Data Analytics

Machine Learning Linear Classification. Prof. Matteo Matteucci

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2009 EXAM-I FALL 2009 EXAM-I EXAMINATION COVER PAGE Professor Moseley

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2005 FINAL EXAM FALL 2005 FINAL EXAM EXAMINATION COVER PAGE Professor Moseley

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

a) Prepare a normal probability plot of the effects. Which effects seem active?

MATH 251 MATH 251: Multivariate Calculus MATH 251 FALL 2009 EXAM-2 FALL 2009 EXAM-2 EXAMINATION COVER PAGE Professor Moseley

Support Vector Machines. Machine Learning Fall 2017

Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

High-Dimensional Statistical Learning: Introduction

CSC 261/461 Database Systems Lecture 12. Spring 2018

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Various Issues in Fitting Contingency Tables

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Transcription:

LECTURE 10: LINEAR MODEL SELECTION PT. 1 October 16, 2017 SDS 293: Machine Learning

Outline Model selection: alternatives to least-squares Subset selection - Best subset - Stepwise selection (forward and backward) - Estimating error Shrinkage methods - Ridge regression and the Lasso - Dimension reduction Labs for each part

Back to the safety of linear models Y β 0 + β 1 X 1 +... + β p X p

Back to the safety of linear models Y β 0 + β 1 X 1 +... + β p X p

Flashback: minimizing RSS 5 10 15 20 25 0 50 100 150 200 250 300

Discussion Why do we minimize RSS? ( have you ever questioned it?)

What do we know about least-squares? Assumption 1: we re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear What can we say about the bias of our least-squares estimates?

What do we know about least-squares? Assumption 1: we re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear Case 1: the number of observations is much larger than the number of predictors (n>>p) What can we say about the variance of our least-squares estimates?

What do we know about least-squares? Assumption 1: we re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear Case 2: the number of observations is not much larger than the number of predictors (n p) What can we say about the variance of our least-squares estimates?

What do we know about least-squares? Assumption 1: we re fitting a linear model Assumption 2: the true relationship between the predictors and the response is linear Case 3: the number of observations is smaller than the number of predictors (n<p) What can we say about the variance of our least-squares estimates?

Bias vs. variance

Discussion How could we reduce the variance?

Subset selection Big idea: if having too many predictors is the problem maybe we can get rid of some Problem: how do we choose?

Flashback: superhero example height = β1 +β 2 + β3 Image credit: Ming Malaykham

Best subset selection: try them all!

Finding the best subset Start with the null model M 0 (containing no predictors) 1. For k = 1,2,, p: a. Fit all (p choose k) models that contain exactly p predictors. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single best model from among M 0 M p using cross-validated prediction error or something similar.

Discussion Question 1: why not just use the one with the lowest RSS? Answer: because you ll always wind up choosing the model with the highest number of predictors (why?)

Discussion Question 2: why not just calculate the cross-validated prediction error on all of them? Answer: so many... models...

A sense of scale We do a lot of work in groups in this class How many different possible groupings are there? Let s break it down: 47 individual people 1,081 different groups of two 16,215 different groups of three

Model overload Number of possible models on a set of p predictors: p k=1 p k = 2 p On 10 predictors: 1,024 models On 20 predictors: 1,048,576 models

A bigger problem Question: what happens to our estimated coefficients as we fit more and more models? Answer: the larger the search space, the larger the variance. We re overfitting!

What if we could eliminate some?

A slightly larger example (p = 5) { } A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE

Best subset selection Start with the null model M 0 (containing no predictors) 1. For k = 1, 2,, p: a. Fit all (p choose k) models that contain exactly p predictors. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single best model from among M 0 M p using cross-validated prediction error or something similar.

Forward selection Start with the null model M 0 (containing no predictors) 1. For k = 1, 2,, p: a. Fit all (p k) models that augment M k-1 with exactly 1 predictor. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single best model from among M 0 M p using cross-validated prediction error or something similar.

Stepwise selection: way fewer models Number of models we have to consider: p p p 1 k = 2 p ( p k) =1+ p p +1 k=1 k=0 2 ( ) On 10 predictors: 1024 models à 51 models On 20 predictors: over 1 million models à 211 models

Forward selection Question: what potential problems do you see? Answer: there s a risk we might prune an important predictor too early. While this method usually does well in practice, it is not guaranteed to give the optimal solution.

Forward selection Start with the null model M 0 (containing no predictors) 1. For k = 1,2,, p: a. Fit all (p k) models that augment M k-1 with exactly 1 predictor. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single best model from among M 0 M p using cross-validated prediction error or something similar.

Backward selection Start with the full model M p (containing all predictors) 1. For k = p, (p 1),, 1: a. Fit all k models that reduce M k+1 by exactly 1 predictor. b. Keep only the one that has the smallest RSS (or equivalently the largest R 2 ). Call it M k. 2. Select a single best model from among M 0 M p using cross-validated prediction error or something similar.

Forward selection Question: what potential problems do you see? Answer: if we have more predictors than we have observations, this method won t work (why?)

Choosing the optimal model Flashback: measures of training error (RSS and R 2 ) aren t good predictors of test error (what we care about) Two options: 1. We can directly estimate the test error, using either a validation set approach or cross-validation 2. We can indirectly estimate test error by making an adjustment to the training error to account for the bias

Adjusted R 2 Intuition: once all of the useful variables have been included in the model, adding additional junk variables will lead to only a small decrease in RSS R 2 =1 RSS TSS R 2 Adj =1 RSS / (n d 1) TSS / (n 1) Adjusted R 2 pays a penalty for unnecessary variables in the model by dividing RSS by (n-d-1) in the numerator

AIC, BIC, and C p Some other ways of penalizing RSS C p = 1 n AIC = 1 nσˆ 2 ( RSS + 2dσˆ 2 ) ( RSS + 2dσˆ 2 ) BIC = 1 ( n RSS + log(n)d σˆ 2 ) Estimate of the variance of the error terms Proportional for least-squares models More severe penalty for large models

Adjust or validate? Question: what are the benefits and drawbacks of each? Pros Cons Adjusted measures Relatively inexpensive to compute Makes more assumptions about the model more opportunities to be wrong Validation More direct estimate (makes fewer assumptions) More expensive: requires either cross validation or a test set

Lab: subset selection To do today s lab in R: leaps To do today s lab in python: itertools, time Instructions and code: [course website]/labs/lab8-r.html [course website]/labs/lab8-py.html Full version can be found beginning on p. 244 of ISLR

Coming up Estimating error with cross-validation A3 due tonight by 11:59pm A4 out, due Oct. 20 th