ITEC 621 Predictive Analytics 6. Variable Selection

Similar documents
MULTICOLLINEARITY AND VARIANCE INFLATION FACTORS. F. Chiaromonte 1

n i n T Note: You can use the fact that t(.975; 10) = 2.228, t(.95; 10) = 1.813, t(.975; 12) = 2.179, t(.95; 12) =

Classification & Regression. Multicollinearity Intro to Nominal Data

Linear regression methods

STAT 4385 Topic 06: Model Diagnostics

L7: Multicollinearity

Chapter 2 Multiple Regression (Part 4)

Penalized Regression

Multiple Linear Regression

Multiple Linear Regression

Chapter 14 Student Lecture Notes 14-1

MS-C1620 Statistical inference

Checking model assumptions with regression diagnostics

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Regression in R. Seth Margolis GradQuant May 31,

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

STAT 212: BUSINESS STATISTICS II Third Exam Tuesday Dec 12, 6:00 PM

Lecture 14: Shrinkage

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

Part V. Model Slection and Regularization. As of Nov 21, 2018

Collaborative Statistics: Symbols and their Meanings

Lecture 10 Multiple Linear Regression

Econometrics -- Final Exam (Sample)

1 Multiple Regression

STA121: Applied Regression Analysis

LINEAR REGRESSION, RIDGE, LASSO, SVR

Chapter 3 Multiple Regression Complete Example

Prediction of Bike Rental using Model Reuse Strategy

Correlation. Tests of Relationships: Correlation. Correlation. Correlation. Bivariate linear correlation. Correlation 9/8/2018

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Data analysis strategies for high dimensional social science data M3 Conference May 2013

This paper is not to be removed from the Examination Halls

Stability and the elastic net

Multiple Linear Regression. Chapter 12

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Multiple Regression Methods

y response variable x 1, x 2,, x k -- a set of explanatory variables

Day 4: Shrinkage Estimators

Draft of an article prepared for the Encyclopedia of Social Science Research Methods, Sage Publications. Copyright by John Fox 2002

Data Analysis and Statistical Methods Statistics 651

Statistics 5100 Spring 2018 Exam 1

STAT 462-Computational Data Analysis

Analysis of Variance

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Machine Learning Linear Regression. Prof. Matteo Matteucci

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

Simple Linear Regression

Homework 1: Solutions

STAT 212 Business Statistics II 1

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Single and multiple linear regression analysis

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Chapter 19: Logistic regression

Theorems. Least squares regression

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Tutorial on Linear Regression

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

The prediction of house price

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Inference for the Regression Coefficient

Remedial Measures for Multiple Linear Regression Models

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Multivariate Regression (Chapter 10)

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Stat 401B Final Exam Fall 2015

The simple linear regression model discussed in Chapter 13 was written as

Chapter 26 Multiple Regression, Logistic Regression, and Indicator Variables

Regression, Ridge Regression, Lasso

Confidence Interval for the mean response

Inference for Regression Simple Linear Regression

STA 4210 Practise set 2b

Ridge Regression. Chapter 335. Introduction. Multicollinearity. Effects of Multicollinearity. Sources of Multicollinearity

Simple Linear Regression: One Qualitative IV

Multicollinearity Exercise

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

Exam Applied Statistical Regression. Good Luck!

Multiple linear regression S6

Regression Models. Chapter 4. Introduction. Introduction. Introduction

What If There Are More Than. Two Factor Levels?

Inferences for Regression

PENALIZING YOUR MODELS

Categorical Predictor Variables

Chapter 4: Regression Models

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Sections 7.1, 7.2, 7.4, & 7.6

Ch 3: Multiple Linear Regression

Review of Statistics 101

Linear model selection and regularization

Statistics 262: Intermediate Biostatistics Model selection

Prediction & Feature Selection in GLM

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

STAT Checking Model Assumptions

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

ISyE 691 Data mining and analytics

Transcription:

ITEC 621 Predictive Analytics 6. Variable Selection

Multi-Collinearity XI(û) X s are not independent (are correlated) Y = X * B Approximately: X has no inverse because its columns are dependent Really: X *X has no (pseudo)-inverse because its columns are (too) dependent 2

Testing for Multi-Collinearity First, you need to analyze the correlation matrix and inspect for desirable correlations à high between the dependent and any independent variable; and low among independent variables. Run your regression model and report multi-collinearity statistics in the results. Two are most widely used: Ø Condition Index (CI): a composite score of the linear association of all independent variables for the model as a whole ürule of thumb: CI < 3 no problem, 3 < CI < 5 some concern, CI > 5 severe, no good 3 Ø Variance Inflation Factors (VIF): a statistic measuring the contribution of each predictor (X i ) to the model s multicollinearity, which helps figure out which variables are problematic ü VIF X i = 1 1-R 2 for X i regressed against all other predictors ü Rule of thumb: VIF < 1 no problem, VIF >= 1 too high,

4 Variable Selection Methods XI(û) X s are not independent (are correlated)

Subset Comparison: Intuition You can test any 2 related models: Large vs. Reduced (or Restricted): Reduced Model: Y = β + β 1 (X 1 ) + β 2 (X 2 ) + + ε Large Model: Y = β + β 1 (X 1 ) + β 2 (X 2 ) + + β 3 (X 3 ) + β 4 (X 4 ) + ε We need to test if the Large model s SSE is significantly lower than the Reduced model s SSE, taking into account the loss of degrees of freedom caused by adding more variables to the model. We can do this with an ANOVA F-Test (or any other fit statistic comparison). Generally, if any of the added coefficients to the Full Model are significant, the ANOVA F-Test will also be significant, but this is not always the case. The F-Test rules. 5

Best Subset Selection: Intuition Suppose you have P possible predictors à 2 extreme models: Null Model (NO predictors): Y = β + ε Full Model (ALL predictors): Y = β + β 1 (X 1 ) + β 2 (X 2 ) + + β P (X P ) + ε 6

library(islr) # Contains Hitters data set lm.reduced <- lm(salary ~ AtBat + Hits + Walks, data=hitters) lm.large <- lm(salary ~ AtBat + Hits + Walks + Division + PutOuts, data=hitters) lm.full <- lm(salary ~ AtBat + Hits + Walks + Division + PutOuts + Errors, data=hitters) summary(lm.reduced); summary(lm.large); summary(lm.full) anova(lm.reduced, lm.large, lm.full) # Compare all 3 models (from smaller to larger) Null Model Example: Subset Comparison Full Model 7

Best Subset Selection: Intuition Suppose you have P possible predictors à 2 extreme models: Null Model (NO predictors): Y = β + ε Full Model (ALL predictors): Y = β + β 1 (X 1 ) + β 2 (X 2 ) + + β P (X P ) + ε 8 Start with the Null model, then try all single-predictor models, then all possible 2-predictor models, etc., ending with the Full model Then compare all resulting models using cross-validation This method works well when P is small because you end up testing all possible models But if P is large, the pool of possible models will grow exponentially (2 P -1) and it may not be computationally practical to test all of them. Ø 1 variables à 2 1-1 = 1,24 models Ø 2 variables à 2 2-1 = 1,48,576 models There are R packages for best subset selection, with algorithms to test most plausible models.

Example: Best Subset Selection library(islr) # Needed for the Hitters data set library(leaps) # Contains the regsubsets() function for subset selection regfit.full=regsubsets(salary~., Hitters) # Fit the full model summary(regfit.full) reg.summary <- summary(regfit.full) plot(reg.summary$rss, xlab="number of Variables", ylab="rss",type="l") plot(reg.summary$adjr2, xlab="number of Variables", ylab="adjusted RSq", type="l") 9

Breakout 1

Regularization in Sports Analytics ITEC 621, Week 6

All-Star

New York Times

Plus-Minus Totals, 1/29/13 Adjusted Plus-Minus Each possession ORL scoring: y = [,,, -2,,,, -3,,, -2,, -1,, 1, ] (IND scoring: y = [,,, +2,,,, +3,,, +2,, +1,, -1 ])

Plus Minus (PM): How many (net) points does the team score while a player plays? Adjusted Plus-Minus (APM): Regularized APM (RAPM): Predictive model for PM based on lineups (i.e. improves PM by controlling for teammate & opponent quality) APM, with regularization to overcome multicollinearity & small samples (i.e. tries to identify players with most impact)

Adjusted Plus-Minus (APM): Predict points scored for each possession based on lineup 2-2 1-3 2 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1??????????? Y: Team points = X: Lineups * B: APM

Adjusted Plus-Minus (APM): Predict points scored for each possession based on lineup 2-2 1-3 2 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.1.3.5 -.9 -.2.2.1 -.1 Shane Battier Y: Team points = X: Lineups * B: APM

Regularization: Adding a penalty term to the error function before minimizing it Ridge Regression LASSO Regression Tuning parameter: How much to regularize?

Tuning parameter: How much to regularize?

Regularized Adjusted Plus-Minus (RAPM) APM with regularization for multicollinearity & small samples Ridge regularization 2-2 1-3 2 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.1.3.53 -.95 -.22.2.1 Y: Team points = X: Lineups * B: RAPM

Regularized Adjusted Plus-Minus (RAPM) APM with regularization for multicollinearity & small samples LASSO regularization 2-2 1-3 2 = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1.4.7-1.1 Y: Team points = X: Lineups * B: RAPM