Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Similar documents
Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Remedial Measures for Multiple Linear Regression Models

Chapter 11: Robust & Quantile regression

How the mean changes depends on the other variable. Plots can show what s happening...

Topic 18: Model Selection and Diagnostics

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

Regression Model Building

Linear model selection and regularization

Regression Diagnostics for Survey Data

Regression Analysis for Data Containing Outliers and High Leverage Points

holding all other predictors constant

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Day 4: Shrinkage Estimators

Math 423/533: The Main Theoretical Topics

Simple Regression Model Setup Estimation Inference Prediction. Model Diagnostic. Multiple Regression. Model Setup and Estimation.

Chapter 11: Robust & Quantile regression

STATISTICS 110/201 PRACTICE FINAL EXAM

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Chapter 10 Building the Regression Model II: Diagnostics

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Multiple Linear Regression

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

Quantitative Methods I: Regression diagnostics

STAT 705 Nonlinear regression

Generalized Linear Models

Multiple Linear Regression

Machine Learning for OR & FE

Regression, Ridge Regression, Lasso

Question Possible Points Score Total 100

Data Mining Stat 588

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Lecture 18 Miscellaneous Topics in Multiple Regression

Math 3330: Solution to midterm Exam

Chapter 1 Statistical Inference

Introduction to Statistical modeling: handout for Math 489/583

Chapter 11 Building the Regression Model II:

MATH 644: Regression Analysis Methods

Linear Model Selection and Regularization

Statistics 203: Introduction to Regression and Analysis of Variance Course review

22s:152 Applied Linear Regression. Take random samples from each of m populations.

STATISTICS 479 Exam II (100 points)

STAT 704 Sections IRLS and Bootstrap

Machine Learning for OR & FE

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Prediction of Bike Rental using Model Reuse Strategy

STK4900/ Lecture 5. Program

Lecture 10 Multiple Linear Regression

Detecting and Assessing Data Outliers and Leverage Points

Model Selection Procedures

10. Alternative case influence statistics

Linear regression methods

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Statistics 262: Intermediate Biostatistics Model selection

Model Selection. Frank Wood. December 10, 2009

Lecture 1: Linear Models and Applications

Need for Several Predictor Variables

Review for Final Exam Stat 205: Statistics for the Life Sciences

Regression Diagnostics Procedures

ST430 Exam 2 Solutions

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

STATISTICS 174: APPLIED STATISTICS TAKE-HOME FINAL EXAM POSTED ON WEBPAGE: 6:00 pm, DECEMBER 6, 2004 HAND IN BY: 6:00 pm, DECEMBER 7, 2004 This is a

STAT 4385 Topic 06: Model Diagnostics

Matematické Metody v Ekonometrii 7.

Regression Steven F. Arnold Professor of Statistics Penn State University

10 Model Checking and Regression Diagnostics

Lecture 14: Shrinkage

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Unit 10: Simple Linear Regression and Correlation

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Linear Regression Models

Course in Data Science

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Outline. Topic 13 - Model Selection. Predicting Survival - Page 350. Survival Time as a Response. Variable Selection R 2 C p Adjusted R 2 PRESS

Regression diagnostics

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Lecture 2. Simple linear regression

Unit 11: Multiple Linear Regression

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Regression Models - Introduction

Lecture 6: Linear Regression (continued)

Lecture 18: Simple Linear Regression

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

No other aids are allowed. For example you are not allowed to have any other textbook or past exams.

Chapter 14 Student Lecture Notes 14-1

Simple Linear Regression

The Model Building Process Part I: Checking Model Assumptions Best Practice

Transcription:

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014 Tim Hanson, Ph.D. University of South Carolina T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 1 / 13

Chapter 8: Polynomials & Interactions Higher order models * (8.1 & 8.2) Polynomials & interactions in regression. Examples: Y i = β 0 + β 1 x i1 + β 2 x i2 + ɛ i, Y i = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2 + ɛ i, Y i = β 0 + β 1 x i1 + β 11 x 2 i1 + β 2 x i2 + β 22 x 2 i2 + β 12 x i1 x i2 + ɛ i. * Interpretation. First-order & second-order Taylor s approximation to general surface Y i = g(x i1, x i2 ) + ɛ i. * Extrapolation. Centering. * Use of higher order models to test simpler (e.g. first-order) models. * Know how to interpret interaction models! (pp. 306 308). T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 2 / 13

Chapter 8: Polynomials & Interactions In the model Y i = β 0 + β 1 x i1 + β 11 xi1 2 + β 2 x i2 + ɛ i, how would you find when E(Y i ) is either the largest or smallest w.r.t. x i1? Is this possible to do when there is an interaction? For example, Y i = β 0 + β 1 x i1 + β 11 xi1 2 + β 2 x i2 + β 12 x i1 x i2 + ɛ i. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 3 / 13

Chapter 8: Polynomials & Interactions * (8.3) Categorical predictors with two or more levels. Dummy variables, baseline group. We primarily used zero-one dummy variables. * (8.4) Interactions between continuous and categorical predictors. Example where x i2 = 0 or 1: Y i = β 0 + β 1 x i1 + β 2 I {x i2 = 1} + β 12 x i1 I {x i2 = 1} + ɛ i. Interpretation! What is β 12 here? * Nice example is soap production pp. 330 334. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 4 / 13

Chapter 9: Model & variable selection Model & variable selection * (9.1) Model building, pp. 343 349. * Preliminaries: functional form of response & predictors and possible interactions. Use prior knowledge. * Reduction of predictors. Whether or not this is done depends on the type of study: controlled experiments (no), controlled experiment with adjusters (yes), confirmatory observational studies (no), exploratory studies i.e. data dredging YES! * Model refinement and selection: 9.3 & 9.4. * Model validation: 9.6, chapter 10??? T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 5 / 13

Chapter 9: Model & variable selection * (9.3) Model selection: MSE p SSTO/(n 1) * Maximize Ra 2 = 1 Same as minimizing MSE p.. Penalizes for adding predictors. * Minimize C p = SSEp MSE P (n 2p). P is number of predictors including intercept in large model that provides good estimate of σ 2 & p is number of predictors including intercept in smaller model. E(C p ) p (or < p) have little bias. Also penalizes for adding too many predictors. * AIC p & BIC p (BIC also called SBC). AIC = 2 log L(ˆβ) + 2p. * PRESS= n i=1 (Y i Ŷ i(i) ) 2. Leave-one-out cross-validated measure of model s predictive ability. * (9.4) Backwards elim., forward selection, stepwise variable selection. * (9.6) Validation: we did not cover this much. Use of validation sample. PRESS gets at same thing model with smallest PRESS has best n-fold out of sample prediction. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 6 / 13

Chapter 10: diagnostics & validation Chapter 10: Diagnostics * Can we trust the model? * (10.1) Added variable (partial regression) plots: regress Y on all predictors except x j and regress x j on remaining predictors, plot residuals versus residuals. Flat pattern x j not needed, linear pattern x j needed as linear term, curved pattern x j needed, but transformed. * Added variable plots assume remaining predictors are correctly specified. Can lead to lots of model tweaking. * (10.2) H = X(X X) 1 X is n n hat matrix. h ii are diagonal leverages. * r i = Y i Ŷ i MSE(1 hii ) is (internally) studentized residual. Y * t i Ŷi i is (externally) studentized deleted residuals. Has MSE(i) (1 h ii ) t n p 1 distribution if model is true: Bonferroni check for observations outlying with respect to the model. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 7 / 13

Chapter 10: diagnostics & validation * r i or t i versus predictors x 1,..., x k and/or versus fitted Ŷ i confirm or invalidate modeling assumptions...also get at adequacy. Checking for linearity & constant variance. Residuals should be approximately normal if normality holds and model is correct. Non-normality affects hypothesis tests in small samples. * (10.3) h ii are leverage points: potentially influential values of x i. Rule of thumb h ii > 2p n outlying with respect to the rest of the predictor vectors. h = x h (X X) 1 x h used to check for hidden extrapolation with high dimensional predictors. * (10.4) DFFIT i = Ŷi Ŷi(i) MSE(i) is number of standard deviations fitted Ŷ h i ii changes when leaving out (x i, Y i ). DFFIT i > 1 or 2 p/n indicates influential observation. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 8 / 13

Chapter 10: diagnostics & validation n j=1 (Ŷ j Ŷ j(i) ) 2 * Cook s distance D i = p MSE = r i 2h ii (1 h ii )p. When is D i large? Aggregate measure of how fitted surface changes when leaving out i. Look for values substantially larger than other values. * What do large values of r i indicate? Large values of D i? Large values of both? * (10.5) VIF j = 1/(1 R 2 j ) where R2 j is R 2 regressing x j onto the rest of the predictors. Check for VIF j > 10. VIF s are for predictors not observations. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 9 / 13

Chapter 11: Remedial measures Chapter 11: Other types of regression * (11.1) Weighted least squares: fix for non-constant variance that does not involve transforming the response. * Recipe on page 425 to estimate weights w i. * b w = (X WX) 1 X WY. * (11.2) Ridge regression: fix for multicollinearity. * Adds bias, reduces variance by adding a biasing constant c to OLS estimate b r = ((X ) X + ci) 1 (X ) Y. Pick c from ridge trace of standardized regression coefficients. * Increasing c reduces variance of estimator but biases coefficients toward zero. * LASSO is same idea, but regression parameter estimates can be zero: useful for multicollinearity and variable selection. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 10 / 13

Chapter 11: Remedial measures * (11.3) Least absolute deviations (LAD) or L 1 regression minimizes (p. 438) n Q(β) = Y i x iβ. * Dampens effect of outliers (w.r.t. the model). Also useful for non-normal errors. * Fit in PROC QUANTREG. i=1 * IRLS (iteratively reweighted least squares) regression. Special case is Huber regression which is between OLS and LAD. Fit in PROC ROBUSTREG. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 11 / 13

Chapter 11: Remedial measures Additive model * Fit in proc gam or proc transreg (also gam package in R). Will come back to this in STAT 705. Most general version: h(y i ) = β 0 + β 1 x i1 + g 1 (x i1 ) + + β k x ik + g k (x ik ) + ɛ. * The functions h( ), g 1 ( ),..., g k ( ) fit via splines. * Obviates the use of added variable plots. All k transformations estimated simultaneously. * proc gam provides tests of H 0 : g j (x j ) = 0 all x j but does not estimate h( ). * proc transreg provides estimates of g j (x j ) = β j x j + g j (x) as well as h( ), but does not test H 0 : g j (x j ) = 0 all x j (that I know of). * Can be used to quickly find suitable transformations of predictors in normal-errors regression. * Possible to include interactions & interaction surfaces in proc gam. T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 12 / 13

Chapter 11: Remedial measures Exam II * Will take place in LC 205, a few doors down, Thursday, 12/4 at the usual time. This is a STAT 205 lab. * Exam II will be one data analysis project and a few short answer. Hopefully shorter than midterm. * Closed book, closed notes, you can use SAS & R documentation, not internet. * You will need to build and validate a good, predictive regression model for a given data set. Be complete as you can. Look at the applied parts of old qualifying exams to get an idea. * Either use PC s in lab (you need a login), or else bring your own laptop. You will need to print out your exam in the lab (preferred), or else email it to me and I ll print it out (not preferred). * Questions? T. Hanson (USC) Stat 704: Data Analysis I, Fall 2014 13 / 13