Regression Model Building

Similar documents
holding all other predictors constant

Regression Diagnostics Procedures

Prediction of Bike Rental using Model Reuse Strategy

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

Introduction to Regression

11. Generalized Linear Models: An Introduction

Checking model assumptions with regression diagnostics

Topic 18: Model Selection and Diagnostics

Review: Second Half of Course Stat 704: Data Analysis I, Fall 2014

Multiple Linear Regression

Generalized Linear Models

Single and multiple linear regression analysis

Chapter 1 Statistical Inference

LINEAR REGRESSION. Copyright 2013, SAS Institute Inc. All rights reserved.

Chapter 14 Student Lecture Notes 14-1

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

Multicollinearity occurs when two or more predictors in the model are correlated and provide redundant information about the response.

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

Generalized Linear Models: An Introduction

Chapter 1. Modeling Basics

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

Multiple linear regression S6

1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Course in Data Science

Chapter 4: Regression Models

Analysing data: regression and correlation S6 and S7

Applied Regression Modeling

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

INFERENCE FOR REGRESSION

Statistical Modelling in Stata 5: Linear Models

Linear Regression Measurement & Evaluation of HCC Systems

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Handbook of Regression Analysis

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Regression diagnostics

STATISTICS 110/201 PRACTICE FINAL EXAM

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

For Bonnie and Jesse (again)

Simple Linear Regression

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

Lecture 6: Linear Regression (continued)

STAT 4385 Topic 06: Model Diagnostics

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

11. Generalized Linear Models: An Introduction

STAT5044: Regression and Anova

Outline. Review regression diagnostics Remedial measures Weighted regression Ridge regression Robust regression Bootstrapping

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Simple Linear Regression

Simple Linear Regression

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

GLM I An Introduction to Generalized Linear Models

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Regression Review. Statistics 149. Spring Copyright c 2006 by Mark E. Irwin

Investigating Models with Two or Three Categories

Labor Economics with STATA. Introduction to Regression Diagnostics

Statistics 572 Semester Review

Unit 6 - Simple linear regression

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Correlation and Regression Bangkok, 14-18, Sept. 2015

Cheat Sheet: Linear Regression

Introducing Generalized Linear Models: Logistic Regression

9. Linear Regression and Correlation

Subject CS1 Actuarial Statistics 1 Core Principles

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

M & M Project. Think! Crunch those numbers! Answer!

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression Model Specification in R/Splus and Model Diagnostics. Daniel B. Carr

Regression Diagnostics for Survey Data

Contents. Acknowledgments. xix

Multiple Linear Regression

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

The General Linear Model. April 22, 2008

Unit 10: Simple Linear Regression and Correlation

The General Linear Model. November 20, 2007

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

Regression Modelling. Dr. Michael Schulzer. Centre for Clinical Epidemiology and Evaluation

CHAPTER 5. Outlier Detection in Multivariate Data

1 Introduction to Minitab

Generalized Linear Models (GLZ)

Institute of Actuaries of India

Generalized linear models

A Second Course in Statistics: Regression Analysis

Math 423/533: The Main Theoretical Topics

Generalized Linear Models for Non-Normal Data

Quantitative Methods I: Regression diagnostics

Package My.stepwise. June 29, 2017

Regression. Simple Linear Regression Multiple Linear Regression Polynomial Linear Regression Decision Tree Regression Random Forest Regression

INTRODUCTION TO LINEAR REGRESSION ANALYSIS

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

STK4900/ Lecture 5. Program

DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1

Appendix A Summary of Tasks. Appendix Table of Contents

Transcription:

Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated Procedures and all possible regressions: Backward Elimination (Top down approach) Forward Selection (Bottom up approach) Stepwise Regression (Combines Forward/Backward) C p Statistic - Summarizes each possible model, where best model can be selected based on statistic

Backward Elimination Select a significance level to stay in the model (e.g. SLS=0.20, generally.05 is too low, causing too many variables to be removed) Fit the full model with all possible predictors Consider the predictor with lowest t-statistic (highest P-value). If P > SLS, remove the predictor and fit model without this variable (must re-fit model here because partial regression coefficients change) If P SLS, stop and keep current model Continue until all predictors have P-values below SLS

Forward Selection Choose a significance level to enter the model (e.g. SLE=0.20, generally.05 is too low, causing too few variables to be entered) Fit all simple regression models. Consider the predictor with the highest t-statistic (lowest P-value) If P SLE, keep this variable and fit all two variable models that include this predictor If P > SLE, stop and keep previous model Continue until no new predictors have P SLE

Stepwise Regression Select SLS and SLE (SLE<SLS) Starts like Forward Selection (Bottom up process) New variables must have P SLE to enter Re-tests all old variables that have already been entered, must have P SLS to stay in model Continues until no new variables can be entered and no old variables need to be removed

All Possible Regressions - C p Fits every possible model. If K potential predictor variables, there are 2 K -1 models. Label the Mean Square Error for the model containing all K predictors as MSE K For each model, compute SSE and C p where p is the number of parameters (including intercept) in model C p = SSE MSE K ( n 2 p) Select the model with the fewest predictors that has C p p

Regression Diagnostics Model Assumptions: Regression function correctly specified (e.g. linear) Conditional distribution of Y is normal distribution Conditional distribution of Y has constant standard deviation Observations on Y are statistically independent Residual plots can be used to check the assumptions Histogram (stem-and-leaf plot) should be mound-shaped (normal) Plot of Residuals versus each predictor should be random cloud U-shaped (or inverted U) Nonlinear relation Funnel shaped Non-constant Variance Plot of Residuals versus Time order (Time series data) should be random cloud. If pattern appears, not independent.

Detecting Influential Observations Studentized Residuals Residuals divided by their estimated standard errors (like t-statistics). Observations with values larger than 3 in absolute value are considered outliers. Leverage Values (Hat Diag) Measure of how far an observation is from the others in terms of the levels of the independent variables (not the dependent variable). Observations with values larger than 2(k+1)/n are considered to be potentially highly influential, where k is the number of predictors and n is the sample size. DFFITS Measure of how much an observation has effected its fitted value from the regression model. Values larger than 2*sqrt((k+1)/n) in absolute value are considered highly influential. Use standardized DFFITS in SPSS.

Detecting Influential Observations DFBETAS Measure of how much an observation has effected the estimate of a regression coefficient (there is one DFBETA for each regression coefficient, including the intercept). Values larger than 2/sqrt(n) in absolute value are considered highly influential. Cook s D Measure of aggregate impact of each observation on the group of regression coefficients, as well as the group of fitted values. Values larger than 4/n are considered highly influential. COVRATIO Measure of the impact of each observation on the variances (and standard errors) of the regression coefficients and their covariances. Values outside the interval 1 +/- 3(k+1)/n are considered highly influential.

Obtaining Influence Statistics and Studentized Residuals in SPSS.Choose ANALYZE, REGRESSION, LINEAR, and input the Dependent variable and set of Independent variables from your model of interest (possibly having been chosen via an automated model selection method)..under STATISTICS, select Collinearity Diagnostics, Casewise Diagnostics and All Cases and CONTINUE.Under PLOTS, select Y:*SRESID and X:*ZPRED. Also choose HISTOGRAM. These give a plot of studentized residuals versus standardized predicted values, and a histogram of standardized residuals (residual/sqrt(mse)). Select CONTINUE..Under SAVE, select Studentized Residuals, Cook s, Leverage Values, Covariance Ratio, Standardized DFBETAS, Standardized DFFITS. Select CONTINUE. The results will be added to your original data worksheet.

Variance Inflation Factors Variance Inflation Factor (VIF) Measure of how highly correlated each independent variable is with the other predictors in the model. Used to identify Multicollinearity. Values larger than 10 for a predictor imply large inflation of standard errors of regression coefficients due to this variable being in model. Inflated standard errors lead to small t-statistics for partial regression coefficients and wider confidence intervals

Nonlinearity: Polynomial Regression When relation between Y and X is not linear, polynomial models can be fit that approximate the relationship within a particular range of X General form of model: E ( Y ) = α + β X + L+ 1 β k X k Second order model (most widely used case, allows one bend ): E( Y ) = α + β X + β X 1 2 2 Must be very careful not to extrapolate beyond observed X levels

Generalized Linear Models (GLM) General class of linear models that are made up of 3 components: Random, Systematic, and Link Function Random component: Identifies dependent variable (Y) and its probability distribution Systematic Component: Identifies the set of explanatory variables (X 1,...,X k ) Link Function: Identifies a function of the mean that is a linear function of the explanatory variables g ( µ ) = α + β X + L+ β 1 1 k X k

Random Component Conditionally Normally distributed response with constant standard deviation - Regression models we have fit so far. Binary outcomes (Success or Failure)- Random component has Binomial distribution and model is called Logistic Regression. Count data (number of events in fixed area and/or length of time)- Random component has Poisson distribution and model is called Poisson Regression Continuous data with skewed distribution and variation that increases with the mean can be modeled with a Gamma distribution

Common Link Functions Identity link (form used in normal and gamma regression models): g( µ ) = µ Log link (used when µ cannot be negative as when data are Poisson counts): g( µ ) = log( µ ) Logit link (used when µ is bounded between 0 and 1 as when data are binary): µ g( µ ) = log 1 µ

Exponential Regression Models Often when modeling growth of a population, the relationship between population and time is X exponential: E( Y ) = µ = αβ Taking the logarithm of each side leads to the linear relation: log( µ ) = log( α) + X log( β ) = α ' + β ' X Procedure: Fit simple regression, relating log(y) to X. Then transform back: ^ log( Y ) = a + bx ^ ^ ^ a b α = e β = e Y = ^ ^ α β X