Multiple Regression Analysis

Similar documents
Regression ( Kemampuan Individu, Lingkungan kerja dan Motivasi)

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

Multiple linear regression S6

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Item-Total Statistics. Corrected Item- Cronbach's Item Deleted. Total

Multivariate Correlational Analysis: An Introduction

Prepared by: Prof. Dr Bahaman Abu Samah Department of Professional Development and Continuing Education Faculty of Educational Studies Universiti

ECON 497 Midterm Spring

Sociology 593 Exam 1 February 14, 1997

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Introduction to Regression

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Review of Multiple Regression

Multiple Regression and Model Building (cont d) + GIS Lecture 21 3 May 2006 R. Ryznar

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Assoc.Prof.Dr. Wolfgang Feilmayr Multivariate Methods in Regional Science: Regression and Correlation Analysis REGRESSION ANALYSIS

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

Multiple OLS Regression

ECON 4230 Intermediate Econometric Theory Exam

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Single and multiple linear regression analysis

Chapter 9 - Correlation and Regression

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Practical Biostatistics

Simple Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

y response variable x 1, x 2,, x k -- a set of explanatory variables

Simple Linear Regression Using Ordinary Least Squares

General linear models. One and Two-way ANOVA in SPSS Repeated measures ANOVA Multiple linear regression

Analysing data: regression and correlation S6 and S7

REVIEW 8/2/2017 陈芳华东师大英语系

bivariate correlation bivariate regression multiple regression

1 Correlation and Inference from Regression

2 Prediction and Analysis of Variance

Ordinary Least Squares Regression Explained: Vartanian

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

Multicollinearity Richard Williams, University of Notre Dame, Last revised January 13, 2015

Regression Diagnostics Procedures

Regression Model Building

Multiple Regression and Model Building Lecture 20 1 May 2006 R. Ryznar

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

A discussion on multiple regression models

Sociology 593 Exam 1 February 17, 1995

Sociology 593 Exam 1 Answer Key February 17, 1995

Using the Regression Model in multivariate data analysis

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

A particularly nasty aspect of this is that it is often difficult or impossible to tell if a model fails to satisfy these steps.

Correlation and Regression Bangkok, 14-18, Sept. 2015

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Daniel Boduszek University of Huddersfield

TOPIC 9 SIMPLE REGRESSION & CORRELATION

LAB 3 INSTRUCTIONS SIMPLE LINEAR REGRESSION

EDF 7405 Advanced Quantitative Methods in Educational Research MULTR.SAS

Available online at (Elixir International Journal) Statistics. Elixir Statistics 49 (2012)

Example: Forced Expiratory Volume (FEV) Program L13. Example: Forced Expiratory Volume (FEV) Example: Forced Expiratory Volume (FEV)

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

STATISTICS 110/201 PRACTICE FINAL EXAM

Chapter 19: Logistic regression

Last updated: Oct 18, 2012 LINEAR REGRESSION PSYC 3031 INTERMEDIATE STATISTICS LABORATORY. J. Elder

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

CHAPTER 6: SPECIFICATION VARIABLES

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Chapter 4 Regression with Categorical Predictor Variables Page 1. Overview of regression with categorical predictors

9. Linear Regression and Correlation

Multiple Linear Regression I. Lecture 7. Correlation (Review) Overview

SMAM 314 Practice Final Examination Winter 2003

(4) 1. Create dummy variables for Town. Name these dummy variables A and B. These 0,1 variables now indicate the location of the house.

Review of Statistics 101

SPSS LAB FILE 1

Final Exam - Solutions

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Inter Item Correlation Matrix (R )

Regression of Inflation on Percent M3 Change

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

Interactions and Centering in Regression: MRC09 Salaries for graduate faculty in psychology

Technical Appendix C: Methods. Multilevel Regression Models

Chapter 3 Multiple Regression Complete Example

CHAPTER 5 FUNCTIONAL FORMS OF REGRESSION MODELS

Chapter 14 Student Lecture Notes 14-1

Multivariate Lineare Modelle

Correlation and simple linear regression S5

Can you tell the relationship between students SAT scores and their college grades?

Lecture 4: Multivariate Regression, Part 2

The simple linear regression model discussed in Chapter 13 was written as

Applied Econometrics. Applied Econometrics Second edition. Dimitrios Asteriou and Stephen G. Hall

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Day 4: Shrinkage Estimators

FinQuiz Notes

MORE ON SIMPLE REGRESSION: OVERVIEW

( ), which of the coefficients would end

DEVELOPMENT OF CRASH PREDICTION MODEL USING MULTIPLE REGRESSION ANALYSIS Harshit Gupta 1, Dr. Siddhartha Rokade 2 1

Research Design - - Topic 19 Multiple regression: Applications 2009 R.C. Gardner, Ph.D.

Regression Analysis. A statistical procedure used to find relations among a set of variables.

Lecture 4: Multivariate Regression, Part 2

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Transcription:

Multiple Regression Analysis

Where as simple linear regression has 2 variables (1 dependent, 1 independent): y ˆ = a + bx Multiple linear regression has >2 variables (1 dependent, many independent): ˆ 2 y = a + b1 x1 + b2 x +... b n x n The problems and solutions are the same as bivariate regression, except there are more parameters to estimate.

In bivariate regression we fit a line through points plotted in 2- dimenstional space:

In multiple regression with 3 variables we fit a plane through points plotted in 3-dimenstional space: Additional variables add additional dimensions to the variable space.

In addition to the assumptions of bivariate regression, multiple regression has the assumption of no multicollinearity among the independent variables. Multicollinearity when two or more of the independent variables are highly correlated, making it difficult to separate their effects on the dependent variable.

Example: Determine the strength of the relationship between native American male standing height, average yearly minimum temperature, and annual temperature range. Variables: MHT Male Standing Height(cm) Dependent AnnMinTemp Annual Minimum Temp (ºF) Independent AnnRange Annual Temp Range (ºF) Independent

Model 1 Model Summary b Adjusted Std. Error of Durbin- R R Square R Square the Estimate Watson.654 a.428.416 30.04066 1.683 a. Predictors: (Constant), AnnRange, AnnMinTemp b. Dependent Variable: MHT 41.6% of height explained by min temp and range. ANOVA b Model 1 Regres sion Residual Total Sum of Squares df Mean Square F Sig. 63546.875 2 31773.438 35.208.000 a 84829. 502 94 902.442 148376.4 96 a. Predic tors: (Constant), AnnRange, AnnMinTemp b. Dependent Variable: MHT Model is significant. Model 1 (Constant) AnnMinTemp AnnRange a. Dependent Variable: MHT Unstandardized Coeffic ients Coefficients a Standardiz ed Coeffic ients Collinearity Statistics B Std. Error Beta t Sig. Tolerance VIF 1665.620 15.964 104.334.000 4.492.603.855 7.446.000.462 2.166 1.565.552.325 2.834.006.462 2.166 Slopes are not zero. Some collinearity.

We interpret the regression equation as follows: Male Standing Height = 1665.6 + 4.49(º F min temp) + 1.57(º F temp range) The equation can be interpreted as follows: Every 1ºF increase in minimum temperature adds 4.49 centimeters in male standing height holding constant the temperature range. Conversely, every increase of 1ºF in the annual temperature range adds 1.57 centimeters in male standing height, holding constant the minimum temperature.

Normality of the residuals is one of the most important assumptions of linear regression. In this case the residuals are normally distributed.

The observed and predicted residuals do not display any systematic bias, which would indicate that the independent variables vary systematically with each other.

Coefficients a Unstandardized Coeffic ients Standardiz ed Coeffic ients Collinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (Constant) 1665.620 15.964 104.334.000 AnnMinTemp 4.492.603.855 7.446.000.462 2.166 AnnRange 1.565.552.325 2.834.006.462 2.166 a. Dependent Variable: MHT Tolerance is the amount of the variance in a given independent variable that can not be explained by other independent variables. In this case 46.2% of the variance in one can not be explained by the other meaning that 53.8% of the variance IS shared or collinear. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate Durbin- Watson 1.654 a.428.416 30.04066 1.683 a. Predictors: (Constant), AnnRange, AnnMinTemp b. Dependent Variable: MHT This is why the standard error of the estimate is so large. The standard error of the estimate is the average error expressed in the original units (e.g. centimeters). 30cm is 1 foot of error... in a person s height.

Coefficients a Unstandardized Coeffic ients Standardiz ed Coeffic ients Collinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (Constant) 1665.620 15.964 104.334.000 AnnMinTemp 4.492.603.855 7.446.000.462 2.166 AnnRange 1.565.552.325 2.834.006.462 2.166 a. Dependent Variable: MHT VIFs (variance inflation factors) higher than 2 are considered problematic (according to SPSS) and our VIFs are over just over 2.1.

Coefficients a Unstandardized Coeffic ients Standardiz ed Coeffic ients Collinearity Statistics Model B Std. Error Beta t Sig. Tolerance VIF 1 (Constant) 1665.620 15.964 104.334.000 AnnMinTemp 4.492.603.855 7.446.000.462 2.166 AnnRange 1.565.552.325 2.834.006.462 2.166 a. Dependent Variable: MHT The standard beta values indicate the strength of the relationship between the independent and dependent variables. Minimum temperature is a much stronger predictor of height than annual range.

The question becomes: do these collinearity statistics rise to the level of indicating multicollinearity among the independent variables? In this example they do. Correlations AnnMinTemp AnnRange AnnMinTemp Pearson Correlation Sig. (2-tailed) 1 -.734**.000 N 97 97 AnnRange Pearson Correlation -.734** 1 Sig. (2-tailed) N.000 97 97 **. Correlation is s ignificant at the 0.01 level (2-tailed).

Misspecification an error in the regression equation due to the exclusion of an independent variable that influences the dependent variable OR the inclusion of an independent variable that does not influence the dependent variable. Misspecification errors are common since it is difficult to know a priori what factors influence the dependent variable. Misspecification is a hypothesis, not statistical issue.

Data Transformation

Often the association between two variables in not linear. Data transformation (log, etc ) is perfectly acceptable. The type of transformation must be stated in your summary statement.

In this case, log transforming the population data created a linear relationship.

Converting to natural log is easy. For example, the mining town of Argentine has a population of 100, its natural log would be: pop(ln) = ln(100) = 4.60517 Converting back to the original units is also easy: pop = e 4.60517 = 100

Calculator transformations: Converting to a log: use the ln key. Converting from a log: use the e x key. SPSS transformations: Converting to a log: Transform>Compute variable> Arithmetic>Ln Converting from a log: Transform>Compute variable> Arithmetic>Exp

Population and Elevation in Colorado Mining Towns. The model is significant. What is the standard error of the estimate telling us? What are the units? Population = 46852.9 + (-4.238)(Elevation)

Population and Elevation in Colorado Mining Towns: Log Transformation The model is significant. What is the standard error of the estimate telling us? What are the units? Ln(population) = 33.108 0.003(elevation)

Town Population Elevation (ft) (ln)population (ln)predicted (ln)residual Argentine 100 11161 4.61 4.90195 -.29678 Boreas 200 11535 5.30 3.95677 1.34155 Breckenridge 8000 9597 8.99 8.85453.13267 Buckskin Joe 500 10860 6.21 5.66264.55196 Chihuahua 200 10571 5.30 6.39301-1.09469 Dudley 200 10400 5.30 6.82517-1.52685 Fairplay 8000 9931 8.99 8.01043.97676 Hamilton 3000 9997 8.01 7.84364.16273 Horseshoe 800 10544 6.68 6.46125.22337 Lamartine 500 10485 6.21 6.61035 -.39574 Lincoln 1500 10384 7.31 6.86560.44762 Montezuma 800 10358 6.68 6.93131 -.24670 Mosquito 250 10720 5.52 6.01645 -.49499 Park City 300 10587 5.70 6.35258 -.64879 Parkville 10000 9944 9.21 7.97758 1.23276 Quartzville 200 11424 5.30 4.23729 1.06103 Rexford 50 11201 3.91 4.80086 -.88884 Sacramento 100 11398 4.61 4.30300.30217 Saints John 200 10798 5.30 5.81933 -.52101 Silverheels 150 10771 5.01 5.88757 -.87693 Swandyke 200 11093 5.30 5.07380.22452 Silver Plume 5500 9825 8.61 8.27832.33418

Converting to Original Units from a Log Transformation Town = Horseshoe Population = 800 Elevation = 10,544ft Predicted ln(population) = 6.46125 Calculated ln(residual) = 0.22337 Converting to original units (people): population = e (6.46125) = 640 Converting the residual: residual = e 0.22337 =1.25028 A A This is the ratio of the difference between the actual and predicted value. Original population = (640)(1.25028) = 800.2 Observed Predicted = residual Residual in original units (people): difference = 800 640 = 160 e.g. the equation under-predicted Horseshoe s population by 160 people.

Town Population Elevation (ft) (ln)population (ln)predicted (ln)residual Argentine 100 11161 4.61 4.90195 -.29678 Observed Population = 100 (ln)predicted Population = 4.90195 (ln)residual = -0.29678 Predicted population ( e predicted ) Residual = observed predicted = What are the predicted population and residual values, in the original units?

Iterative Regression

If you are exploring a data base for associations, one method is to use iterative regression. Iterative Regression a iterative procedure which either adds or removes variables from a regression model based on their significance.

IMPORTANT: The SPSS stepwise procedure give results that are inconsistent with the other methods. Due to this inconsistency it is recommended that the stepwise procedure not be used. A better method of performing iterative regression is to use all variables with the enter procedure, then remove insignificant variables individually. OR use the backwards or forwards procedures.

Types of Iterative Regression: Enter all variables are entered in a single step. Stepwise independent variables are entered based on the smallest F probability. Variables already in the equation are removed if their probability of F becomes too large. Backward all variables are entered into the equation and then sequentially removed based on the smallest partial correlation. Forward - A stepwise variable selection procedure in which variables are sequentially entered into the model.

Harrisburg Housing Value (Iterative using the Enter procedure) Not significant Not significant

With insignificant variables removed. No changes here. All slopes are significant. Predicted value ($) = -233435.212 + 19.515(Square Feet) + 143.475(Year Built) 3848.55 (Bedrooms) + 10101.928(Half Baths) + 4.545(Parcel Size) 12.126(Distance to Front St)

Standardized Coefficients Standardized or beta coefficients are slope values that have been standardized so that their variances are 1. They can be used to determine which of the independent variables have a greater effect on the dependent variable when the variables are measured in different units of measurement. In this case, Square Feet and Distance to Front Street are having the greatest effect.

705 ½ South Front Street Value = $133,900 Square Feet = 2380 Parcel Size = 2975 Distance to Front Street = 84 Year Built = 1900 Bedrooms = 3 Half Baths = 1 Predicted value ($) = -233435.212 + 19.515(2380) + 143.475(1900) 3848.55 (3) + 10101.928(1) + 4.545(2975) 12.126(84) Predicted value ($) = -233435.212 + 46445.7 + 272602.5 11545.65 +10101.928 + 13521.375 1018.884 Predicted value ($) =109236.3 Residual ($) = 109236.3 133900 = -24663.7 This is not surprising considering that the r 2 was 0.591. Over 40% of the variation in housing value is not explained by this model.

Mapping Regression Residuals

Temperature Recording Sites Kyrgystan Region

Average yearly temperature is influenced by: Elevation: 6.4 C per 1000 m elevation change. Latitude: 4.0 C per 1000 km latitude change. To what degree can we predict temperature based on both elevation and latitude?

Elevation Latitude

Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.824 a.679.677 3.48693 a. Predictors: (Constant), Elevation b. Dependent Variable: Average Temperature Model: Elevation ANOVA a Model 1 Sum of Mean Square Squares df F Sig. Regression 4936.797 1 4936.797 406.031.000 b Residual 2334.466 192 12.159 Total 7271.264 193 a. Dependent Variable: Average temperature b. Predictors: (Constant), Elevation Coefficients a Model 1 a. Dependent Variable: Average Temperature Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 14.683.390 37.691.000 Elevation -.005.000 -.824-20.150.000 Temperature P = 14.683 (0.005)Elevation

The standard error of the estimate is about 3.5 C, which is half of the number of degree change per 1000m elevation. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.824 a.679.677 3.48693 a. Predictors: (Constant), Elevation b. Dependent Variable: Average Temperature This model is not very accurate.

Unknown Missing explanatory variable.

Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.254 a.065.060 5.95185 a. Predictors: (Constant), Latitude b. Dependent Variable: Average Temperature This R 2 is really low. Model: Latitude ANOVA a Model 1 Sum of Mean Square Squares df F Sig. Regression 469.747 1 469.747 13.260.000 b Residual 6801.516 192 35.425 Total 7271.264 193 a. Dependent Variable: Average Temperature b. Predictors: (Constant), Latitude Coefficients a Model 1 a. Dependent Variable: Average Temperature Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t Sig. (Constant) 30.470 6.002 5.077.000 Latitude -.531.146 -.254-3.641.000

The standard error of the estimate is about 6 C, which is nearly the number of degree change per 1000m elevation. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.254 a.065.060 5.95185 a. Predictors: (Constant), Latitude b. Dependent Variable: Average Temperature This model is also not very accurate. By itself, the variable latitude is not a good predictor of temperature.

This similarity in pattern suggests that together elevation and latitude combined may produce a strong predictive model.

Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.952 a.907.906 1.88058 a. Predictors: (Constant), Elevation, Latitude b. Dependent Variable: Average Temperature Model: Elevation + Latitude ANOVA a Model 1 Sum of Mean Square Squares df F Sig. Regression 6595.775 2 3297.887 932.505.000 b Residual 675.489 191 3.537 Total 7271.264 193 a. Dependent Variable: Average Temperature b. Predictors: (Constant), Elevation, Latitude Coefficients a Model Unstandardized Coefficients Standardized Coefficients Correlations Collinearity Statistics Std. Error Zero-order Tolerance VIF B Beta t Sig. Partial Part (Constant) 57.936 2.008 28.852.000 1 Elevation -.005.000 -.949-41.620.000 -.824 -.949 -.918.936 1.068 Latitude -1.032.048 -.494-21.658.000 -.254 -.843 -.478.936 1.068 a. Dependent Variable: Average Temperature

The standard error of the estimate is less than 2 C, which is by far the best estimator. Model Summary b Model R R Square Adjusted R Square Std. Error of the Estimate 1.952 a.907.906 1.88058 a. Predictors: (Constant), Latitude b. Dependent Variable: Average Temperature This model is very accurate (90%).

Outlier?

Significantly over/under predicted locations.

There does not appear to be any spatial pattern to the distribution of residuals. The residuals appear to be spatially random. The number of large over/under predictions is about equal. It might be a good idea to examine large over/under predicted locations in greater detail.

Over-prediction Name Lon Lat Elev Temp Resid Humrogi 71.33 38.28 1737 12.17 3.10 Dzhergetal 73.1 41.57 1800 10.43 5.10 Gasan-kuli 39.22 52.22 23 16.06 12.13 Under-prediction Name Lon Lat Elev Temp Resid Kushka 62.35 35.28 57 15.23-5.99 Susamyr 74 42.2 2087-1.95-5.06 Aksai 76.49 42.07 3135-7.27-4.86 An initial inspection does not show any locational influences, with the exception of Gasan-kuli, which is located far from the other sites.

Gasan-kuli Dzhergetal Susamyr Aksai Kushka Humrogi

Key Points: 1. Let theory drive your selection of independent variables. Individual variable analyses (regressions) were misleading. 2. Use the tools available. Both statistics and graphs. 3. Map residuals and look for patterns. Patterns may be of interest. The absence of patterns is NOT a failure.