Assumptions of the error term, assumptions of the independent variables

Petra Petrovics, Renáta Géczi-Papp Assumptions of the error term, assumptions of the independent variables 6 th seminar

Multiple linear regression model Linear relationship between x 1, x 2,, x p and y Y depends on: x 1, x 2,, x p p independent variables The error term (ε) β 0, β 1,, β p regression coefficients Y = β 0 + β 1 x 1 + β 2 x 2 + + β p x p +ε

Assumptions of the error term The expected value of the error term equals 0 E(ε X 1, X 2, X p )=0 Constant variance (homoscedasticity) Var(ε) = 2 The error term is uncorrelated across observations. Normally distributed error term.

Assumptions of the independent variables Linear independency. Fix values, which do not change sample by sample. There is no scale error. The independent variable is uncorrelated with the error term.

Standard linear regression model When the abovementioned assumptions are met If the sample data does not meet the assumptions, more complex models and estimation procedures are required.

SPSS y - turnover x 1 -property x 2 -number of emloyees 1 35 54 98 2 27 52 120 3 42 50 95 4 47 58 145 5 53 82 184 6 45 72 106 7 61 120 240 8 58 108 175 9 65 92 165 10 77 122 202

1. M(ε) = 0 The positive and negative values offset each other. If different from 0, the reason may be that we missed a significant explanatory variable. It is difficult to verify in practice. If we assume that the least squares method is valid, then this condition is met.

2. Homoscedasticity (Var(ε) = 2 ) the variance of the error term is the same for all observations. Testing: o o Plots of residuals versus independent variables (or predicted value ŷ or time) Statistic tests Goldfeld-Quandt test, (Especially when the hetescedasticity is related to one of the independent variables.)

Graphical tests for homoscedasticity e e e x i ŷ x i ŷ x i ŷ Homoscedastic residuals Heteroscedastic residuals e residual

SPSS Analyze / Regression / Linear - Plots Dependent variable Standardized predicted value Standardized residuum Deleted residuum Adjusted predicted value Studentized residuum Studentized deleted residuum Relationship of standardized predicted value (ZPRED) and standardized residuum (ZRESID) Homoscedasticity?

Output Variance of residuum ~constant Homoscedasticity

If it is heteroscedastic LOGARITHM! Transform/Compute variable

The error term is uncorrelated across observations In case of cross-sectional, data the observations meet the assumption of simple random sampling, thus we do not have to test this hypothesis. before making estimations according to time series data, we need to determine the residual autocorrelation.

Causes of autocorrelation if we did not use every important descriptive variables in the model (we can t recognise the effect, no data, short time series) if the model specification is wrong i.e.: the relationship is not linear, but we use linear regression not random scaling errors

Plots to detect autocorrelation e t e t Independent variable there is no in the equation. e t-1 e t-1 e We sholud to use other type of function. t

The Durbin-Watson test H 0 : ρ = 0 no autocorrelation H 1 : ρ 0 autocorrelation +violatoró autocorrelation - violator autocorrelation 0 d l d u 2 4-d u 4-d l 4 No problem d n t 2 Limits: ( e t n t 1 e e 2 t t 1) 0 d 4 Positive autocorrelation: 0 d 2 Negative autocorrelation : 2 d 4 Weaker problem: no decision Use more variable Use larger database 2

A Durbin-Watson test decision table H 1 Accept H 0 :p=0 Reject p>0 Positive autocorrelatio n p<0 Negative autocorrelatio n No decision d>d u d<d l d l <d<d u d<4-d u d>4-d l 4-d l <d<4-d u Source: Kerékgyártó-Mundruczó [1999]

Durbin-Watson statistics (5% significance level) Source: Statisztikai képletgyűjtemény n m = 1 m = 2 m = 3 m = 4 m = 5 d L d U d L d U d L d U d L d U d L d U 15 1,08 1,36 0,95 1,54 0,82 1,75 0,69 1,97 0,56 2,21 16 1,10 1,37 0,98 1,54 0,86 1,73 0,74 1,93 0,62 2,15 17 1,13 1,38 1,02 1,54 0,90 1,71 0,78 1,90 0,67 2,10 18 1,16 1,39 1,05 1,53 0,93 1,69 0,82 1,87 0,71 2,06 19 1,18 1,40 1,08 1,53 0,97 1,68 0,86 1,85 0,75 2,02 20 1,20 1,41 1,10 1,54 1,00 1,68 0,90 1,83 0,79 1,99 21 1,22 1,42 1,13 1,54 1,03 1,67 0,93 1,81 0,83 1,96 22 1,24 1,43 1,15 1,54 1,05 1,66 0,96 1,80 0,86 1,94 23 1,26 1,44 1,17 1,54 1,08 1,66 0,99 1,79 0,90 1,92 24 1,27 1,45 1,19 1,55 1,10 1,66 1,01 1,78 0,93 1,90 25 1,29 1,45 1,21 1,55 1,12 1,66 1,04 1,77 0,95 1,89 26 1,30 1,46 1,22 1,55 1,14 1,65 1,06 1,76 0,98 1,88 27 1,32 1,47 1,24 1,56 1,16 1,65 1,08 1,76 1,01 1,86 28 1,33 1,48 1,26 1,56 1,18 1,65 1,10 1,75 1,03 1,85 29 1,34 1,48 1,27 1,56 1,20 1,65 1,12 1,74 1,05 1,84 30 1,35 1,49 1,28 1,57 1,21 1,65 1,14 1,74 1,07 1,83 31 1,36 1,50 1,30 1,57 1,23 1,65 1,16 1,74 1,09 1,83 32 1,37 1,50 1,31 1,57 1,24 1,65 1,18 1,73 1,11 1,82 33 1,38 1,51 1,32 1,58 1,26 1,65 1,19 1,73 1,13 1,81 34 1,39 1,51 1,33 1,58 1,27 1,65 1,21 1,73 1,15 1,81 35 1,40 1,52 1,34 1,58 1,28 1,65 1,22 1,73 1,16 1,80 36 1,41 1,52 1,35 1,59 1,29 1,65 1,24 1,73 1,18 1,80 37 1,42 1,53 1,36 1,59 1,31 1,66 1,25 1,72 1,19 1,80 38 1,43 1,54 1,37 1,59 1,32 1,66 1,26 1,72 1,21 1,79 39 1,43 1,54 1,38 1,60 1,33 1,66 1,27 1,72 1,22 1,79 40 1,44 1,54 1,39 1,60 1,34 1,66 1,29 1,72 1,23 1,79 50 1,50 1,59 1,46 1,63 1,42 1,67 1,38 1,72 1,34 1,77 60 1,55 1,62 1,51 1,65 1,48 1,69 1,44 1,73 1,41 1,77 70 1,58 1,64 1,55 1,67 1,52 1,70 1,49 1,74 1,46 1,77 80 1,61 1,66 1,59 1,69 1,56 1,72 1,53 1,74 1,51 1,77 90 1,63 1,68 1,61 1,70 1,59 1,73 1,57 1,75 1,54 1,78 100 1,65 1,69 1,63 1,72 1,61 1,74 1,59 1,76 1,57 1,78

SPSS Analyze / Regression / Linear - Statistics

0 d l d u 2 4-d u 4-d l 4 0,95 1,54 2,46 3,05 1,381 d l <d<d u no decison We need to include more variables / Or need to increase the number of observations!

Testing: Plots Normally distributed errors Quantitative tests- Goodness-of-fit tests Chi square 2 test Kolmogorov-Smirnoff test

Graphical testing e z A plot of the values of the residuals against normal distributed values. The assumption is not violated when the figure is nearly linear.

Goodness-of-fit test H 0 : P r (ε j ) = P j (the distribution is normal) H 1 : J j : P r (ε j ) P j r 2 2 ( f ) i 1 npi np i H 0 2 (1 ),( r 1 b )

Output The bell-shaped normal distribution with a mean of 0 and standard deviation 1. Approximately normal (but not definitely)

2 nd solution Analyze / Regression / Linear - SAVE

Nonparametric Test Analyze / Nonparametric Test / 1-Samle K-S... H 0 - normális eloszlás H 1 - nem normális eloszlás

Output If the significance level (p) is smaller than 5% (0,05), we reject the null hypothesis. Now it is higher than 0,05, which means we accept the normal distribution.

3 rd solution Graphs / Histogram - Display normal curve

Assumptions of the independent variables Linear independency. Fix values, which do not change sample by sample. There is no scale error. The independent variable is uncorrelated with the error term.

Testing: Multicollinearity X j =f(x 1, X 2,,X j-1, X j+1,,x p ) regression models: Multiple determination coefficient F-test(F>F krit ) VIF- indicator

VIF-measure Variance Inflation Factor 1 VIF VIF=1 with the others) VIF if R j2 =0 (jth independent variable doesn t correlate R j2 =1 (jth independent variable is an exact linear combination of other independent variables) VIF 1 VIF 2 - weak multicollinearity 2 5 j 1 1 R VIF 5 - strong disturbing multicollinearity VIF - very strong, harmful multicollinearity 2 j

Correction for Multicollinearity We should find the offending independent variables to exclude them. We can combine independent variables which are strongly (creating principle components), which will differ from the original independents, but it will contain the information content of the original ones.

SPSS Analyze / Regression / Linear - Statistics

Thank You For Your Attention stgpren@uni-miskolc.hu