Linear Regression Models - PDF Free Download

Linear Regression Models November 13, 2018 1 / 89

1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least squares estimators Hypothesis testing on regression coefficients 2 Regression diagnostics Residuals Influential observations and leverage points Multicollinearity 3 Categorical predictors Dummy variable coding 2 / 89

Basic framework 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least squares estimators Hypothesis testing on regression coefficients 2 Regression diagnostics Residuals Influential observations and leverage points Multicollinearity 3 Categorical predictors Dummy variable coding 3 / 89

Basic framework When to use Regression Analysis Regression analysis is used for explaining or modeling the relationship between a single variable Y, called the response, output or dependent variable, and one or more predictor, input, independent or explanatory variables, X 1,..., X p. when p = 1, it is called simple regression; when p > 1 it is called multiple regression or sometimes multivariate regression; when there is more than one Y, then it is called multivariate multiple regression. The response must be a continuous variable but the explanatory variables can be continuous, discrete or categorical. 4 / 89

Basic framework Linear Model One very general form for the model would be: Y = f (X 1, X 2, X 3 ) + ɛ where f is some unknown function and ɛ is the error which is additive in this instance. Since we usually don t have enough data to try to estimate f directly, we usually have to assume that it has some more restricted form, perhaps linear as in Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + ɛ where β j, j = 0, 1, 2, 3 are unknown parameters. β 0 is called the intercept term. Thus the problem is reduced to the estimation of four values rather than the complicated infinite dimensional f. In a linear model the parameters enter linearly the predictors do not have to be linear. 5 / 89

Notation Basic framework Model specification and assumptions The following notation is typical of that used to represent the elements of a linear model: y i = β 0 + β 1 x i1 +... + β k 1 X ik 1 + ɛ i (1) The subscript (i) is used to represent a set of observations (usually from 1 to n where n is the total sample size). The y i and x i represent respectively the i th observation of the Y and X j variables (j = 1,..., k 1). ɛ i represents the deviation of the i th observed Y from the value of Y expected by the model component. The parameters β 0 and β j represent population intercept and population slopes (effect of X j on Y per unit of x). 6 / 89

Basic framework Matrix notation Model specification and assumptions The regression equation in matrix notation is written as: y = X β + ɛ (2) y 1 1 x 11... x 1,k 1 y 2 1 x y n 1 =. X 21... x 2,k 1 n k =.... β k 1 = y n 1 x n1... x n,k 1 β 0 β 1. β k 1 ɛ 1 ɛ ɛ 2 n 1 =.. ɛ n The column of ones incorporates the intercept term. 7 / 89

Basic framework Geometrical representation Model specification and assumptions The model Album Sales i = β 0 + β 1 Advertising budget i + β 2 Airplay i + ɛ i 8 / 89

Basic framework Model assumptions Model specification and assumptions 1 Linearity of relation between y and X : y = X β + ɛ 2 The average value of random terms is null: E[ɛ 1 ] 0 E[ɛ] =. =. E[ɛ n ] 0 E[y] = E[X β + ɛ] = E[X β] + E[ɛ] = X β The average values of the random variables that generate the observed values of the dependent variable y lie along the regression hyperplane. 9 / 89

Basic framework Model assumptions Model specification and assumptions 3 Omoschedasticity and non-correlation between random terms: var[ɛ] = E[ɛɛ ] = σ 2 I n var[ɛ] = E[ɛɛ ] E[ɛ]E[ɛ ] = E[ɛɛ ] E[ɛ 1 ɛ 1 ] E[ɛ 1 ɛ 1 ]... E[ɛ 1 ɛ n ] E[ɛ 2 ɛ 1 ] E[ɛ 2 ɛ 2 ]... E[ɛ 2 ɛ n ] =...... E[ɛ n ɛ 1 ] E[ɛ n ɛ 2 ]... E[ɛ n ɛ n ] σ 2 0... 0 0 σ 2... 0 =...... 0 0... σ 2 10 / 89

Basic framework Model assumptions Model specification and assumptions 4 No information from X can help to understand the nature of ɛ: E[ɛ X ] = 0 5 X is a (non stochastic) matrix n K, of rank K The independent variables X j are supposed to be linearly independent, i.e. no one can be expressed as a linear combination of the others. If the linearly independence of the predictors is not assured, the model is not identifiable, i.e. we can not obtain a unique solution for model parameter estimates. 11 / 89

Basic framework Model assumptions Model specification and assumptions 6 Errors are independent and identically normally distributed with mean 0 and variance σ 2 : ɛ N(0, σ 2 I n ) It is necessary to assume a distributional form for the errors ɛ to make confidence intervals or perform hypothesis tests. Now since y = X β + ɛ and given assumptions 5 and 6 it follows that: var(y) = var(x β + ɛ) = var(ɛ) = σ 2 I n y N(X β, σ 2 I n ) 12 / 89

Basic framework Model parameters estimation Parameter estimation: least squares method y i = β 0 + β 1 x i1 +... + β k 1 x ik 1 + ɛ i ɛ i N(0, σ 2 I n ) The parameter to be estimated are: the regression coefficients β 1 +... + β k 1 the intercept β o the error variance σ 2 Each β j should be interpreted as the expected change in y when a unitary change is observed in X j while all other predictors are kept constant. The intercept is the expected mean value of y when all X j = 0. 13 / 89

Basic framework Least squares estimation Parameter estimation: least squares method The least squares estimate of β minimizes the sum of the squared errors: n n ( ɛ 2 1 = y i x i β ) 2 i=1 where β is an arbitrary vector of coefficients (real numbers). We look for β minimizing i=1 f ( β) = ɛ ɛ = (y X β) (y X β) = min and by expanding the product it is obtained that: f ( β) = ɛ ɛ = y y β X y y X β + β X X β = y y 2 β X y + β X X β 14 / 89

Basic framework Least squares estimation Parameter estimation: least squares method Differentiating with respect to β and setting to zero δf ( β) δ β = 2X y + 2X X β = 0 The solution of this equation is a vector ˆβ such that: X X ˆβ = X y These are called the normal equations. Now provided X X (a square matrix k k) is invertible ˆβ = (X X ) 1 X y 15 / 89

Basic framework Parameter estimation: least squares method Estimated regression equation and prediction The least squares method leads to estimating the coefficients as: ˆβ = (X X ) 1 X y By using the parameter estimations it is possible to write the estimated regression line: ŷ i = ˆβ 0 + ˆβ 1 x i1 +... + ˆβ k 1 x ik 1 And obtain a prediction for a new observed x 0 ŷ 0 = ˆβ 0 + ˆβ 1 x 01 +... + ˆβ k 1 x 0k 1 x 0 is a vector that has not been used to get parameter estimations. 16 / 89

Basic framework Parameter estimation: least squares method Algebraic aspects related to the least squares solution The results obtained with the least squares allow to separate the vector of observations y in two parts, the fitted values ŷ = X ˆβ and the residuals e: y = X ˆβ + e = ŷ + e Since ˆβ = (X X ) 1 X y, it follows that: ŷ = X ˆβ = X (X X ) 1 X y = Hy The matrix H is symmetric (H = H ), idempotent (H H = H 2 = H) and is called prediction matrix, since it provides the vector of the fitted values in the regression of y on X. 17 / 89

Basic framework Geometrical perspective Parameter estimation: least squares method The data vector y is projected orthogonally onto the model space spanned by X. The fit is represented by projection ŷ = X ˆβ with the difference between the fit and the data represented by the residual vector e. ˆβ is in some sense the best estimate of β within the model space. The response predicted by the model is ŷ = X ˆβ = X (X X ) 1 X y = Hy where H is an orthogonal projection matrix. 18 / 89

Basic framework Estimating σ 2 Parameter estimation: least squares method After some calculation, one can show that ( n ) E = E (e e) = σ 2 (n k) which shows that i=1 e 2 i s 2 = is an unbiased estimator of σ 2 n e e n k = i=1 e2 i n k (n k) is the degrees of freedom of the model. The square root of s 2 is named standard error of regression and can be interpreted as the average squared deviation of the dependent variable values y around the regression hyperplan. 19 / 89

Basic framework Example: investment data Parameter estimation: least squares method Year Investment (y) GDP (X 1) Trend (X 2) 1982 209.952 1060.859 1 1983 207.825 1073.783 2 1984 214.923 1101.366 3 1985 215.985 1132.313 4 1986 220.371 1164.465 5 1987 230.058 1200.523 6 1988 245.872 1246.966 7 1989 256.720 1282.905 8 1990 266.044 1310.659 9 1991 268.273 1325.582 10 1992 263.361 1333.072 11 1993 229.628 1317.668 12 1994 230.785 1346.267 13 1995 246.659 1385.830 14 1996 249.619 1395.408 15 Source: ISTAT (values are in thousands of billions) 20 / 89

Basic framework Example: investment data Parameter estimation: least squares method The model INVEST = β 0 + β 1 GDP + β 2 TREND + ɛ The parameters β 0 : the value of the investments when the explanatory variables are equal to zero. β 1 : the increase in investments when there is a unit GDP growth, keeping the value of the trend fixed. β 2 : the increase in investments when there is a one-off increase in the trend (i.e., it passes from one year to the next), while maintaining the value of GDP. 21 / 89

Basic framework Example: investment data Parameter estimation: least squares method The quantities β 0, β 1, and β 3 are the parameters to be estimated by the relation: (X X ) ˆβ = X y By solving the system of normal equations we get: Variables ˆβ j Intercept (X 0 ) -441.272 GDP (X 1 ) 0.625 TREND (X 2 ) -12.522 If GDP increases by 1 billion, investments will increase by 625 millions, keeping the trend constant. If the trend increases by one unit (ie one year to the next), investments will decrease by 12522 billions if the GDP is kept constant. 22 / 89

Basic framework Coefficient of determination R 2 Goodness of Fit In a linear regression model (with an intercept), the least squares condition implies that the sum of squares of the dependent variable can be expressed as the sum of two components 1 : n (y i ȳ) 2 = i=1 n (ŷ i ȳ) 2 + i=1 TSS = ESS + RSS n i=1 e 2 i Coefficient of determination or percentage of variance explained R 2 = 1 RSS TSS = ESS TSS 23 / 89

Basic framework Coefficient of determination R 2 Coefficient of determination R 2 = 1 RSS TSS = ESS TSS The range is 0 R 2 1 - values closer to 1 indicating better fits: R 2 = 1 indicates a perfect fit (y i = ŷ i i) R 2 = 0 indicates that the fitted values are identical to the average value of the dependent variable (ŷ i = ȳ i) It expresses the percentage of y variance explained by the model. It can be expressed as the square of the correlation between observed values and estimated values: R 2 = r 2 y,ŷ = [cov (y, ŷ)]2 var(y)var(ŷ) 24 / 89

Basic framework Coefficient of determination R 2 The Adjusted R 2 The coefficient R 2 is a nondecreasing function of the number of explanatory variables. In comparing two regression models with the same dependent variable but differing number of X variables, the number of X variables present in the model must be taken into account. It is a good practice to use adjusted R 2 rather than R 2 because this latter tends to give an overly optimistic picture of the fit of the regression, particularly when the number of explanatory variables is not very small compared with the number of observations. R 2 a = 1 n 1 n (k 1) 1 (1 R2 ) 25 / 89

Basic framework Coefficient of determination R 2 Example: investment data INVEST = 441.272 + 0.625 GDP 12.522 TREND R 2 = 0.947 R 2 a = 0.938 The regression model explains 94% of the variance of the dependent variable, i.e. 94% of the variability of the investments is attributable to the linear relationship with GDP and TREND. 26 / 89

Basic framework Mean and Variance of ˆβ Properties of the least squares estimators ˆβ = ( X X ) 1 X y E( ˆβ) = β(unbiased) var( ˆβ) = σ 2 (X X ) 1 Note that since ˆβ is a vector, σ 2 (X X ) 1 is a variance-covariance matrix. Sometimes you want the standard error for a particular component which can be picked out as in se( ˆβ j ) = (X X ) 1 jj σ 27 / 89

Basic framework Gauss-Markov theorem Properties of the least squares estimators In the linear regression model, the beta estimator obtained with the least squares method is the most efficient estimator (minimum variance) in the class of linear and unbiased estimators. The Gauss-Markov theorem shows that the least squares estimate ˆβ is a good choice, but if the errors are correlated or have unequal variance, there will be better estimators: When the errors are correlated or have unequal variance, generalized least squares should be used. When the error distribution is long-tailed, then robust estimates might be used. Robust estimates are typically not linear in y. When the predictors are highly correlated (collinear), then biased estimators such as ridge regression might be preferable. 28 / 89

Basic framework Distribution of ˆβ Properties of the least squares estimators Assuming that the errors are independent and identically normally distributed with mean 0 and variance σ 2, i.e. ɛ N ( 0, σ 2 I ) and since y = X β + ɛ y N ( X β, σ 2 I ) and from this we find that (using the fact that linear combinations of normally distributed values are also normal) ˆβ N ( β, σ 2 (X X ) 1) ˆβ i N ( β j, σ 2 S jj) where with the S jj symbol we indicate the element at the intersection of the j-th row and the j-th column of the matrix (X X ) 1 29 / 89

Basic framework Testing just one predictor Hypothesis testing on regression coefficients Let us consider the null hypothesis (H 0 ) against the alternative bilateral hypothesis (H 1 ): H 0 : β j = 0 vs H 1 : β j 0 And the corresponding test statistic is: t j = ˆβ j β j s ˆβj T (n k) Decision rule classical approach p-value approach 30 / 89

Basic framework Decision rule: classical approach Hypothesis testing on regression coefficients We reject H 0 : β = 0 with the error probability α if: t j > t 1 α 2 ;(n k) 31 / 89

Basic framework Decision rule: p-value approach Hypothesis testing on regression coefficients We reject H 0 : β = 0 with the error probability α if: where p value = Pr( t n k > t j ) p value < α 32 / 89

Basic framework Example: investment data Hypothesis testing on regression coefficients Let us consider the null hypothesis (H 0 ) against the alternative bilateral hypothesis (H 1 ): H 0 : β 1 = 0 vs H 1 : β 1 0 And the corresponding test statistic is: t 1 = ˆβ 1 β 1 s ˆβ1 = 0.625 0.058 = 10.76 t (n k) = 2.1788 33 / 89

Basic framework Example: investment data Hypothesis testing on regression coefficients t j = 10.76 > t 1 α 2 ;(n k) = 2.178 We reject H 0 : β = 0 with the error probability α = 0.05 It makes sense to use the GDP variable to describe the behavior of investments. 34 / 89

Basic framework Overall F test Hypothesis testing on regression coefficients We want to test the null hypothesis H 0 = there is no a linear relation between Y and the Xs versus the alternative hypothesis H 1 = there is a linear relation between at least one of the predictors and the response. For a multiple regression model the overall F-test is: H 0 : y = β 0 1 n + ɛ vs H 1 : y = X β + ɛ We will test the null model M 0 against our model M 1 35 / 89

Basic framework Overall F test Hypothesis testing on regression coefficients Hypotheses H 0 : y = β 0 1 n + ɛ vs H 1 : y = X β + ɛ Test statistic F = ESS/(k 1) RSS/(n k) F (α;k 1,n k) Decision rule H 0 will be rejected with a significance level of α if F > F (α;k 1,n k) Pr(F (α;k 1,n k) > F ) < α 36 / 89

Basic framework ANOVA table as a regression output Hypothesis testing on regression coefficients Source DF Sum Squares Mean Squares F Regression k-1 ESS ESS/k 1 Error n-k RSS RSS/n k Total n-1 TSS ESS/k 1 RSS/n k Traditionally, the information in the overall F test is presented in an analysis of variance table. Most computer packages produce a variant on this. As the originator of the table, Fisher said in 1931, it is nothing but a convenient way of arranging the arithmetic. 37 / 89

Basic framework Example: investment data Hypothesis testing on regression coefficients Source DF Sum Squares Mean Squares F p-value Regression 2 5841.1069 2920.535 107.861 2.14E-08 Error 12 324.923 27.077 Total 15 6165.993 We conclude that at least one of the regression coefficients is significantly different from zero and that the regression model as a whole has its own statistical reason. 38 / 89

Basic framework Confidence Intervals for each β j individually Hypothesis testing on regression coefficients Confidence intervals provide an alternative way of expressing the uncertainty in our estimates. Even so, they are closely linked to the tests that we have already constructed. Confidence interval for the β j coefficient: ( Pr ˆβj t α s ˆβj β j ˆβ ) j t α s ˆβ j = 1 α If the above interval does not include zero, the β j coefficient can be considered statistically significant at the 5% level. 39 / 89

Basic framework Example: investment data Hypothesis testing on regression coefficients Confidence interval for the β 1 coefficient: ( Pr ˆβ1 t α s β ˆβ1 1 ˆβ 1 t α s ˆβ1) = 1 α Pr (0.499 β 1 0.752) = 1 α There is a 95% probability that a unit increase in GDP (X 1 ), keeping the value of the trend constant, investments increase from an amount ranging from 499 to 752 billion. 40 / 89

Regression diagnostics 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least squares estimators Hypothesis testing on regression coefficients 2 Regression diagnostics Residuals Influential observations and leverage points Multicollinearity 3 Categorical predictors Dummy variable coding 41 / 89

Regression diagnostics Diagnostics Regression diagnostics are used to detect problems with the model and suggest improvements. Residual analysis to determines if the hypotheses formulated on the error term of the regression model are valid with respect to the analyzed phenomenon; to identifies the presence of outliers (abnormal observations with respect to the dependent variable Y), leverage points (abnormal observations over the X), influencing observations (observations whose inclusion greatly modifies the estimates of the least squares). 42 / 89

Residuals Regression diagnostics Residuals e = y i ŷ i residuals are the basis for measuring the variability of y not explained by the regression model. e i = ˆɛ i any departure from the assumed hypotheses affects the values of the residuals. 43 / 89

Residuals Regression diagnostics Residuals Recall that ŷ = X (X X ) 1 X y = Hy where H is the hat-matrix: e = y ŷ = (I H)y = (I H)X β + (I H)ɛ = (I H)ɛ So var(e) = var(i H)ɛ = (I H)σ 2 assuming that var(ɛ) = σ 2 I. We see that although the errors may have equal variance and be uncorrelated the residuals do not. 44 / 89

Regression diagnostics Studentized residuals Residuals r i = e i s 1 h ii i = 1,..., n. If the model assumptions are correct var(r i ) = 1 and corr(r i, r j ) tends to be small. Studentized residuals are sometimes preferred in residual plots as they have been standardized to have equal variance: only when there is unusually large leverage will the differences be noticeable. Studentization can only correct for the natural non-constant variance in residuals when the errors have constant variance: if there is some underlying heteroscedascity in the errors, studentization cannot correct for it. Any abnormal observation will inevitably affect s and thus also the studentized residuals. 45 / 89

Regression diagnostics Jacknife residuals Residuals Let s (i) denote the residual variance after eliminating the i-th observation: r i = e i s (i) 1 hii i = 1,..., n Fortunately there is an easy way to compute ri, which avoids doing n regressions: ( ) 2 n k 1 ri = r i n k ri 2 i = 1,..., n Since ri t n k 1, jacknife residuals above 2.5 in absolute value represent potential outliers. 46 / 89

Regression diagnostics Notes on outliers Residuals 1 Two or more outliers next to each other can hide each other. 2 An outlier in one model may not be an outlier in another when the variables have been changed or transformed. 3 The error distribution may not be normal and so larger residuals may be expected. 4 For large datasets, we need only worry about clusters of outliers. Such clusters are less likely to occur by chance and more likely to represent actual structure. Finding these cluster is not always easy. 47 / 89

Regression diagnostics What should be done about outliers? Residuals 1 Check for a data entry error first. These are relatively common. 2 Examine the physical context - why did it happen? Sometimes, the discovery of an outlier may be of singular interest. 3 Exclude the point from the analysis: to avoid any suggestion of dishonesty, always report the existence of outliers even if you do not include them in your final model. 48 / 89

Regression diagnostics Leverage point Influential observations and leverage points h i = H ii are called leverages and are useful diagnostics that depends only on X. We see that: var(e i ) = σ 2 (1 h i ) so that a large leverage for h i will make var(e i ) small in other words the fit will be forced to be close to y i. Rule of thumb Since an average value for h i is k n leverages of more than 2 k n should be looked at more closely. 49 / 89

Regression diagnostics Leverage points and outliers Influential observations and leverage points The two additional points marked by a triangular and a circle both have high leverage because they are far from the rest of the data: the triangular is not an outlier; the circle does not have a large residual if it is included in the fit. The solid line is the fit including the triangular point but not the circle point. The dotted line is the fit without either additional point and the dashed line is the fit with the circle point but not the triangle point. 50 / 89

Regression diagnostics Influential observations Influential observations and leverage points An influential point is one whose removal from the dataset would cause a large change in the fit. An influential point may or may not be an outlier and may or may not have large leverage but it will tend to have at least one of those two properties. The triangle point is not an influential point but the circle point is. 51 / 89

Regression diagnostics Measures of influence Influential observations and leverage points Cook distance D i = ( ˆβ (i) ˆβ) X X ( ˆβ (i) ˆβ) (ks 2 ) h ii ri 2 = 1 h ii k The subscripted (i) indicates the fit without case (i). The combination of the residual and leverage effects leads to influence. High D i values indicate influential observations on the vector of β parameters. 52 / 89

Regression diagnostics Residual plots Influential observations and leverage points 1 residuals vs fitted values: Things to look for are heteroscedascity (non-constant variance) and nonlinearity: if all is well, you should see constant variance in the vertical (e) direction and the scatter should be symmetric vertically about 0. 2 residuals vs each predictors: Look for the same things except in the case of plots against predictors not in the model, look for any relationship which might indicate that this predictor should be included. 3 q-q plot, histogram, boxplot to check the Normality assumtion. 53 / 89

Regression diagnostics Residual plots Influential observations and leverage points 54 / 89

Regression diagnostics Residual plots: normality Influential observations and leverage points 55 / 89

Regression diagnostics Residual plots: normality Influential observations and leverage points The consequences of non-normality are: that the least squares estimates may not be optimal - they will still be BLUE but other robust estimators may be more effective. that the tests and confidence intervals are invalid. However, it has been shown that only really longtailed distributions cause a problem. Mild non-normality can safely be ignored and the larger the sample size the less troublesome the non-normality. what to do? 1 A transformation of the response may solve the problem. 2 Accept non-normality and base the inference on the assumption of another distribution or use resampling methods 3 For short-tailed distributions, the consequences of non-normality are not serious and can reasonably be ignored. 56 / 89

Regression diagnostics Residual plots: correlated errors Influential observations and leverage points We assume that the errors are uncorrelated but for temporally or spatially related data this may well be untrue. For such data, it is wise to check the uncorrelated assumption. Plot e against time. Use formal tests like the Durbin-Watson or the run test. what to do? If you do have correlated errors, you can use Generalized Least Squares. 57 / 89

Regression diagnostics Residual plots: correlated errors Influential observations and leverage points 58 / 89

Regression diagnostics Example: investment data Influential observations and leverage points The residuals tend to disperse randomly above and below their average. 59 / 89

Regression diagnostics Example: investment data Influential observations and leverage points There are no observations with high residuals, but a non-casual pattern is shown over the time. 60 / 89

Regression diagnostics Example: investment data Influential observations and leverage points There are no residues that exceed 2.5 in absolute value, so no outliers are highlighted. 61 / 89

Regression diagnostics Example: investment data Influential observations and leverage points There are no observations that exceed the cut-off: 2 k n = 0.4. 62 / 89

Regression diagnostics Example: investment data Influential observations and leverage points The most influential value in the ˆβ estimate is associated with the 1996. 63 / 89

Regression diagnostics The multicollinearity Multicollinearity For b-variables regression involving explanatory variables x 1,..., x b (where x 1 = 0 for all observations to allow for the intercept term), an exact linear relationship is said to exist if the following condition is satisfied: a 1 x 1 + a 2 x 2+,..., a b x b = 0 (3) where a 1,..., a b are constants such that not all of them are zero simultaneously. In case of quasi-multicollinearity, the x variables are intercorrelated but not perfectly: a 1 x 1 + a 2 x 2+,..., a b x b + ν i = 0 (4) where ν i is a stochastic error term. 64 / 89

Regression diagnostics The multicollinearity problem Multicollinearity If multicollinearity is perfect in the sense of (3), the inverse of matrix X T X does not exist the regression coefficients cannot be estimated. If multicollinearity is less than perfect, as in (4), the regression coefficients, although determinate, posses large standard errors the regression coefficients cannot be estimated with great precision. 65 / 89

Regression diagnostics Consequences of multicollinearity Multicollinearity Although BLUE, the OLS estimators posses large standard errors, which means the coefficients cannot be estimated with great precision. Because of large standard errors, the confidence interval tend to be larger. Because of large standard errors, the t ratio of one or more regression coefficient tends to be statistically insignificant. Although the t ratio of one or more regression coefficient is statistically insignificant, the R 2 overall measure of goodness of fit can be very high. The OLS estimators and their standard errors can be sensitive to small changes in the data. 66 / 89

Regression diagnostics Detection of multicollinearity: the VIF index Multicollinearity One may regress each of the x j variables on remaining x variables and find out the corresponding coefficients of determination Rj 2. A high Rj 2 would suggest that x j is highly correlated with the rest of the x s. Thus one may drop from x j the model. The coefficients of determination Rj 2 are also important for making the Variance Inflation Factors (VIF j ) that measure the inflation in the variances of regression coefficients when we get off the ideal case (orthogonal regressors). 1 VIF j = 1 Rj 2 (5) The larger is the value of VIF j, the more collinear is the variable x j. As a rule of thumb, if the VIF j > 10, that variable is said to be highly collinear. 67 / 89

Regression diagnostics Detection of multicollinearity: the CN index Multicollinearity The condition number is defined as: CN = λmax λ min (6) where λ max is the maximum eigenvalue of (Z T Z), i.e., the correlation matrix of the independent variables, and λ min is the minimum eigenvalue. The condition index is defined as: CI = λmax λ min = CN (7) Rule of thumb If CN is between 100 and 1000 there is moderate to strong multicollinearity and if it exceeds 1000 there is severe multicollinearity. If the CI is between 10 and 30, there is moderate to strong multicollinearity and if it exceeds 30 there is severe multicollinearity. 68 / 89

Regression diagnostics Multicollinearity Example: Longley s Economic Regression Data A macroeconomic data set which provides a well-known example for a highly collinear regression: 7 economical variables, observed yearly from 1947 to 1962 (n=16) 2. Label Variables GNP.deflator GNP implicit price deflator (1954 = 100) GNP Gross National Product Unemployed number of unemployed Armed.Forces number of people in the armed forces Population noninstitutionalized population >= 14 years of age Year the year (time) Employed number of people employed 69 / 89

Regression diagnostics Multicollinearity Example: Longley s Economic Regression Data The model: Employed = β 0 + β 1 GNP.deflator + β 2 GNP + β 3 Unemployed = β 4 Armed.Forces + β 5 Population + β 6 Year + β 7 Employed 70 / 89

Regression diagnostics Multicollinearity Example: Longley s Economic Regression Data Eigenvalues of (X T X ): 6.665299e + 07; 2.090730e + 05; 1.053550e + 05; 1.803976e + 04; 2.455730e + 01; 2.015117e + 00 72 / 89

Regression diagnostics Multicollinearity Example: Longley s Economic Regression Data Condition numbers: 1.00000; 17.85504; 25.15256; 60.78472; 1647.47771; 5751.21560 a a Other condition numbers are also worth considering because they indicate whether more than just one independent linear combination is to blame. 73 / 89

Regression diagnostics Multicollinearity Example: Longley s Economic Regression Data VIF: 135.53244; 1788.51348; 33.61889; 3.58893; 399.15102; 758.98060 a a The VIF for orthogonal predictors is 1. 74 / 89

Categorical predictors 1 Basic framework Model specification and assumptions Parameter estimation: least squares method Coefficient of determination R 2 Properties of the least squares estimators Hypothesis testing on regression coefficients 2 Regression diagnostics Residuals Influential observations and leverage points Multicollinearity 3 Categorical predictors Dummy variable coding 75 / 89

Categorical predictors Including categorical predictors Predictors that are qualitative in nature, like for example Gender, are sometimes called categorical or factors. The strategy is to incorporate the qualitative predictors within the y = xβ + ɛ framework. We can then use the estimation, inferential and diagnostic techniques that we have already learnt. Dummy coding Dummy coding is a way of representing groups of people using only zeros and ones. To do it, we have to create several variables; in fact, the number of variables we need is one less than the number of groups we re recoding. 76 / 89

Categorical predictors Dummy variable coding Dummy variable coding 1 Count the number of groups you want to recode and subtract 1. 2 Create as many new variables as the value you calculated in step 1. These are your dummy variables. 3 Choose one of your groups as a baseline (i.e., a group against which all other groups should be compared). This should usually be a control group, or, the group that represents the majority of people. 4 Having chosen a baseline group, assign that group values of 0 for all of your dummy variables. 5 For your first dummy variable, assign the value 1 to the first group that you want to compare against the baseline group. Assign all other groups 0 for this variable. 6 For the second dummy variable assign the value 1 to the second group that you want to compare against the baseline group. Assign all other groups 0 for this variable. 7 Repeat this until you run out of dummy variables. 8 Place all of your dummy variables into the regression analysis! 77 / 89

Categorical predictors Dummy variable coding: example Dummy variable coding Consider a 4 level factor that will be coded using 3 dummy variables. This table describes the coding that treats level one as the standard level to which all other levels are compared: There are other choices of coding: the choice of coding does not affect the R 2, σ 2 and overall F -statistic. It does effect the ˆβ and you do need to know what the coding is before making conclusions about ˆβ. 78 / 89

Categorical predictors Different models Dummy variable coding Consider a regression model that describes the relationship between the dependent variable y and two predictors, a continuous X and a categorical C with two levels (two groups): 1 The same regression line for both groups: y = β 0 + β 1 X + ɛ 2 Separate regression lines for each group but same slope: y = β 0 + β 1 X + β 2 C + ɛ 3 Separate regression lines for each group but same intercept: y = β 0 + β 1 X + β 2 X C + ɛ 4 Separate regression lines for each group: y = β 0 + β 1 X + β 2 C + β 3 X C + ɛ 79 / 89

Categorical predictors Different models Dummy variable coding Consider a regression model that describes the relationship between the dependent variable y and two predictors, a continuous X and a categorical C with two levels (two groups): 1 The same regression line for both groups: y = β 0 + β 1 X + ɛ The categorical predictor C is not statistically significant (it has no effect on y), so X is included as the only predictor. 80 / 89

Categorical predictors Different models Dummy variable coding Consider a regression model that describes the relationship between the dependent variable y and two predictors, a continuous X and a categorical C with two levels (two groups): 2 Separate regression lines for each group but same slope 3 : y = β 0 + β 1 X + β 2 C + ɛ The categorical predictor C has an effect on the y response variable, so X and C are included as predictors in the model. 81 / 89

Categorical predictors Different models Dummy variable coding Consider a regression model that describes the relationship between the dependent variable y and two predictors, a continuous X and a categorical C with two levels (two groups): 3 Separate regression lines for each group but same intercept 4 : y = β 0 + β 1 X + β 2 X C + ɛ The categorical predictor C has effect on the y response variable only through its interaction with the continuous predictor X (the interaction is indicated by X C), so X and X C are included as predictors in the model. 82 / 89

Categorical predictors Different models Dummy variable coding Consider a regression model that describes the relationship between the dependent variable y and two predictors, a continuous X and a categorical C with two levels (two groups): 4 Separate regression lines for each group 5 : y = β 0 + β 1 X + β 2 C + β 3 X C + ɛ The categorical predictor C has an effect on the y response variable taken individually that through its interaction with the continuous predictor X (the interaction is indicated by X C), then X, C and X C are included as predictors in the model. 83 / 89

Categorical predictors A two-level example Dummy variable coding Annual Consumption and Income in the United States during the 1940 1950 period: Year Income Consumption 1940 244.0 229.9 1941 277.9 243.6 1942 317.5 241.1 1943 332.1 248.2 1944 343.6 255.2 1945 338.1 270.9 1946 332.7 301.0 1947 318.8 305.8 1948 335.8 312.2 1949 336.8 319.3 1950 362.8 337.3 84 / 89

Categorical predictors A two-level example Dummy variable coding The graph shows that there are 4 points (associated with 1942 1945) where consumption is well below the hypothetical regression line fitted for the remaining data. 85 / 89

Categorical predictors A two-level example Dummy variable coding Year Income Consumption War 1940 244.0 229.9 0 1941 277.9 243.6 0 1942 317.5 241.1 1 1943 332.1 248.2 1 1944 343.6 255.2 1 1945 338.1 270.9 1 1946 332.7 301.0 0 1947 318.8 305.8 0 1948 335.8 312.2 0 1949 336.8 319.3 0 1950 362.8 337.3 0 Consumption = β 0 + β 1 Income + β 2 War + ɛ 86 / 89

Categorical predictors A two-level example Dummy variable coding Estimate Std. Error t value Pr(> t ) Intercept -10.0649 28.44336-0.35386 0.732591 Income 0.959595 0.089481 10.72398 5.03E-06 War -55.4624 5.902399-9.39659 1.35E-05 y w=0 = 10.0649 + 0.959595X y w=1 = ( 10.0649 + ( 55.4624)) + 0.959595X = 65.5273 + 0.959595X The effect of the categorical predictor (War) is to decrease the theoretical estimate of the amount of consumption for the years 1942 1945, of an amount of 55.4624 87 / 89

Categorical predictors A two-level example Dummy variable coding y w=0 = 10.0649 + 0.959595X y w=1 = 65.5273 + 0.959595X 88 / 89

References Categorical predictors Dummy variable coding 1 Searle, S.R. (1971). Linear Models, Wiley. 2 Wooldridge, J. (2013). Introductory Econometrics, Wiley. 3 Faraway, J.J. (2005). Linear Models with R, Chapman & Hall/CRC. 4 Weisberg, S. (2005). Applied Linear Regression, Wiley. 5 Riani, M. and Laurini, F. (2008). Modelli statistici per l economia con applicazioni aziendali, Pitagora Editrice Bologna. 6 Jobson, J.D. (1991). Applied Multivariate Data Analysis Vol I: Regression and experimental design, Springer 89 / 89