Explanatory Variables Must be Linear Independent...

Size: px

Start display at page:

Download "Explanatory Variables Must be Linear Independent..."

Claude Chandler
5 years ago
Views:

1 Explanatory Variables Must be Linear Independent... Recall the multiple linear regression model Y j = β 0 + β 1 X 1j + β 2 X 2j + + β p X pj + ε j, i = 1,, n. is a shorthand for n linear relationships Y 1 = β 0 + β 1 X 11 + β 2 X β p X p1 + ε 1 Y 2 = β 0 + β 1 X 12 + β 2 X β p X p2 + ε 2. Y n = β 0 + β 1 X 1n + β 2 X 2n + + β p X pn + ε n The least square estimate ( β 0, β 1,..., β p ) exists under 2 conditions n p cannot include too many covariates The p covariates and also the intercept must be linearly independent What does linearly independent mean? Handout 1C - 1

2 Handout 1C - 2 Definition of Linear Dependence and Independence A subset of vectors v 1, v 2,..., v n is called linearly dependent if there exist scalars a 1, a 2,..., a n, not all zero, such that a 1 v 1 + a 2 v a n v n = 0. Otherwise, the vectors v 1, v 2,..., v n are linearly independent. For example, the four vectors below are linearly dependent because v 1 v 2 v 3 v 4 = 0, v 1 =, v 2 =, v 3 = v 4 = but v 1, v 2, v 3 are linearly independent because the only scalars a 1, a 2, a 3 that make a 1 v 1 + a 2 v 2 + a 3 v 3 = 0 are a 1 = a 2 = a 3 =

3 Handout 1C - 3 Example Suppose in some study, the covariates include WT 2 = weight at age 2, in kg WT 9 = weight at age 9, in kg DW = weight gain from age 2 to 9, in kg The covariate WT 2, WT 9, DW are linearly dependent, because DW = WT 9 WT 2.

4 Handout 1C - 4 What happens When Explanatory Variables Are Linearly Dependent? We cannot fit the model Y = β 0 + β 1 WT 2 + β 2 WT 9 + β 3 DW + ε, because the coefficients cannot be uniquely determined. Observe Y = β 0 + (β 1 + c)wt 2 + (β 2 c)wt 9 + (β 3 + c)dw + ε = β 0 + β 1 WT 2 + β 2 WT 9 + β 3 DW + c(wt } 2 WT {{ 9 + DW } ) + ε =0 Regardless of the value of c, the mean of the response Y are all the same. The set of coefficients (β 1, β 2, β 3 ) will fit the data as well as (β 1 + c, β 2 c, β 3 + c) does for any constant c.

5 Handout 1C - 5 What to Do When Explanatory Variables Are Linearly Dependent? Remove some of the explanatory variables that are linearly dependent with others until the remaining explanatory variables are linearly independent e.g., remove anyone of WT2, WT9, and DW will make the remaining linearly independent Put constraint(s) on the β s so that they can be uniquely determined. commonly adopted approaches for models in experimental designs

6 Handout 1C - 6 Dummy Variables (1) Sometimes the explanatory variables are categorical, like blood type (O, A, B, AB). However, it makes NO sense to write a model Y = β 0 + β 1 (blood type) + ε i. because blood type is not a number In experimental design, the treatment factors are often categorial, e.g., the type of fertilizer How to represent categorical variables numerically in a model? Create a dummy variable (aka. indicator variable) for each category of the categorical variable

7 Dummy Variables (2) For example, for the variable blood type, four dummy variables are created for the 4 categories: O, A, B, and AB: D O = 1 if one s blood type is O, and 0 otherwise D A = 1 if one s blood type is A, and 0 otherwise D B = 1 if one s blood type is B, and 0 otherwise D AB = 1 if one s blood type is AB, and 0 otherwise Though the model Y = β 0 + β 1 (blood type) + ε i makes no sense, but the following model does because D O, D A, D B and D AB are all numbers (either 0 or 1). Y = β 0 + β 1 D O + β 2 D A + β 3 D B + β 4 D AB + ε i The mean response E[Y ] for the 4 blood types are then blood type E(Y ) O β 0 + β 1 A β 0 + β 2 B β 0 + β 3 AB β 0 + β 4 Handout 1C - 7

8 Handout 1C - 8 But The Dummy Variables Are Linearly Dependent... Every individual must fall in exactly one of the 4 categories, it is always true that This means: D O + D A + D B + D AB 1 = 0. One of the 4 dummy variables is redundant because knowing any 3 tells us the rest one D O, D A, D B, D AB and the intercept are linearly dependent, and consequently, the coefficients (β 0, β 1, β 2, β 3, β 4 ) cannot be uniquely determined For this reason, we say the model Y = β 0 + β 1 D O + β 2 D A + β 3 D B + β 4 D AB + ε i is overparameterized because it specifies more parameters than we actually need.

9 Handout 1C - 9 How to Deal With Overparametrization? There are various ways to deal with overparametrization in the model Y = β 0 + β 1 D O + β 2 D A + β 3 D B + β 4 D AB + ε i. Some common ways include dropping the intercept (i.e., letting β 0 = 0) dropping one dummy variable, e.g., D O (i.e., letting β 1 = 0) The category of which the dummy variable is dropped is called the baseline. If D O is dropped, the baseline is blood type O letting β 1 + β 2 + β 3 + β 4 = 0

10 Handout 1C - 10 When the Intercept is Dropped... Dropping the intercept β 0, the coefficients for the dummy variables become the mean response E[Y ] for the coefficient blood type Y = β 1 D O + β 2 D A + β 3 D B + β 4 D AB + ε i. blood type E(Y ) O β 1 A β 2 B β 3 AB β 4

11 Handout 1C - 11 When One of the Dummy Variables is Dropped... Dropping one of the dummy variables is dropped, the model becomes Y = β 0 + β 2 D A + β 3 D B + β 4 D AB + ε i, and the mean response E[Y ] for the 4 blood type are blood type E(Y ) O β 0 A β 0 + β 2 B β 0 + β 3 AB β 0 + β 4 The mean of Y under the baseline (blood type O) is β 0 The mean of Y for for blood type A is β 0 + β 2 One can compare the means of Y for blood type A and O by testing β 2 = 0 Useful for comparing categories with the baseline category.

12 Handout 1C - 12 Choice of the Baseline Category Can Be Arbitrary If blood type O is the baseline: If blood type A is the baseline: Y = β 0 +β 2 D A +β 3 D B +β 4 D AB +ε i blood type E(Y ) O β 0 A β 0 + β 2 B β 0 + β 3 AB β 0 + β 4 Y = β 0+β 1D O +β 3D B +β 4D AB +ε i blood type E(Y ) O β 0 + β 1 A β 0 B β 0 + β 3 AB β 0 + β 4 The 2 models are equivalent in the sense that they give identical group means: β 0 = β 0 + β 1 β 0 + β 2 = β 0 β 0 + β 3 = β 0 + β 3 β 0 + β 4 = β 0 + β 4

13 Example: Salary Survey S X E M S = Salary X = Experience, in years E = Education (1 if H.S. only, 2 if Bachelor s only, 3 if Advanced degree) M = Management Status (1 if manager, 0 if non-manager) Handout 1C - 13

14 Handout 1C - 14 Example: Salary Survey Coding Variables (1) Let s first consider the effect of experience (X ) and education (E) on employee s salary (S), ignoring the effect of management status. Experience (X ): numerical Education (E): qualitative, 3 categories, need 3 dummy variables { 1 if i th person has a high school diploma only E i1 = 0 otherwise { 1 if i th person has a B.S. only E i2 = 0 otherwise { 1 if i th person has an advanced degree E i3 = 0 otherwise. Model 1: S = β 0 + βx + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + ε

15 Handout 1C - 15 Example: Salary Survey Coding Variables Model 1: S = β 0 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + ε This model is overparameterized. Need a constraint. If dropping the intercept (letting β 0 = 0), then δ 1 + βx + ε S = δ 2 + βx + ε δ 3 + βx + ε if H.S. only if B.A. or B.S. only if advanced In this parametrization, δ 1, δ 2, δ 3 represent the 3 different intercepts of the regression lines of S on X at the 3 different education levels Often, we are interested in comparison between categories, e.g., whether Bachelors earn more than H.S. graduates on average, i.e., δ 2 > δ 1, or not?

16 Handout 1C - 16 Example: Salary Survey Coding Variables If we drop the dummy variable E 2 for Bachelors, i.e., use the Bachelors degree as the baseline, then β 0 + δ 1 + βx + ε if H.S. only S = β 0 + βx + ε if B.A. or B.S. only β 0 + δ 3 + βx + ε if advanced This parametrization is convenient for comparison between categories. One can test whether Bachelors earn δ 2 more than H.S. graduates by testing δ 1 < 0, and test whether an Advanced degree increase salary by testing δ 3 > 0

17 Handout 1C - 17 Example: Salary Survey Regression Fit (1) > salary = read.table("salarysurvey.txt", head=true) > lm1a = lm(s ~ E+X, data = salary) > summary(lm1a) (... Part of the R output is omitted) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** E ** X e-06 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: 3604 on 43 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 43 DF, p-value: 3.538e-06 Something wrong?

18 Handout 1C - 18 Example: Salary Survey Regression Fit (2) Let s check the model matrix. > model.matrix(lm1a) (Intercept) E X (omitted) attr(,"assign") [1] R treats E(education) as a numerical variable taking values 1, 2, and 3, not a categorical one.

19 Remark: Model 2 is nested in Model 1 (Why?). Handout 1C - 19 Example: Salary Survey Numerical or Categorical? If one treats E(education) as a numerical variable taking values 1, 2, and 3, the model then becomes Model 2 : S = β 0 + βx + δe + ε. But Model 2 has different implication from Model 1 that on average, a Bachelor s degree increases salary by δ 2 ; a Bachelor s degree + an advanced degree increase salary by δ 3 That is, the salary bonus for completing college is as much as the bonus for completing an advanced degree unrealistic and too restrictive. Treating E as a categorical variable allows the salary bonus for a Bachelor s degree and an advanced degree to be different.

20 Handout 1C - 20 Example: Salary Survey Regression Fit (3) > salary$e = as.factor(salary$e) > lm1 = lm(s ~ E+X, data = salary) > summary(lm1) (... Part of the R output is omitted) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-10 *** E * E ** X e-06 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: 3622 on 42 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 42 DF, p-value: 1.291e-05 The command as.factor(e) tells R that E is categorical By default, R use the lower most level (E = 1) as the baseline

21 Example: Salary Survey Regression Fit (4) Let s check the model matrix of Model 1. > model.matrix(lm1) (Intercept) X E2 E (... omitted...) attr(,"assign") [1] attr(,"contrasts") attr(,"contrasts")$e [1] "contr.treatment" Now R knows E is categorial and creates 2 dummy variables: E2 and E3, and treats H.S. diploma (E = 1) as the baseline. Handout 1C - 21

22 Handout 1C - 22 Example: Salary Survey Interpreting Coefficients From the output of Model 1, the predicted salary is Ŝ = X E E 3. This model implies that on average: each extra year of experience worths $548.6; completing college increases salary by $3221.1; completing college + advanced degree increase salary by $ All the 3 coefficients above are significantly different from 0 (P-value < 5%) What if we want to compare Bachelors with advanced degree holders?

23 Handout 1C - 23 Example: Salary Survey Changing Baseline (1) If not happy with the baseline category R chooses, say want E = 2 (Bachelor s degree) to be the baseline, one can either manually create the dummy variables E1 and E3 > salary$e1 = as.integer(salary$e==1) > salary$e3 = as.integer(salary$e==3) > lm1b = lm(s ~ X + E1 + E3, data = salary) or use the command relevel() > salary$e = relevel(salary$e, ref = "2") > lm1c = lm(s ~ X + E,data=salary) Both will fit Model 1 using E = 2 as the baseline. See the R outputs on the next page. Conclusion: Looking at the coefficient for E3 in the next page, we can conclude advanced degree holders do NOT earn significantly more than Bachelors (P-value 0.25).

24 Handout 1C - 24 > summary(lm1b) Estimate Std. Error t value Pr(> t ) (Intercept) e-14 *** X e-06 *** E * E Signif. codes: 0 *** ** 0.01 * Residual standard error: 3622 on 42 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 42 DF, p-value: 1.291e-05 > summary(lm1c) Estimate Std. Error t value Pr(> t ) (Intercept) e-14 *** X e-06 *** E * E Signif. codes: 0 *** ** 0.01 * Residual standard error: 3622 on 42 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 3 and 42 DF, p-value: 1.291e-05

25 with δ 1 = , δ 2 = , δ 3 = , and β = Handout 1C - 25 What If We Want To Drop The Intercept? > lm1e = lm(s ~ -1 + X + E, data = salary) > summary(lm1e) Estimate Std. Error t value Pr(> t ) X e-06 *** E e-14 *** E e-10 *** E < 2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: 3622 on 42 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 42 DF, p-value: < 2.2e-16 This fits the model δ 1 + βx + ε S = δ 2 + βx + ε δ 3 + βx + ε if H.S. only if Bachelor s only if advanced

26 Handout 1C - 26 What About the Sum-to-Zero Constraint δ 1 + δ 2 + δ 3 = 0? For the salary example, S = β 0 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + ε β 0 + δ 1 + βx + ε if H.S. only = β 0 + δ 2 + βx + ε if Bachelor s only β 0 + δ 3 + βx + ε if advanced the sum-to-zero constraint δ 1 + δ 2 + δ 3 = 0 is NOT intuitive, under which the coefficients δ 1, δ 2, δ 3 and β 0 have NO natural interpretations. Nonetheless, the sum-to-zero constraint will exhibit its power in factorial designs in which two or more treatment factors are administered in an experiment. We will come back to this in Chapter 8.

27 Interaction Between Categorical and Numerical Variables Regardless of the constraint is used, the model S = β 0 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + ε β 0 + δ 1 + βx + ε if H.S. = β 0 + δ 2 + βx + ε if B.A. or B.S. β 0 + δ 3 + βx + ε if advanced assumes constant effect of experience X on salary S (the slope β) across all education levels, which can be unrealistic. If the effect a variable on response changes with the level of another variable, we say the effects of the two variables interact, If not, we say their effect is additive. e.g., the model above assumes the effects of education (E) and experience (X) on salary are additive How to write a MLR model with the slope of X changing with education levels? Handout 1C - 27

28 Handout 1C - 28 Interaction Between Categorical and Numerical Variables Consider the model S = β 0 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + γ 1 (E 1 X ) + γ 2 (E 2 X ) + γ 3 (E 3 X ) + ε Here (E 1 X ) means the product of the variables E 1 and X. Then β 0 + δ 1 + (β + γ 1 )X + ε S = β 0 + δ 2 + (β + γ 2 )X + ε β 0 + δ 3 + (β + γ 3 )X + ε if H.S. if B.A. or B.S. if advanced Again, the model is overparameterized. We need one additional constraint on β and γ s. Some common constraints are β = 0 γ 1 = 0 (or γ 2 = 0, or γ 3 = 0) γ 1 + γ 2 + γ 3 = 0

29 Handout 1C - 29 If one uses H.S. diploma as the baseline, i.e., letting δ 1 = 0 and γ 1 = 0, S = β 0 + δ 2 E 2 + δ 3 E 3 + βx + γ 2 (E 2 X ) + γ 3 (E 3 X ) + ε β 0 + (β )X + ε if H.S. = β 0 + δ 2 + (β + γ 2 )X + ε if B.A. or B.S. β 0 + δ 3 + (β + γ 3 )X + ε if advanced Then γ 2 is the extra salary per year of experience for completing college, and γ 3 is that for getting an advanced degree.

30 Handout 1C - 30 Fitting Models with Interaction In R In R, the term X:E or X*E in the model formula represents all the interaction terms of X and E. By default, R uses the lowest level E = 1 (H.S. diploma) as the baseline. > salary$e = relevel(salary$e, ref = "1") > lm2 = lm(s ~ X+E+X:E, data = salary) > summary(lm2) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-11 *** X ** E E X:E X:E Neither γ 2 nor γ 3 is significantly different from 0 (P-value 0.37 and 0.17).

31 The interaction is not significant. Handout 1C - 31 Test For Interaction Testing whether the effect of experience on salary changes with education level is equivalent to testing H 0 : γ 1 = γ 2 = γ 3 That is, it compares the full model and the reduced model below S = β 0 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + γ 1 (E 1 X ) + γ 2 (E 2 X ) + γ 3 (E 3 X ) + ε (full) S = β 0 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + ε > lm1 = lm(s ~ X+E, data = salary) > lm2 = lm(s ~ X+E+X:E, data = salary) > anova(lm1,lm2) Analysis of Variance Table Model 1: S ~ X + E Model 2: S ~ X + E + X:E Res.Df RSS Df Sum of Sq F Pr(>F) (reduced)

32 Handout 1C - 32 Interaction Between Two Categorical Variables Now let s take another categorical variable, management status (M), into account. M = { 1 if manager, 0 if non-manager Since M is a categorical variable, just like E, we should create dummy variables M 0 and M 1 for the two categories, and consider the model S = β 0 + α 0 M 0 + α 1 M 1 + δ 1 E 1 + δ 2 E 2 + δ 3 E 3 + βx + ε. However, we don t need both M 0 and M 1 since M 0 + M 1 = 1 and the model is again overparameterized. We can drop one of M 0 and M 1 and one of E 1, E 2 and E 3. So we dropped M 0 and E 1, and consider the model S = β 0 + α 1 M 1 + δ 2 E 2 + δ 3 E 3 + βx + ε.

33 Handout 1C - 33 Interaction Between Two Categorical Variables S = β 0 + α 1 M 1 + δ 2 E 2 + δ 3 E 3 + βx + ε. This model implies that, on average managers earn α 1 more than non-managers; completing college increases salary by δ 2 ; completing college + advanced degree increase salary by δ 3 However, the model above assumes the effect of management status on salary does not change with education levels. Thus we may consider the follow model with management status and education level interactions. S = β 0 + α 1 M 1 + δ 2 E 2 + δ 3 E 3 + θ 2 (M E 2 ) + θ 3 (M E 3 ) + βx + ε.

34 Handout 1C - 34 Interaction Between Two Categorical Variables in R No interaction > lm3 = lm(s ~ X+E+M, data = salary) > summary(lm3) (... omitted... ) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** X < 2e-16 *** E e-11 *** E e-09 *** M < 2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: 1027 on 41 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 4 and 41 DF, p-value: < 2.2e-16

35 Handout 1C - 35 Interaction Between Two Categorical Variables in R With interaction > lm4 = lm(s ~ X+E+M+E:M, data = salary) > summary(lm4) (... omitted... ) Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** X <2e-16 *** E <2e-16 *** E <2e-16 *** M <2e-16 *** E2:M <2e-16 *** E3:M <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 39 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 5517 on 6 and 39 DF, p-value: < 2.2e-16

36 Handout 1C - 36 Interaction Between Two Categorical Variables in R Test of interaction > anova(lm3,lm4) Analysis of Variance Table Model 1: S ~ X + E + M Model 2: S ~ X + E + M + E:M Res.Df RSS Df Sum of Sq F Pr(>F) < 2.2e-16 ***

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression Chapter 7: Dummy Variable Regression So far, we ve only considered quantitative variables in our models. We can integrate categorical predictors by constructing artificial