CHAPTER 2: Assumptions and Properties of Ordinary Least Squares, and Inference in the Linear Regression Model

Size: px

Start display at page:

Download "CHAPTER 2: Assumptions and Properties of Ordinary Least Squares, and Inference in the Linear Regression Model"

Godwin Lewis
5 years ago
Views:

1 CHAPTER 2: Assumptions and Properties of Ordinary Least Squares, and Inference in the Linear Regression Model Prof. Alan Wan 1 / 57

2 Table of contents 1. Assumptions in the Linear Regression Model 2 / 57

3 Assumptions The validity and properties of least squares estimation depend very much on the validity of the classical assumptions underlying the regression model. As we shall see, many of these assumptions are rarely appropriate when dealing with data for business. However, they represent a useful starting point dealing with the inferential aspects of the regression and for the development of more advanced techniques. The assumptions are as follows: 3 / 57

4 Assumptions 1. The regression model is linear in the unknown parameters. 4 / 57

5 Assumptions 1. The regression model is linear in the unknown parameters. 2. The elements in X are non-stochastic, meaning that the values of X are fixed in repeated samples (i.e., when repeating the experiment, choose exactly the same set of X values on each occasion so that they remain unchanged). Notice, however, this does not imply that the values of Y also remain unchanged from sample to sample. The Y values depend also on the uncontrollable values of ɛ, which vary from one sample to another. Y as well as ɛ are therefore stochastic, meaning that their values are determined by some chance mechanism and hence subject to a probability distribution. 4 / 57

6 Assumptions 1. The regression model is linear in the unknown parameters. 2. The elements in X are non-stochastic, meaning that the values of X are fixed in repeated samples (i.e., when repeating the experiment, choose exactly the same set of X values on each occasion so that they remain unchanged). Notice, however, this does not imply that the values of Y also remain unchanged from sample to sample. The Y values depend also on the uncontrollable values of ɛ, which vary from one sample to another. Y as well as ɛ are therefore stochastic, meaning that their values are determined by some chance mechanism and hence subject to a probability distribution. Essentially this means our regression analysis is conditional on the given values of the regressors. It is possible to weaken the assumption to one of stochastic X distributed independently of the disturbance term. 4 / 57

7 Assumptions 3. Zero mean value of the disturbance ɛ i, i.e., E(ɛ i ) = 0, i, or in matrix terms, ɛ 1 0 ɛ 2 E(ɛ) = E. = 0. ɛ n 0 = 0 leading to E(Y ) = X β The zero mean of the disturbances implies that no relevant regressors have been omitted from the model. 5 / 57

8 Assumptions 4. The variance-covariance matrix of ɛ is a scalar matrix. That is, ɛ 1 E(ɛɛ ɛ 2 [ ] ) = E( ɛ1 ɛ 2 ɛ n ). = ɛ n E(ɛ 2 1 ) E(ɛ 1ɛ 2 ) E(ɛ 1 ɛ n ) σ E(ɛ 2 ɛ 1 ) E(ɛ 2 2 ) E(ɛ 2ɛ n ) = 0 σ E(ɛ n ɛ 1 ) E(ɛ n ɛ 2 ) E(ɛ 2 n) 0 0 σ 2 = σ 2 I. 6 / 57

9 Assumptions This variance covariance matrix embodies two assumptions: var(ɛ i ) = σ 2 i. This assumption is termed homoscedasticity (the converse is heteroscedasticity). 7 / 57

10 Assumptions This variance covariance matrix embodies two assumptions: var(ɛ i ) = σ 2 i. This assumption is termed homoscedasticity (the converse is heteroscedasticity). cov(ɛ i ɛ j ) = 0 i j. This assumption is termed pairwise uncorrelatedness (the coverse is serial correlation or autocorrelation). 7 / 57

11 Assumptions 5. ρ(x ) = rank(x ) = k < n. In other words, the explanatory variables do not form a linear dependent set as X is n k. We say that X has full column rank. If this conditions fails, then X X cannot be inverted and O.L.S. estimation becomes infeasible. This problem is known as perfect multicollinearity. 8 / 57

12 Assumptions 5. ρ(x ) = rank(x ) = k < n. In other words, the explanatory variables do not form a linear dependent set as X is n k. We say that X has full column rank. If this conditions fails, then X X cannot be inverted and O.L.S. estimation becomes infeasible. This problem is known as perfect multicollinearity. 6. As n, n i=1 (x ij x j ) 2 /n Q j, where Q j is finite, j = 1,, k. 8 / 57

13 Properties of O.L.S. When some or all of the above assumptions are satisfied, the O.L.S. estimator b of β possesses the following properties. Note that not every property requires all of the above assumptions to be fulfilled. Properties of the O.L.S. estimator: 9 / 57

14 Properties of O.L.S. When some or all of the above assumptions are satisfied, the O.L.S. estimator b of β possesses the following properties. Note that not every property requires all of the above assumptions to be fulfilled. Properties of the O.L.S. estimator: b is a linear estimator in the sense that it is a linear combination of the observations of Y : b = (X X ) 1 X Y c 11 c 12 c 1n y 1 c 21 c 22 c 2n y 2 = c k1 c k2 c kn y n 9 / 57

15 Properties of O.L.S. Unbiasedness E(b) = E((X X ) 1 X Y ) = E(β + (X X ) 1 X ɛ) = β + (X X ) 1 X E(ɛ) = β 10 / 57

16 Properties of O.L.S. Unbiasedness E(b) = E((X X ) 1 X Y ) = E(β + (X X ) 1 X ɛ) = β + (X X ) 1 X E(ɛ) = β Thus, b is an unbiased estimator of β. That is, in repeated samples, b has an average value identical to β, the parameter b tries to estimate. 10 / 57

17 Properties of O.L.S. Variance-Covariance matrix: COV (b) = E((b E(b))(b E(b)) ) = E((b β)(b β) ) = E((X X ) 1 X ɛɛ X (X X ) 1 ) = (X X ) 1 X E(ɛɛ )X (X X ) 1 = σ 2 (X X ) 1 11 / 57

18 Properties of O.L.S. Variance-Covariance matrix: COV (b) = E((b E(b))(b E(b)) ) = E((b β)(b β) ) = E((X X ) 1 X ɛɛ X (X X ) 1 ) = (X X ) 1 X E(ɛɛ )X (X X ) 1 = σ 2 (X X ) 1 Main diagonal elements are the variances of b j s, j = 1,, k; off-diagonal elements are covariances. For the special case of a simple linear regression, [ ] [ b1 σ 2 ( 1 n Cov( ) = + x 2 n i=1 (x ) σ 2 x ] i x) 2 n i=1 (x i x) 2 b 2 σ 2 x σ 2 n i=1 (x i x) 2 n i=1 (x i x) 2 11 / 57

19 Properties of O.L.S. b is the best linear unbiased (B.L.U.) estimator of β. Refer to the Gauss-Markov theorem. The B.L.U. properties implies that each b j, j = 1,, k, has the smallest variance among the class of all linear unbiased estimators of β j. More discussion in class. 12 / 57

20 Properties of O.L.S. b is the best linear unbiased (B.L.U.) estimator of β. Refer to the Gauss-Markov theorem. The B.L.U. properties implies that each b j, j = 1,, k, has the smallest variance among the class of all linear unbiased estimators of β j. More discussion in class. b is the minimum variance unbiased (M.V.U.) estimator of β, meaning that b j has a variance no larger than that of any unbiased estimator of β j, linear or non-linear. More discussion in class. 12 / 57

21 Properties of O.L.S. b is the best linear unbiased (B.L.U.) estimator of β. Refer to the Gauss-Markov theorem. The B.L.U. properties implies that each b j, j = 1,, k, has the smallest variance among the class of all linear unbiased estimators of β j. More discussion in class. b is the minimum variance unbiased (M.V.U.) estimator of β, meaning that b j has a variance no larger than that of any unbiased estimator of β j, linear or non-linear. More discussion in class. b is a consistent estimator of β, meaning that when n becomes sufficiently large, the probability of b j = β j converges to 1, j = 1,, k. We say that b converges in probability to the true value of β. More discussion in class. 12 / 57

22 Matters of Inference If one assumes additionally that ɛ MVN(0, σ 2 I ), then Y MVN(X β, σ 2 I ) b MVN(β, σ 2 (X X ) 1 ) Using properties of the sampling distribution of b, inference about the population parameters in β can be drawn. 13 / 57

23 Matters of Inference However, we need an estimator of σ 2, the variance around the regression line. This estimator is given by s 2 = e e n n k = i=1 (y i ŷ i ) 2, n k where n k is the model s degrees of freedom (d.o.f.) - the number of logically independent pieces of information in the data. 14 / 57

24 Matters of Inference However, we need an estimator of σ 2, the variance around the regression line. This estimator is given by s 2 = e e n n k = i=1 (y i ŷ i ) 2, n k where n k is the model s degrees of freedom (d.o.f.) - the number of logically independent pieces of information in the data. It can be shown that e e/σ 2 χ 2 (n k), or (n k)s 2 /σ 2 χ 2 (n k). 14 / 57

25 Matters of Inference However, we need an estimator of σ 2, the variance around the regression line. This estimator is given by s 2 = e e n n k = i=1 (y i ŷ i ) 2, n k where n k is the model s degrees of freedom (d.o.f.) - the number of logically independent pieces of information in the data. It can be shown that e e/σ 2 χ 2 (n k), or (n k)s 2 /σ 2 χ 2 (n k). Using the properties of the Chi-square distribution, it can be shown that E(s 2 ) = σ 2, i.e., s 2 is an unbiased estimator of σ / 57

26 Matters of Inference The quantities b j, j = 1,, k, are simply point estimates (single numbers). Often it is more desirable to state a range of values in which the parameter is thought to lie rather than a single number. These ranges are called confidence interval (C.I.) estimates. 15 / 57

27 Matters of Inference The quantities b j, j = 1,, k, are simply point estimates (single numbers). Often it is more desirable to state a range of values in which the parameter is thought to lie rather than a single number. These ranges are called confidence interval (C.I.) estimates. Although the interval estimate is less precise, the confidence that the true population parameters falls between the interval limits is increased. The interval should be precise enough to be practically useful. 15 / 57

28 Matters of Inference Consider a coefficient β j in β. The interval L β j U is a 100(1 α) % confidence interval for β j in the sense that, prior to sampling, P(L β j U) = 1 α This definition states that the C.I. with confidence coefficient 1 α is an interval estimate such that the probability is 1 α that the calculated limits include β j for any random trial. That is, in many random samples of size n, 100(1 α) percent of the interval estimates will include β j. 16 / 57

29 Matters of Inference Recall that b N(β, σ 2 (X X ) 1 ). Hence b j β j σc jj N(0, 1), where c 2 jj is the jj th element of (X X ) / 57

30 Matters of Inference Recall that b N(β, σ 2 (X X ) 1 ). Hence b j β j σc jj N(0, 1), where c 2 jj is the jj th element of (X X ) 1. Hence P(z (α/2) b j β j σc jj z (1 α/2) ) = 1 α Recognising that z(α/2) = z(1 α/2) and after some manipulations, we can write P(b j z (1 α/2) σc jj β j b j + z (1 α/2) σc jj ) = 1 α 17 / 57

31 Matters of Inference However, σ 2 is typically unknown. Replacing σ 2 by s 2 results in b j β j sc jj t (n k). 18 / 57

32 Matters of Inference However, σ 2 is typically unknown. Replacing σ 2 by s 2 results in b j β j sc jj t (n k). If Z N(0, 1) and W χ 2 (n k) and Z and W are Z independently distributed, then t (n k). More W /(n k) discussion on t distribution in class. 18 / 57

33 Matters of Inference However, σ 2 is typically unknown. Replacing σ 2 by s 2 results in b j β j sc jj t (n k). If Z N(0, 1) and W χ 2 (n k) and Z and W are Z independently distributed, then t (n k). More W /(n k) discussion on t distribution in class. Hence the confidence interval becomes P(b j t (1 α/2,n k) sc jj β j b j + t (1 α/2,n k) sc jj ) = 1 α 18 / 57

34 Matters of Inference Hypothesis tests about β j can also be performed. The most common test about a coefficient in a regression is: H 0 : β j = β j vs. H 1 : β j β j at a significance level α, the probability of rejecting H 0 when H 0 is correct, the so-called Type I error. 19 / 57

35 Matters of Inference Hypothesis tests about β j can also be performed. The most common test about a coefficient in a regression is: H 0 : β j = β j vs. H 1 : β j β j at a significance level α, the probability of rejecting H 0 when H 0 is correct, the so-called Type I error. To test this hypothesis, a t statistic is used: t = b j β j sc jj If H 0 is true then t has a t distribution with n k degrees of freedom. 19 / 57

36 Matters of Inference If H 0 is true, then t is expected to lie not too far from the centre of the distribution. The decision rule is: Reject H 0 if t > t (1 α/2,n k) or t < t (1 α/2,n k) Do not reject otherwise. 20 / 57

37 Matters of Inference If H 0 is true, then t is expected to lie not too far from the centre of the distribution. The decision rule is: Reject H 0 if t > t (1 α/2,n k) or t < t (1 α/2,n k) Do not reject otherwise. Testing this hypothesis is equivalent to asking if βj lies in the 100(1 α) percent C.I. of β j. It is a common practice to test the hypothesis of H 0 : β j = 0. Failure to reject this hypothesis would imply that β j is not significantly different from zero, or equivalently, X j has no significant impact on the behaviour of Y, at level of significance α. 20 / 57

38 Matters of Inference Return to Example 1.3, and consider the estimation of β 2. Back in Chapter 1, we already computed b 2 = The output shows that e e = 25 i=1 e2 i = Note that d.o.f. = 25 3 = 22. Hence s 2 = e e/22 = / 57

39 Matters of Inference Return to Example 1.3, and consider the estimation of β 2. Back in Chapter 1, we already computed b 2 = The output shows that e e = 25 i=1 e2 i = Note that d.o.f. = 25 3 = 22. Hence s 2 = e e/22 = From (X X ) 1, c22 2 = Hence s.e.(b 2 ) = = / 57

40 Matters of Inference Return to Example 1.3, and consider the estimation of β 2. Back in Chapter 1, we already computed b 2 = The output shows that e e = 25 i=1 e2 i = Note that d.o.f. = 25 3 = 22. Hence s 2 = e e/22 = From (X X ) 1, c22 2 = Hence s.e.(b 2 ) = = Set α = From the t distribution table, t (1 0.05/2,22) = Hence the 95 percent C.I. for β 2 is P( (2.074)(0.1721) β (2.074)(0.1721)) = 0.95 or P( β ) = / 57

41 Matters of Inference This C.I. contains 0, meaning that if we test H 0 : β 2 = 0 vs. H 1 : β 2 0, we would not be able to reject H 0 at α = This is indeed the case. 22 / 57

42 Matters of Inference This C.I. contains 0, meaning that if we test H 0 : β 2 = 0 vs. H 1 : β 2 0, we would not be able to reject H 0 at α = This is indeed the case. Note that for testing H 0, t = = 1.928, which lies to the left of t (1 0.05/2,22) = Hence H 0 cannot be rejected at α = / 57

43 Matters of Inference This C.I. contains 0, meaning that if we test H 0 : β 2 = 0 vs. H 1 : β 2 0, we would not be able to reject H 0 at α = This is indeed the case. Note that for testing H 0, t = = 1.928, which lies to the left of t (1 0.05/2,22) = Hence H 0 cannot be rejected at α = Alternatively, one can base the decision on the p-value, which is the probability of obtaining a value of t at least as extreme as the actual computed value if H 0 is true. In our example, the p-value is , meaning that P(t > or t < 1.928) = / 57

44 Matters of Inference The p value can be viewed as the minimum level of significance chosen for the test to result in a rejection of H 0. Thus, a decision rule using p-value may be stated as: Reject H 0 if p-value< α Do not reject H 0 if p-value α 23 / 57

45 Matters of Inference The p value can be viewed as the minimum level of significance chosen for the test to result in a rejection of H 0. Thus, a decision rule using p-value may be stated as: Reject H 0 if p-value< α Do not reject H 0 if p-value α In the last example, the p-value is Hence we cannot reject H 0 at α = On the other hand, H 0 can be rejected at any significance level at or higher than / 57

46 Matters of Inference The p value can be viewed as the minimum level of significance chosen for the test to result in a rejection of H 0. Thus, a decision rule using p-value may be stated as: Reject H 0 if p-value< α Do not reject H 0 if p-value α In the last example, the p-value is Hence we cannot reject H 0 at α = On the other hand, H 0 can be rejected at any significance level at or higher than Similarly, we can test H 0 : β 3 = 0 and conclude that H 0 is rejected at α = / 57

47 Matters of Inference Altogether, it means allowing for a 5% Type 1 risk, disposable income is not significant for explaining consumption but the total value of assets is significant. Note that if we conclude that β j = 0, it does not necessarily follow that X j is unrelated to Y. It simply means that, when the other explanatory variables are included in the model, the marginal contribution of X j to further improving the model s fit is negligible. 24 / 57

48 Matters of Inference Altogether, it means allowing for a 5% Type 1 risk, disposable income is not significant for explaining consumption but the total value of assets is significant. Note that if we conclude that β j = 0, it does not necessarily follow that X j is unrelated to Y. It simply means that, when the other explanatory variables are included in the model, the marginal contribution of X j to further improving the model s fit is negligible. Sometimes it also makes sense to conduct a hypothesis test for the intercept coefficient. This should be done only when there are data that span X = 0 or at least near X = 0, and the difference between Y equaling zero and not equaling zero when X = 0 is scientifically plausible and interesting. 24 / 57

49 Type 1 and Type 2 errors Rejecting H 0 when it is true is called a Type 1 error. Recall that if H 0 is true the probability that it will be (incorrectly) rejected is P(t > t (1 α/2,n k) ) + P(t < t (1 α/2,n k) ) = α. This is the significance level; by choosing α, we effectively determine the probability that the test will incorrectly reject a true hypothesis. 25 / 57

50 Type 1 and Type 2 errors Rejecting H 0 when it is true is called a Type 1 error. Recall that if H 0 is true the probability that it will be (incorrectly) rejected is P(t > t (1 α/2,n k) ) + P(t < t (1 α/2,n k) ) = α. This is the significance level; by choosing α, we effectively determine the probability that the test will incorrectly reject a true hypothesis. If H 0 is false and it is not rejected then a Type 2 error has been committed. While we can fix P(Type 1 error), the same control of Type 2 error is not possible. See the following diagram for an illustration for testing H 0 : β 2 = 50 with σ 2 known and var(b 2 ) = Suppose that β 2 is either 50 or Note that Type 2 error probability depends on the true value of β 2 which is unknown in practice. 25 / 57

51 Type 1 and Type 2 errors 26 / 57

52 Type 1 and Type 2 errors Most elementary texts define the power of the test as the probability of rejecting a false H 0, i.e., the probability of doing the right thing in the face of an incorrect H 0. By this definition, the power is equal to 1 minus the Type 2 error probability. Sometimes the power is simply defined as the probability of rejecting H 0. By this definition, α, the significance level, is a point on the power curve. 27 / 57

53 Type 1 and Type 2 errors Let H 0 : β 2 = β2. P(Type 2 error β 2 β2 ) = P(Not rejecting H 0 β 2 β2 ). Power(β 2 ) = P(rejecting H 0 β 2 ) A test is unbiased if Power(β 2 β 2 β2 ) P(Type 1 error). For a test where H 0 corresponds to a point in the parameter space (e.g., a two-sided t test), the significance level is a point on the power curve. For a test where H 0 corresponds to a region in the parameter space (e.g., a one-sided t test), the significance level is the maximum probability of committing a Type 1 error within the region defined by H 0, and P(Type 1 error) has a range of values with α being the maximum of the range. 28 / 57

54 Partitioning of Total Sum of Squares Analysis of variance (ANOVA) is a useful and flexible way of analysing the fit of the regression. To motivate, consider y i = ŷ i + e i y i ȳ = ŷ i ȳ + e i n (y i ȳ) 2 = n (ŷ i ȳ) 2 + i=1 i=1 n ei i=1 n (ŷ i ȳ)e i i=1 29 / 57

55 Partitioning of Total Sum of Squares Note that n (y i ȳ) 2 = Total Sum of Squares (TSS) i=1 n (ŷ i ȳ) 2 = Regression Sum of Squares (RSS) i=1 n ei 2 = Error Sum of Squares (ESS) i=1 n (ŷ i ȳ)e i = 0 (provided that there is an intercept) i=1 30 / 57

56 Coefficient of Determination Thus, or TSS = RSS + ESS R 2 = RSS TSS = 1 ESS TSS, which is the coefficient of determination. It measures the model s goodness of fit : the proportion of variability of the sample Y values that has been explained by the regression. Obviously, 0 R / 57

57 Partitioning of Degrees of Freedom TSS has n 1 d.o.f. because there are n deviations y i ȳ that enter into TSS, but one constraint on the deviations, namely, n i=1 (y i ȳ) = 0. So there are n 1 d.o.f. in the deviations. 32 / 57

58 Partitioning of Degrees of Freedom TSS has n 1 d.o.f. because there are n deviations y i ȳ that enter into TSS, but one constraint on the deviations, namely, n i=1 (y i ȳ) = 0. So there are n 1 d.o.f. in the deviations. ESS has n k d.o.f. because there are n residuals but k d.o.f. are lost due to k constraints on the e i s associated with estimating the β s. 32 / 57

59 Partitioning of Degrees of Freedom TSS has n 1 d.o.f. because there are n deviations y i ȳ that enter into TSS, but one constraint on the deviations, namely, n i=1 (y i ȳ) = 0. So there are n 1 d.o.f. in the deviations. ESS has n k d.o.f. because there are n residuals but k d.o.f. are lost due to k constraints on the e i s associated with estimating the β s. RSS has k 1 d.o.f. because the regression function contains k parameters but the deviations ŷ i ȳ are subject to the constraint that n i=1 (ŷ i ȳ) = / 57

60 Partitioning of Degrees of Freedom TSS has n 1 d.o.f. because there are n deviations y i ȳ that enter into TSS, but one constraint on the deviations, namely, n i=1 (y i ȳ) = 0. So there are n 1 d.o.f. in the deviations. ESS has n k d.o.f. because there are n residuals but k d.o.f. are lost due to k constraints on the e i s associated with estimating the β s. RSS has k 1 d.o.f. because the regression function contains k parameters but the deviations ŷ i ȳ are subject to the constraint that n i=1 (ŷ i ȳ) = 0. The d.o.f. add up: (n 1) = (n k) + (k 1) 32 / 57

61 Mean Squares A sum of squares divided by its associated d.o.f. is called a mean square. Mean Square Regression (MSR) = RSS/(k 1) Mean Square Error (MSE) = ESS/(n k) 33 / 57

62 Mean Squares A sum of squares divided by its associated d.o.f. is called a mean square. Mean Square Regression (MSR) = RSS/(k 1) Mean Square Error (MSE) = ESS/(n k) In Example 1.3, n=25, k=3, RSS= , ESS= , TSS= Hence R 2 = / = MSR = /2 = MSE = /22 = / 57

63 Overall Significance of the Model Frequently, one may wish to test whether or not there is a relationship between Y and the regression model constructed. It is a test of H 0 : β 2 = β 3 = = β k = 0 vs. H 1 : at least one of β js, (j = 2,, k), is non-zero. The test statistic is F = MSR MSE = RSS/(k 1) ESS/(n k) distributed as F (k 1,n k) if H 0 is true. 34 / 57

64 Overall Significance of the Model The decision rule is: To reject H 0 if F > F 1 α,k 1,n k or p-value< α; Not to reject H 0 otherwise. 35 / 57

65 Overall Significance of the Model The decision rule is: To reject H 0 if F > F 1 α,k 1,n k or p-value< α; Not to reject H 0 otherwise. F F (k 1,n k) because under H 0, RSS/σ 2 χ 2 (k 1), ESS/σ 2 χ 2 (n k), and RSS and ESS are distributed independently. 35 / 57

66 Overall Significance of the Model The decision rule is: To reject H 0 if F > F 1 α,k 1,n k or p-value< α; Not to reject H 0 otherwise. F F (k 1,n k) because under H 0, RSS/σ 2 χ 2 (k 1), ESS/σ 2 χ 2 (n k), and RSS and ESS are distributed independently. Refer to Example 1.3, F = = F (0.95,2,22) = Hence we reject H 0 convincingly at significance level We cannot reject H 0 only if α is set to E-08 or lower, as indicated by the test statistic s p-value. 35 / 57

67 Overall Significance of the Model Why do we perform an F test in addition to t tests? What can we learn from the F test? 36 / 57

68 Overall Significance of the Model Why do we perform an F test in addition to t tests? What can we learn from the F test? In the intercept only model, all of the fitted values equal the mean of the response variable. Therefore, if the overall F test is significant, the regression model predicts the response better than than the mean of the response. 36 / 57

69 Overall Significance of the Model Why do we perform an F test in addition to t tests? What can we learn from the F test? In the intercept only model, all of the fitted values equal the mean of the response variable. Therefore, if the overall F test is significant, the regression model predicts the response better than than the mean of the response. While R 2 provides an estimate of the strength of the relationship, it does not provide a formal hypothesis test for this relationship. If the overall F test is significant, one can conclude that the R 2 is significantly different from zero. In fact, the F statistic can be written as F = R 2 /(k 1) (1 R 2 )/(n k) 36 / 57

70 Overall Significance of the Model If the overall F test is significant, but few or none of the t tests are significant then it is an indication that multicollinearity might be a problem for the data. More on multicollinearity in Chapter / 57

71 Overall Significance of the Model If the overall F test is significant, but few or none of the t tests are significant then it is an indication that multicollinearity might be a problem for the data. More on multicollinearity in Chapter 3. Notice that for a simple linear regression model, the null hypothesis for the overall F test is simply β 2 = 0, which is precisely the same null for the t test of β 2 = 0. In fact, when k = 1, F (1,n k) = t 2 (n k). In Example 1.1, for testing H 0 : β 2 = 0, F = = (9.717) 2 = t 2, p-values are exactly the same. 37 / 57

72 Overall Significance of the Model If the overall F test is significant, but few or none of the t tests are significant then it is an indication that multicollinearity might be a problem for the data. More on multicollinearity in Chapter 3. Notice that for a simple linear regression model, the null hypothesis for the overall F test is simply β 2 = 0, which is precisely the same null for the t test of β 2 = 0. In fact, when k = 1, F (1,n k) = t 2 (n k). In Example 1.1, for testing H 0 : β 2 = 0, F = = (9.717) 2 = t 2, p-values are exactly the same. In Example 1.2, for testing H 0 : β 2 = 0, F = = (13.937) 2 = t 2, p-values are exactly the same. 37 / 57

73 F test for linear restrictions In fact, the usefulness of the F test is not limited to testing overall significance. The F test can be used for testing any linear equality restrictions on β. 38 / 57

74 F test for linear restrictions In fact, the usefulness of the F test is not limited to testing overall significance. The F test can be used for testing any linear equality restrictions on β. The general formula for the F statistic is F = (e e r e e ur )/m e e ur /(n k) = (R 2 ur R 2 r )/m (1 R 2 ur )/(n k) F (m,n k) H 0, where the subscripts ur and r correspond to the unrestricted and restricted models respectively, and m is the number of restrictions under H / 57

75 F test for linear restrictions e e r is the ESS associated with the restricted model (i.e., the model that imposes the restrictions implied by H 0 ; e e ur is the ESS associated with the unrestricted model (i.e., the model that ignores the restrictions). Rr 2 and Rur 2 are defined analogously. 39 / 57

76 F test for linear restrictions e e r is the ESS associated with the restricted model (i.e., the model that imposes the restrictions implied by H 0 ; e e ur is the ESS associated with the unrestricted model (i.e., the model that ignores the restrictions). Rr 2 and Rur 2 are defined analogously. The F statistic for testing H 0 : β 2 = β 3 = = β k = 0 is a special case of (1), because under H 0, m = k 1, e e r = TSS (the restricted model has no explanatory power) and correspondingly, Rr 2 = / 57

77 F test for linear restrictions Example 2.1 One model of production that is widely used in economics is the Cobb-Douglas production function: y i = β 1x β 2 2i x β 3 3i exp(ɛ i), where y i =output; x 2i =labour input; x 3i =capital input. Or, in log-transformed terms, lny i = lnβ 1 + β 2 lnx 2i + β 3 lnx 3i + ɛ i, = β 1 + β 2 lnx 2i + β 3 lnx 3i + ɛ i, 40 / 57

78 F test for linear restrictions To illustrate, we use annual data for the agricultural sector of Taiwan for Results obtained using SAS: The REG Procedure Model: MODEL1 Dependent Variable: lny Number of Observations Read 15 Number of Observations Used 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept lnx lnx / 57

79 F test for linear restrictions β 2 is the elasticity of output with respect to the labour input; it measures the percentage change in output due to a one percent change in labour input; β 3 is interpreted analogously. 42 / 57

80 F test for linear restrictions β 2 is the elasticity of output with respect to the labour input; it measures the percentage change in output due to a one percent change in labour input; β 3 is interpreted analogously. The sum β 2 + β 3 gives information on returns to scale, that is, the response of output to a proportional change in the inputs. In particular, if this sum is 1, then there are constant returns to scale, that is, doubling the inputs will double the outputs. Hence one may be interested in testing H 0 : β 2 + β 3 = / 57

81 F test for linear restrictions The restricted model is one that imposes the restriction β 2 + β 3 = 1 onto the coefficients in the minimisation of the SSE. The least squares estimator (referred to as restricted least squares (R.L.S.)) is obtained by minimising the objective function φ = (y Xb ) (y Xb ) 2λ (Rb r), where R is a m k matrix of constants, r is a m 1 vector of constants and b is the R.L.S. estimator. 43 / 57

82 F test for linear restrictions The restricted model is one that imposes the restriction β 2 + β 3 = 1 onto the coefficients in the minimisation of the SSE. The least squares estimator (referred to as restricted least squares (R.L.S.)) is obtained by minimising the objective function φ = (y Xb ) (y Xb ) 2λ (Rb r), where R is a m k matrix of constants, r is a m 1 vector of constants and b is the R.L.S. estimator. For this example, R = [0 1 1] and r = / 57

83 F test for linear restrictions The SAS results are as follows: The REG Procedure Model: MODEL1 Dependent Variable: lny NOTE: Restrictions have been applied to parameter estimates. Number of Observations Read 15 Number of Observations Used 15 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model <.0001 Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept lnx lnx < / 57

84 F test for linear restrictions Using the F test procedure, to test H 0 : β 2 + β 3 = 1 vs. H 1 : otherwise F = (Rur 2 Rr 2 )/m (1 Rur 2 )/(n k) = ( )/1 ( )/12 = 4.34 At α = 0.05, F (0.95,1,12) = Hence we cannot reject H 0 at 0.05 level of significance and conclude that the returns to scale is constant. 45 / 57

85 F test for linear restrictions SAS can perform the test automatically. The result from the following output concurs with the result based on our calculations. The p-value indicates that H 0 can be rejected only when α is set to at least The REG Procedure Model: MODEL1 Test 1 Results for Dependent Variable lny Mean Source DF Square F Value Pr > F Numerator Denominator / 57

86 Coefficient of variation The coefficient variation is obtained by dividing the standard error of the regression by the mean of y i values and multiplying by / 57

87 Coefficient of variation The coefficient variation is obtained by dividing the standard error of the regression by the mean of y i values and multiplying by 100. It expresses the standard error of the regression in unit free values. Thus the coefficients of variation for two different regressions can be compared more readily than the standard errors because the influence of the units of the data has been removed. The SAS program for Example 2.1 is as follows. 47 / 57

88 SAS Program for Example 2.1 data example21; input y x2 x3; ods html close; ods listing; lny=log(y); lnx2=log(x2); lnx3=log(x3); cards; ; proc reg data=example21; model lny=lnx2 lnx3; test lnx2+lnx3=1; run; proc reg data=example21; model lny=lnx2 lnx3; restrict lnx2+lnx3=1; run; ods html close; ods html; run; 48 / 57

89 Adjusted Coefficient of Determination Some statisticians have suggested to modify R 2 to recognise the number of independent variables in the model. The reason is that R 2 can generally be made larger if additional explanatory variables are added to the model. A measure that recognises the number of explanatory variables in the model is called the adjusted coefficient of determination: R 2 a = 1 ESS/(n k) TSS/(n 1) 49 / 57

90 Adjusted Coefficient of Determination Some statisticians have suggested to modify R 2 to recognise the number of independent variables in the model. The reason is that R 2 can generally be made larger if additional explanatory variables are added to the model. A measure that recognises the number of explanatory variables in the model is called the adjusted coefficient of determination: R 2 a = 1 ESS/(n k) TSS/(n 1) For the unrestricted model of Example 2.1, R 2 a = / /14 = Hence the adjustment has only a small effect, as R 2 a is almost the same as R / 57

91 Inference on Prediction A point prediction is obtained by inserting the given X values into the regression equation, giving ŷ f = b 1 + b 2 x f 2 + b 3 x f b k x fk Let g = (1, x f 2, x f 3,, x fk ). Then ŷ f = g b. Note that var(g b) = g var(b)g. If we assume normality for the disturbance term, it follows that g b g β N(0, 1) var(g b) 50 / 57

92 Inference on Prediction When the unknown σ 2 in var(b) is replaced by s 2, the usual shift to the t distribution occurs, giving ŷ f E(y f ) s g (X X ) 1 g t (n k), from which a 100(1 α) percent confidence (or prediction) interval for E(y f ) is ŷ f ± t (1 α/2,n k) s g (X X ) 1 g (1) 51 / 57

93 Inference on Prediction Returning to Example 1.3, the estimated regression equation is: ŷ i = x i x i3 A family with annual disposable income of $50,000 and liquid assets worth $100,000 is predicted to spend ŷ f = (50) (100) = thousand dollars on non-durable goods and services in a year. 52 / 57

94 Inference on Prediction For this example, (X X ) 1 = Hence g (X X ) 1 g = [ ] = / 57

95 Inference on Prediction s = and t (0.975,22) = Thus, the 95% prediction interval for E(y f ) is ± 2.074(38.436) or to / 57

96 Inference on Prediction s = and t (0.975,22) = Thus, the 95% prediction interval for E(y f ) is ± 2.074(38.436) or to Sometimes one may wish to obtain a prediction interval for y f rather than E(y f ). The two differ only by the disturbance term ɛ f, which is unpredictable with a mean of 0, so the point prediction remains the same. 54 / 57

97 Inference on Prediction s = and t (0.975,22) = Thus, the 95% prediction interval for E(y f ) is ± 2.074(38.436) or to Sometimes one may wish to obtain a prediction interval for y f rather than E(y f ). The two differ only by the disturbance term ɛ f, which is unpredictable with a mean of 0, so the point prediction remains the same. However, the uncertainty of the prediction increases due to the presence of ɛ f. Now, y f = g β + ɛ f. Therefore, 54 / 57

98 Inference on Prediction e f = y f ŷ f = ɛ f g (b β). Squaring both sides and taking expectations gives var(e f ) = σ 2 + g var(b)g = σ 2 (1 + g (X X ) 1 g) from which we can derive the following t statistic: ŷ f y f s 1 + g (X X ) 1 g t (n k) 55 / 57

99 Inference on Prediction which leads to the 100(1 α) percent confidence interval for y f : ŷ f ± t (1 α/2,n k) s 1 + g (X X ) 1 g 56 / 57

100 Inference on Prediction which leads to the 100(1 α) percent confidence interval for y f : ŷ f ± t (1 α/2,n k) s 1 + g (X X ) 1 g Comparison with (1) shows that the only difference is an increase of 1 inside the square root term. Thus, for the data in Example 1.3, the prediction interval for y f is: ± 2.074(38.436) or to One can obtain these outputs directly using SAS by adding the following options to PROC REG: /p CLM CLI; 56 / 57

101 Inference on Prediction The REG Procedure Model: MODEL1 Dependent Variable: y Output Statistics Dependent Predicted Std Error Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual Sum of Residuals 0 Sum of Squared Residuals Predicted Residual SS (PRESS) / 57

CHAPTER 4: Forecasting by Regression

CHAPTER 4: Forecasting by Regression Prof. Alan Wan 1 / 57 Table of contents 1. Revision of Linear Regression 3.1 First-order Autocorrelation and the Durbin-Watson Test 3.2 Correction for Autocorrelation