Regression Analysis. y t = β 1 x t1 + β 2 x t2 + β k x tk + ϵ t, t = 1,..., T,

Size: px

Start display at page:

Download "Regression Analysis. y t = β 1 x t1 + β 2 x t2 + β k x tk + ϵ t, t = 1,..., T,"

Katrina Pamela Lawson
6 years ago
Views:

1 Regression Analysis The multiple linear regression model with k explanatory variables assumes that the tth observation of the dependent or endogenous variable y t is described by the linear relationship y t = β 1 x t1 + β 2 x t2 + β k x tk + ϵ t, t = 1,..., T, where the x ti s, t = 1,..., T, i = 1,..., k, are the independent, exogenous, or explanatory variables. If there is a constant in the regression, then, for example, x t1 = 1 for all t. 1

2 To compactify the notation, define y = β = y 1 y 2. y T β 1 β 2. β k, X =, ϵ = ϵ 1 ϵ 2. ϵ T x 11 x 12 x 1k x 21 x 22 x 2k x T 1 x T 2 x T k,, so that y = Xβ + ϵ. We assume that T > k, and X has full column rank, i.e., rank(x) = k. Then we know that also rank(x X) = rank(x) = k. If there is a constant in the regression, then the first column of X is just a column of ones. 2

3 For the disturbance or error process {ϵ t } T t=1, we assume that (i) E(ϵ X) = 0 (ii) The covariance matrix of the errors is Cov(ϵ X) = σ 2 I T = σ σ σ 2, i.e., the errors are homoskedastic (all have the same variance σ 2 ) and uncorrelated, i.e., E(ϵ t ϵ τ ) = We also frequently assume { σ 2 t = τ 0 t τ. ϵ t iid N(0, σ 2 ), i.e., ϵ N(0, σ 2 I T ), (1) where I T is the identity matrix of dimension T. 3

4 The normality assumption may often been acceptable for monthly returns, but is less adequate for higher frequencies such as weekly or daily. A simple test for normality is provided by the Jarque Bera test (JB), which is JB = T (Ŝ2 6 + ( K ) 3) 2 24 asy χ 2 (2), (2) where Ŝ = T 1 t (R it R i ) 3 T σ i 3, K 1 = t (R it R i ) 4 σ i 4 (3) are the sample skewness and sample kurtosis, respectively. Note that both measures are location and scale free, i.e., not affected by a linear transformation R i = a + br i. 4

5 The skewness checks for asymmetries (recall that the normal distribution is symmetric and has skewness zero). The kurtosis, due to the fourth moment involved, gives very large values a higher weight and thus checks whether the tails of the empirical distribution are thicker than those of the normal, i.e., whether the probability of very large observations (of either sign) is more likely than under normality (which has kurtosis equal to three). A kurtosis significantly larger than three indicates excess kurtosis, i.e., larger probability of large losses and gains. Such a density typically has also more weight in the center and less weight in the shoulders of the distribution. Considerable excess kurtosis is typical for returns at daily or weekly frequencies, which is clearly of great importance for risk management. 5

6 0.35 density with excess kurtosis normal density density return Normality is rejected at the 5% level if JB exceeds the right tail critical value of the χ 2 (2) distribution, which is approximately given by

7 19 DAX stocks, monthly data, January 1995 December 2004, index is the DAX i Company skewness kurtosis JB 1 Allianz BASF BayHypo BMW Bayer CommBank Conti DtBank Eon Henkel Linde Lufthansa MAN RWE Schering Siemens Thyssen TUI VW

8 ,, and indicates significance at the 10%, 5%, and 1% level, respectively. 8

9 Ordinary Least Squares (OLS) Estimation The OLS estimator is defined to be the vector β which minimizes the residual sum of squares, S( β) = (y X β) (y X β) = y y 2y X β + β X X β = ϵ ϵ, where ϵ = y X β. To minimize the sum of squares, we calculate S( β) β = 2Xy + 2X X β = 0, β = (X X) 1 X y. (4) 9

10 For example, consider the single index model (SIM), R it = α i + β i R Mt + ϵ it, t = 1,..., T. (5) In this case, X = 1 R M1 1 R M2.. 1 R MT, y = R i1 R i2. R it, (6) X X = X y = T T R Mt t=1 T t=1 T T R Mt t=1 T R it t=1 RMt 2 t=1 R Mt R it, (7), (8) 10

11 (X X) 1 = [ T RMt RMt R 2 Mt ] 1 [ R 2 Mt R Mt ] = = 1 T R Mt T T RMt 2 ( R Mt ) 2 [ ] R 2 M R M R M 1, s 2 M where R M = T 1 R Mt, R 2 M = T 1 R 2 Mt, and s 2 M = R2 M R2 M = T 1 (R Mt R M ) 2. Hence [ αi β i ] = 1 T = [ R 2 M R M R M 1 s 2 M [ R 2 M R M R M 1 s 2 M ] ] [ [ ] Rit RMt R it R i R M R i ]. 11

12 That is β i = R MR i R i R M s 2 M = T 1 t (R it R i )(R Mt R M ) s 2 M = s im s 2 M. α i = R2 MR i R M R M R i s 2 M = R i(r 2 M R 2 M) + R M (R M R i R M R i ) s 2 M = R i R MR i R M R i s 2 R M M = R i β i R M. 12

13 Goodness of Fit The coefficient of determination is often used to informally judge how well the model explains the observed variations in the endogenous variable. This is based on 1 ϵ ϵ = (y X β)(y X β) where ŷ = X β. = y y 2 β X y + β X X β = y y ŷ ŷ, (9) 1 β X y = βx X(X X) 1 X y = β X X β. 13

14 If there is a constant term in the regression, the residuals sum to zero, and we have y = ŷ, so that s 2 y = s 2 ŷ + s2 ϵ, (10) and the R 2 is R 2 = s2 ŷ s 2 y = 1 s2 ϵ s 2, 0 R 2 1. y In the context of the single index model (SIM), the coefficient of determination can be interpreted as an estimate of the fraction of variability of asset i that is due to movements of the market (proportion of nondiversifiable risk). The remaining part of the variability is then due to nonmarket factors and can be diversified away. If there is no constant term in the regression, we can still define a R 2, but this does not have the same interpretation, as (10) is no longer valid. 14

15 That is, (9) still holds, but it cannot be interpreted in terms of a variance decomposition. 15

16 19 DAX stocks, monthly data, January 1995 December 2004, index is the DAX i Company βi R 2 1 Allianz BASF BayHypo BMW Bayer CommBank Conti DtBank Eon Henkel Linde Lufthansa MAN RWE Schering Siemens Thyssen TUI VW

17 Problem of R 2 is that it does not take into account the number of regressors. The adjusted R 2 is R adj 2 = 1 s2 ϵ T 1 s 2 y T k = 1 (1 R2 ) T 1 T k, which may decrease with the inclusion of a new regressor, and may even become negative (so the term R squared may appear misleading). However, we may want to compare different factor models for returns by applying criteria which are directly related to the prospective use of the model, i.e., out of sample forecasting of the correlation structure. For example, a simple approach is to calculate global minimum variance portfolios over the out of sample period and subsequently compare the realized portfolio variances implied by different models. 17

18 Distribution of β We have β = (X X) 1 X y = (X X) 1 X (Xβ + ϵ) = β + (X X) 1 X ϵ, where ϵ t N(0, σ 2 I T ). It follows that β is normally distributed with mean β (unbiasedness) and covariance matrix (X X) 1 X Cov(ϵ) }{{} =σ 2 I T X(X X) 1 = σ 2 (X X) 1, i.e., β N(β, σ 2 (X X) 1 ). 18

19 Estimation of σ 2 An unbiased estimator of σ 2 is σ 2 = ϵ ϵ T k. (11) Further results (under normality of the errors) are: ϵ ϵ σ 2 χ2 (T k), (12) a χ 2 distribution with T k degrees of freedom. β and σ 2 are independent. 19

20 Testing Hypotheses about β Consider a hypothesis about a single element of β, i.e., H 0 : β i = β 0 i. The variance of β i is σ 2 βi = σ 2 (X X) 1 ii, and, under H 0, β i β 0 i σ 2 βi N(0, 1). (13) As we don t know σ 2, we can use σ 2 instead, but this changes the distribution of (13), due to the sampling variance of σ 2. We have seen that (T k) σ 2 /σ 2 χ 2 (T k). 20

21 Thus, dividing the N(0, 1) variable in (13) by σ2 /σ 2 produces a Student s t distribution with T k degrees of freedom, i.e., the t statistic t = β i βi 0 t(t k), σ 2 βi where σ 2 βi = σ 2 (X X) 1 ii. A symmetric (1 α) confidence interval can be calculated via β i ± t 1 α/2,t k σ 2 β i, (14) where t 1 α/2,t k is the 1 α/2 quantile of the t distribution with T k degrees of freedom. The decision rule for a two sided t test is then to reject the null at level α if the (absolute value of the) t statistic exceeds the (1 α/2) quantile of the Student s t distribution with T k degrees of freedom. 21

22 In the two sided t test above, both positive and negative deviations from the null can lead to a rejection of H 0. That is, we test H 0 : β i = βi 0 H 1 : β i βi 0. against the alternative If we are only interested in deviations in one direction, for example H 1 : β i > βi 0, we refer to a one sided test. The corresponding null hypothesis is H 0 : β i β 0 i and we reject H 0 if t > t 1 α,t k, (15) where t 1 α,t k is the (1 α) quantile of t distribution with T k degrees of freedom. 22

23 Next, consider a hypothesis of the form H 0 : Rβ = r, where R is a q k matrix of rank q. For example, if, in the single index model, we want to test α i = 0 and β i = 1 for asset i, R = [ ] [ 0, r = 1 ]. If H 0 is true, then R β r N(0, σ 2 R(X X) 1 R ). Therefore W = (R β r) [R(X X) 1 R ] 1 (R β r) σ 2 χ 2 (q). 23

24 As before, we cannot use this directly, as σ 2 unknown, but we can use: is Let Y 1 χ 2 (ν 1 ), and Y 2 χ 2 (ν 2 ), and let Y 1 and Y 2 be independent. Then the ratio F = Y 1/ν 1 Y 2 /ν 2 = Y 1 Y 2 ν 2 ν 1 F (ν 1, ν 2 ), i.e., F has an F distribution with ν 1 degrees of freedom in the numerator and ν 2 degrees of freedom in the denominator. Use the fact that (T k) σ 2 /σ 2 χ 2 (T k), and independence of β and σ 2. The test statistic F = (R β r) [R(X X) 1 R ] 1 (R β r) q σ 2 F (q, T k). 24

25 Many alternative representations exist for the F statistic. For example, we can consider the restricted least squares problem, i.e., the minimization of the RSS under the constraint that R β R = r. Working through the first order conditions associated with the Lagrangian L = (y Xβ) (y Xβ) + λ (Rβ r), (16) the restricted estimator is β R = β (X X) 1 R [R(X X) 1 R ] 1 (R β r). Calculations show that ϵ R ϵ R = ϵ ϵ+(r β r) [R(X X) 1 R ] 1 (R β r). 25

26 Therefore, three alternative forms of F with appealing interpretation are F = ϵ R ϵ R ϵ ϵ q σ 2 = (R 2 R 2 R )/q (1 R 2 )/(T k) = ( ϵ R ϵ R ϵ ϵ)/q ϵ ϵ/(t k). = ( β R β)(x X)( β R β) q σ 2. 26

27 Optimality Properties of β 1) Gauss-Markov Theorem The OLS estimator is the Best Linear Unbiased Estimator (BLUE), i.e., for any other unbiased linear estimator β of β, we have that Cov( β) Cov( β) is a positive semidefinite matrix. To see this, let be unbiased for β. Define so that β = Ay B = A (X X) 1 X, β = [B + (X X) 1 X ]y = [B + (X X) 1 X ](Xβ + ϵ) = (BX + I k )β + (B + (X X) 1 X )ϵ. 27

28 Unbiasedness requires BX = 0, so that β = β + (B + (X X) 1 X )ϵ, and the covariance matrix of β is Cov( β) = σ 2 (B + (X X) 1 X )(B + (X X) 1 X ) = σ 2 BB +σ 2 [BX(X X) 1 + (X X) 1 X B ] +σ 2 (X X) 1 = σ 2 [BB + (X X) 1 ] Therefore, the covariance matrix of the competing estimator exceeds that of the OLS estimator by σ 2 BB, which is a positive semidefinite matrix. 28

29 Now consider any linear combination w β of the elements of the parameter vector. The sampling variance of w β is Var(w β) = w [σ 2 BB + σ 2 (X X) 1 ]w σ 2 w (X X) 1 w. This shows that the sampling variance of w β is less than or equal to that of the corresponding combination of any other linear unbiased estimator of β. By choosing w to be the ith unit vector in R k, an implication of this is Var( β i ) Var( β i ), i = 1,..., k. 29

30 2) Given our distributional assumption about the error term (iid normality), we can say even more, by relying on the Cramér Rao lower bound (CRLB). In fact, in this case, the OLS estimator has the smallest variance among all unbiased estimators (not only linear estimators), meaning that it is an UMVUE (Uniformly Minimum Variance Unbiased Estimator). 30

31 Prediction Suppose we want to predict y T +1 = x T +1 β + ϵ T +1, using our OLS fitted model. Our predictor is ŷ T +1 = x T +1 β = xt +1 (X X) 1 X y, which has prediction error x T +1 β yt +1 = x T +1 ( β β) ϵ T +1. This is unbiased in the sense that the prediction error has zero expectation, and the prediction error variance is E[(ŷ T +1 y T +1 ) 2 ] = σ 2 [x T +1 (X X) 1 x T ]. 31

32 This is the best linear unbiased predictor, because a linear unbiased predictor of the form ỹ T +1 = ay can be written as [x T +1 (X X) 1 X + b]y, with b = a x T +1 (X X) 1 X. This has prediction error ỹ T +1 y T +1 = [x T +1 (X X) 1 X + b](xβ + ϵ) x T +1 β ϵ T +1 = bxβ +[x T +1 (X X) 1 X + b]ϵ ϵ T +1, so that bx = 0 is required for unbiasedness. The forecast error variance is then E[(ỹ T +1 y T +1 ) 2 ] = σ 2 (x T +1 (X X) 1 x T +1 + bb + 1) σ 2 (x T +1 (X X) 1 x T ) = E[(ŷ T +1 y T +1 ) 2 ]. 32

33 Generalized Least Squares If ϵ N(0, Σ), where Σ is a covariance matrix, the OLS estimator need no longer be efficient according to the Gauss-Markov theorem. However, we can find nonsingular C such that CC = Σ, for example, by employing the Choleski decomposition, with C nonsingular. Then, with ϵ = C 1 ϵ, we have E( ϵ ϵ ) = C 1 Σ(C ) 1 = C 1 CC (C ) 1 = I. Thus, we may consider the transformed equation C 1 y = C 1 X + C 1 ϵ, with GLS (generalized least squares) estimator β = (X Σ 1 X) 1 X Σ 1 y. 33

34 We will use this when we discuss the Black Litterman approach. 34

Simple Linear Regression

Simple Linear Regression Christopher Ting Christopher Ting : christophert@smu.edu.sg : 688 0364 : LKCSB 5036 January 7, 017 Web Site: http://www.mysmu.edu/faculty/christophert/ Christopher Ting QF 30 Week