Interpreting Regression Results

Interpreting Regression Results Carlo Favero Favero () Interpreting Regression Results 1 / 42

Interpreting Regression Results Interpreting regression results is not a simple exercise. We propose to split these procedure in three steps. First, understand the relevance of our regression independently from inference on the parameters. There is an easy way to do this: suppose all parameters in the model are known and identical to the estimated values and learn how to read these. Second, introduce a measure of sampling variability and evaluate again what you know taking into account that parameters are estimated and there is uncertainty surrounding your point estimates. Third, remember that each regression is run after a reduction process has been, explicitly or implicitly implemented. The relevant question is what happens if something went wrong in the reduction process? What are the consequences of omitting relevant information or of including irrelevant one in your specification? Favero () Interpreting Regression Results 2 / 42

Relevance of a regression is different form statistical significance of the estimated parameters. In fact, confusing statistical significance of the estimated parameter describing the effect of a regressor on the dependent variable with practical relevance of that effect is a rather common mistake in the use of the linear model. Statistical inference is a tool for estimating parameters in a probability model and assessing the amount of sampling variability. Statistics gives us indication on what we can say about the values of the parameters in the model on the basis of our sample. The relevance of a regression is determined by the share of the unconditional variance of y that is explained by the variance of E (y X). Measuring how large is the share of the unconditional variance of y explained by the regression function is the fundamental role of R 2. Favero () Interpreting Regression Results 3 / 42 as a measure of relevance of a regression 2

The R-squared as a measure of relevance of a regression To illustrate the point let us consider two specific cases of applications of the CAPM: ( ( r i t r rf t r m t t ( ui,t ) ) r rf u m,t ) = 0.8σ m u m,t + σ i u i,t = µ m + σ m u m,t [( 0 n.i.d. 0 ), ( 1 0 0 1 )] µ m = 0.0065, σ m = 0.054, σ 1 = 0.09, σ 2 = 0.005 We simulate an artificial sample of 1056 (same length with the sample July 1926-June2014) observations for each process. µ m and σ m are calibrated to match the first two moments of the market portfolio excess returns over the sample 1926:7-2014:7. While the standard errors of the two excess returns are calibrated to deliver R 2 in the CAPM regression of respectively about.22 and.98. Favero () Interpreting Regression Results 4 / 42

The R-squared as a measure of relevance of a regression By running the two CAPM regressions on the artificial sample: TABLE 3.1: The estimation ( of ) the CAPM on artificial data Dependent Variable r 1 t rrf t Regressor ( ) Coefficient Std. Error t-ratio Prob. r m t r rf t 0.875 17.48 0.000 R 2 0.22 S.E. of regression 0.0076 ( ) Dependent Variable r 2 t rrf t Regressor Coefficient Std. Error t-ratio Prob. ( r m t r rf t ) 0.793 201.86 0.000 R 2 0.972 S.E. of regression 0.0000 In both cases the estimated beta are statistically significant and very close to their true value of 0.8. Favero () Interpreting Regression Results 5 / 42

In both experiments the conditional expectation changes of the same amount but the share of the unconditional variance of y explained by Favero () Interpreting Regression Results 6 / 42 The R-squared as a measure of relevance of a regression Simulate again the processes but introduce at some point a temporary shift of two per cent in the excess returns in the market portfolio..24.20.16.12.08..00.04 04. 08. 12. 16.3.2.1.0.1.2.3.4.20.15.10.05.00. 05. 10. 15 Simulated Market Portfolio excess returns 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 Baseline Alternative Simulated Portfolio 1 excess returns 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 Baseline Alternative Simulated Portfolio 2 excess returns 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 Baseline Alternative

Inference in the Linear Regression Model Users of econometric models in finance attributes high priority to the concept of "statistical significance" of their estimates. In the standard statistical jargon an estimate of a parameter is statistical significant if its estimated value, compared with its sampling standard deviation makes it unlikely that in other samples the estimate may change of sign. In the linear regression model the statistical index mostly used is the t-ratio and an estimated parameter has a significance which is usually measured in terms of its P-value, the probability with which that coefficient is equal to zero. In the previous section we have discussed the common confusion between statistical significance and relevance In this section we illustrate the basic principles that allow us to evaluate statistical significance and to perform test of relevant hypothesis on the estimated coefficient in a linear model. Favero () Interpreting Regression Results 7 / 42

Elements of Distribution theory We consider the distribution of a generic n-dimensional vector z, together with the derived distribution of the vector x = g (z) which admits the inverse z = h (x), with h = g 1. If prob (z 1 < z < z 2 ) = z 2 z 1 f (z) dz, and prob (x 1 < x < x 2 ) = x 2 x 1 f (x) dx, then: f (x) = f (h (x)) J, where J = h 1 h n x 1 x 1............ h 1 x n... h n x n = h. x Favero () Interpreting Regression Results 8 / 42

The normal distribution The standardized normal univariate has the following distribution: 1 f (z) = exp ( 12 ) z2, 2π E (z) = 0, var (z) = 1. By considering the transformation x = σz + µ, we derive the distribution of the univariate normal as: ( ) 1 f (x) = σ 2π exp (x µ)2 2σ 2, E (x) = µ, var (x) = σ 2. Favero () Interpreting Regression Results 9 / 42

The normal multivariate distribution Consider now the vector z = (z 1, z 2,..., z n ), such that f (z) = n i=1 ( f (z i ) = (2π) n 2 exp 1 ) 2 z z. z is, by construction, a vector of normal independent variables with zero mean and identity variance covariance matrix. The conventional notation is z N (0, I n ). Favero () Interpreting Regression Results 10 / 42

The normal multivariate distribution Consider now the linear transformation, x = Az + µ, where A is an (n n) invertible matrix. We consider the following transformation z = A 1 (x µ) with Jacobian J = A 1 = 1 A. By applying the formula for the transformation of variables, we have: f (x) = (2π) n 2 A 1 exp ( 1 ) 2 (x µ) A 1 A 1 (x µ), which, by defining the positive definite matrix = AA, equals ( f (x) = (2π) n 2 1 2 exp 1 ) 2 (x µ) 1 (x µ). The conventional notation for the multivariate normal is x N (µ, ). Favero () Interpreting Regression Results 11 / 42

The transformation of normal multivariate The formula of the transformation of variable allows us to better understand the theorem introduced in a previous section of this chapter., Theorem For any x N (µ, ), given any (m n) B matrix and any (m 1) vector, d, if y = Bx + d, this implies y N ( Bµ + d, B B ). Consider a partitioning of an n-variate normal vector in two sub-vectors of dimensions n 1 and n n 1 : ( ) (( ) ( )) x1 µ1 Σ11 Σ N, 12. x 2 µ 2 Σ 21 Σ 22 By applying the formula for the transofrmation of variables, we obtain two results: 1 x 1 N (µ 1, 11 ), which follows from applying the general formula in the case d = 0, B = (I n1 0); 2 (x 1 x 2 ) N ( µ 1 + Σ 12 Σ22 1 (x 2 µ 2 ), Σ 11 Σ 12 Σ22 1 Σ 21), which is Favero () Interpreting Regression Results 12 / 42

Distributions derived from the normal Consider z N (0, I n ), an n-variate standard normal. The distribution of ω = z z is defined as a χ 2 (n) distribution with n degrees of freedom. Consider two vectors z 1 and z 2 of dimensions n 1 and n 2 respectively, with the following distribution: ( z1 z 2 ) (( 0 N 0 ) ( In1 0, 0 I n2 )). We have ω 1 = z 1 z 1 χ 2 (n 1 ), ω 2 = z 2 z 2 χ 2 (n 2 ), and ω 1 + ω 2 = z 1 z 1 + z 2 z 2 χ 2 (n 1 + n 2 ). In general, the sum of two independent χ 2 (n) distributions is in itself distributed as χ 2 with a number of degrees of freedom equal to the sum of the degrees of freedom of the two χ 2. Favero () Interpreting Regression Results 13 / 42

Distributions derived from the normal Our discussion of the multivariate normal concludes that if x N (µ, ), then (x µ) 1 (x µ) χ 2 (n). A related result establishes that if z N (0, I n ) and M is a symmetric idempotent (n n) matrix of rank r, then z Mz χ 2 (r). Another distribution related to the normal is the F-distribution. The F-distribution is obtained as the ratio of two independent χ 2 divided by the respective degrees of freedom. Given ω 1 χ 2 (n 1 ), and ω 2 χ 2 (n 2 ), we have: ω 1 /n 1 ω 2 /n 2 F (n 1, n 2 ). Favero () Interpreting Regression Results 14 / 42

Distributions derived from the normal The Student s t distribution is then defined as: t n = F (1, n). Another useful result establishes that two quadratic forms in the standard multivariate normal, z Mz and z Qz, are independent if MQ = 0. We can finally state the following theorem, which is fundamental to the statistical inference in the linear model: Theorem If z N (0, I n ), M and Q are symmetric and idempotent matrices of ranks r and s respectively and MQ = 0, then z Qz r F (s, r). z Mz s Favero () Interpreting Regression Results 15 / 42

The conditional distribution y X To perform inference in the linear regression model, we need a further hypothesis to specify the distribution of y conditional upon X: ( ) y X N Xβ, σ 2 I, (1) or, equivalently ( ) ɛ X N 0, σ 2 I. (2) ) Given (1) we can immediately derive the distribution of ( β X which, being a linear combination of a normal distribution, is also normal: ) ( ( β X N β, σ 2 ( X X ) ) 1. (3) Favero () Interpreting Regression Results 16 / 42

The conditional distribution y X Equation (3) constitutes the basis to construct confidence intervals and to perform hypothesis testing in the linear regression model. Consider the following expression: ( β β ) X X ( β β ) σ 2 = ɛ X (X X) 1 X X (X X) 1 X ɛ σ 2 = ɛ Qɛ σ 2, Q = X ( X X ) 1 X and, applying the results derived in the previous section, we know that ɛ Qɛ σ 2 X χ 2 (k). (4) Favero () Interpreting Regression Results 17 / 42

The conditional distribution y X Equation (4) is not useful in practice, as we do not know σ 2. However, we know that ) S ( β X ɛ σ 2 = Mɛ σ 2 X χ 2 (T k). (5) M = I Q (6) Since MQ = 0, we know the distribution of the ratio of (4) and (5); moreover, taking the ratio, we get rid of the unknown term σ 2 : ( β β ) X X ( β β ) /σ 2 s 2 /σ 2 = ɛ Qɛ ɛ (T k) kf (k, T k). (7) Mɛ Favero () Interpreting Regression Results 18 / 42

Clicker 6 Insert Clicker 6 here Favero () Interpreting Regression Results 19 / 42

Confidence Intervals for β We use result (7) to obtain from the tables of the F-distribution the critical value F α (k, T k) such that prob [F (k, T k) > F α (k, T k)] = α, 0 < α < 1, for different values of α we are in the position of evaluating exactly an inequality of the following form: { ) ) } prob ( β β X X ( β β ks 2 F α (k, T k) = 1 α, which defines confidence intervals for β centred upon β. Favero () Interpreting Regression Results 20 / 42

Hypothesis Testing Hypothesis testing is strictly linked to the derivation of confidence intervals. When testing the hypothesis, we aim at rejecting the validity of restrictions imposed on the model on the basis of the sample evidence. Within this framework, (??) (3) are the maintained hypothesis and the restricted version of the model is identified with the null hypothesis H 0. Following the Neyman Pearson approach to hypothesis testing, one derives a statistic with known distribution under the null. Then the probability of the first-type error (rejecting H 0 when it is true) is fixed at α. For example, we use a test at the level α of the null hypothesis β = β 0, based on the F-statistic, when we do not reject the null H 0 if β 0 lies within the confidence interval associated with the probability 1 α. However, in practice, this is not a useful way of proceeding, as the economic hypotheses of interest rarely involve a number of restrictions equal to the number of estimated parameters. Favero () Interpreting Regression Results 21 / 42

Hypothesis Testing The general case of interest is therefore the one when we have r restrictions on the vector of parameters with r < k. If we limit our interest to the class of linear restrictions, we can express them as H 0 = Rβ = r, where R is an (r k) matrix of parameters with rank k and r is an (r 1) vector of parameters. To illustrate how R and r are constructed, we consider the baseline case of the CAPM model; we want to impose the restriction β 0,i = 0 on the following specification: ( r i t r rf t ) = β 0,i + β 1,i ( r m t Rβ = r, ( ) ( ) β 1 0 0,i = (0). β 1,i r rf t ) + u i,t, (8) The distribution of a known statistic under the null is derived by applying known results. Favero () Interpreting Regression Results 22 / 42

Hypothesis Testing ) ( If ( β X N β, σ 2 (X X) 1), then: ( ) ( R β r X N Rβ r, σ 2 R ( X X ) ) 1 R. (9) The test is constructed by deriving the distribution of (9) under the null Rβ r = 0. Given that ( ) R β r X = Rβ r + R ( X X ) 1 X u, under H 0, we have: ( ) R β r (R ( X X ) ) 1 1 ( ) R R β r = ɛ X ( X X ) 1 R ( R ( X X ) 1 R ) 1 R ( X X ) 1 X ɛ = ɛ Pɛ. where P is a symmetric idempotent matrix of rank r, orthogonal to M. Favero () Interpreting Regression Results 23 / 42

Hypothesis Testing Then ( ) R β r (R (X X) 1 R ) 1 ( ) R β r s 2 rf (r, T k), under H 0, which can be used to test the relevant hypothesis. Favero () Interpreting Regression Results 24 / 42

Clicker 7 Insert Clicker 7 here Favero () Interpreting Regression Results 25 / 42

The Partitioned Regression Model Given the linear model: y = Xβ + ɛ, Partition X in two blocks two blocks of dimension (Txr) and (Tx (k r)) and β in a corresponding way into [ ] β 1 β 2. The partitioned regression model can then be written as follows y = X 1 β 1 + X 2 β 2 + ɛ, Favero () Interpreting Regression Results 26 / 42

The Partitioned Regression Model It is useful to derive the formula for the OLS estimator in the partitioned regression model. To obtain such results we partition the normal equations X X β = X y as: ( X 1 X 2 ) ( X1 X 2 ) ( β1 β 2 ) = ( X 1 X 2 ) y, or, equivalently, ( X 1 X 1 X 1 X 2 X 2 X 1 X 2 X 2 ) ( β1 β 2 ) = ( X 1 y X 2 y ). (10) Favero () Interpreting Regression Results 27 / 42

The Partitioned Regression Model System (10) can be resolved in two stages by first deriving an expression β 2 as: β 2 = ( X 2 X ) ) 1 2 (X 2 y X 2 X 1 β 1, and then by substituting it in the first equation of (10) to obtain X 1 X 1 β 1 + X 1 X ( ) ) 2 X 1 2 X 2 (X 2 y X 2 X 1 β 1 = X 1 y, from which: β 1 = ( X 1 M ) 1 2X 1 X 1 M 2 y ( ) ) M 2 = (I X 2 X 1 2 X 2 X 2. Favero () Interpreting Regression Results 28 / 42

The Partitioned Regression Model Note that, as M 2 is idempotent, we can also write: β 1 = ( X 1 M 2 M 2X 1 ) 1 X 1 M 2 M 2y, and β 1 can be interpreted as the vector of OLS coefficients of the regression of y on the matrix of residuals of the regression of X 1 on X 2. Thus, an OLS regression on two regressors is equivalent to two OLS regressions on a single regressor (Frisch-Waugh theorem). Favero () Interpreting Regression Results 29 / 42

The Partitioned Regression Model Finally, consider the residuals of the partitioned model: ɛ = y X 1 β1 X 2 β2, ɛ = y X 1 β X2 ( X 2 X 2 ) 1 ( X 2 y X 2 X 1 β 1 ), ɛ = M 2 y M 2 X 1 β1 ( ) = M 2 y M 2 X 1 X 1 1 M 2 X 1 X 1 M 2 y ( ) ) = (M 2 M 2 X 1 X 1 1 M 2 X 1 X 1 M 2 y, however, we already know that ɛ = My, therefore, ( ) ) M = (M 2 M 2 X 1 X 1 1 M 2 X 1 X 1 M 2. (11) Favero () Interpreting Regression Results 30 / 42

Testing restrictions on a subset of coefficients In the general framework to test linear restrictions we set r = 0, R = [ I r 0 ], and partition β in a corresponding way into [ ] β 1 β 2. In this case the restriction Rβ r = 0 is equivalent to β 1 = 0 in the partitioned regression model. Under H 0, X 1 has no additional explicatory power for y with respect to X 2, therefore: ( ) H 0 : y = X 2 β 2 + ɛ, (ɛ X 1, X 2 ) N 0, σ 2 I. Note that the statement y = X 2 γ 2 + ɛ, ( ) (ɛ X 2 ) N 0, σ 2 I, is always true under our maintained hypotheses. However, in general γ 2 = β 2. Favero () Interpreting Regression Results 31 / 42

Testing restrictions on a subset of coefficients To derive a statistic to test H 0 remember that the general matrix R (X X) 1 R is the upper left block of (X X) 1, which we can now write as (X 1 M 2X 1 ) 1. The statistic then takes the form β 1 (X 1 M 2X 1 ) β 1 rs 2 = y M 2 X 1 (X 1 M 2X 1 ) 1 X 1 M 2y T k y F (T k, r). My r Given (11), (10) can be re-written as: y M 2 y y My T k y F (T k, r), (12) My r where the denominator is the sum of the squared residuals in the unconstrained model, while the numerator is the difference between the sum of residuals in the constrained model and the sum of residuals in the unconstrained model. Favero () Interpreting Regression Results 32 / 42

Testing restrictions on a subset of coefficients Consider the limit case r = 1 and β 1 is a scalar. The F-statistic takes the form β 2 1 s 2 (X 1 M 2X 1 ) F (T k, r), under H 0, where (X 1 M 2X 1 ) 1 is element (1, 1) of the matrix (X X) 1. Using the result on the relation between the F and the Student s t-distribution: β 1 s (X 1 M 2X 1 ) 1/2 t (T k) under H 0. Therefore, an immediate test of significance of the coefficient can be performed, by taking the ratio of each estimated coefficient and the associated standard error. Favero () Interpreting Regression Results 33 / 42

The partial regression theorem The Frisch-Waugh Theorem described above is worth more consideration. The theorem tells us than any given regression coefficient in the model E (y X) = Xβ can be computed in two different but exactly equivalent ways: 1) by regressing y on all the columns of X, 2) by first regressing the j-th column of X on all the other columns of X, computing the residuals of this regression and then by regressing y on these residuals. This result is relevant in that it clarifies that the relationships pinned down by the estimated parameters in a linear model do not describe the connections between the regressand and each regressor but the connection between the part of each regressor that is not explained by the other ones and the regressand. Favero () Interpreting Regression Results 34 / 42

What if analysis The relevant question in this case becomes how much shall y change if I change X i? The estimation of a single equation linear model does not allow to anser that question, for a number of reasons. First, estimated parameters in a linear model can only answer the question how much shall E (y X) if I change X? We have seen that the two questions are very different if the R 2 of the regression is low, in this case a change in E (y X) may not effect any visible and relevant effect on y. Second, a regression model is a conditional expected value GIVEN X. In this sense there is no space for changing the value of any element in X. Favero () Interpreting Regression Results 35 / 42

What if analysis Any statement involving such a change requires some assumption on how the conditional expectation of y changes if X changes and a correct analysis of this requires an assumption on the joint distribution of y and X. Simulation might require the use of the multivariate joint model even when valid estimation can be performed concentrating only on the conditional model. Strong exogeneity is stronger than weak exogeneity for the estimation of the parameters of interest. Favero () Interpreting Regression Results 36 / 42

What if analysis Think of a linear model with know parameters y = β 1 x 1 + β 2 x 2 What is in this model the effect of on y of changing x 1 by one unit while keeping x 2 constant? Easy β 1. Now think of the estimated linear model: y = ˆ β 1 x 1 + ˆ β 2 x 2 + û Now y is different from E (y X) and the question "what is in this model the effect of on E (y X) of changing x 1 by one unit while keeping x 2 constant?" does not in general make sense. Favero () Interpreting Regression Results 37 / 42

Clicker 8 Insert Clicker 8 here Favero () Interpreting Regression Results 38 / 42

What if analysis Changing x 1 keeping x 2 unaltered implies that there is zero correlation among this variables. But the estimates β ˆ 1 and β ˆ 2 are obained by using data in which in general there is some correlation between x 1 and x 2. Data in which fluctuations in x 1 do not have any effect on x 2 would have most likely generated different estimates from those obtained in the estimation sample. The only valid question that can be answered using the coefficients in linear regression is "What is the effect on E (y X) of changing the part of each regressors that is orthogonal to the other ones". "What if" analysis requires simulation and in most cases a low level of reduction than that used for regression analysis. Favero () Interpreting Regression Results 39 / 42

The semi-partial R-squared When the columns of X are orthogonal to eache other the total R 2 can be exactly decomposed in the sum of the partial R 2 due to each regressor x i (the partial R 2 of a regressor i is defined as the R 2 of the regression of y on x i ). This is in general not the case in applications with non experimental data: columns of X are correlated and a (often large) part of the overall R 2 does depend on the joint behaviour of the columns of X. However, it is always possible to compute the marginal contribution to the overall R 2 due to each regressor x i, defined as the difference between the overall R 2 and the R 2 ot the regression that inlcudes all columns X except x i. This is called the semi-partial R 2. Favero () Interpreting Regression Results 40 / 42

The semi-partial R-squared Interestingly, the the semi-partial R 2 is a simple tranformation of the t-ratio: ( spr 2 i = t2 β 1 R 2 ) i (T k) This result has two interesting implications. First, a quantity which we considered as just a measure of statistical reliability, can lead to a measure of relevance when combined with the overall R 2 of the regression. Second, we can re-iterate the difference between statistical significance and relevance. Suppose you have a sample size of 10000 and you have 10 columns in X and the t-ratio on a coefficient β i is of about 4 with an associate P-value of the order.01: very statistical significant! The derivation of the semi-partial R 2 tells us that the contribution of this variable to the overall R2 is at most approximately 16/(10000-10) that is: less than two thousands. Favero () Interpreting Regression Results 41 / 42

Clicker 9 Insert Clicker 9 here Favero () Interpreting Regression Results 42 / 42