Advanced Quantitative Methods: ordinary least squares

Advanced Quantitative Methods: Ordinary Least Squares University College Dublin 31 January 2012

1 2 3 4 5

Terminology y is the dependent variable referred to also (by Greene) as a regressand X are the independent variables also known as explanatory variables also known as regressors or predictors (or factors, carriers) X is sometimes called the design matrix (or factor space) y is regressed on X The error term ε is sometimes called a disturbance: ε = y Xβ. The difference between the observed and predicted dependent variable is called the residual: e = y ŷ = y Xˆβ.

y i = β 0 +β 1 x 1 +β 2 x 2 + +ε i y = Xβ +ε ε N(0,σ 2 ) y N(Xβ,σ 2 )

Components Two components of the model: y N(µ,σ 2 ) Stochastic µ = Xβ Systematic Generalised version (not necessarily linear): y f(µ,α) Stochastic µ = g(x,β) Systematic (King 1998, 8)

Components y f(µ,α) Stochastic µ = g(x,β) Systematic Stochastic component: varies over repeated (hypothetical) observations on the same unit. Systematic component: varies across units, but constant given X. (King 1998, 8)

Uncertainty y f(µ,α) Stochastic µ = g(x,β) Systematic Two types of uncertainty: Estimation uncertainty: lack of knowledge about α and β; can be reduced by increasing n. Fundamental uncertainty: represented by stochastic component and exists independent of researcher.

Ordinary Least Squares (OLS) For the linear model, the most popular method of estimation is (OLS). ˆβ OLS are those estimates of β that minimize the sum of squared residuals: e e. OLS is the best linear unbiased estimator (BLUE).

Assumptions: specification Linear in parameters (i.e. f(xβ) = Xβ and E(y) = Xβ) No extraneous variables in X No omitted independent variables Parameters to be estimated are constant Number of parameters is less than the number of cases, k < n Note that this does not imply that you cannot include non-linearly transformed variables, e.g. y i = β 0 +β 1 x i +β 2 xi 2 +ε i can be estimated with OLS.

Assumptions: errors Errors have an expected value of zero, E(ε X) = 0 Errors are normally distributed, ε N(0,σ 2 ) Errors have a constant variance, var(ε X) = σ 2 < Errors are not autocorrelated, cov(ε i,ε j X) = 0 i j Errors and X are uncorrelated, cov(x,ε) = 0

Assumptions: regressors X varies X is of full column rank (note: requires k < n) No measurement error in X No endogenous variables in X

y = Xβ +ε y 1 y 2. y n n 1 1 x 12 x 13 x 1k 1 x 22 x 23 x 2k =...... 1 x n2 x n3 x nk n k β 1 β 2. β k k 1 ε 1 ε 2 +. ε n n 1 y 1 y 2. y n n 1 β 1 +β 2 x 12 +β 3 x 13 + +β k x 1k β 1 +β 2 x 22 +β 3 x 23 + +β k x 2k =. β 1 +β 2 x n2 +β 3 x n3 + +β k x nk n 1 ε 1 ε 2 +. ε n n 1

Deriving ˆβ OLS y = Xβ +ε y = Xˆβ +e ˆβ OLS = argmin ˆβ e e = argmin(y Xˆβ) (y Xˆβ) ˆβ

Deriving ˆβ OLS e e = (y Xˆβ) (y Xˆβ) = (y (Xˆβ) )(y Xˆβ) = y y (Xˆβ) y y Xˆβ +(Xˆβ) Xˆβ = y y ˆβ X y y Xˆβ + ˆβ X Xˆβ = y y 2y Xˆβ + ˆβ X Xˆβ

Deriving ˆβ OLS ˆβ OLS = argmin ˆβ (e e) ˆβ OLS = 0 e e = (y y 2y Xˆβ + ˆβ X Xˆβ) ˆβ OLS = 0

Deriving ˆβ OLS (y y 2y Xˆβ + ˆβ X Xˆβ) ˆβ OLS = 0 2X Xˆβ OLS 2X y = 0 2X Xˆβ OLS = 2X y X Xˆβ OLS = X y (X X) 1 X Xˆβ OLS = (X X) 1 X y ˆβ OLS = (X X) 1 X y

X Xˆβ OLS = X y 1 x 12 x 13 x 1k 1 x 22 x 23 x 2k....... 1 x n2 x n3 x nk 1 x 12 x 13 x 1k 1 x 22 x 23 x 2k....... 1 x n2 x n3 x nk 1 x 12 x 13 x 1k 1 x 22 x 23 x 2k =....... 1 x n2 x n3 x nk n k n k n k ˆβ 1 ˆβ 2. ˆβ k y 1 y 2. y n k 1 n 1 (ˆβ here refers to OLS estimates.)

X Xˆβ OLS = X y 1 1 1 x 12 x 22 x n2 x 13 x 23 x n3...... x 1k x 2k x nk k n 1 x 12 x 13 x 1k 1 x 22 x 23 x 2k....... 1 x n2 x n3 x nk 1 1 1 x 12 x 22 x n2 = x 13 x 23 x n3...... x 1k x 2k x nk n k k n ˆβ 1 ˆβ 2. ˆβ k y 1 y 2. y n k 1 n 1

X Xˆβ OLS = X y n xi2 xik xi2 (xi2 ) 2 xi2 x ik xi3 xi3 x i2 xi3 x ik...... xik xik x i2 (xik ) 2 k k ˆβ 1 ˆβ 2 ˆβ 3. ˆβ k k 1 yi xi2 y i xi3 = y i. xik y i k 1 ( refers to n i.)

X Xˆβ OLS = X y So this can be seen as a set of linear equations to solve: ˆβ 1 n+ ˆβ 2 xi2 + + ˆβ k xik = y i ˆβ 1 xi2 + ˆβ 2 (xi2 ) 2 + + ˆβ k xi2 x ik = x i2 y i ˆβ 1 xi3 + ˆβ 2 xi3 x i2 + + ˆβ k xi3 x ik = x i3 y i ˆβ 1 xik + ˆβ 2 xik x i2 + + ˆβ k (xik ) 2 = x ik y i.

X Xˆβ OLS = X y When there is only one independent variable, this reduces to: ˆβ 1 n+ ˆβ 2 xi = y i ˆβ 1 xi + ˆβ 2 x 2 i = x i y i ˆβ 1 = yi ˆβ 2 xi n = nȳ ˆβ 2 n x n (ȳ ˆβ 2 x) x i + ˆβ 2 x 2 i = x i y i = ȳ ˆβ 2 x nȳ x ˆβ 2 n x 2 + ˆβ 2 x 2 i = x i y i xi y i nȳ x ˆβ 2 = x 2 i n x 2

Simple regression y i = β 0 +β 1 x i +ε i If β 1 = 0 and the only regressor is the intercept, then ˆβ 0 = ȳ. If β 0 = 0, so that there is no intercept and one explanatory variable x, then ˆβ 1 =. n i x i y i n i x 2 i If there is an intercept and one explanatory variable, then ˆβ 1 = n i (x i x)(y i ȳ) n i (x i x) 2

Demeaning If the observations are expressed as deviations from their means, i.e.: y i = y i ȳ x i = x i x, then ˆβ 1 = n i x i y n i x 2 i i The intercept can be estimated as ȳ ˆβ 1 x.

Standardized variables The coefficient on a variable is interpreted as the effect on y given a one unit increase in x and thus the interpretation is dependent on the scale of measurement. An option can be to standardize the variables: y = y ȳ σy 2 x = x x σ 2 x y = X β +ε and interpret the coefficients as the effect on y expressed in standard deviations as x increase by one standard deviation.

Standardized variables library(arm) summary(standardize(lm(y ~ x)), binary.inputs="leave.alone") See also Andrew Gelman (2007), Scaling regression inputs by dividing by two standard deviations.

Exercise: teenage gambling library(faraway) data(teengamb) summary(teengamb) Using matrix formulas, 1 regress gamble on a constant 2 regress gamble on sex 3 regress gamble on sex, status, income, verbal

Exercise: teenage gambling 1 Regress gamble on sex, status, income, verbal 2 Which observation has the largest residual? 3 Compute mean and median of residuals 4 Compute correlation between residuals and fitted values 5 Compute correlation between residuals and income 6 All other predictors held constant, what would be the difference in predicted expenditure on gambling between males and females?

Deriving var(ˆβ OLS ) var(ˆβ OLS ) = E[(ˆβ OLS E(ˆβ OLS ))(ˆβ OLS E(ˆβ OLS )) ] = E[(ˆβ OLS β)(ˆβ OLS β) ]

Deriving var(ˆβ OLS ) ˆβ OLS β = (X X) 1 X y β = (X X) 1 X (Xβ +ε) β = (X X) 1 X Xβ +(X X) 1 X ε β = β +(X X) 1 X ε β = (X X) 1 X ε

Deriving var(ˆβ OLS ) var(ˆβ OLS ) = E[(ˆβ OLS β)(ˆβ OLS β) ] = E[((X X) 1 X ε)((x X) 1 X ε) ] = E[(X X) 1 X εε X(X X) 1 ] = (X X) 1 X E[εε ]X(X X) 1 = (X X) 1 X σ 2 IX(X X) 1 = σ 2 (X X) 1 X X(X X) 1 = σ 2 (X X) 1

Standard errors var(ˆβ OLS ) = σ 2 (X X) 1 The standard errors of ˆβ OLS are then the square root of the diagonal of this matrix. In the simple case where y = β 0 +β 1 x, this gives var(ˆβ 1 ) = σ 2 n i (x i x) 2 Note how an increase in variance in x leads a decrease in the standard error of ˆβ 1.

e = y Xˆβ OLS = y X(X X) 1 X y = (I X(X X) 1 X )y = My = M(Xβ +ε) = (I X(X X) 1 X )Xβ +Mε = Xβ X(X X) 1 X Xβ +Mε = Xβ Xβ +Mε = Mε

e e = (Mε) (Mε) = ε M Mε = ε Mε E[e e X] = E[ε Mε X] ε Mε is a scalar, so if you see it as a 1 1 matrix, it is equal to its trace, therefore E[e e X] = E[tr(ε Mε X] = E[tr(Mεε ) X] = tr(me[εε X]) = tr(mσ 2 I) = σ 2 tr(m)

tr(m) = tr(i X(X X) 1 X ) = tr(i) tr(x(x X) 1 X ) = n tr(x X(X X) 1 ) = n k E(e e X) = (n k)σ 2 ˆσ 2 = e e n k

OLS in R n <- dim(x)[1] k <- dim(x)[2] b.hat <- solve(t(x) %*% X) %*% t(x) %*% y e <- y - X %*% b.hat s2.hat <- 1/(n-k) * t(e) %*% e v.hat <- s2.hat * solve(t(x) %*% X) Or: summary(lm(y ~ X))

Exercise: US wages library(faraway) data(uswages) summary(uswages) 1 Using matrix formulas, regress wage on educ, exper and race. 2 Interpret the results 3 Plot residuals against fitted values and against educ 4 Repeat with log(wage) as dependent variable

Unbiasedness of ˆβ OLS Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS From deriving var(ˆβ OLS ) we have: ˆβ OLS β = (X X) 1 X y β = (X X) 1 X (Xβ +ε) β = (X X) 1 X Xβ +(X X) 1 X ε β = β +(X X) 1 X ε β = (X X) 1 X ε Since we assume E(ε) = 0: E(ˆβ OLS β) = (X X) 1 X E(ε) = 0 Therefore, ˆβ OLS is an unbiased estimator of β.

Efficiency of ˆβ OLS Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS The Gauss-Markov theorem states that there is no linear unbiased estimator of β that has a smaller sampling variance than ˆβ OLS, i.e. ˆβ OLS is BLUE. An estimator is linear iff it can be expressed as a linear function of the data on the dependent variable: ˆβlinear j = n i f(x ij )y i, which is the case for OLS: ˆβ OLS = (X X) 1 X y = Cy (Wooldridge, pp. 101-102, 111-112)

Gauss-Markov Theorem Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS ˆβ OLS = (X X) 1 X y = Cy Imagine we have another linear estimator: ˆβ = Wy, where W = C+D. ˆβ = (C+D)y = Cy+Dy = ˆβ OLS +D(Xβ +ε) = ˆβ OLS +DXβ +Dε

Gauss-Markov Theorem Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS E(ˆβ ) = E(ˆβ OLS +DXβ +Dε) = E(ˆβ OLS )+DXβ +E(Dε) = β +DXβ +0 DXβ = 0 DX = 0, because both ˆβ OLS and ˆβ are unbiased, so E(ˆβ OLS ) = E(ˆβ ) = β. (Hayashi, pp. 29-30)

Gauss-Markov Theorem Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS ˆβ = ˆβ OLS +Dε ˆβ β = ˆβ OLS +Dε β = (C+D)ε var(ˆβ ) = E(ˆβ β)e(ˆβ β) = E((C+D)ε)E((C+D)ε) = (C+D)E(εε )(C+D) = σ 2 (C+D)(C+D) = σ 2 (CC +DC +CD +DD ) = σ 2 ((X X) 1 +DD ) σ 2 (X X) 1

Consistency of ˆβ OLS Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS ˆβ OLS = (X X) 1 X y = (X X) 1 X (Xβ +ε) = β +(X X) 1 X ε = β +( 1 n X X) 1 ( 1 n X ε) 1 lim n n X X = Q = lim (1 n n X X) 1 = Q 1 E( 1 n X ε) = E( 1 n n x i ε i ) = 0 i

Consistency of ˆβ OLS Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS var( 1 n X ε) = E( 1 n X ε( 1 n X ε) ) = 1 n X E(εε )X 1 n = ( σ2 X n )(X n ) lim n var(1 n X ε) = 0Q = 0

Consistency of ˆβ OLS Unbiasedness of ˆβ OLS Efficiency of ˆβ OLS Consistency of ˆβ OLS ˆβ OLS = β +( 1 n X X) 1 ( 1 n X ε) lim (1 n n X X) 1 = Q 1 E( 1 n X ε) = 0 lim var(1 n n X ε) = 0 imply that the sampling distribution of ˆβ OLS collapses to β as n becomes very large, i.e.: plim ˆβ OLS = β n

Sums of squares SST Total sum of squares (y i ȳ) 2 SSE Explained sum of squares (ŷ i ȳ) 2 SSR Residual sum of squares ei 2 = (ŷ i y i ) 2 = e e The key to remember is that SST = SSE + SSR Sometimes instead of explained and residual, regression and error are used, respectively, so that the abbreviations are swapped (!).

Defined in terms of sums of squares: = SSE SST = 1 SSR SST (yi ŷ i ) 2 = 1 (yi ȳ) 2 Interpretation: the proportion of the variation in y that is explained linearly by the independent variables A much over-used statistic: it may not be what we are interested in at all

in matrix algebra y y = (ŷ+e) (ŷ+e) = ŷ ŷ+ŷ e+e ŷ+e e = ŷ ŷ+2ˆβ X e+e e = ŷ ŷ+e e = 1 e e y y = ŷ ŷ y y (Hayashi, p. 20)

When a model has no intercept, it is possible for to lie outside the interval (0, 1) rises with the addition of more explanatory variables. For this reason we often report the adjusted : 1 (1 ) n 1 n k The values from different y samples cannot be compared

Exercise: US wages library(faraway) data(uswages) summary(uswages) 1 Using matrix formulas, regress wage on educ, exper and race. 2 What proportion of the variance in wage is explained by these three variables?