Econ 2120: Section 2 Part I - Linear Predictor Loose Ends Ashesh Rambachan Fall 2018
Outline Big Picture Matrix Version of the Linear Predictor and Least Squares Fit Linear Predictor Least Squares Omitted Variable Bias Formula Residual Regression Useful Trick
Outline Big Picture Matrix Version of the Linear Predictor and Least Squares Fit Linear Predictor Least Squares Omitted Variable Bias Formula Residual Regression Useful Trick
Orthogonality Conditions Goal of first half of the class: Get you to think in terms of projections. Best Linear Predictor: Projection onto space of linear functions of X Conditional expectation: Projection onto space of all linear functions of X Key: There is a one-to-one map between the projection problem and the orthogonality condition.
Orthogonality Conditions We can always (trivially) write Y = g(x ) + U, U = Y g(x ). To understand what g(x ) describes, you just need to know the orthogonality condition that U satisfies. Two cases: (1): g(x ) = β 0 + β 1 X and E[U] = E[U X ] = 0 = g(x ) is the best linear predictor. (2): E[U h(x )] = 0 for all h( ). = g(x ) = E[Y X ].
Orthogonality Conditions Why is this useful? If you can transform your minimization problem e.g. into a projection problem, min E[(Y β 0 β 1 X ) 2 ] β 0,β 1 min Y β 0 β 1 X 2, β 0,β 1 you have a whole new set of (very) useful tools. By the projection theorem, we can characterize the solution by a set of orthogonality conditions AND we have uniqueness.
Best Linear Predictor Why are we emphasizing the best linear predictor E [Y X ] so much?
Best Linear Predictor Why are we emphasizing the best linear predictor E [Y X ] so much? In some sense, it can always be computed. Requires very few assumptions aside from the existence of second moments of the joint distribution. You can always go to Stata, type reg y x and get something that has the interpretation of a best linear predictor. Regression output = estimated coefficients of best linear predictor.
Best Linear Predictor Why are we emphasizing the best linear predictor E [Y X ] so much? In some sense, it can always be computed. Requires very few assumptions aside from the existence of second moments of the joint distribution. You can always go to Stata, type reg y x and get something that has the interpretation of a best linear predictor. Regression output = estimated coefficients of best linear predictor. BUT: It is a completely separate question if this is an interesting interpretation. To say more, you need to assume more.
Outline Big Picture Matrix Version of the Linear Predictor and Least Squares Fit Linear Predictor Least Squares Omitted Variable Bias Formula Residual Regression Useful Trick
Linear Predictor X = (X 1,..., X K ) is a K 1 vector, β = (β 1,..., β K ) is the K 1 vector of coefficients of the best linear predictor. Coefficients characterized by K orthogonality conditions E[(Y X β)x j ] = E[X j (Y X β)] = 0 for j = 1,..., K.
Linear Predictor X = (X 1,..., X K ) is a K 1 vector, β = (β 1,..., β K ) is the K 1 vector of coefficients of the best linear predictor. Coefficients characterized by K orthogonality conditions E[(Y X β)x j ] = E[X j (Y X β)] = 0 for j = 1,..., K. Re-write in vector form to get E[ X (Y X β) ] = 0. K 1 1 1 This is a K 1 system of linear equations.
Linear Predictor X = (X 1,..., X K ) is a K 1 vector, β = (β 1,..., β K ) is the K 1 vector of coefficients of the best linear predictor. Coefficients characterized by K orthogonality conditions E[(Y X β)x j ] = E[X j (Y X β)] = 0 for j = 1,..., K. Re-write in vector form to get E[ X (Y X β) ] = 0. K 1 1 1 This is a K 1 system of linear equations. Provided E[XX ] is non-singular, E[XY ] K 1 E[XX ] K K β K 1 = 0 = β K 1 = E[XX ] 1 K K E[XY ]. K 1
Least Squares Data are (y i, x i1,..., x ik ) for i = 1,..., n. x i = (x i1,..., x ik ) is the 1 K vector of covariates for the i-th observation and x j = (x 1j,..., x nj ) be the n 1 vector of observations of the j-th covariate.
Least Squares Data are (y i, x i1,..., x ik ) for i = 1,..., n. x i = (x i1,..., x ik ) is the 1 K vector of covariates for the i-th observation and x j = (x 1j,..., x nj ) be the n 1 vector of observations of the j-th covariate. Define y = n 1 y 1. y n, b = K 1 b 1. b k, x = n K x 1. x n = ( x 1... x K ). n K matrix x is sometimes referred to as the design matrix.
Least Squares Least-squares coefficients b j defined by K orthogonality conditions (x j ) (y xb) = 0 for j = 1,..., K. 1 n n 1
Least Squares Least-squares coefficients b j defined by K orthogonality conditions (x j ) (y xb) = 0 for j = 1,..., K. 1 n n 1 Stack these orthogonality conditions to get (x 1 ). (y xb) = x (y xb) = 0 (x K ) n 1 K n n 1 K n Produces K 1 system of linear equations.
Least Squares Least-squares coefficients b j defined by K orthogonality conditions (x j ) (y xb) = 0 for j = 1,..., K. 1 n n 1 Stack these orthogonality conditions to get (x 1 ). (y xb) = x (y xb) = 0 (x K ) n 1 K n n 1 K n Produces K 1 system of linear equations. Provided x x K K non-singular, b = (x x) 1 x y. K 1 K K K 1
Least Squares Can show (just algebra) x x = n x i x i, x y = i=1 n x i y i. i=1
Least Squares Can show (just algebra) x x = n x i x i, x y = i=1 n x i y i. i=1 Can also write the least-squares coefficients as ( n ) 1 ( n ) ( 1 b = x i x i x i y i = n i=1 i=1 n ) 1 ( 1 x i x i n i=1 n x i y i ). i=1 This formula is extremely useful when we discuss asymptotics.
OVB formula (again) Short linear predictor of Y : E [Y X 1,..., X K 1 ] = α 1 X 1 +... + α K 1 X K 1,
OVB formula (again) Short linear predictor of Y : E [Y X 1,..., X K 1 ] = α 1 X 1 +... + α K 1 X K 1, Long linear predictor of Y E [Y X 1,..., X K ] = β 1 X 1 +... + β K X K
OVB formula (again) Short linear predictor of Y : E [Y X 1,..., X K 1 ] = α 1 X 1 +... + α K 1 X K 1, Long linear predictor of Y E [Y X 1,..., X K ] = β 1 X 1 +... + β K X K Auxiliary linear predictor of X K E [X K X 1,..., X K 1 ] = γ 1 X 1 +... + γ K 1 X K 1.
OVB formula (again) Let X K = X K γ 1 X 1... γ K 1 X K 1.
OVB formula (again) Let X K = X K γ 1 X 1... γ K 1 X K 1. So X K = γ 1 X 1 +... + γ K 1 X K 1 + X K, Y = β 1 X 1 +... + β K 1 X K 1 + β K X K + U where U = Y E [Y X 1,..., X K ] and β j = β j + γ j β K for j = 1,..., K 1.
OVB formula (again) Note that Y β 1 X 1... β K 1 X K 1, X j = β K X K + U, X j = 0 for j = 1,..., K 1.
OVB formula (again) Note that Y β 1 X 1... β K 1 X K 1, X j = β K X K + U, X j = 0 for j = 1,..., K 1. These are the same orthogonality conditions of the short linear predictor. So α j = β j + γ j β K for j = 1,..., K 1.
Outline Big Picture Matrix Version of the Linear Predictor and Least Squares Fit Linear Predictor Least Squares Omitted Variable Bias Formula Residual Regression Useful Trick
Residual Regression Very useful trick that is used all the time. Consider best linear predictor with K covariates E [Y X 1,..., X K ] = β 1 X 1 +... + β K X K. Focus on β K. Is there simple closed-form expression for β K?
Residual Regression Very useful trick that is used all the time. Consider best linear predictor with K covariates E [Y X 1,..., X K ] = β 1 X 1 +... + β K X K. Focus on β K. Is there simple closed-form expression for β K? Yes! Use residual regression.
Residual Regression Auxiliary linear predictor of X K given X 1,..., X K 1. Denote this as E [X K X 1,..., X K 1 ] = γ 1 X 1 +... + γ K 1 X K 1 Associated residual is X K = X K γ 1 X 1... γ K 1 X K 1.
Residual Regression Auxiliary linear predictor of X K given X 1,..., X K 1. Denote this as E [X K X 1,..., X K 1 ] = γ 1 X 1 +... + γ K 1 X K 1 Associated residual is X K = X K γ 1 X 1... γ K 1 X K 1. Theorem (Residual Regression) β K can be written as β K = E[Y X K ] E[ X 2 K ].
Residual Regression: proof By definition, X K = γ 1 X 1 +... + γ K 1 X K 1 + X K.
Residual Regression: proof By definition, X K = γ 1 X 1 +... + γ K 1 X K 1 + X K. Substitute into E [Y X 1,..., X k ] to get E [Y X 1,..., X K ] = β 1 X 1 +... + β K 1 X K 1 + β K X K where β j = β j + γ j β K for j = 1,..., K 1.
Residual Regression: proof X K is a linear combination of X 1,..., X K. So, it is orthogonal to the residual Y E [Y X 1,..., X K ]. Y β 1 X 1... β K 1 X K 1 β K XK, X K = 0.
Residual Regression: proof X K is a linear combination of X 1,..., X K. So, it is orthogonal to the residual Y E [Y X 1,..., X K ]. Y β 1 X 1... β K 1 X K 1 β K XK, X K = 0. X K is orthogonal to X 1,..., X K 1. Above simplifies to Y β K XK, X K = 0 and so, β K = E[Y X K ] E[ X 2 K ].
Residual Regression: intuition Coefficient β K on X K is the coefficient of the best linear predictor of Y given the residuals of X K. If the conditional expectation is linear, β K is the "partial effect" of X K on Y holding all else constant.
Exercise 1 Consider the long and short predictors in the population: E [Y 1, X 1, X 2, X 3 ] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 E [Y 1, X 1, X 2 ] = α 0 + α 1 X 1 + α 2 X 2 (1) Provide a formula relating α 2 and β 2.
Exercise 1 Consider the long and short predictors in the population: E [Y 1, X 1, X 2, X 3 ] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 E [Y 1, X 1, X 2 ] = α 0 + α 1 X 1 + α 2 X 2 (1) Provide a formula relating α 2 and β 2. (2) Suppose that X 3 is uncorrelated to X 1 and X 2. Does α 2 = β 2?
Exercise 1 Consider the long and short predictors in the population: E [Y 1, X 1, X 2, X 3 ] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 E [Y 1, X 1, X 2 ] = α 0 + α 1 X 1 + α 2 X 2 (1) Provide a formula relating α 2 and β 2. (2) Suppose that X 3 is uncorrelated to X 1 and X 2. Does α 2 = β 2? (3): Suppose that Cov(X 2, X 3 ) = 0, Cov(X 1, X 3 ) 0, Cov(X 1, X 2 ) 0. Does α 2 = β 2?
Exercise 2 The partial covariance between the random variables Y, X K given X 1,..., X K 1 is where The partial variance of Y is Cov (Y, X K X 1,..., X K 1 ) = E[Ỹ X K ] Ỹ = Y E [Y X 1,..., X K 1 ] X K = X K E [X K X 1,..., X K 1 ]. V (Y X 1,..., X K 1 ) = E[Ỹ 2 ]. The partial correlation between Y, X K is Cor (Y, X K X 1,..., X K 1 ) = E[Ỹ X K ] (E[Ỹ ]E[ X K ]) 1/2.
Exercise 2 (continued) (1): Consider the linear predictor E [Y 1, X 1,..., X K ] = β 0 + β 1 X 1 +... + β K X K. Use our residual regression results to show that β K = Cov (Y, X K 1, X 1,..., X K 1 ) V. (X K 1, X 1,..., X K 1 )
Residual Regression: sample analog Consider the least-squares fit using K covariates ŷ i = y i x 1,..., x K = b 1 x i1 +... + b K x ik. Following a similar argument, can show that b K is the coefficient of the least-squares fit of y on x k, the vector of residuals given by That is, b K can be written as x ik = x ik x ik x 1,..., x K 1. b K = 1 n n i=1 y i x ik 1 n n i=1 x ik 2.
Frisch-Waugh-Lovell Theorem Residual regression is typically known as the Frisch-Waugh-Lovell Theorem.
Frisch-Waugh-Lovell Theorem Residual regression is typically known as the Frisch-Waugh-Lovell Theorem. Interested in regression Y = X 1 β 1 + X 2 β 2 + u, where Y is an n 1 vector, X 1 is an n K 1 matrix, X 2 is an n K 2 matrix and u is an n 1 vector of residuals. How can we write the least squares coefficients in β 2?
Frisch-Waugh-Lovell Theorem Same as the estimate in the modified regression M X1 Y = M X1 X 2 β 2 + M X1 u where M X1 is the orthogonal complement of the projection matrix X 1 (X 1 X 1) 1 X 1. M X1 = I X 1 (X 1X 1 ) 1 X 1 It projects onto the orthogonal complement of the space spanned by the columns of X 1.
Frisch-Waugh-Lovell Theorem Same as the estimate in the modified regression M X1 Y = M X1 X 2 β 2 + M X1 u where M X1 is the orthogonal complement of the projection matrix X 1 (X 1 X 1) 1 X 1. M X1 = I X 1 (X 1X 1 ) 1 X 1 It projects onto the orthogonal complement of the space spanned by the columns of X 1. β 2 can be written as the coefficient in the regression of residuals of Y on residuals of X 2.
Outline Big Picture Matrix Version of the Linear Predictor and Least Squares Fit Linear Predictor Least Squares Omitted Variable Bias Formula Residual Regression Useful Trick
Useful Trick - Orthogonal Decomposition Notice we used the same trick in OVB and residual regression results. Decomposed variable into two pieces - one piece that lives in simple space and another that is orthogonal to this simple space.
Useful Trick - Orthogonal Decomposition Example: random variable Y, decompose it into Y = E[Y X ] + (Y E[Y X ]). }{{} U E[Y X ] lives on the space of functions of X only and U is orthogonal to any function of X.
Useful Trick - Orthogonal Decomposition Example: random variable Y, decompose it into Y = E[Y X ] + (Y E[Y X ]). }{{} U E[Y X ] lives on the space of functions of X only and U is orthogonal to any function of X. Example: random variable Y, decompose it into Y = E [Y X 1,..., X K ] + (Y E [Y X 1,..., X K ]). }{{} V E [Y X 1,..., X K ] lives on the space of linear functions of X 1,..., X K only and V is orthogonal to any linear function of X 1,..., X K.