Projection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42

Size: px

Start display at page:

Download "Projection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42"

Rebecca Jane Strickland
6 years ago
Views:

1 Projection Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Projection 1 / 42

2 1 Hilbert Space and Projection Theorem 2 Projection in the L 2 Space 3 Projection in R n Projection Matrices 4 Partitioned Fit and Residual Regression Projection along a Subspace Ping Yu (HKU) Projection 2 / 42

3 Overview Whenever we discuss projection, there must be an underlying Hilbert space since we must define "orthogonality". We explain projection in two Hilbert spaces (L 2 and R n ) and integrate many estimators in one framework. Projection in the L 2 space: linear projection and regression (linear regression is a special case) Projection in R n : Ordinary Least Squares (OLS) and Generalized Least Squares (GLS) One main topic of this course is the (ordinary) least squares estimator (LSE). Although the LSE has many interpretations, e.g., as a MLE or a MoM estimator, the most intuitive interpretation is that it is a projection estimator. Ping Yu (HKU) Projection 2 / 42

4 Hilbert Space and Projection Theorem Hilbert Space and Projection Theorem Ping Yu (HKU) Projection 3 / 42

5 Hilbert Space and Projection Theorem Hilbert Space Definition (Hilbert Space) A complete inner product space is called a Hilbert space. a An inner product is a bilinear operator h,i : H H! R, where H is a real vector space, satisfying for any x,y,z 2 H and α 2 R, (i) hx + y,zi = hx,zi + hy,zi; (ii) hαx,zi = α hx,zi; (iii) hx,zi = hz,xi; (iv) hx,xi 0 with equal if and only if x = 0. We denote this Hilbert space as (H,h,i). a A metric space (H,d) is complete if every Cauchy sequence in H converges in H, where d is a metric on H. A sequence fx n g in a metric space is called a Cauchy sequence if for any ε > 0, there is a positive integer N such that for all natural numbers m,n > N, d(x m,x n ) < ε. Ping Yu (HKU) Projection 4 / 42

6 Hilbert Space and Projection Theorem Angle and Orthogonality An important inequality in the inner product space is the Cauchy Schwarz inequality: jhx,yij kxk kyk, where kk p h,i is the norm induced by h,i. Due to this inequality, we can define angle(x,y) = arccos hx,yi kxk kyk. We assume the value of the angle is chosen to be in the interval [0,π]. [Figure Here] If hx,yi = 0, angle(x,y) = π 2 ; we call x is orthogonal to y and denote it as x? y. Ping Yu (HKU) Projection 5 / 42

7 Hilbert Space and Projection Theorem Figure: Angle in Two-dimensional Euclidean Space Ping Yu (HKU) Projection 6 / 42

8 Hilbert Space and Projection Theorem Projection and Projector The ingredients of a projection are fy,m,(h,h,i)g, where M is a subspace of H. Note that the same H endowed with different inner products are different Hilbert spaces, so the Hilbert space is denoted as (H,h,i) rather than H. Our objective is to find some Π(y) 2 M such that Π(y) = arg min h2m ky hk2. (1) Π(): H! M is called a projector, and Π(y) is called a projection of y. Ping Yu (HKU) Projection 7 / 42

9 Hilbert Space and Projection Theorem Direct Sum, Orthogonal Space and Orthogonal Projector Definition Let M 1 and M 2 be two disjoint subspaces of H so that M 1 \ M 2 = f0g. The space V = fh 2 Hjh = h 1 + h 2,h 1 2 M 1,h 2 2 M 2 g is called the direct sum of M 1 and M 2 and it is denoted by V = M 1 M 2. Definition Let M be a subspace of H. The space M? fh 2 Hjhh,Mi = 0g is called the orthogonal space or orthogonal complement of M, where hh,mi = 0 means h is orthogonal to every element in M. Definition Suppose H = M 1 M 2. Let h 2 H so that h = h 1 + h 2 for unique h i 2 M i, i = 1,2. Then P is a projector onto M 1 along M 2 if Ph = h 1 for all h. In other words, PM 1 = M 1 and PM 2 = 0. When M 2 = M 1?, we call P as an orthogonal projector. Ping Yu (HKU) Projection 8 / 42

10 Hilbert Space and Projection Theorem Figure: Projector and Orthogonal Projector What is M 2? [Back to Lemma 9] Ping Yu (HKU) Projection 9 / 42

11 Hilbert Space and Projection Theorem Hilbert Projection Theorem Theorem (Hilbert Projection Theorem) If M is a closed subspace of a Hilbert space H, then for each y 2 H, there exists a unique point x 2 M for which ky xk is minimized over M. Moreover, x is the closest element in M to y if and only if hy x,mi = 0. The first part of the theorem states the existence and uniqueness of the projector. The second part of the theorem states something related to the first order conditions (FOCs) of (1) or, simply, orthogonal conditions. From the theorem, given any closed subspace M of H, H = M M?. Also, the closest element in M to y is determined by M itself, not the vectors generating M since there may be some redundancy in these vectors. Ping Yu (HKU) Projection 10 / 42

12 Hilbert Space and Projection Theorem Figure: Projection Ping Yu (HKU) Projection 11 / 42

13 Hilbert Space and Projection Theorem Sequential Projection Theorem (Law of Iterated Projections or LIP) If M 1 and M 2 are closed subspaces of a Hilbert space H, and M 1 M 2, then Π 1 (y) = Π 1 (Π 2 (y)), where Π j (), j = 1,2, is the orthogonal projector of y onto M j. Proof. Write y = Π 2 (y) + Π? 2 (y). Then Π 1 (y) = Π 1 (Π 2 (y) + Π? 2 (y)) = Π 1(Π 2 (y)) + Π 1 (Π? 2 (y)) = Π 1(Π 2 (y)), where the last equality is because Π? 2 (y),x = 0 for any x 2 M 2 and M 1 M 2. We first project y onto a larger space M 2, and then project the projection of y (in the first step) onto a smaller space M 1. The theorem shows that such a sequential procedure is equivalent to projecting y onto M 1 directly. We will see some applications of this theorem below. Ping Yu (HKU) Projection 12 / 42

14 Projection in the L 2 Space Projection in the L 2 Space Ping Yu (HKU) Projection 13 / 42

15 Projection in the L 2 Space Linear Projection A random variable x 2 L 2 (P) if E[x 2 ] <. L 2 (P) endowed with some inner product is a Hilbert space. y 2 L 2 (P), x 1,,x k 2 L 2 (P), M = span (x 1,,x k ) span(x), 1 H = L 2 (P) with h,i defined as hx,yi = E [xy]. h Π(y) = arg min E (y h) 2i h2m = x 0 arg min β2r k E h(y x 0 β ) 2i (2) is called the best linear predictor (BLP) of y given x, or the linear projection of y onto x. 1 span(x) = z 2 L 2 (P)jz = x 0 α,α 2 R k. Ping Yu (HKU) Projection 14 / 42

16 Projection in the L 2 Space continue... Since this is a concave programming problem, FOCs are sufficient 2 : where u = y 2E x y x 0 β 0 = 0 ) E [xu] = 0 (3) h Π(y) is the error, and β 0 = arg min E (y x 0 β ) 2i. β2r k Π(y) always exists and is unique, but β 0 needn t be unique unless x 1,,x k are linearly independent, that is, there is no nonzero vector a 2 R k such that a 0 x = 0 almost surely (a.s.). h Why? If 8 a 6= 0, a 0 x 6= 0, then E (a 0 x) 2i > 0 and a 0 E [xx 0 ]a > 0, thus E [xx 0 ] > 0. So from (3), and Π(y) = x 0 (E [xx 0 ]) 1 E [xy]. β 0 = E xx 0 1 E [xy] (why?) (4) In the literature, β with a subscript 0 usually represents the true value of β. 2 x (a0 x) = x (x0 a) = a Ping Yu (HKU) Projection 15 / 42

17 Projection in the L 2 Space Regression The setup is the same as in linear projection except that M = L 2 (P,σ(x)), where L 2 (P,σ(x)) is the space spanned by any function of x (not only the linear function of x) as long as it is in L 2 (P). h Π(y) = arg min E (y h) 2i (5) h2m Note that E h(y h) 2i = E h(y E[yjx] + E[yjx] h) 2i h = E (y E[yjx]) 2i + 2E [(y E[yjx]) (E[yjx] h)] + E h(e[yjx] h) 2i h? = E (y E[yjx]) 2i h + E (E[yjx] h) 2i h E (y E[yjx]) 2i E[u 2 ], so Π(y) = E [yjx], which is called the population regression function (PRF), where the error u satisfies E[ujx] = 0 (why?). We can use variation to characterize the FOCs: h 0 = argmin E (y (Π(y) + εh(x))) 2i ε2r 2 E [h(x) (y (Π(y) + εh(x)))]j ε=0 = 0 ) E [h(x)u] = 0, 8 h(x) 2 L 2 (P,σ(x)) (6) Ping Yu (HKU) Projection 16 / 42

18 Projection in the L 2 Space Relationship Between the Two Projections Π 1 (y) is the BLP of Π 2 (y) given x, i.e., the BLPs of y and Π 1 (y) given x are the same. This is a straightforward application of the law of iterated projections. Explicitly, define h β o = arg min E β2r k E [yjx] The FOCs for this minimization problem are x 0 β i Z h 2 = arg min E [yjx] β2r k E [ 2x(E [yjx] x 0 β o )] = 0 ) E [xx 0 ]β o = E [xe [yjx]] = E [xy] ) β o = (E [xx 0 ]) 1 E [xy] = β 0 x 0 β 2 i df (x). In other words, β 0 is a (weighted) least squares approximation to the true model. If E [yjx] is not linear in x, β o depends crucially on the weighting function F (x) or the distribution of x. The weighting function ensures that frequently drawn x i will yield small approximation errors at the cost of larger approximation errors for less frequently drawn x i. Ping Yu (HKU) Projection 17 / 42

19 Projection in the L 2 Space Figure: Linear Approximation of Conditional Expectation (I) Ping Yu (HKU) Projection 18 / 42

20 Projection in the L 2 Space Figure: Linear Approximation of Conditional Expectation (II) Ping Yu (HKU) Projection 19 / 42

21 Projection in the L 2 Space Linear Regression Linear regression is a special case of regression with E[yjx] = x 0 β. Regression and linear projection are implied by the definition of projection, but linear regression is a "model" where some structure (or restriction) is imposed. In the following figure, when we project y onto a larger space M 2 = L 2 (P,σ(x)), Π(y) falls into a smaller space M 1 = span (x) by coincidence, so there must be a restriction on the joint distribution of (y, x) (what kind of restriction?). In summary, the linear regression model is y = x 0 β + u, E[ujx] = 0. E[ujx] = 0 is necessary for a causal interpretation of β. Ping Yu (HKU) Projection 20 / 42

22 Projection in the L 2 Space Figure: Linear Regression Ping Yu (HKU) Projection 21 / 42

23 Projection in R n Projection in R n Ping Yu (HKU) Projection 22 / 42

24 Projection in R n The LSE The projection in the L 2 space is treated as the population version. The projection in R n is treated as the sample counterpart of the population version. The LSE is defined as n bβ = arg min SSR(β ) = arg min β2r k β2r k i=1 y i h x 0 i β 2 = arg min E n y x 0 β i 2, β2r k where E n [] is the expectation under the empirical distribution of the data, and n SSR(β ) y i i=1 x 0 n i β 2 = yi 2 2β 0 n x i y i + β 0 i=1 i=1 is the sum of squared residuals as a function of β. n x i x 0 i β i=1 Ping Yu (HKU) Projection 23 / 42

25 Projection in R n Figure: Objective Functions of OLS Estimation: k = 1, 2 Ping Yu (HKU) Projection 24 / 42

26 Projection in R n Normal Equations SSR(β ) is a quadratic function of β, so the FOCs are also sufficient to determine the LSE. Matrix calculus 3 gives the FOCs for b β: 0 = β SSR(b β ) = 2 = 2X 0 y + 2X 0 X b β, which is equivalent to the normal equations n i=1 X 0 X b β = X 0 y. x i y i + 2 n x i x 0 iβ b i=1 So bβ = (X 0 X) 1 X 0 y. 3 x (a0 x) = x (x0 a) = a, and x (x0 Ax) = (A + A 0 )x. Ping Yu (HKU) Projection 25 / 42

27 Projection in R n Notations Matrices are represented using uppercase bold. In matrix notation the sample (data, or dataset) is (y,x), where y is an n 1 vector with ith entry y i and X is a matrix with ith row x 0 i, i.e., y (n1) 0 B y 1.. y n 1 C A and 0 X = (nk) The first column of X is assumed to be ones if without further specification, i.e., the first column of X is 1 = (1,,1) 0. The bold zero, 0, denotes a vector or matrix of zeros. Reexpress X as X = X 1 X k, where different from x i, X j, j = 1,,k, represents the jth column of X and is all the observations for jth variable. The linear regression model upon stacking all n observations is then y = Xβ + u, where u is an n 1 column vector with ith entry u i. Ping Yu (HKU) Projection 26 / 42 x x 0 n 1 C A,

28 Projection in R n LSE as a Projection The above derivation of b β expresses the LSE using rows of the data matrices y and X. The following expresses the LSE using columns of y and X. y 2 R n, X 1,,X k 2 R n are linearly independent, M = span (X 1,,X k ) span(x), 4 H = R n with the Euclidean inner product. 5 (7) Π(y) = arg min h2m hk2 = X arg min ky β2r k Xβk 2 n = X arg min β2r k i=1 y i x 0 i β 2, where n i=1 y i x 0 i β 2 is exactly the objective function of OLS. 4 span(x) = z 2 R n jz = Xα,α 2 R k is called the column space or range space of X. 5 Recall that for x = (x 1,,x n ), and z = (z 1,,z n ), the Euclidean inner product of x and z is hx,zi = n i=1 x i z i, so kxk 2 = hx,xi = n i=1 x 2 i. Ping Yu (HKU) Projection 27 / 42

29 Projection in R n continue... As Π(y) = X b β, we can solve out b β by premultiplying both sides by X, that is, X 0 Π(y) = X 0 X b β ) b β = (X 0 X) 1 X 0 Π(y), where (X 0 X) 1 exists because X is full rank. On the other hand, orthogonal conditions for this optimization problem are where bu = y Π(y). X 0 bu = 0, Since these orthogonal conditions are equivalent to normal equations (or the FOCs), b β = (X 0 X) 1 X 0 y. These two b β s are the same since (X 0 X) 1 X 0 y (X 0 X) 1 X 0 Π(y) = (X 0 X) 1 X 0 bu = 0. Finally, Π(y) = X(X 0 X) 1 X 0 y = P X y, where P X is called the projection matrix. Ping Yu (HKU) Projection 28 / 42

30 Projection in R n Multicollinearity In the above calculation, we first project y on span(x) and then find b β by solving Π(y) = X b β. The two steps involve very different operations: optimization versus solving linear equations. Furthermore, although Π(y) is unique, b β may not be. When rank(x) < k or X is rank deficient, there are more than one (actually, infinite) b β such that X b β = Π(y). This is called multicollinearity and will be discussed in more details in the next chapter. In the following discussion, we always assume rank(x) = k or X is full-column rank; otherwise, some columns of X can be deleted to make it so. Ping Yu (HKU) Projection 29 / 42

31 Projection in R n Generalized Least Squares All are the same as in the last example except hx,zi W = x 0 Wz, where the weight matrix W is positive definite and denoted as W > 0. The projection FOCs are where eu = y Thus X e β, that is, Π(y) = X arg min β2r k ky Xβk 2 W. (8) X, eu W = 0 (orthogonal conditions) hx,xi W e β = hx,yiw ) e β = (X 0 WX) 1 X 0 Wy. Π(y) = X(X 0 WX) 1 X 0 Wy = P X?WX y where the notation P X?WX will be explained later. Ping Yu (HKU) Projection 30 / 42

32 Projection in R n Projection Matrices Projection Matrices Since Π(y) = P X y is the orthogonal projection onto span(x), P X is the orthogonal projector onto span(x). Similarly, bu = y Π(y) = (I n P X )y M X y is the orthogonal projection onto span? (X), so M X is the orthogonal projector onto span? (X), where I n is the n n identity matrix. Since P X X = X(X 0 X) 1 X 0 X = X, M X X = (I n P X )X = 0; we say P X preserves span(x), M X annihilates span(x), and M X is called the annihilator. This implies another way to express bu: bu = M X y = M X (Xβ + u) = M X u. Also, it is easy to check M X P X = 0, so M X and P X are orthogonal. Ping Yu (HKU) Projection 31 / 42

33 Projection in R n Projection Matrices continue... P X is symmetric: P 0 X = X(X 0 X) 1 X 0 0 = X(X 0 X) 1 X 0 = P X. P X is idempotent 6 (intuition?): P 2 X = X(X 0 X) 1 X 0 X(X 0 X) 1 X 0 = P X. P X is positive semidefinite: for any α 2 R n, α 0 P X α = (X 0 α) 0 (X 0 X) 1 X 0 α 0, "Positive semidefinite" cannot be strengthen to "positive definite". Why? For an idempotent matrix, the rank equals the trace 7. and tr(p X ) = tr(x(x 0 X) 1 X 0 ) = tr((x 0 X) 1 X 0 X) = tr(i k ) = k < n, tr(m X ) = tr(i n P X ) = tr(i n ) tr(p X ) = n k < n. For a general "nonorthogonal" projector P, it is still unique and idempotent, but need not be symmetric (let alone positive semidefiniteness). For example, P X?WX in the GLS estimation is not symmetric. 6 A square matrix A is idempotent if A 2 = AA = A. 7 Trace of a square matrix is the sum of its diagonal elements. tr(a + B) =tr(a)+tr(b) and tr(ab) =tr(ba). Ping Yu (HKU) Projection 32 / 42

34 Partitioned Fit and Residual Regression Partitioned Fit and Residual Regression Ping Yu (HKU) Projection 33 / 42

35 Partitioned Fit and Residual Regression Partitioned Fit It is of interest to understand the meaning of part of b β, say, b β 1 in the partition of bβ = ( b β 0 1, b β 0 2 )0, where we partition with rank(x) = k. Xβ = " X 1...X2 # β 1 β 2 We will show that b β 1 is the "net" effect of X 1 on y when the effect of X 2 is removed from the system. This result is called the Frisch-Waugh-Lovell (FWL) theorem due to Frisch and Waugh (1933) and Lovell (1963). The FWL theorem is an excellent implication of the projection property of least squares. To simplify notation, P j P Xj, M j M Xj, Π j (y) = X j b β j, j = 1,2. Ping Yu (HKU) Projection 34 / 42

36 Partitioned Fit and Residual Regression The FWL Theorem Theorem bβ 1 could be obtained when the residuals from a regression of y on X 2 alone are regressed on the set of residuals obtained when each column of X 1 is regressed on X 2. In mathematical notations, bβ 1 = X 0 1?2 X 1?2 1 X 0 1?2 y?2 = X 0 1 M 2X 1 1 X 0 1 M 2 y. where X 1?2 = (I P 2 )X 1 = M 2 X 1, y?2 = (I P 2 )y = M 2 y. This theorem states that b β 1 can be calculated by the OLS regression of ey = M 2 y on e X 1 = M 2 X 1. This technique is called residual regression. Corollary Π 1 (y) X 1 b β 1 = X 1 X 0 1?2 X 1 1 X 0 1?2 y P 12 y = P 12 (Π(y)). Ping Yu (HKU) Projection 35 / 42

37 Partitioned Fit and Residual Regression Figure: The FWL Theorem Ping Yu (HKU) Projection 36 / 42

38 Partitioned Fit and Residual Regression P 12 P 12 = X 1 {z} trailing term X 0 1 (I P 1 2)X 1 X 0 1 (I P 2 ). {z } leading term I P 2 in the leading term annihilates span(x 2 ) so that P 12 (Π 2 (y)) = 0. The leading term sends Π(y) toward span? (X 2 ). But the trailing X 1 ensures that the final result will lie in span(x 1 ). The rest of the expression for P 12 ensures that X 1 is preserved under the transformation: P 12 X 1 = X 1. Why P 12 y = P 12 (Π(y))? We can treat the projector P 12 as a sequential projector: first project y onto span(x) to get Π(y), and then project Π(y) to span(x 1 ) along span(x 2 ) to get Π 1 (y). bβ 1 is calculated from Π 1 (y) by bβ 1 = (X 0 1 X 1) 1 X 0 1 Π 1(y). Ping Yu (HKU) Projection 37 / 42

39 Partitioned Fit and Residual Regression Figure: Projection by P 12 Ping Yu (HKU) Projection 38 / 42

40 Partitioned Fit and Residual Regression Proof I of the FWL Theorem (brute-force) Calculate b β 1 explicitly in the residual regression and check whether it is equal to the LSE of β 1. Residual regression includes the following three steps. Step 1: Projecting y on X 2, we have the residuals bu y = y X 2 (X 0 2 X 2) 1 X 0 2 y = M 2y. Step 2: Projecting X 1 on X 2, we have the residuals bu x1 = X 1 X 2 (X 0 2 X 2) 1 X 0 2 X 1 = M 2 X 1. Step 3: Projecting bu y on b U x1, we get the residual regression estimator of β 1 1 eβ 1 = bu 0 b x1ux1 bu 0 x1 bu y = X 0 1 M 1 2X 1 X 0 1 M 2 y h i 1 h i = X 0 1 X 1 X 0 1 X 2(X 0 2 X 2) 1 X 0 2 X 1 X 0 1 y X0 1 X 2 X 0 2 X 1 2 X 0 2 y h i W 1 X 0 1 y X0 1 X 2 X 0 2 X 1 2 X 0 2 y Ping Yu (HKU) Projection 39 / 42

41 Partitioned Fit and Residual Regression continue... On the other hand, bβ = = and X 0 1 X 1 X 0 1 X 1 2 X 0 1 y X 0 2 X 1 X 0 2 X 2 X 0 2 y W 1 W 1 X 0 1 X 2(X 0 2 X 2) 1 X 0 1 y X 0 2 y, bβ 1 = W 1 X 0 1 y W 1 X 0 1 X 2(X 0 2 X 2) 1 X 0 h 2 i y = W 1 X 0 1 y X0 1 X 2 X 0 2 X 1 2 X 0 2 y = β e 1. The partitioned inverse formula: A11 A 12 A 21 A 22 1 = ea 1 11 ea 1 11 A 12A 1 22 A 1 22 A 21A e 1 11 A A 1 22 A 21A e 1 11 A 12A 1 22! (9) where e A 11 = A 11 A 12 A 1 22 A 21. Ping Yu (HKU) Projection 40 / 42

42 Partitioned Fit and Residual Regression Proof II of the FWL Theorem To show b β 1 = X 0 1?2 X 1?2 1 X 0 1?2 y?2, we need only show that X 0 1 M 2y = X 0 1 M 2X 1 bβ 1. Multiplying y = X 1 b β 1 + X 2 b β 2 + bu by X 0 1 M 2 on both sides, we have X 0 1 M 2y = X 0 1 M 2X 1 b β 1 + X 0 1 M 2X 2 b β 2 + X 0 1 M 2 bu = X0 1 M 2X 1 b β 1, where the last equality is from M 2 X 2 = 0, and X 0 1 M 2 bu = X0 bu 1 = 0 (why the first equality hold? bu = Mu and M 2 M = M). Ping Yu (HKU) Projection 41 / 42

43 Partitioned Fit and Residual Regression Projection along a Subspace P 12 as a Projector along a Subspace Lemma Define P X?Z as the projector onto span(x) along span? (Z), where X and Z are n k matrices and Z 0 X is nonsingular. Then P X?Z is idempotent, and P X?Z = X(Z 0 X) 1 Z 0. For orthogonal projectors, P X = P X?X. To see the difference between P X and P X?Z, we check Figure 2 again. In the left panel, X = (1,0) 0 and Z = (1,1) 0 ; in the right panel, X = (1,0) 0. (why?) It is easy to check that P X = and P X?Z = 0 0 So an orthogonal projector must be symmetric, while an projector need not be. P 12 = P X1?X 1?2.. Ping Yu (HKU) Projection 42 / 42

Least Squares Estimation-Finite-Sample Properties

Least Squares Estimation-Finite-Sample Properties Ping Yu School of Economics and Finance The University of Hong Kong Ping Yu (HKU) Finite-Sample 1 / 29 Terminology and Assumptions 1 Terminology and Assumptions