Lecture 05 Geometry of Least Squares

Size: px

Start display at page:

Download "Lecture 05 Geometry of Least Squares"

Clyde Reynolds
5 years ago
Views:

1 Lecture 05 Geometry of Least Squares 16 September 2015 Taylor B. Arnold Yale Statistics STAT 312/612

2 Goals for today 1. Geometry of least squares 2. Projection matrix P and annihilator matrix M 3. Multivariate Galton Heights

3 Geometry of Least Sqares

4 Last time, we established that the least squares solution to the model: y = Xβ + ϵ Yields the solution: β = (X t X) 1 X t y As long as the matrix X t X is invertable.

5 Define the column space of the matrix X as: R(X) = {θ : θ = Xb, b R p } R n This is the space spanned by the p columns of X sitting in n-dimensional space.

6 Define the column space of the matrix X as: R(X) = {θ : θ = Xb, b R p } R n This is the space spanned by the p columns of X sitting in n-dimensional space. Notice that the least squares problem can be re-written as: { θ = arg min y θ 2 2, s.t θ R(X) } θ Where then β = X θ.

7 Theorem 3.2 (p.g. 37, Rao & Toutenburg) The minimum, θ is attained when (y θ) R(X). In other words, (y θ) is perpendicular to all vectors in R.

8 Proof: Pick a θ in R such that (y θ) R(X).

9 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ)

10 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) + 2(y θ) t ( θ θ)

11 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) + 2(y θ) t ( θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ)

12 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) + 2(y θ) t ( θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) = y θ θ θ 2 2

13 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) + 2(y θ) t ( θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) = y θ θ θ 2 2 y θ 2 2

14 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) + 2(y θ) t ( θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) = y θ θ θ 2 2 y θ 2 2 So, if such a θ exists it attains the minimum. To see that it does, write θ = X β.

15 Proof: Pick a θ in R such that (y θ) R(X). This implies that X t (y θ) = 0. Then for all θ R: y θ 2 2 = (y θ + θ θ) t (y θ + θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) + 2(y θ) t ( θ θ) = (y θ) t (y θ) + ( θ θ) t ( θ θ) = y θ θ θ 2 2 y θ 2 2 So, if such a θ exists it attains the minimum. To see that it does, write θ = X β. Then: X t (y θ) = X t (y X β) = X t y X t X β

16 To see that such a θ does exist, write θ = X β.

17 To see that such a θ does exist, write θ = X β. Then: X t (y θ) = X t (y X β)

18 To see that such a θ does exist, write θ = X β. Then: X t (y θ) = X t (y X β) = X t y X t X β

19 To see that such a θ does exist, write θ = X β. Then: X t (y θ) = X t (y X β) = X t y X t X β = X t y X t X(X t X) 1 X t y

20 To see that such a θ does exist, write θ = X β. Then: X t (y θ) = X t (y X β) = X t y X t X β = X t y X t X(X t X) 1 X t y = X t y X t y = 0 And therefore our proposed θ R(X).

21 From this geometric interpretation of the least squares estimator, we introduce an important matrix P X called the projection matrix. P X = X(X t X) 1 X t I ll often drop the subscript as it should be understood that the projection is on the data matrix X.

22 Notice that PX = X: PX = X(X t X) 1 X t X = X

23 Notice that PX = X: PX = X(X t X) 1 X t X = X And Py gives the fitted values ŷ: Py = X(X t X) 1 X t Xy = X β = θ = ŷ Do you see why the projection matrix is called the projection matrix?

24 Notice that PX = X: PX = X(X t X) 1 X t X = X And Py gives the fitted values ŷ: Py = X(X t X) 1 X t Xy = X β = θ = ŷ Do you see why the projection matrix is called the projection matrix?

26 The projection matrix is sometimes called the hat matrix. Any thoughts as to why?

27 A closely related matrix to P is the annihilator matrix M: M = I n P

28 A closely related matrix to P is the annihilator matrix M: It gets its name because MX = 0. M = I n P

29 The matrix P = X(X t X) 1 X t is clearly symmetric. It is also idempotent: P 2 = X(X t X) 1 X t X(X t X) 1 X t

30 The matrix P = X(X t X) 1 X t is clearly symmetric. It is also idempotent: P 2 = X(X t X) 1 X t X(X t X) 1 X t = X(X t X) 1 (X t X)(X t X) 1 X t

31 The matrix P = X(X t X) 1 X t is clearly symmetric. It is also idempotent: P 2 = X(X t X) 1 X t X(X t X) 1 X t = X(X t X) 1 (X t X)(X t X) 1 X t = X(X t X) 1 X t

32 The matrix P = X(X t X) 1 X t is clearly symmetric. It is also idempotent: P 2 = X(X t X) 1 X t X(X t X) 1 X t = X(X t X) 1 (X t X)(X t X) 1 X t = X(X t X) 1 X t = P

33 M is also symmetric M t = (I n P) t = (I n P t ) = M

34 M is also symmetric M t = (I n P) t = (I n P t ) = M And idempotent: M 2 = (I n P) 2 = (I n P)(I n P)

35 M is also symmetric M t = (I n P) t = (I n P t ) = M And idempotent: M 2 = (I n P) 2 = (I n P)(I n P) = I n 2 P + P 2

36 M is also symmetric M t = (I n P) t = (I n P t ) = M And idempotent: M 2 = (I n P) 2 = (I n P)(I n P) = I n 2 P + P 2 = I n 2 P + P

37 M is also symmetric M t = (I n P) t = (I n P t ) = M And idempotent: M 2 = (I n P) 2 = (I n P)(I n P) = I n 2 P + P 2 = I n 2 P + P = I n P = M

38 M is also symmetric M t = (I n P) t = (I n P t ) = M And idempotent: M 2 = (I n P) 2 = (I n P)(I n P) = I n 2 P + P 2 = I n 2 P + P = I n P = M These properties both make sense given the geometric interpretation of P and M as projections; into the column space of X and the compliment of the columns space of X.

39 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β

40 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β = y Py

41 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β = y Py = (I n P)y

42 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β = y Py = (I n P)y = My

43 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β = y Py = (I n P)y = My = M(Xβ + ϵ)

44 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β = y Py = (I n P)y = My = M(Xβ + ϵ) = Mϵ

45 These properties are quite useful. Notice how we can easily rewrite the following for the residual vector r = y X β: r = y X β = y Py = (I n P)y = My = M(Xβ + ϵ) = Mϵ The matricies P and M not only help make the derivation easier, they also give geometric insight into what we are doing.

46 One particularly useful formula will be writing the squared residuals as: r 2 2 = Mϵ 2 2 = ϵ t M t Mϵ = ϵ t Mϵ

47 One particularly useful formula will be writing the squared residuals as: r 2 2 = Mϵ 2 2 = ϵ t M t Mϵ = ϵ t Mϵ So the matrix M translates the sum of squared residuals into the sum of the square errors, which are estimated by the residuals.

48 Applications

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No.

3. For a given dataset and linear model, what do you think is true about least squares estimates? Is Ŷ always unique? Yes. Is ˆβ always unique? No. 7. LEAST SQUARES ESTIMATION 1 EXERCISE: Least-Squares Estimation and Uniqueness of Estimates 1. For n real numbers a 1,...,a n, what value of a minimizes the sum of squared distances from a to each of