Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009

Size: px

Start display at page:

Download "Ridge Regression. Flachs, Munkholt og Skotte. May 4, 2009"

Norah Leonard
5 years ago
Views:

1 Ridge Regression Flachs, Munkholt og Skotte May 4, 2009 As in usual regression we consider a pair of random variables (X, Y ) with values in R p R and assume that for some (β 0, β) R +p it holds that E(Y X) = β 0 + X j β j = β 0 + X β. j= Let X be the N p matrix of N p-dimensional covariates and let Y denote the N-dimensional vector of observations. With X j = N N i= x ij and Y = N N i= y i we shall use the notation [ ] =., Y = Y, X = X X p and X = X where is N -dimensional. Initially we define the ridge regression estimate ( ˆβ 0, ˆβ) as the pair (β 0, β) that minimizes the penalized residual sum of squares f λ (β 0, β) = (Y β 0 Xβ) (Y β 0 Xβ) + λβ β. Note that the intercept β 0 is not penalized. Question. Show that the ridge regression problem is equivalent to the problem of finding a minimizer ( ˆβ 0 c, ˆβ c ) for g λ (β0, c β c ) = (Y β0 c (X X)β c ) (Y β0 c (X X)β c ) + λ(β c ) β c. Proof. Consider the bijective re-parametrization φ : R +p R +p defined by φ(β 0, β) = (β 0 + Xβ, β).

2 Then f λ (β 0, β) = (Y β 0 Xβ) (Y β 0 Xβ) + λβ β = (Y (β 0 + Xβ) (X X)β) (Y (β 0 + Xβ) (X X)β) + λβ β = g λ (φ(β 0, β)). Thus the two problems are equivalent and the solutions satisfy ˆβ c = ˆβ and ˆβ c 0 = ˆβ 0 + X ˆβ. In the centered version we have that d dβ c 0 Necessarily ˆβ 0 c = Y. Now we have g λ (β c 0, β c ) = 2(Y β c 0 (X X)β c ) = 2 n (y i β0 c i= = 2n(Y β c 0) (x ij X j )βj) c D β cg λ ( ˆβ c 0, β c ) = 2(Y Y (X X)β c ) (X X) + 2λ(β c ) = 2(Y Y) (X X) + 2(β c ) (X X) (X X) + 2λ(β c ) ( ) = 2(Y Y) (X X) + 2(β c ) (X X) (X X) + λi By transposing we see that a minimizer ˆβ c (or ˆβ) must satisfy i= ( (X X) (X X) + λi) ˆβc = (X X) (Y Y). It is therefore possible to assume that the matrix of covariates and the observation vector has been centered, such that the average of each column is zero, if we first let β c 0 = Y. In the following we let X = X X and y = Y Y. Define for λ > 0 the ridge regression estimate ˆβ(λ) as the β that minimize RSS λ (β) = (y Xβ) (y Xβ) + λβ β. Question 2. With X = UDV the singular value decomposition of X, show for λ > 0 that if ˆβ(λ) is a minimizer of RSS λ (β) then ˆβ(λ) ˆβ(λ) = i= ( + λ)2 y u i u i y where d i, i =,..., p are the singular values and u i, i =,..., p are the columns U in the SVD. 2

3 Proof. It follows from the calculations above that ˆβ(λ) satisfy (X X + λi) ˆβ(λ) = X y. When λ > 0 the term is invertible regardless of the rank of X. decomposition X = UDV, we see that Using the singular value ˆβ(λ) = (X X + λi) X y = (V D 2 V + λv V ) V DU y = (V (D 2 + λi)v ) V DU y = V (D 2 + λi) DU y This gives us that ˆβ(λ) ˆβ(λ) = y UD(D 2 + λi) 2 DU y = ( + λ)2 y u i u i y, i= using that for a p-dimensional vector b = (b,..., b p ) and a p p diagonal matrix A it holds that bab = a j b 2 j. j= The ordinary least square solution is obtained by minimizing (y βx) (y βx). The solution is only unique when X has full rank, but any solution satisfy Define t = X Xβ = X y. min β β β:x Xβ=X y Question 3. Show that ˆβ(λ) ˆβ(λ) < t for λ > 0. And that the function λ s(λ) := ˆβ(λ) ˆβ(λ) is continuous, strictly decreasing function, with s(λ) 0 for λ. 3

4 Proof. Consider s : (0, ) (0, ) defined by Then s is continuous with and since s(λ) = ˆβ(λ) ˆβ(λ) = i= ( + λ)2 y u i u i y d dλ s(λ) = y u i u 2 i y ( + < 0 λ)3 ( i= + λ)2 0 when λ. we have that s(λ) is strictly decreasing and goes to zero for λ. Let ˆβ be a minimizer of (y Xβ) (y Xβ), if the minimizer is not unique, consider the one with the smallest norm. For any λ > 0 we have that RSS λ ( ˆβ(λ)) RSS λ ( ˆβ), combining this we see that 0 RSS λ ( ˆβ) RSS λ ( ˆβ(λ)) = (y X ˆβ) (y X ˆβ) (y X ˆβ(λ)) (y X ˆβ(λ)) + λ ˆβ ˆβ λ ˆβ(λ) ˆβ(λ) λ ˆβ ˆβ λ ˆβ(λ) ˆβ(λ). Thus for λ > 0 the ordinary least squares estimate ˆβ (also the one with the smallest norm if there are more than one) satisfy ˆβ ˆβ ˆβ(λ) ˆβ(λ), taking the limit λ 0 on both sides gives Thus for any λ > 0 we see that ˆβ(λ) ˆβ(λ) = i= t di 0 = i ( + λ)2 y u i u i y < Y u i u i Y. di 0 = i Y u i u i Y t. Question 4. Show for λ > 0 that a minimizer of RSS λ (β) also is a minimizer of (y Xβ) (y Xβ), subject to the constraint β β s(λ). 4

5 Let β be a minimizer of RSS λ (β). Observe that β β = s(λ). Let β 2 satisfy β2 β 2 s(λ). Then 0 RSS λ (β 2 ) RSS λ (β ) = (y Xβ 2 ) (y Xβ 2 ) (y Xβ ) (y Xβ ) + λβ2 β 2 λs(λ) (y Xβ 2 ) (y Xβ 2 ) (y Xβ ) (y Xβ ) Thus β is also a minimizer for the constrained least squares problem. Question 5. Show for λ > 0 that a minimizer of (y Xβ) (y Xβ), subject to the constraint β β s(λ), is also a minimizer of RSS λ (β). Argue that the constraint minimization problem above yields the ordinary least squares estimate whenever s t. Let β be a minimizer of (y Xβ) (y Xβ) subject to the constraint β β s(λ). Then (y Xβ ) (y Xβ ) (y X ˆβ(λ)) (y X ˆβ(λ)) since ˆβ(λ) satisfy the restriction. This gives RSS λ (β ) (y X ˆβ(λ)) (y X ˆβ(λ)) + λβ β RSS λ ( ˆβ(λ)). If s t (at least one of) the ordimary least square estimate(s) is contained in the restriction set, thus the minimizer of the constrained problem will be the least squares estimate. The ridge regression estimate can therefore be seen as a ordinary least squares estimate on a parameter set restricted by β β s. The translation between the two models is data dependent, since s is given by λ in a data dependent manner. The predicted values in the ordinary least squares regression are ŷ = X ˆβ(0) = X(X X) X y, in the case where X has full rank p, else replace with a generalized inverse. Question 6. Show that for the projection P = X(X X) X onto the column space of X, we have tr(p ) = p and P 2 = P. 5

6 By rules of the trace we have By direct calculation tr(p ) = tr(x(x X) X ) = tr(x X(X X) ) = tr(i p ) = p. P 2 = X(X X) X X(X X) X = X(X X) X = P. The predicted values in the ridge regression are ŷ = X(X X + λi) X y. Question 7. Show for λ > 0 that for the so-called smoother S λ = X(X X + λi) X we have By properties of the trace tr(s λ ) = i= + λ < p and S2 λ S λ. tr(s λ ) = tr(x(x X + λi) X ) = tr(x X(X X + λi) ) = tr(v D 2 V (V D 2 V + λi) ) = tr(v D 2 V (V (D 2 + λi)v ) ) = tr(v D 2 V (V ) (D 2 + λi) V ) = tr(d 2 (D 2 + λi) ) and since +λ < this gives tr(s λ ) = i= + λ < p. To show that S 2 λ S λ we must show that S λ S 2 λ is positive semi-definite. Observe that S λ S 2 λ = X(X X + λi) X X(X X + λi) X X(X X + λi) X = X(X X + λi) ((X X + λi) X X)(X X + λi) X = X(X X + λi) (λi)(x X + λi) X = λx(x X + λi) 2 X The matrix (X X + λi) 2 is positive definite and since any b R N \{0} gives X b R p, we se that b (S λ Sλ 2 )b 0. 6

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized