Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem.

Size: px

Start display at page:

Download "Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem."

Frederica Miller
5 years ago
Views:

1 Exercise List: Proving convergence of the Gradient Descent Method on the Ridge Regression Problem. Robert M. Gower September 5, 08 Introduction Ridge regression is perhaps the simplest example of a training problem in Machine Learning. Consider the task of learning a rule that maps the feature vector x R d to outputs y R. Furthermore you are given a set of labelled observations x i, y i ) for i =,..., n. We restrict ourselves to linear mappings. That is, we need to find w R d such that x i w y i, for i =,..., n. ) That is the hypothesis function is parametrized by w and is given by h w : x w x. To choose a w such that each x i w is close to y i, we use the squared loss ly) = y / and the squared regularizor. That is, we minimize w = arg min w n n i= x i w y i ) + λ w, ) where λ > 0 is the regularization parameter. We now have a complete training problem ). With this simple ridge regression problem, we can illustrate many different techniques used in machine learning, such as using crossvalidation to select λ, dimension reduction tools, data scaling and stochastic optimization. In this exercise we will solve ) using gradient descent, and we will establish how fast does gradient converge. Using the matrix notation X def = [x,..., x n ] R d n, and y = [y,..., y n ] R n, 3) We need only consider a linear mapping as opposed to the more general affine mapping x i w x i +β, because the zero order term β R can be incorporated by defining a new feature vectors ˆx i = [x, ] and new variable ŵ = [w, β] so that ˆx i ŵ = x i w + β Excluding the issue of selection λ using something like crossvalidation wiki/cross-validation_statistics)

2 we can re-write the objective function in ) as First we introduce some necessary notation. fw) def = n X w y + λ w. 4) Notation: For every x, w, R d let x, w def = x y and let x = x, x. Let A R d d be a matrix and let σ min A) and σ max A) be the smallest and largest singular values of A defined by σ min A) def Ax = min and σ max A) def Ax = max. 5) x R d, x 0 x x R d, x 0 x Finally, a result you will need, if A is a symmetric positive semi-definite matrix the largest singular value of A can be defined instead as σ max A) = max x R d, x 0 x Ax = max. 6) x R d, x 0 x Therefore and x σ max A), x R d. 7) Ax x σ max A), x R d. 8) Gradient descent We will now solve the following ridge regression problem w = arg min w R d using gradient descent. Ex. Consider the Gradient descent method where n X w y + λ w def = fw) ), 9) w t+ = w t α fw t ), 0) is a fixed stepsize and α = σ max A), ) A def = n XX + λi. )

3 Part I Show that the gradient fx) of 9) is given by where w is the solution to 9) and fw) = Aw b = Aw w ), b def = n Xy. Now that we have calculated the gradient, re-write the iterates 0) using this gradient. Part II Show or convince yourself that A as defined in ) is positive semi-definite, that is Aw, w 0, w R d, 3) and that Part III σ max I αa) = α σ min A) = σ mina) σ max A). 4) Show that the iterates 0) converge to w according to w t+ w σ ) mina) w t w, σ max A) for all t. The number σ min A)/σ max A)) is known as the rate of convergence. Hint : Subtract w from both sides of 0) and use the results from the previous two exercises. Hint : Try and show that b = Aw! Part IV Let κa) def = σ maxa) σ min A), which is known as the condition number of A. What happens to κ as λ and λ 0, respectively? What does this imply about the speed at which gradient descent converges to the solution? 3

4 Part V Let us consider the extreme case where λ = 0. Consider the coordinate change ŵ = P w, where P R d d is invertible. With this coordinate change we can solve the problem in ŵ given by ŵ = arg min ŵ R d n X P ŵ y + λ ) P ŵ, 5) then switch back the coordinate system to get the solution in w given by w = P ŵ. 6) If we use gradient descent to solve 5), at what rate does it converge? To get the fastest rate possible, what should P be? Does the choice P = diagxx ), 7) make sense? Extra question: Lookup and read about batch normalization. Is it somehow related to preconditioning? Discuss with your colleagues. Remark: The matrix P is known as the preconditioner and the particular choice given by 7) is a standard choice known as feature scaling and it is often used in machine learning. Answer Ex. I) Differentiating we have ) fw) = n XX + λi w Xy = Aw y = Aw w ), where the last equality follows since Aw = b. Consequently the gradient descent method 0) can be written as w t+ = w t αaw t w ). 8) Answer Ex. II) First note that I αa)x, x = x α )+3) x σ max A) 7) x σ maxa) x σ max A) x thus the matrix I αa) is positive semi-definite and only has non-negative eigenvalues. Furthermore I αa)x, x x = α 4 = 0, x. 9)

5 Since I αa) is symmetric positive semi-definite we can use 6) to calculate the largest singular value, thus we have σ max I αa) 6)+9) = ) max α x R n x = α min x R n x = α σ min A). Answer Ex. III) Subtracting w from both sides of 8) gives w t+ w = w t w αaw t w ) = I αa)w t w ). Taking norm in the above gives In particular for α = σ maxa) w t+ w 8) σ max I αa) w t w w t+ w 4) = α σ min A)) w t w. the above shows that σ ) mina) w t w. σ max A) Answer Ex. IV) We can re-write the largest singular value of A as σ max A) = max w 0 Aw, w w ) n = max XX + λi)w, w w 0 w = max w 0 X w w +λ = σ max X) +λ. And similarly Consequently Ergo and σ max A) = σ min X) + λ. κa) = σ maxx) + λ σ min X) + λ. 0) lim κa) =, λ lim κa) = λ 0 κx). 5

Linear Regression. S. Sumitra

Linear Regression. S. Sumitra Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D