18.657: Mathematics of Machine Learning

Size: px

Start display at page:

Download "18.657: Mathematics of Machine Learning"

Jordan Butler
6 years ago
Views:

1 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture Scribe: Kevin Li Oct. 4, 05. CONVEX OPTIMIZATION FOR MACHINE LEARNING In this lecture, we will cover the basics of convex optimization as it applies to machine learning. There is much more to this topic than will be covered in this class so you may be interested in the following boos. Convex Optimization by Boyd and Vandenberghe Lecture notes on Convex Optimization by Nesterov Convex Optimization: Algorithms and Complexity by Bubec Online Convex Optimization by Hazan The last two are drafts and can be obtained online.. Convex Problems A convex problem is an optimization problem of the form minf(x) where f and C are x C convex. First, we will debun the idea that convex problems are easy by showing that virtually all optimization problems can be written as a convex problem. We can rewrite an optimization problem as follows. min f(x), min t, min t X X t f(x),x X (x,t) epi(f) where the epigraph of a function is defined by epi(f) =f(x, { t) XIR : t f(x)g} Figure : An example of an epigraph. Source:

2 6 Now we observe that for linear functions, where the convex hull is defined min c x = min c x x D x conv(d) N αi x i conv(d) = { fy : 9N Z +, x,..., x N D, α i 0, α i =, y = g} To prove this, we now that the left side is a least as big as the right side since D conv(d). For the other direction, we have Therefore we have min c x = min min min c α i x i x conv(d) N x,...,x N D α,...,α N N x,...,x N D α,...,α N x D min min min α i min c x N x,...,x N D α,...,α N x D = min min min α i c x i min c x = min c x x D min f(x), min t x X (x,t) conv(epi(f)) which is a convex problem. Why do we want convexity? As we will show, convexity allows us to infer global information from local information. First, we must define the notion of subgradient. Definition (Subgradient): Let C IR d, f : C! IR. A vector g IR d is called a subgradient of f at x C if f(x) f(y) g (x y) 8y C. The set of such vectors g is denoted by f(x). Subgradients essentially correspond to gradients but unlie gradients, they always exist for convex functions, even when they are not differentiable as illustrated by the next theorem. Theorem: If f : C! IR is convex, then for all x, f(x) = ;. In addition, if f is differentiable at x, then f(x) = { frf(x)g. } Proof. Omitted. Requires separating hyperplanes for convex sets.

3 Theorem: Let f,c be convex. If x is a local minimum of f on C, then it is also global minimum. Furthermore this happens if and only if 0 f(x). Proof. 0 f(x) if and only if f(x) f(y) 0 for all y C. This is clearly equivalent to x being a global minimizer. Next assume ( x is a local ) minimum. Then for all y Cthere exists ε small enough such that f(x) f ( ε)x + εy ( ε)f(x)+εf(y) =) f(x) f(y) for all y C. Not only do we now that local minimums are global minimums, looing at the subgradient also tells us where the minimum can be. If g (x y) < 0 then f(x) <f(y). This means f(y) cannot possibly be a minimum so we can narrow our search to ys such that g (x y). In one dimension, this corresponds to the half line fy { IR : y xg} if g>0 and the half line fy { IR : y xg} if g<0. This concept leads to the idea of gradient descent.. Gradient Descent y x and f differentiable the first order Taylor expansion of f at x yields f(y) f(x)+ g (y x). This means that min f(x + εµˆ) min f(x)+g (εµˆ) µˆ = g g which is minimized at µˆ =. Therefore to minimizes the linear approximation of f at x, one should move in direction opposite to the gradient. Gradient descent is an algorithm that produces a sequence of points fx { j } g j such that (hopefully) f(x j+ ) <f(x j ). Figure : Example where the subgradient of x is a singleton and and the subgradient of x contains multiple elements. Source: Subgradient_optimization 3

4 Algorithm Gradient Descent algorithm Input: x C, positive sequence fη { s } g s for s = to do x s+ = x s η s g s, g s f(x s ) end for return Either x = xs or x argmin f(x) x {x,...,x } Theorem: Let f be a convex L-Lipschitz function on IR d such that x argmin IR d f(x) exists. Assume that jx x R j R. Then if ηs = η = L for all s, then LR f( xs ) f(x ) p and LR min f(x s ) f(x ) p s Proof. Using the fact that g s = (x η s+ x s ) and the equality a b = a + b a b, f(x s ) f(x ) g s (x s x ) = (x s x s+ ) (x s x ) η [ ] = xs x s+ + x x s xs+ x η η = g s + (δ η s δs+) where we have defined δ s = x s x. Using the Lipschitz condition η f(x s ) f(x ) L + (δ η s δs+) Taing the average from, to we get η η R f(x s ) f(x ) L η + (δ δs + ) L + δ η η L + η Taing η = R L to minimize the expression, we obtain LR f(x s ) f(x ) p Noticing that the left-hand side of the inequality is larger than both f( x s ) f(x ) by Jensen s inequality and min f(x s ) s f(x ) respectively, completes the proof. 4

5 One flaw with this theorem is that the step size depends on. We would rather have step sizes η s that does not depend on so the inequalities hold for all. With the new step sizes, η s ( ) R η s [ f(x ) f x )] L δ L s ( + (δs s+) ηs + After dividing by η s, we would lie the right-hand side to approach 0. For this to η happen we need s! 0 and η s!. One candidate for the step size is η s = G η s since s then ηs c G log() and η s c G p. So we get ( ) c GL log R η s η s [f(x s ) f(x )] p + c c G p Choosing G appropriately, the right-hand side approaches 0 at the rate of LR log. Notice p that we get an extra factor of log. However, if we loo at the sum from / to instead of to, ηs c G and η s c Gp. Now we have s= ( ) clr min f(x s ) f(x ) min f(x s ) f(x ) η s η s [f(x s ) f(x )] p s s s= s= which is the same rate as in the theorem and the step sizes are independent of. Important Remar: Note this rate only holds if we can ensure that jx / x j R since we have replaced x by x / in the telescoping sum. In general, this is not true for gradient descent, but it will be true for projected gradient descent in the next lecture. One final remar is that the dimension d does not appear anywhere in the proof. However, the dimension does have an effect because for larger dimensions, the conditions f is L-Lipschitz and jx x j R are stronger conditions in higher dimensions. 5

6 MIT OpenCourseWare Mathematics of Machine Learning Fall 05 For information about citing these materials or our Terms of Use, visit:

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved