15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Size: px

Start display at page:

Download "15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015"

Marshall Floyd
5 years ago
Views:

1 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and analyze it in the context of convex optimization. Preliminaries First, recall the following definitions: Definition 6.. A convex set K R n is said to be convex iff ( λx + ( λy K x, y K, λ [0, ] Definition 6.2. A function f : R n R is said to be convex iff f(λx + ( λy λf(x + ( λf(y x, y R n, λ [0, ] In the context of this lecture, we will always assume that the function f is differentiable. Fact 6.3. A function f is convex iff x, y f(y f(x + f(x, y x Geometrically Fact 6.3 states that the function always lies above its tangent (see Fig 6.. If Figure 6.: he blue line denotes the function and the red line is the tangent at x. [V] the function f is also twice differentiable, then we denote its second derivative (a.k.a. Jacobian by H(f i,j := f x i x j. Fact 6.4. A twice differentiable function f is convex iff H(f 0. 2 Convex Minimization and Gradient Descent here are two kinds of problems that we will concern our-self with:. Unconstrained Convex Minimization (UCM: Given a convex function f min f(x x R n 2. Constrained Convex Minimization (CCM: Given a convex function f and convex set K, min f(x x K he notation A B signifies A B is positive semidefinite.

2 Algorithm : Gradient Descent for t 0 to do x t+ x t η t f(x t end Result: ˆx = t=0 x i 2. Unconstrained Convex Minimization One useful property of convex functions is that f is convex implies that all local minima are also global minima. Hence, solving f(x = 0 would enable us to compute the global minima exactly. Quite often, it is not possible to solve f = 0. However, we can hope to iteratively approximate the optimal solution x. We do so by taking steps in the direction opposite to gradient, to get closer to a local minimum. he algorithm commonly known as Gradient Descent is described in Algorithm and satisfies the following guarantee. heorem 6.5. For any convex function f : R n R with x R n f(x 2 G and suppose x x 0 D then f(ˆx f(x ɛ if we set = ( GD 2 ɛ and ηt = D G. We will use the following elementary fact in the proof a, b = 2 [ a 2 + b 2 a b 2 ] (F Proof. f(x t f(x f(x t, x t x = η x t x t+, x t x x t x t+ 2 + x t x t 2 x t+ x 2 η f(x t t 2 t+ By convexity of f By update rule Using Fact F (6. 2

Analyzing the objective, ( f(ˆx f(x t=0 = f x t f(x [f(x t f(x ] t=0 t=0 = ( ( η f(x t 2 + 2 t 2 t+ ( η f(x t 2 + 2 0 2 η f(x t 2 + 2 0 By convexity of f Substituting in 6.

3 Analyzing the objective, ( f(ˆx f(x t=0 = f x t f(x [f(x t f(x ] t=0 t=0 = ( ( η f(x t t 2 t+ ( η f(x t η f(x t By convexity of f Substituting in 6. [η2 G 2 + D 2 ] DG Plugging in η t = D ɛ and = ( GD ε 2 G Remark 6.6. he algorithm s path resembles that of a random walk towards the optimum goal (see Fig 6.2. In practice, one may choose to vary the step length as time progresses. he above Figure 6.2: he yellow lines denote the level sets of the function f and the red walk denotes the steps of gradient descent. [Com06] analysis is tight assuming lipschitzness. 2.2 Constrained Convex Minimization Unlike the case of UCM, the derivative may not be 0 at the optimum. Nonetheless, the main idea of gradient descent still yields a good algorithm. We would like to take steps opposite to the gradient but the update rule needs to ensure that the new point x t+ lies within K. o ensure this, we simply project each step back onto K. Let K : Rn K be defined as (y = argmin x K x y K 3

4 he modified algorithm is given in Algorithm 2 with changes highlighted in blue. We will show Algorithm 2: Gradient Descent For CCM for t 0 to do y t+ x t η t f(x t ; x t+ ( K yt+ ; end Result: ˆx = t=0 x i below that a theorem (and analysis similar to that of heorem 6.5 holds. heorem 6.7. Given a convex set K with diameter D and any convex function f : R n R with x R n f(x 2 G and suppose x 0 K then f(ˆx f(x ɛ if we set = ( GD 2 ɛ and η t = Proof. D G. f(x t f(x f(x t, x t x = η x t y t+, x t x x t y t+ 2 + x t x t 2 y t+ x 2 ( x t y t+ 2 + x t x t 2 x t+ x 2 η f(x t t 2 t+ he rest of the argument is identical to the proof of heorem Relation with Multiplicative Weights By convexity of f By update rule Using Fact F By definition of K Consider the following problem: At each time step, you propose an x t and an adversary exhibits a function f t with f t G. he cost of each time step is f t (x t and your objective is to minimize regret. o solve this problem, we can use the same algorithm (with one slight modification. he update rule is now taken with respect to gradient of the current function f t. x t+ x t η t f t (x t he same analysis shows that if ( DG 2 ε ( ft (x t f t (x DG ε t=0 One advantage of this algorithm (and analysis is that it holds for all convex bodies, as opposed to MW algorithm which holds just for the simplex. However, it now depends D and G (instead of log(# of experts, while the ( ε 2 dependence is the same. In the special case of linear functions (i.e. f i (x = l i, x and K is the simplex, the previous theorem will give us 2N as diam(k = 2 and l ε 2 i N. his is substantially worse than the guarantees that multiplicative weights provides, but we will see how this can be improved. 4

5 4 Further Assumptions and Modifications 4. Sub-Gradients Definition 6.8. A function z is called a sub-gradient if y R n f(y f(x + z(x, y x If a function is not differentiable, we can use sub-gradient instead and the entire proof goes through again. Even an approximate sub-gradient suffices. 4.2 β-smooth function Definition 6.9. A function f is a β-smooth convex function if f(y f(x + f(x, y x + β 2 x y 2 for all x, y K Fact 6.0. f is l-smooth convex function H(f βi. In this case, the gradient descent algorithm with step size η t = O ( lt yields an ˆx which satisfies f(ˆx f(x ε when = O( G2 log D ε. 4.3 l-strongly convex functions Definition 6.. A function is l-strongly convex if f(y f(x + f(x, y x + l 2 x y 2 for all x, y K Fact 6.2. f is t-smooth convex function H(f li. In this case, the gradient descent algorithm converges to a solution with error ε in = O( ε 4.4 Well-conditioned Functions Functions that are β-smooth and l-strongly convex are known as well-conditioned functions with condition number β l. In this case, we get the much stronger convergence in time = O(log( ε. heorem 6.3. Given a function f which is β-smooth and l-strongly convex and let x be the solution to the unconstrained convex minimization problem argmin x R nf(x then x t x 2 exp( δt x t x 2 where δ = δ(β, l f(x t f(x β ( 4t 2 exp β/l + he proof of the above theorem and many others is included in the upcoming book by Hazan (see Chapter 3 in [Haz]. 5

6 References [Com06] Wikimedia Commons. Gradient ascent(countour, Available at wiki/gradient_descent#/media/file:gradient_ascent_(contour.png. 6.2 [Haz] [V] Elad Hazan. Introduction to Online Convex Optimization. Graduate text in machine learning and optimization. Available at Nisheeth Vishnoi and Jakub arnawski. Fundamentals of convex optimization. lecture basics, gradient descent and its variants. Available at files/lec-fall4-ver.pdf. 6. 6

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework