Descent methods. min x. f(x)

Size: px

Start display at page:

Download "Descent methods. min x. f(x)"

Dortha Randall
5 years ago
Views:

1 Gradient Descent

2 Descent methods min x f(x) 5 / 34

3 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34

4 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34

5 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 6 / 34

6 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 6 / 34

7 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 Make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) 6 / 34

8 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) o(α f(x) 2 ) 6 / 34

9 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) o(α f(x) 2 ) = f(x) α f(x) o(α) 6 / 34

10 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) o(α f(x) 2 ) = f(x) α f(x) o(α) For α near 0, α f(x) 2 2 dominates o(α) For positive, sufficiently small α, f(x(α)) smaller than f(x) 6 / 34

11 Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. 7 / 34

12 Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. Again, we have the Taylor expansion f(x(α)) = f(x) + α f(x), d + o(α), where f(x), d dominates o(α) for suff. small α 7 / 34

13 Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. Again, we have the Taylor expansion f(x(α)) = f(x) + α f(x), d + o(α), where f(x), d dominates o(α) for suff. small α Since d is obtuse to f(x), this implies f(x(α)) < f(x) 7 / 34

14 8 / 34

15 Gradient descent Consider unconstrained, smooth convex optimization min f(x) i.e., f is convex and differentiable with dom(f) = R n. Denote the optimal criterion value by f = min f(x), and a solution by x Gradient descent: choose initial x (0) R n, repeat: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... Stop at some point 3

16 4

17 Gradient descent interpretation At each iteration, consider the expansion f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2 Quadratic approximation, replacing usual 2 f(x) by 1 t I f(x) + f(x) T (y x) 1 2t y x 2 2 linear approximation to f proximity term to x, with weight 1/(2t) Choose next point y = x + to minimize quadratic approximation: x + = x t f(x) 6

18 x + = argmin y Blue point is x, red point is f(x) + f(x) T (y x) + 1 2t y x 2 2 7

19 How to choose step sizes Convergence analysis Forward stagewise regression, boosting 8

20 Fixed step size Simply take t k = t for all k = 1, 2, 3,..., can diverge if t is too big. Consider f(x) = (10x x2 2 )/2, gradient descent after 8 steps: *

21 10 Can be slow if t is too small. Same example, gradient descent after 100 steps: *

22 11 Same example, gradient descent after 40 appropriately sized steps: * Clearly there s a tradeoff convergence analysis later will give us a better idea

23 12 Backtracking line search One way to adaptively choose the step size is to use backtracking line search: First fix parameters 0 < β < 1 and 0 < α 1/2 At each iteration, start with t = 1, and while f(x t f(x)) > f(x) αt f(x) 2 2 shrink t = βt. Else perform gradient descent update x + = x t f(x) Simple and tends to work well in practice (further simplification: just take α = 1/2)

24 14 Backtracking picks up roughly the right step size (12 outer steps, 40 steps total): * Here α = β = 0.5

25 15 Exact line search Could also choose step to do the best we can along direction of negative gradient, called exact line search: t = argmin s 0 f(x s f(x)) Usually not possible to do this minimization exactly Approximations to exact line search are often not much more efficient than backtracking, and it s usually not worth it

26 16 Convergence analysis Assume that f convex and differentiable, with dom(f) = R n, and additionally f(x) f(y) 2 L x y 2 for any x, y I.e., f is Lipschitz continuous with constant L > 0 Theorem: Gradient descent with fixed step size t 1/L satisfies f(x (k) ) f x(0) x 2 2 2tk We say gradient descent has convergence rate O(1/k) I.e., to get f(x (k) ) f ɛ, we need O(1/ɛ) iterations

27 19 Convergence analysis for backtracking Same assumptions, f is convex and differentiable, dom(f) = R n, and f is Lipschitz continuous with constant L > 0 Same rate for a step size chosen by backtracking search Theorem: Gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2 2t min k If β is not too small, then we don t lose much compared to fixed step size (β/l vs 1/L)

28 20 Convergence analysis under strong convexity Reminder: strong convexity of f means f(x) 2 2 x 2 2 is convex for some m > 0. If f is twice differentiable, then this implies 2 f(x) mi for any x Sharper lower bound than that from usual convexity: f(y) f(x) + f(x) T (y x) + m 2 y x 2 2 all x, y Under Lipschitz assumption as before, and also strong convexity: Theorem: Gradient descent with fixed step size t 2/(m + L) or with backtracking line search search satisfies where 0 < c < 1 f(x (k) ) f c k L 2 x(0) x 2 2

29 21 I.e., rate with strong convexity is O(c k ), exponentially fast! I.e., to get f(x (k) ) f ɛ, need O(log(1/ɛ)) iterations Called linear convergence, because looks linear on a semi-log plot: f(x (k) ) p 10 0 exact l.s backtracking l.s k Constant c depends adversely on condition number L/m (higher condition number slower rate)

30 Convex function f is convex if domf is a convex set and Jensen s inequality holds: f(θx+(1 θ)y) θf(x)+(1 θ)f(y) x,y domf, θ [0,1] first-order condition for (continuously) differentiable f, Jensen s inequality can be replaced with f(y) f(x)+ f(x) T (y x) x,y domf second-order condition for twice differentiable f, Jensen s inequality can be replaced with 2 f(x) 0 x domf Gradient method 1-6

31 Strictly convex function f is strictly convex if domf is a convex set and f(θx+(1 θ)y) < θf(x)+(1 θ)f(y) x,y domf, x y, θ (0,1) first-order condition (for differentiable f): dom f is convex and f(y) > f(x)+ f(x) T (y x) x,y domf,x y hence minimizer of f is unique (if a minimizer exists) second-order condition note that 2 f(x) 0 is not necessary for strict convexity (cf., f(x) = x 4 ) Gradient method 1-7

32 Strongly convex function f is strongly convex with parameter m > 0 if f(x) m 2 x 2 2 is convex Jensen s inequality f(θx+(1 θ)y) θf(x)+(1 θ)f(y) m 2 θ(1 θ) x y 2 2 first-order condition f(y) f(x)+ f(x) T (y x)+ m 2 y x 2 2 x,y domf second-order condition 2 f(x) mi x domf Gradient method 1-8

33 Quadratic lower bound (from 1st order condition) if f is strongly convex with parameter m, then f(y) f(x)+ f(x) T (y x)+ m 2 y x 2 2 x,y domf f(y) (x,f(x)) if domf = R n, this implies f has a unique minimizer x and m 2 x x 2 2 f(x) f(x ) 1 2m f(x) 2 2 x Gradient method 1-9

34 Monotonicity of gradient differentiable f is convex if and only if domf is convex and ( f(x) f(y)) T (x y) 0 x,y domf i.e., f : R n R n is a monotone mapping gradient of strictly convex function is strictly monotone: ( f(x) f(y)) T (x y) > 0 x,y domf, x y gradient of strongly convex function is strongly monotone or coercive: ( f(x) f(y)) T (x y) m x y 2 2 x,y domf Gradient method 1-10

35 proof of monotonicity property if f is differentiable and convex, then f(y) f(x)+ f(x) T (y x), f(x) f(y)+ f(y) T (x y) combining the inequalities gives ( f(x) f(y)) T (x y) 0 if f is monotone, then g (t) g (0) for t 0 where g(t) = f(x+t(y x)), g (t) = f(x+t(y x)) T (y x) hence, f(y) = g(1) = g(0)+ 1 0 g (t)dt g(0)+g (0) = f(x)+ f(x) T (y x) Gradient method 1-11

36 Functions with Lipschitz continuous gradients gradient of f is Lipschitz continuous with parameter L > 0 if f(x) f(y) 2 L x y 2 x,y domf from the Cauchy-Schwarz inequality, this implies ( f(x) f(y)) T (x y) L x y 2 2 i.e., Lx f(x) is monotone; in other words L 2 x 2 2 f(x) is convex for twice differentiable f, equivalent to 2 f(x) LI for all x domf Gradient method 1-12

37 Quadratic upper bound if f is Lipschitz-continuous with parameter L, then f(y) f(x)+ f(x) T (y x)+ L 2 y x 2 2 x,y domf f(y) (x,f(x)) if domf = R n and f has a minimizer x, implies 1 2L f(x) 2 2 f(x) f(x ) L 2 x x 2 2 x Gradient method 1-13

38 Co-coercivity of gradient if f is convex and f is Lipschitz continuous with parameter L, then ( f(x) f(y)) T (x y) 1 L f(x) f(y) 2 2 x,y domf property is known as co-coercivity of f (with parameter 1/L) proof: the convex functions g(z) = f(z) f(x) T z, h(z) = f(z) f(y) T z have Lipschitz-continuous gradients, and minimizers z = x and z = y, resp. therefore, from page 1-13, g(x) g(y) 1 2L g(y) 2 2, h(y) h(x) 1 2L h(x) 2 2 combining these two inequalities shows co-coercivity Gradient method 1-14

39 extension: if in addition f is strongly convex with parameter m, then g(x) = f(x) m 2 x 2 2 is convex and g is Lipschitz continuous with parameter L m the co-coercivity property for g gives ( f(x) f(y)) T (x y) ml m+l x y m+l f(x) f(y) 2 2 for all x,y domf Gradient method 1-15

40 Analysis of gradient method x (k) = x (k 1) t k f(x (k 1) ), k = 1,2,... with fixed step size or backtracking line search assumptions 1. f is convex and differentiable with domf = R n 2. f(x) is Lipschitz continuous with parameter L > 0 3. optimal value f = inf x f(x) is finite and attained at x Gradient method 1-16

41 Analysis for constant step size from quadratic upper bound (page 1-13) with y = x t f(x): f(x t f(x)) f(x) t(1 Lt 2 ) f(x) 2 2 therefore, if x + = x t f(x) and 0 < t 1/L, f(x + ) f(x) t 2 f(x) 2 2 f + f(x) T (x x ) t 2 f(x) 2 2 = f + 1 ( ) x x 2 2t 2 x x t f(x) 2 2 = f + 2t( 1 x x 2 2 x + x 2 ) 2 Gradient method 1-17

42 take x = x (i 1), x + = x (i), t i = t, and add the bounds for i = 1,...,k: k (f(x (i) ) f ) 1 2t i=1 = 1 2t k i=1 ( ) x (i 1) x 2 2 x (i) x 2 2 ( ) x (0) x 2 2 x (k) x t x(0) x 2 2 since f(x (i) ) is non-increasing, f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt x(0) x 2 2 conclusion: #iterations to reach f(x (k) ) f ǫ is O(1/ǫ) Gradient method 1-18

43 Backtracking line search initialize t k at ˆt > 0 (for example, ˆt = 1); take t k := βt k until f(x t k f(x)) < f(x) αt k f(x) 2 2 f(x t f(x)) f(x) αt f(x) < β < 1; we will take α = 1/2 (mostly to simplify proofs) t Gradient method 1-19

44 Analysis for backtracking line search line search with α = 1/2 if f has a Lipschitz continuous gradient f(x) t(1 tl 2 ) f(x) 2 2 f(x) t 2 f(x) 2 2 t = 1/L f(x t f(x)) selected step size satisfies t k t min = min{ˆt,β/l} Gradient method 1-20

45 convergence analysis from page 1-17: f(x (i) ) f + 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 i f + 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 min add the upper bounds to get f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt min x (0) x 2 2 conclusion: same 1/k bound as with constant step size Gradient method 1-21

46 Analysis for strongly convex functions better results exist if we add strong convexity to the assumptions on p.1-16 analysis for constant step size if x + = x t f(x) and 0 < t 2/(m+L): x + x 2 2 = x t f(x) x 2 2 = x x 2 2 2t f(x) T (x x )+t 2 f(x) 2 2 (1 t 2mL m+l ) x x 2 2+t(t 2 m+l ) f(x) 2 2 (1 t 2mL m+l ) x x 2 2 (step 3 follows from result on p.1-15) Gradient method 1-22

47 distance to optimum x (k) x 2 2 c k x (0) x 2 2, c = 1 t 2mL m+l implies (linear) convergence for t = 2/(m+L), get c = ((κ 1)/(κ+1)) 2 with κ = L/m bound on function value (from page 1-13) f(x (k) ) f L 2 x(k) x 2 2 c2k L 2 x(0) x 2 2 conclusion: #iterations to reach f(x (k) ) f ǫ is O(log(1/ǫ)) Gradient method 1-23

48 22 A look at the conditions A look at the conditions for a simple problem, f(β) = 1 2 y Xβ 2 2 Lipschitz continuity of f: This means 2 f(x) LI As 2 f(β) = X T X, we have L = σ 2 max(x) Strong convexity of f: This means 2 f(x) mi As 2 f(β) = X T X, we have m = σ 2 min (X) If X is wide i.e., X is n p with p > n then σ min (X) = 0, and f can t be strongly convex Even if σ min (X) > 0, can have a very large condition number L/m = σ max (X)/σ min (X)

49 Condition number Condition number For l 2 -norm we have Cond(A) = A A 1 Cond 2(A) = A 2 A 1 2 If A is symmetric, we have Remark 1: Cond(A) 1. Cond 2(A) = λmax(a) λ min(a). Remark 2: If det(a) 0, Cond(A) 1.

50 Explanation of condition number Original problem Ax = b; Perturbed problem A(x + δx) = b + δb; Subtracting two equations we have Take norm we have With condition we have δx = A 1 δb δx A 1 δb = A 1 Ax δb b. δx x Ax A x Cond(A) δb b.

51 Condition number Condition number characterize the stability If Cond(A) 1, stability is very bad; If Cond(A) 1, stability is good. Lesson we should learn: We should avoid handle the bad condition number problem in applications!

52 Stability Loosely speaking, stability is to indicate how sensitive the solution of a problem may be to small relative changes in the input data. It is often quantized by condition number of a problem or an algorithm. Stability of the original problem ( Well-posedness ) This means the well-posedness of the original problem. The linear system with high condition number is a typical ill-posed example, which is called the unstable problem. Stability of numerical algorithm This means the condition of the algorithm. Ex: The Gaussian elimination without pivoting for some linear system will be unstable.

53 23 A function f having Lipschitz gradient and being strongly convex satisfies: mi 2 f(x) LI for all x R n, for constants L > m > 0 Think of f being sandwiched between two quadratics May seem like a strong condition to hold globally (for all x R n ). But a careful look at the proofs shows that we only need Lipschitz gradients/strong convexity over the sublevel set This is less restrictive S = {x : f(x) f(x (0) )}

54 24 Practicalities Stopping rule: stop when f(x) 2 is small Recall f(x ) = 0 at solution x If f is strongly convex with parameter m, then f(x) 2 2mɛ f(x) f ɛ Pros and cons of gradient descent: Pro: simple idea, and each iteration is cheap Pro: very fast for well-conditioned, strongly convex problems Con: often slow, because interesting problems aren t strongly convex or well-conditioned Con: can t handle nondifferentiable functions

55 25 Forward stagewise regression Let s stick with f(β) = 1 2 y Xβ 2 2, linear regression setting X is n p, its columns X 1,... X p are predictor variables Forward stagewise regression: start with β (0) = 0, repeat: Find variable i s.t. Xi T r is largest, where r = y Xβ(k 1) (largest absolute correlation with residual) Update β (k) i = β (k 1) i + γ sign(x T i r) Here γ > 0 is small and fixed, called learning rate This looks kind of like gradient descent

56 26 Steepest descent Close cousin to gradient descent, just change the choice of norm. Let p, q be complementary (dual): 1/p + 1/q = 1 Steepest descent updates are x + = x + t x, where x = f(x) q u u = argmin v p 1 f(x) T v If p = 2, then x = f(x), gradient descent If p = 1, then x = f(x)/ x i e i, where f (x) x i = max f j=1,...n (x) x j = f(x) Normalized steepest descent just takes x = u (unit q-norm)

57 27 An interesting equivalence Normalized steepest descent with respect to l 1 norm: updates are x + i ( f ) = x i t sign (x) x i where i is the largest component of f(x) in absolute value Compare forward stagewise: updates are β + i = β i + γ sign(x T i r), r = y Xβ But here f(β) = 1 2 y Xβ 2 2, so f(β) = XT (y Xβ) and f(β)/ β i = Xi T (y Xβ) Hence forward stagewise regression is normalized steepest descent under l 1 norm (with fixed step size t = γ)

58 28 Early stopping and sparse approximation If we run forward stagewise to completion, then we will minimize f(β) = y Xβ 2 2, i.e., we will produce a least squares solution What happens if we stop early? May seem strange from an optimization perspective (we are under-optimizing )... Interesting from a statistical perspective, because stopping early gives us a sparse approximation to the least squares solution Well-known sparse regression estimator, the lasso: 1 min β R p 2 y Xβ 2 2 subject to β 1 s How do lasso solutions and forward stagewise estimates compare?

59 29 Stagewise path Lasso path Coordinates Coordinates β (k) ˆβ(s) 1 For some problems (some y, X), they are exactly the same as the learning rate γ 0!

60 30 Gradient boosting Given observations y = (y 1,... y n ) R n, predictor measurements x i R p, i = 1,... n Want to construct a flexible (nonlinear) model for outcome based on predictors. Weighted sum of trees: m θ i = β j T j (x i ), j=1 i = 1,... n Each tree T j inputs predictor measurements x i, outputs prediction. Trees are grown typically pretty short...

61 31 Pick a loss function L that reflects setting; e.g., for continuous y, could take L(y i, θ i ) = (y i θ i ) 2 Want to solve min β R M n i=1 ( L y i, M j=1 ) β j T j (x i ) Indexes all trees of a fixed size (e.g., depth = 5), so M is huge Space is simply too big to optimize Gradient boosting: basically a version of gradient descent that is forced to work with trees First think of optimization as min θ f(θ), over predicted values θ (subject to θ coming from trees)

62 32 Start with initial model, e.g., fit a single tree θ (0) = T 0. Repeat: Evaluate gradient g at latest prediction θ (k 1), [ ] L(yi, θ i ) g i = θ i θ i =θ (k 1) i, i = 1,... n Find a tree T k that is close to g, i.e., T k solves min trees T n ( g i T (x i )) 2 i=1 Not hard to (approximately) solve for a single tree Update our prediction: θ (k) = θ (k 1) + α k T k Note: predictions are weighted sums of trees, as desired

63 33 Can we do better? Recall O(1/ɛ) rate for gradient descent over problem class of convex, differentiable functions with Lipschitz continuous gradients First-order method: iterative method, updates x (k) in x (0) + span{ f(x (0) ), f(x (1) ),... f(x (k 1) )} Theorem (Nesterov): For any k (n 1)/2 and any starting point x (0), there is a function f in the problem class such that any first-order method satisfies f(x (k) ) f 3L x(0) x (k + 1) 2 Can attain rate O(1/k 2 ), or O(1/ ɛ)? Answer: yes (and more)!

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like