Descent methods. min x. f(x)

Size: px
Start display at page:

Download "Descent methods. min x. f(x)"

Transcription

1 Gradient Descent

2 Descent methods min x f(x) 5 / 34

3 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34

4 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34

5 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 6 / 34

6 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 6 / 34

7 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 Make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) 6 / 34

8 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) o(α f(x) 2 ) 6 / 34

9 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) o(α f(x) 2 ) = f(x) α f(x) o(α) 6 / 34

10 Gradient methods Unconstrained optimization min f(x) x R n. Suppose we have a vector x R n for which f(x) 0 Consider the ray: x(α) = x α f(x) for α 0 As before, make first-order Taylor expansion around x f(x(α)) = f(x) + f(x), x(α) x + o( x(α) x 2 ) = f(x) α f(x) o(α f(x) 2 ) = f(x) α f(x) o(α) For α near 0, α f(x) 2 2 dominates o(α) For positive, sufficiently small α, f(x(α)) smaller than f(x) 6 / 34

11 Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. 7 / 34

12 Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. Again, we have the Taylor expansion f(x(α)) = f(x) + α f(x), d + o(α), where f(x), d dominates o(α) for suff. small α 7 / 34

13 Descent methods Carrying the idea further, consider x(α) = x + αd, where direction d R n obtuse to f(x), i.e., f(x), d < 0. Again, we have the Taylor expansion f(x(α)) = f(x) + α f(x), d + o(α), where f(x), d dominates o(α) for suff. small α Since d is obtuse to f(x), this implies f(x(α)) < f(x) 7 / 34

14 8 / 34

15 Gradient descent Consider unconstrained, smooth convex optimization min f(x) i.e., f is convex and differentiable with dom(f) = R n. Denote the optimal criterion value by f = min f(x), and a solution by x Gradient descent: choose initial x (0) R n, repeat: x (k) = x (k 1) t k f(x (k 1) ), k = 1, 2, 3,... Stop at some point 3

16 4

17 Gradient descent interpretation At each iteration, consider the expansion f(y) f(x) + f(x) T (y x) + 1 2t y x 2 2 Quadratic approximation, replacing usual 2 f(x) by 1 t I f(x) + f(x) T (y x) 1 2t y x 2 2 linear approximation to f proximity term to x, with weight 1/(2t) Choose next point y = x + to minimize quadratic approximation: x + = x t f(x) 6

18 x + = argmin y Blue point is x, red point is f(x) + f(x) T (y x) + 1 2t y x 2 2 7

19 How to choose step sizes Convergence analysis Forward stagewise regression, boosting 8

20 Fixed step size Simply take t k = t for all k = 1, 2, 3,..., can diverge if t is too big. Consider f(x) = (10x x2 2 )/2, gradient descent after 8 steps: *

21 10 Can be slow if t is too small. Same example, gradient descent after 100 steps: *

22 11 Same example, gradient descent after 40 appropriately sized steps: * Clearly there s a tradeoff convergence analysis later will give us a better idea

23 12 Backtracking line search One way to adaptively choose the step size is to use backtracking line search: First fix parameters 0 < β < 1 and 0 < α 1/2 At each iteration, start with t = 1, and while f(x t f(x)) > f(x) αt f(x) 2 2 shrink t = βt. Else perform gradient descent update x + = x t f(x) Simple and tends to work well in practice (further simplification: just take α = 1/2)

24 14 Backtracking picks up roughly the right step size (12 outer steps, 40 steps total): * Here α = β = 0.5

25 15 Exact line search Could also choose step to do the best we can along direction of negative gradient, called exact line search: t = argmin s 0 f(x s f(x)) Usually not possible to do this minimization exactly Approximations to exact line search are often not much more efficient than backtracking, and it s usually not worth it

26 16 Convergence analysis Assume that f convex and differentiable, with dom(f) = R n, and additionally f(x) f(y) 2 L x y 2 for any x, y I.e., f is Lipschitz continuous with constant L > 0 Theorem: Gradient descent with fixed step size t 1/L satisfies f(x (k) ) f x(0) x 2 2 2tk We say gradient descent has convergence rate O(1/k) I.e., to get f(x (k) ) f ɛ, we need O(1/ɛ) iterations

27 19 Convergence analysis for backtracking Same assumptions, f is convex and differentiable, dom(f) = R n, and f is Lipschitz continuous with constant L > 0 Same rate for a step size chosen by backtracking search Theorem: Gradient descent with backtracking line search satisfies where t min = min{1, β/l} f(x (k) ) f x(0) x 2 2 2t min k If β is not too small, then we don t lose much compared to fixed step size (β/l vs 1/L)

28 20 Convergence analysis under strong convexity Reminder: strong convexity of f means f(x) 2 2 x 2 2 is convex for some m > 0. If f is twice differentiable, then this implies 2 f(x) mi for any x Sharper lower bound than that from usual convexity: f(y) f(x) + f(x) T (y x) + m 2 y x 2 2 all x, y Under Lipschitz assumption as before, and also strong convexity: Theorem: Gradient descent with fixed step size t 2/(m + L) or with backtracking line search search satisfies where 0 < c < 1 f(x (k) ) f c k L 2 x(0) x 2 2

29 21 I.e., rate with strong convexity is O(c k ), exponentially fast! I.e., to get f(x (k) ) f ɛ, need O(log(1/ɛ)) iterations Called linear convergence, because looks linear on a semi-log plot: f(x (k) ) p 10 0 exact l.s backtracking l.s k Constant c depends adversely on condition number L/m (higher condition number slower rate)

30 Convex function f is convex if domf is a convex set and Jensen s inequality holds: f(θx+(1 θ)y) θf(x)+(1 θ)f(y) x,y domf, θ [0,1] first-order condition for (continuously) differentiable f, Jensen s inequality can be replaced with f(y) f(x)+ f(x) T (y x) x,y domf second-order condition for twice differentiable f, Jensen s inequality can be replaced with 2 f(x) 0 x domf Gradient method 1-6

31 Strictly convex function f is strictly convex if domf is a convex set and f(θx+(1 θ)y) < θf(x)+(1 θ)f(y) x,y domf, x y, θ (0,1) first-order condition (for differentiable f): dom f is convex and f(y) > f(x)+ f(x) T (y x) x,y domf,x y hence minimizer of f is unique (if a minimizer exists) second-order condition note that 2 f(x) 0 is not necessary for strict convexity (cf., f(x) = x 4 ) Gradient method 1-7

32 Strongly convex function f is strongly convex with parameter m > 0 if f(x) m 2 x 2 2 is convex Jensen s inequality f(θx+(1 θ)y) θf(x)+(1 θ)f(y) m 2 θ(1 θ) x y 2 2 first-order condition f(y) f(x)+ f(x) T (y x)+ m 2 y x 2 2 x,y domf second-order condition 2 f(x) mi x domf Gradient method 1-8

33 Quadratic lower bound (from 1st order condition) if f is strongly convex with parameter m, then f(y) f(x)+ f(x) T (y x)+ m 2 y x 2 2 x,y domf f(y) (x,f(x)) if domf = R n, this implies f has a unique minimizer x and m 2 x x 2 2 f(x) f(x ) 1 2m f(x) 2 2 x Gradient method 1-9

34 Monotonicity of gradient differentiable f is convex if and only if domf is convex and ( f(x) f(y)) T (x y) 0 x,y domf i.e., f : R n R n is a monotone mapping gradient of strictly convex function is strictly monotone: ( f(x) f(y)) T (x y) > 0 x,y domf, x y gradient of strongly convex function is strongly monotone or coercive: ( f(x) f(y)) T (x y) m x y 2 2 x,y domf Gradient method 1-10

35 proof of monotonicity property if f is differentiable and convex, then f(y) f(x)+ f(x) T (y x), f(x) f(y)+ f(y) T (x y) combining the inequalities gives ( f(x) f(y)) T (x y) 0 if f is monotone, then g (t) g (0) for t 0 where g(t) = f(x+t(y x)), g (t) = f(x+t(y x)) T (y x) hence, f(y) = g(1) = g(0)+ 1 0 g (t)dt g(0)+g (0) = f(x)+ f(x) T (y x) Gradient method 1-11

36 Functions with Lipschitz continuous gradients gradient of f is Lipschitz continuous with parameter L > 0 if f(x) f(y) 2 L x y 2 x,y domf from the Cauchy-Schwarz inequality, this implies ( f(x) f(y)) T (x y) L x y 2 2 i.e., Lx f(x) is monotone; in other words L 2 x 2 2 f(x) is convex for twice differentiable f, equivalent to 2 f(x) LI for all x domf Gradient method 1-12

37 Quadratic upper bound if f is Lipschitz-continuous with parameter L, then f(y) f(x)+ f(x) T (y x)+ L 2 y x 2 2 x,y domf f(y) (x,f(x)) if domf = R n and f has a minimizer x, implies 1 2L f(x) 2 2 f(x) f(x ) L 2 x x 2 2 x Gradient method 1-13

38 Co-coercivity of gradient if f is convex and f is Lipschitz continuous with parameter L, then ( f(x) f(y)) T (x y) 1 L f(x) f(y) 2 2 x,y domf property is known as co-coercivity of f (with parameter 1/L) proof: the convex functions g(z) = f(z) f(x) T z, h(z) = f(z) f(y) T z have Lipschitz-continuous gradients, and minimizers z = x and z = y, resp. therefore, from page 1-13, g(x) g(y) 1 2L g(y) 2 2, h(y) h(x) 1 2L h(x) 2 2 combining these two inequalities shows co-coercivity Gradient method 1-14

39 extension: if in addition f is strongly convex with parameter m, then g(x) = f(x) m 2 x 2 2 is convex and g is Lipschitz continuous with parameter L m the co-coercivity property for g gives ( f(x) f(y)) T (x y) ml m+l x y m+l f(x) f(y) 2 2 for all x,y domf Gradient method 1-15

40 Analysis of gradient method x (k) = x (k 1) t k f(x (k 1) ), k = 1,2,... with fixed step size or backtracking line search assumptions 1. f is convex and differentiable with domf = R n 2. f(x) is Lipschitz continuous with parameter L > 0 3. optimal value f = inf x f(x) is finite and attained at x Gradient method 1-16

41 Analysis for constant step size from quadratic upper bound (page 1-13) with y = x t f(x): f(x t f(x)) f(x) t(1 Lt 2 ) f(x) 2 2 therefore, if x + = x t f(x) and 0 < t 1/L, f(x + ) f(x) t 2 f(x) 2 2 f + f(x) T (x x ) t 2 f(x) 2 2 = f + 1 ( ) x x 2 2t 2 x x t f(x) 2 2 = f + 2t( 1 x x 2 2 x + x 2 ) 2 Gradient method 1-17

42 take x = x (i 1), x + = x (i), t i = t, and add the bounds for i = 1,...,k: k (f(x (i) ) f ) 1 2t i=1 = 1 2t k i=1 ( ) x (i 1) x 2 2 x (i) x 2 2 ( ) x (0) x 2 2 x (k) x t x(0) x 2 2 since f(x (i) ) is non-increasing, f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt x(0) x 2 2 conclusion: #iterations to reach f(x (k) ) f ǫ is O(1/ǫ) Gradient method 1-18

43 Backtracking line search initialize t k at ˆt > 0 (for example, ˆt = 1); take t k := βt k until f(x t k f(x)) < f(x) αt k f(x) 2 2 f(x t f(x)) f(x) αt f(x) < β < 1; we will take α = 1/2 (mostly to simplify proofs) t Gradient method 1-19

44 Analysis for backtracking line search line search with α = 1/2 if f has a Lipschitz continuous gradient f(x) t(1 tl 2 ) f(x) 2 2 f(x) t 2 f(x) 2 2 t = 1/L f(x t f(x)) selected step size satisfies t k t min = min{ˆt,β/l} Gradient method 1-20

45 convergence analysis from page 1-17: f(x (i) ) f + 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 i f + 1 ( ) x (i 1) x 2 2t 2 x (i) x 2 2 min add the upper bounds to get f(x (k) ) f 1 k k i=1 (f(x (i) ) f ) 1 2kt min x (0) x 2 2 conclusion: same 1/k bound as with constant step size Gradient method 1-21

46 Analysis for strongly convex functions better results exist if we add strong convexity to the assumptions on p.1-16 analysis for constant step size if x + = x t f(x) and 0 < t 2/(m+L): x + x 2 2 = x t f(x) x 2 2 = x x 2 2 2t f(x) T (x x )+t 2 f(x) 2 2 (1 t 2mL m+l ) x x 2 2+t(t 2 m+l ) f(x) 2 2 (1 t 2mL m+l ) x x 2 2 (step 3 follows from result on p.1-15) Gradient method 1-22

47 distance to optimum x (k) x 2 2 c k x (0) x 2 2, c = 1 t 2mL m+l implies (linear) convergence for t = 2/(m+L), get c = ((κ 1)/(κ+1)) 2 with κ = L/m bound on function value (from page 1-13) f(x (k) ) f L 2 x(k) x 2 2 c2k L 2 x(0) x 2 2 conclusion: #iterations to reach f(x (k) ) f ǫ is O(log(1/ǫ)) Gradient method 1-23

48 22 A look at the conditions A look at the conditions for a simple problem, f(β) = 1 2 y Xβ 2 2 Lipschitz continuity of f: This means 2 f(x) LI As 2 f(β) = X T X, we have L = σ 2 max(x) Strong convexity of f: This means 2 f(x) mi As 2 f(β) = X T X, we have m = σ 2 min (X) If X is wide i.e., X is n p with p > n then σ min (X) = 0, and f can t be strongly convex Even if σ min (X) > 0, can have a very large condition number L/m = σ max (X)/σ min (X)

49 Condition number Condition number For l 2 -norm we have Cond(A) = A A 1 Cond 2(A) = A 2 A 1 2 If A is symmetric, we have Remark 1: Cond(A) 1. Cond 2(A) = λmax(a) λ min(a). Remark 2: If det(a) 0, Cond(A) 1.

50 Explanation of condition number Original problem Ax = b; Perturbed problem A(x + δx) = b + δb; Subtracting two equations we have Take norm we have With condition we have δx = A 1 δb δx A 1 δb = A 1 Ax δb b. δx x Ax A x Cond(A) δb b.

51 Condition number Condition number characterize the stability If Cond(A) 1, stability is very bad; If Cond(A) 1, stability is good. Lesson we should learn: We should avoid handle the bad condition number problem in applications!

52 Stability Loosely speaking, stability is to indicate how sensitive the solution of a problem may be to small relative changes in the input data. It is often quantized by condition number of a problem or an algorithm. Stability of the original problem ( Well-posedness ) This means the well-posedness of the original problem. The linear system with high condition number is a typical ill-posed example, which is called the unstable problem. Stability of numerical algorithm This means the condition of the algorithm. Ex: The Gaussian elimination without pivoting for some linear system will be unstable.

53 23 A function f having Lipschitz gradient and being strongly convex satisfies: mi 2 f(x) LI for all x R n, for constants L > m > 0 Think of f being sandwiched between two quadratics May seem like a strong condition to hold globally (for all x R n ). But a careful look at the proofs shows that we only need Lipschitz gradients/strong convexity over the sublevel set This is less restrictive S = {x : f(x) f(x (0) )}

54 24 Practicalities Stopping rule: stop when f(x) 2 is small Recall f(x ) = 0 at solution x If f is strongly convex with parameter m, then f(x) 2 2mɛ f(x) f ɛ Pros and cons of gradient descent: Pro: simple idea, and each iteration is cheap Pro: very fast for well-conditioned, strongly convex problems Con: often slow, because interesting problems aren t strongly convex or well-conditioned Con: can t handle nondifferentiable functions

55 25 Forward stagewise regression Let s stick with f(β) = 1 2 y Xβ 2 2, linear regression setting X is n p, its columns X 1,... X p are predictor variables Forward stagewise regression: start with β (0) = 0, repeat: Find variable i s.t. Xi T r is largest, where r = y Xβ(k 1) (largest absolute correlation with residual) Update β (k) i = β (k 1) i + γ sign(x T i r) Here γ > 0 is small and fixed, called learning rate This looks kind of like gradient descent

56 26 Steepest descent Close cousin to gradient descent, just change the choice of norm. Let p, q be complementary (dual): 1/p + 1/q = 1 Steepest descent updates are x + = x + t x, where x = f(x) q u u = argmin v p 1 f(x) T v If p = 2, then x = f(x), gradient descent If p = 1, then x = f(x)/ x i e i, where f (x) x i = max f j=1,...n (x) x j = f(x) Normalized steepest descent just takes x = u (unit q-norm)

57 27 An interesting equivalence Normalized steepest descent with respect to l 1 norm: updates are x + i ( f ) = x i t sign (x) x i where i is the largest component of f(x) in absolute value Compare forward stagewise: updates are β + i = β i + γ sign(x T i r), r = y Xβ But here f(β) = 1 2 y Xβ 2 2, so f(β) = XT (y Xβ) and f(β)/ β i = Xi T (y Xβ) Hence forward stagewise regression is normalized steepest descent under l 1 norm (with fixed step size t = γ)

58 28 Early stopping and sparse approximation If we run forward stagewise to completion, then we will minimize f(β) = y Xβ 2 2, i.e., we will produce a least squares solution What happens if we stop early? May seem strange from an optimization perspective (we are under-optimizing )... Interesting from a statistical perspective, because stopping early gives us a sparse approximation to the least squares solution Well-known sparse regression estimator, the lasso: 1 min β R p 2 y Xβ 2 2 subject to β 1 s How do lasso solutions and forward stagewise estimates compare?

59 29 Stagewise path Lasso path Coordinates Coordinates β (k) ˆβ(s) 1 For some problems (some y, X), they are exactly the same as the learning rate γ 0!

60 30 Gradient boosting Given observations y = (y 1,... y n ) R n, predictor measurements x i R p, i = 1,... n Want to construct a flexible (nonlinear) model for outcome based on predictors. Weighted sum of trees: m θ i = β j T j (x i ), j=1 i = 1,... n Each tree T j inputs predictor measurements x i, outputs prediction. Trees are grown typically pretty short...

61 31 Pick a loss function L that reflects setting; e.g., for continuous y, could take L(y i, θ i ) = (y i θ i ) 2 Want to solve min β R M n i=1 ( L y i, M j=1 ) β j T j (x i ) Indexes all trees of a fixed size (e.g., depth = 5), so M is huge Space is simply too big to optimize Gradient boosting: basically a version of gradient descent that is forced to work with trees First think of optimization as min θ f(θ), over predicted values θ (subject to θ coming from trees)

62 32 Start with initial model, e.g., fit a single tree θ (0) = T 0. Repeat: Evaluate gradient g at latest prediction θ (k 1), [ ] L(yi, θ i ) g i = θ i θ i =θ (k 1) i, i = 1,... n Find a tree T k that is close to g, i.e., T k solves min trees T n ( g i T (x i )) 2 i=1 Not hard to (approximately) solve for a single tree Update our prediction: θ (k) = θ (k 1) + α k T k Note: predictions are weighted sums of trees, as desired

63 33 Can we do better? Recall O(1/ɛ) rate for gradient descent over problem class of convex, differentiable functions with Lipschitz continuous gradients First-order method: iterative method, updates x (k) in x (0) + span{ f(x (0) ), f(x (1) ),... f(x (k 1) )} Theorem (Nesterov): For any k (n 1)/2 and any starting point x (0), there is a function f in the problem class such that any first-order method satisfies f(x (k) ) f 3L x(0) x (k + 1) 2 Can attain rate O(1/k 2 ), or O(1/ ɛ)? Answer: yes (and more)!

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Lecture 5: September 15

Lecture 5: September 15 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 3 Gradient Method Shiqian Ma, MAT-258A: Numerical Optimization 2 3.1. Gradient method Classical gradient method: to minimize a differentiable convex

More information

Unconstrained minimization

Unconstrained minimization CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

CSCI : Optimization and Control of Networks. Review on Convex Optimization

CSCI : Optimization and Control of Networks. Review on Convex Optimization CSCI7000-016: Optimization and Control of Networks Review on Convex Optimization 1 Convex set S R n is convex if x,y S, λ,µ 0, λ+µ = 1 λx+µy S geometrically: x,y S line segment through x,y S examples (one

More information

A Brief Review on Convex Optimization

A Brief Review on Convex Optimization A Brief Review on Convex Optimization 1 Convex set S R n is convex if x,y S, λ,µ 0, λ+µ = 1 λx+µy S geometrically: x,y S line segment through x,y S examples (one convex, two nonconvex sets): A Brief Review

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Lecture 14: October 17

Lecture 14: October 17 1-725/36-725: Convex Optimization Fall 218 Lecture 14: October 17 Lecturer: Lecturer: Ryan Tibshirani Scribes: Pengsheng Guo, Xian Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Convex Functions. Pontus Giselsson

Convex Functions. Pontus Giselsson Convex Functions Pontus Giselsson 1 Today s lecture lower semicontinuity, closure, convex hull convexity preserving operations precomposition with affine mapping infimal convolution image function supremum

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Lecture 6: September 12

Lecture 6: September 12 10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions 3. Convex functions Convex Optimization Boyd & Vandenberghe basic properties and examples operations that preserve convexity the conjugate function quasiconvex functions log-concave and log-convex functions

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Lecture 6: September 17

Lecture 6: September 17 10-725/36-725: Convex Optimization Fall 2015 Lecturer: Ryan Tibshirani Lecture 6: September 17 Scribes: Scribes: Wenjun Wang, Satwik Kottur, Zhiding Yu Note: LaTeX template courtesy of UC Berkeley EECS

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Unconstrained minimization: assumptions

Unconstrained minimization: assumptions Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming,

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Adaptive Restarting for First Order Optimization Methods

Adaptive Restarting for First Order Optimization Methods Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725 Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple

More information

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions 3. Convex functions Convex Optimization Boyd & Vandenberghe basic properties and examples operations that preserve convexity the conjugate function quasiconvex functions log-concave and log-convex functions

More information

Lecture 14: Newton s Method

Lecture 14: Newton s Method 10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Nonlinear Optimization for Optimal Control

Nonlinear Optimization for Optimal Control Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

MATH 680 Fall November 27, Homework 3

MATH 680 Fall November 27, Homework 3 MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and

More information

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3 Convex functions The domain dom f of a functional f : R N R is the subset of R N where f is well-defined. A function(al) f is convex if dom f is a convex set, and f(θx + (1 θ)y) θf(x) + (1 θ)f(y) for all

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Lecture: Smoothing.

Lecture: Smoothing. Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Smoothing 2/26 introduction smoothing via conjugate

More information

Homework 2. Convex Optimization /36-725

Homework 2. Convex Optimization /36-725 Homework 2 Convex Optimization 0-725/36-725 Due Monday October 3 at 5:30pm submitted to Christoph Dann in Gates 803 Remember to a submit separate writeup for each problem, with your name at the top) Total:

More information

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods Convex Optimization Prof. Nati Srebro Lecture 12: Infeasible-Start Newton s Method Interior Point Methods Equality Constrained Optimization f 0 (x) s. t. A R p n, b R p Using access to: 2 nd order oracle

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK) Convex Functions Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK) Course on Convex Optimization for Wireless Comm. and Signal Proc. Jointly taught by Daniel P. Palomar and Wing-Kin (Ken) Ma

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information

Majorization Minimization - the Technique of Surrogate

Majorization Minimization - the Technique of Surrogate Majorization Minimization - the Technique of Surrogate Andersen Ang Mathématique et de Recherche opérationnelle Faculté polytechnique de Mons UMONS Mons, Belgium email: manshun.ang@umons.ac.be homepage:

More information

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa

Convex Optimization. Lecture 12 - Equality Constrained Optimization. Instructor: Yuanzhang Xiao. Fall University of Hawaii at Manoa Convex Optimization Lecture 12 - Equality Constrained Optimization Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 19 Today s Lecture 1 Basic Concepts 2 for Equality Constrained

More information

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives Subgradients subgradients and quasigradients subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE392o, Stanford University Basic inequality recall basic

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

Lecture 7: September 17

Lecture 7: September 17 10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Existence of minimizers

Existence of minimizers Existence of imizers We have just talked a lot about how to find the imizer of an unconstrained convex optimization problem. We have not talked too much, at least not in concrete mathematical terms, about

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent CMU 10-725/36-725: Convex Optimization (Fall 2017) OUT: Sep 29 DUE: Oct 13, 5:00

More information

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution 1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

CHAPTER 11. A Revision. 1. The Computers and Numbers therein

CHAPTER 11. A Revision. 1. The Computers and Numbers therein CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

Lecture 23: Conditional Gradient Method

Lecture 23: Conditional Gradient Method 10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley

More information

Lecture 9 Sequential unconstrained minimization

Lecture 9 Sequential unconstrained minimization S. Boyd EE364 Lecture 9 Sequential unconstrained minimization brief history of SUMT & IP methods logarithmic barrier function central path UMT & SUMT complexity analysis feasibility phase generalized inequalities

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

IFT Lecture 2 Basics of convex analysis and gradient descent

IFT Lecture 2 Basics of convex analysis and gradient descent IF 6085 - Lecture Basics of convex analysis and gradient descent his version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribes: Assya rofimov,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Sparse Approximation and Variable Selection

Sparse Approximation and Variable Selection Sparse Approximation and Variable Selection Lorenzo Rosasco 9.520 Class 07 February 26, 2007 About this class Goal To introduce the problem of variable selection, discuss its connection to sparse approximation

More information

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information