Lecture 2 February 25th

Size: px
Start display at page:

Download "Lecture 2 February 25th"

Transcription

1 Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization. For a convex function, a local minimum is a global minimum and the uniqueness is assured in the case of strict convexity. In the sequel, g is a convex function on R d. The aim is to find one (or the) minimum θ R d and the value of the function at this minimum g(θ ). Some key additional properties will be assumed for g : Lipschitz continuity θ R d, θ D g (θ) B. Smoothness (θ, θ ) ( R d), g (θ ) g (θ ) L θ θ. Strong convexity (θ, θ ) ( R d),g(θ ) g(θ )+g (θ ) (θ θ )+ µ θ θ. We refer to Lecture of this course for additional information on these properties. We point out key references: [], [].. Smooth optimization If g : R d R is a convex L-smooth function, we remind that for all θ,η R d : g (θ) g (η) L θ η. Besides, if g is twice differentiable, it is equivalent to: 0 g (θ) LI. Proposition. (Properties of smooth convex functions) Let g : R d R be a convex L-smooth function. Then, we have the following inequalities:. Quadratic upper-bound: 0 g(θ) g(η) g (η) (θ η) L θ η. Co-coercivity: L g (θ) g (η) [ g (θ) g (η) ] (θ η) 3. Lower bound: g(θ) g(η)+g (η) (θ η)+ L g (θ) g (η) 4. Distance to optimum: g(θ) g(θ ) g (θ) (θ θ ) 5. If g is also µ-strongly convex, then g(θ) g(η)+g (η) (θ η)+ µ g (θ) g (η) -

2 Lecture February 5th If g is also µ-strongly convex, another distance to optimum: g(θ) g(θ ) µ g (θ) Proof () is a simple application of Taylor expansion with integral remainder. (3) We define: h(θ) = g(θ) θ g (η) which is convex with a global minimum at η. Then: h(η) h(θ L h (θ)) h(θ)+h (θ) ( L h (θ)))+ L L h (θ)) h(θ) L h (θ) Thus g(η) η g (η) g(θ) θ g (η) L g (θ) g (η) () Apply (3) twice for (η,θ) and (θ,η), and sum to get 0 [g (η) g (θ)] (θ η)+ L g (θ) g (η) (4) is immediate from the definition of convex function. (5) We define h(θ) = g(θ) θ g (η) which is convex with a global minimum at η. h(η) = minh(θ) minh(θ)+h (θ) (ζ θ)+ µ ζ θ θ ζ The min is attained for: ζ θ = µ h (θ) This leads to h(η) h(θ) µ h (θ) Hence, g(η) η g (η) g(θ) θ g (η) µ g (η) (θ) (6) Put η = θ, where θ is a global minimizer of g in (5). Note.. The proofs are simple with second-order derivatives. For (5) and (6), there is no need of smoothness assumption... Smooth gradient descent Proposition. (Smooth gradient descent) Let g be a L-smooth convex function, and θ be one (or the) minimum of g. For the following algorithm, we have the bounds:. if g is µ-strongly convex, θ t = θ t L g (θ t ) g(θ t ) g(θ ) ( µ/l) t[ g(θ 0 ) g(θ ) ] -

3 Lecture February 5th 06 Figure.. Level lines of convex functions. On the left (large µ), best case for gradient descent. On the right (small µ), the algorithm tends to do small steps and is thus slower.. if g is only L-smooth, g(θ t ) g(θ ) L θ 0 θ. t+4 Note.. (step-size) In that case, the step-size (/L) is constant. A line search is also possible (see, e.g., en. wikipedia. org/ wiki/ Line_ search ), which consists of optimizing the step at each iteration in order to find the minimum of g along the direction g (θ t ). However, it can be time-consuming and should be carefully used (in particular, no need to be very precise in each line search). (robustness) For the same algorithm, we obtain different bounds depending on the assumptions that we make on g. It is important to notice that the algorithm is identical in all cases and adapts to the difficulty of the situation. (lower bounds) It is not the best possible convergence rates after O(d) iterations. Simple direct proof for quadratic convex functions In order to illustrate these results, we study the case of quadratic convex functions: g(θ) = θ Hθ c θ, where H is a symmetric positive semi-definite matrix. We denote by µ andlrespectively the smallest and the largest eigenvalues of H. The global optimum is easily derived: θ = H c (or H c where H is the pseudo-inverse of H). Let us compute explicitly the gradient descent iteration: θ t = θ t L (Hθ c) = θ t L (Hθ Hθ ) θ t θ = (I L H)(θ t θ ) = (I L H)t (θ 0 θ ) -3

4 Lecture February 5th 06 Case : We assume strong convexity of g. µ is therefore positive (µ > 0) and the eigenvalues of (I L H)t are in [0,( µ L )t ]. We obtain an exponential decrease towards 0 and we deduce convergence of iterates: θ t θ ( µ/l) t θ 0 θ. To deal with function values g(θ t ) g(θ ), we notice that: g(θ t ) g(θ ) = (θ θ ) H(θ θ ), using Taylor expansion at the order, or the fact θ = H c. We thus get: g(θ t ) g(θ ) ( µ/l) t[ g(θ 0 ) g(θ ) ]. Case : g is only L-smooth convex, that is, µ equals 0, and the eigenvalues of (I L H)t are in [0,]. We do not have convergence of iterates: θ t θ θ 0 θ, but we still have: g(θ t ) g(θ ) = (θ θ ) H(θ θ ), from which we deduce: g(θ t ) g(θ ) max v [0,L] v( v/l)t θ 0 θ by a decomposition along the eigenspaces. For e [0,L] and t N, we have: v( v/l) t L [ ] tv t L exp( tv L ) C L t, and we get (up to a constant): g(θ t ) g(θ ) L t θ 0 θ. Proof () One iteration of the algorithm consists of: θ t = θ t γg (θ t ) with γ = /L. Let us compute the function value g(θ t ): g(θ t ) = g [ θ t γg (θ t ) ] g(θ t )+g (θ t ) [ γg (θ t ) ] + L γg (θ t ) = g(θ t ) γ( γl/) g (θ t ) = g(θ t ) L g (θ t ) if γ = /L, g(θ t ) µ L[ g(θt ) g(θ ) ] using strongly-convex distance to optimum Thus, g(θ t ) g(θ ) ( µ/l) t[ g(θ 0 ) g(θ ) ] -4

5 Lecture February 5th 06 We may also get [][p.70, thm..5]: θ t θ ( γµl ) t θ 0 θ µ+l as soon as γ µ+l () One iteration of the algorithm consists of: θ t = θ t γg (θ t ) with γ = /L. First, we bound the iterates: θ t θ = θ t θ γg (θ t ) = θ t θ +γ g (θ t ) γ(θ t θ ) g (θ t ) θ t θ +γ g (θ t ) γ L g (θ t ) using co-coercivity = θ t θ γ(/l γ) g (θ t ) θ t θ if γ /L θ 0 θ Using one equality in the proof of (), we get: g(θ t ) g(θ t ) L g (θ t ) By definition of convexity and Cauchy-Schwarz, we have: g(θ t ) g(θ ) g (θ t ) (θ t θ ) g (θ t ) θ t θ And finally, putting things together: g(θ t ) g(θ ) g(θ t ) g(θ ) L θ 0 θ [ g(θt ) g(θ ) ] We then have our Lyapunov function, a function decreasing when the number of iterates increases. Let denote by α the quantity: and by k : We then have: α = L θ 0 θ, k = g(θ k ) g(θ ). k k α k with 0 k = g(θ k ) g(θ ) L θ k θ α k by dividing by k k k k k α because ( k ) is non-increasing k k αt by summing from k = to t 0 t 0 t by inverting +αt 0 L θ 0 θ t+4 since 0 L θ 0 θ and α = -5 L θ 0 θ

6 Lecture February 5th 06.. Accelerated gradient methods A natural question arising from the results of this proposition is : can we do better? Let us state a lower bound using first-order methods. Definition.3 (First-order method) An algorithm using first-order method is any iterative algorithm that selects θ t in θ 0 +span(f (θ 0 ),...,f (θ t )) Theorem.4 (Lower-bound gradient descent) For every integer k (d )/ and every θ 0, there exist convex L-smooth functions with a global minimizer θ such that for any first-order method, g(θ t ) g(θ ) 3 L θ 0 θ 3 (t+) Note..3 k (d )/ is a strong assumption. It means that the number of iterations should not be higher than half the dimension. O(/t) rate for gradient method may not be optimal! Proof [sketch] We refer to [][p.58, section..] for the complete proof. We let the reader check the following facts as an exercise: Define the quadratic function Fact : g t is L-smooth g t (θ) = L t (θ 8[ ) + (θ i θ i+ ) +(θ t ) θ ] Fact : minimizer supported by first t coordinates (closed form) i= Fact 3: any first-order method starting from zero will be supported in the first k coordinates after iteration k Fact 4: the minimum over this support in {,...,k} may be computed in closed form Given iteration k, take g = g k+ and compute lower-bound on g(θ k) g(θ ) θ 0 θ. We now turn to methods susceptible to reach the lower bound stated in theorem.4. They are called accelerated gradient methods [3]. -6

7 Lecture February 5th 06 Theorem.5 (Accelerated gradient descent) Let g be a convex function with L-Lipschitzcont. gradient, and minimum attained at θ. For the following algorithm, we have the bound: θ t = η t L g (η t ) η t = θ t + t t+ (θ t θ t ) g(θ t ) g(θ ) L θ 0 θ (t+) Note..4 For a ten-line proof [4] This is not improvable (i.e., the scaling in t cannot be less than O(/t ) over all similar problems). Theorem.6 (Accelerated gradient descent) Let g be a µ-strongly convex function with L-Lipschitz-cont. gradient and minimum attained at θ. For the following algorithm, θ t = η t L g (η t ) η t = θ t + µ/l + µ/l (θ t θ t ) we have the bound : g(θ t ) f(θ ) L θ 0 θ ( µ/l) t Note..5 Ten-line proof [4] Not improvable Relationship with conjugate gradient for quadratic functions: a similar bound holds. We need to know µ and L for implementation of the algorithm! Often difficult...3 Optimization for sparsity-inducing norms For more information on the subject, see [5]. We use gradient descent as a proximal method, a way of approximating non-smooth functions by differentiable ones. Let us introduce the following minimisation problem: θ t+ = argmin θ R d f(θ t )+(θ θ t ) f(θ t )+ L θ θ t Since it is a quadratic function in θ, the problem is equivalent to: θ t+ = θ t L f(θ t) -7

8 Lecture February 5th 06 More generally, we want to tackle problems of the form: min θ R d f(θ)+µω(θ) where Ω(θ) is here a sparsity-inducing norm. We seek to minimise the number of non-zero terms in θ. The problem is then formulated as: θ t+ = argmin θ R d f(θ t )+(θ θ t ) f(θ t )+µω(θ)+ L θ θ t If Ω(θ) = θ Thresholded gradient descent In that case, convergence rates are similar than smooth optimization. For acceleration methods, look at [6, 7]. In the following, we illustrate the method by focusing on the softthresholding for the l -norm with an example in dimension. Example: quadratic problem in D, i.e. min x R x xy +λ x It is a piecewise quadratic function with a kink at zero, with derivatives at0+: g + = λ y and at 0 : g = λ y. Figure.. Representation of the quadratic function in D (x = 0 and x 0) x = 0 is the solution iff g + 0 and g 0 (i.e., y λ) x 0 is the solution iff g + 0 (i.e., y λ) x = y λ x 0 is the solution iff g 0 (i.e., y λ) x = y +λ In conclusion, the solution is x = sign(y)( y λ) + = soft thresholding -8

9 Lecture February 5th 06 x x*(y) λ λ y..4 Newton method Figure.3. Representation of soft thresholding in D Given θ t, we seek to minimize the second-order Taylor expansion : g(θ) = g(θ t )+g (θ t ) (θ θ t )+ (θ θ t ) g (θ t ) (θ θ t ) It is a quadratic function in θ and we easily derive the minimum: θ t = θ t g (θ t ) g (θ t ) The iteration is expensive, and the running-time complexity O(d 3 ) in general. We have a property of quadratic convergence: if θ t θ is small enough, for some constant C, we have (C θ t θ ) = (C θ t θ ) See for example [8].. Non-smooth optimization.. Counter-example: steepest descent for nonsmooth objectives There is a real difference between the class of smooth problems (with bounded smoothness constant) and the class of all convex problems: methods that converged with a controlled rate in the first setting may now even fail to be consistent. For example, consider the following function: { 5(9θ g(θ,θ ) = +6θ ) θ > θ (9θ +6 θ ) θ θ which is shown in Figure.4. For any value of the stepsize, the fixed stepsize (sub)gradient descent scheme gets stuck at zero, which is clearly not a minimum! New algorithms are needed. -9

10 Lecture February 5th Figure.4. Level lines of g. Color indicates function value... Subgradient method New setting: - Convex continuous function g - Bounded domain C = {θ : θ D} although as the proof shows, any compact convex domain with diameter D will yield the same result. - g reaches its minimum on C at θ - g is Lipschitz-continuous on C with constant B Note.. : In fact, any globally defined, real-valued convex function is locally Lipschitz, so we are dealing with the general convex case. Moreover, an upper bound for the Lipschitz constant (for a given value of R) may be calculated with a finite number of function evaluations. The basic idea is still that of gradient descent, with three important changes: - Obviously, we will have to replace the gradient with (any possible) subgradient. - The stepsize will have to decrease with the number of iterates. - Since even then, iterates may fail to converge, the algorithm does not return the last iterate, but computes an average instead. Subgradient method: algorithm Let Π D be the orthogonal projection on C. Then the subgradient algorithm, when run for T iterations, is defined in Algorithm. For an easier proof that shows where the value for the stepsize comes from, we also consider a slightly different algorithm, in which the stepsize is constant over a single run of the algorithm (Algorithm ). This is known as the finite-horizon setting. -0

11 Lecture February 5th 06 Algorithm minimize g over C = {θ : θ D} Require: ε > 0, B 0 and D 0, θ 0 C Ensure: g(ˆθ) g(θ ) ε T 8D B ε for t = to T do ) D θ[t] Π C (θ[t ] B t g (θ[t ]) end for ˆθ T T t= θ[t] return ˆθ Algorithm minimize g over C = {θ : θ D} Require: ε > 0, B 0 and D 0, θ 0 C Ensure: g(ˆθ c ) g(θ ) ε T 4D B ε for t = to T do ( θ[t] Π C θ[t ] D ) B T g (θ[t ]) end for ˆθ T T t= θ[t] return ˆθ -

12 Lecture February 5th 06 The advantage of the first algorithm over the simpler version is that it can be run in an online fashion (often referred to as an anytime algorithm), i.e. without knowing in advance the number of iterations or the required precision. Theorem.7 g(ˆθ) g(θ ) DB T Proof Π C is -Lipschitz since it is a projection. Therefore for any θ Let γ t denote the stepsize. Then Π C (θ) θ = Π C (θ) Π C (θ ) (θ C) θ θ θ t θ θ t γ t g (θ t ) θ (by the preceding inequality) (.) θ t θ γ t(θ t θ ) g (θ t )+γt g (θ t (.) Thus we get: θ t θ +B γ t γ t(θ t θ ) g (θ t )( g (θ) B) (.3) θ t θ +B γ t γ t [g(θ t ) g(θ )](property of subgradients) (.4) g(θ t ) g(θ ) B γ t + γ t [ θ t θ θ t θ ]. We then average over t T Here there is a difference between the two versions of the algorithm: - If γ t = γ = constant then we get a telescoping sum: T T t=0 g(θ t ) g(θ ) B γ + γt [ θ 0 θ θ T θ ] B γ + γt θ 0 θ B γ + D γt. We can then optimize the upper bound in the parameter γ: γ = D suggests the right dependency for the online version. B T gives DB T which -

13 Lecture February 5th 06 - In the variable γ case we use summation by parts: T t= [g(θ t ) g(θ )] B B B = B T γ t + t= T t= T T γ t + t= t= γ t [ θ t θ θ t θ ] θ t θ [ γ t+ γ t ]+ θ 0 θ γ θ T θ γ T T T γ t + 4D [ ]+ 4D γ t+ γ t γ t= T t= t= γ t + 4D γ T. Now we set γ t = xd B t T t= to get: [g(θ t ) g(θ )] B xd B = xbd xbd T t= T t= T 0 + 4D B T t xd t + BD T x dt+ BD T t x xbd T + BD T x = (x+ x )BD T. (because d dt t = t ) which is optimized when the two terms are equal, that is x =, which gives: T T t= g(θ t ) g(θ ) BD T. Finally, the convexity inequality: ( ) T g θ t T T t=0 T g(θ t ) t=0 gives the desired result in both cases. Subgradient descent with strong convexity When the non-smooth convex function is strongly convex with known parameter µ, it is possible to improve the rate of convergence -3

14 Lecture February 5th 06 Algorithm 3 minimize g over {x : x D} Require: µ > 0, D 0, θ 0 : θ 0 D Ensure: ˆθ t = t (t+)(t+) 0 (u + )θ u where a subscript i indicates value of the variable at iteration i. θ θ 0 ˆθ θ 0 for t do ) θ Π D (θ µ(t+) g (θ) ˆθ t+ˆθ+ t θ t+ end for return ˆθ to O( ). The algorithm and the proof of its convergence properties are similar in spirit to µt the previous one. Since here the algorithm may be used in cases where B is unknown we present it as an iterative scheme (Algorithm 3). Theorem.8 Let g be B-Lipschitz and µ-strongly convex over the set {θ : θ D}. Then for ˆθ defined by the above algorithm, g(ˆθ t ) g(θ ) B µ(t+). Note.. : Somewhat surprisingly, D does not appear explicitly in the upper bound. This is because the conditions that g be B-Lipschitz and µ-strongly convex already impose a limit on the size of g s domain of definition. Proof The proof uses the same ideas as for the general case: the beginning is the same. At (.4), the assumption of strong-convexity permits a better bound: which leads to: θ t θ θ t θ +B γ t γ t[g(θ t ) g(θ )]+ µ θ t θ, γ t [g(θ t ) g(θ )] ( γ t µ) θ t θ +B γ t θ t θ g(θ t ) g(θ ) ( γ t µ) θ t θ + B γ t γ t θ t θ B µ(t+) + µ [t ] θ t θ µ [t+ ] θ t θ (here γ t = µ(t+) ). -4

15 Lecture February 5th 06 Here as before the idea is to average the iterates, taking advantage of convexity on the left hand side and of telescoping terms on the right: g(ˆθ t ) g(θ ) (t+)(t+) (t+)(t+) (t+)(t+) B µ(t+). t+ u[g(θ u ) g(θ )] u= { } B t+ u µ u+ + µ t+ [u(u ) θ u θ 4 (u+)u θ u θ ] u= u= { B (t+) + µ } µ 4 [0 (t+)(t+) θ t+ θ ]..3 Ellipsoid method Idea The basic idea of the ellipsoid method is that of dichotomy: by recursively splitting a set along hyperplanes given by subgradients (so that all minimizers are one side of the hyperplane), it is hoped that the diameter of the set might decrease to zero exponentially fast, as in the one dimensional dichotomy method. To get a good rate of convergence, a natural idea is to split the set at its geometric center. However, calculating the geometric center of a set is in general a difficult problem. This is where knowing few geometric properties of ellipsoids is useful: Definition.9 Ellipsoid: Given a point x R d and a symmetric positive definite matrix P, the ellipsoid E( x,p) is the set: {x R d (x x) P (x x) } The point x is the geometric center of E( x,p) and the matrix P describes the half-axes lengths (eigenvalues) and directions (eigenvectors). Proposition.0 Given an ellispoid E( x,p) and a hyperplane H = {g (x x) 0} containing x, there is a minimum volume ellipsoid E + ( x +,P + ) containing {x E( x,p) g (x x) 0}, with x + = x Pg (.5) d+ g,p P + = d d (P Pgg P ) (.6) d+ g,p -5

16 Lecture February 5th 06 where g,p is the intrinsic P-norm of g, g Pg. Its volume decreases by a constant factor relative to the original ellipsoid: vol d (E + ( x +,P + )) e d vold (E(x,P)) E 0 E E E Figure.5. cuts and minimal ellipsoids This suggests the iterative scheme described in Algorithm 4. Algorithm 4 minimize g over {x : x D} Require: D 0, x 0 : x 0 D x[0] x 0 P D I d h 0 for t do h g ( x[t ]) x[t] x[t ] Ph P d d (P d+ d+ h,p Phh P) h,p end for t argmin 0 t stoptime g( x[t]) return x t Proposition.0 is not quite enough to conclude to a convergence rate, because even if the volume decreases exponentially, it seems hard to ensure that the diameter does the same, as the ellipsoid may be very excentric. However, exponential decrease remains true for approximate minimization. The following proof is taken from Nesterov s book []. Definition. The following quantity will be useful: v g (x) = g (x) (g (x)) (x x ) -6

17 Lecture February 5th 06 Lemma. If g is B-Lipschitz, then g(x) g(x ) Bv g (x) More explicitly, g(x) g(x ) B g (x) (g (x)) (x x ) Proof Let y = x +v g (x) g (x) g (x) Clearly y x v g (x) By convexity, g(y) g(x)+(g (x)) (y x) But by definition, In the end: (g (x)) (x y) = (g (x)) (x x ) v g (x) g (x) = 0 g(x) g(x ) g(y) g(x ) B y x Bv g (x) The main point of introducing this new quantity is that it can be bounded by the volume: Proposition.3 Let Q be a convex set in R d with finite diameter D and nonempty interior. Let x,...,x n R d be points and (E i ) i=,..,n be sets such that Assume that x E 0 = Q. Then: {x Q i {,...,n},(g (x i )) (x i x) 0} E i min v g(x i ) D i n [ ] vold E d i vol d Q Proof we introduce some notation: let v n = min i nv g (x i ) and α = v n D. ( α)x +αq ( α)x +αb(x,d) = B(x,v n) By convexity, we also have that By definition of v n, ( α)x +αq = [( α)x +αq] Q B(x,v n ) Q x x v n i {,...,n}, g (x i (g (x i )) (x i x) = g (x i (g (x i )) (x i x )+ v n x x 0 x E i g (x i (g (x i )) (x x) -7

18 Lecture February 5th 06 Therefore: α d vol d (Q) vol d (( α)x +αq) vol d (E i ). Combining all three inequalities, we obtain the convergence rate of the ellipsoid method: Theorem.4 Let g be convex and B-Lipschitz on B(0,D). Let x n be the value returned by Algorithm 4 when stopped at iteration n. Then g( x n ) g(x ) BDe n d. -8

19 Lecture February 5th Application to Machine Learning After having described many different methods, we now review their applicability to machine learning. Setting As seen in Lecture, in machine learning, more specifically in empirical risk minimization with linear predictors, the objective function is of the form: ˆf(θ) = n n l(y i,φ(x i ) θ) i= Assuming bounded features: Φ(x) R and that the loss is G-Lipschitz on {θ θ RD}, we have shown that ˆf is GR-Lipschitz on {θ θ D}. Which methods can be used? Reasonable hypothesis in this setting - Smoothness: ˆf may or may not be smooth, depending on l (notably the hinge loss is unsmooth, but the ridge loss is smooth). If smooth, the constant is essentially the same as for l, times R, by linearity of differenciation. - Strong convexity: In typical high-dimensional settings, ˆf will rarely be strongly convex for a useful value of µ (in particular it cannot be if n < d). - Dimensional complexity: High-dimensionality makes the ellipsoid method useless, despite its geometric convergence. This is because the rate is O(e n d ) so O(d ) iterates are needed. Each iterate requires multiplying a vector by a matrix, so O(d ) cost is encountered at each iteration, for a total, prohibitive dependency of O(d 4 ) in the dimension. By contrast, gradient and subgradient methods converge at a rate independent of the dimension. The only dimension-dependent cost is due to arithmetic operations, and is linear in d. Thus, a gradient or subgradient method should be used, if possible one that does not need µ. -9

20 Lecture February 5th 06 The case of subgradient descent : choosing the number of steps In the worst, non-smooth case, the convergence rate is O( t ). This is not a problem since we must remember that there is a statistical error: It is f(θ) = E[ˆf(θ)] and not ˆf(θ) that we wish to minimize and we know from the previous lesson that with probability greater than δ: min θ Θ ] ˆf(θ) minf(θ) max ˆf(θ) f(θ) GRD [+ log( δ ) θ Θ θ Θ n Therefore, with t = n iterations of subgradient descent we get an optimization error of same order as the statistical error: ˆf(ˆθ) min η Θ ˆf(η) GRD n. There is not much to gain by iterating for longer (only a factor of at best). The overall complexity is O(n d): it depends only linearly on the dimension. -0

21 Bibliography [] Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, 004. [] S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends R in Machine Learning, 8(3-4):3 357, 05. URL 56/ [3] Y. Nesterov. A method for solving a convex programming problem with rate of convergence O(/k ). Soviet Math. Doklady, 69(3): , 983. [4] M. Schmidt, N. Le Roux, and F. Bach. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization. Advances in Neural Information Processing Systems (NIPS), 0. [5] Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(): 06, 0. [6] Y. Nesterov. Gradient methods for minimizing composite objective function. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, Tech. Rep, 76, 007. [7] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, ():83 0, 009. [8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 003.

Statistical machine learning and convex optimization

Statistical machine learning and convex optimization Statistical machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Mastère M2 - Paris-Sud - Spring 2018 Slides available: www.di.ens.fr/~fbach/fbach_orsay_2018.pdf

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE IFCAM, Bangalore - July 2014 Big data revolution? A new scientific

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Large-scale machine learning and convex optimization

Large-scale machine learning and convex optimization Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Allerton Conference - September 2015 Slides available at www.di.ens.fr/~fbach/gradsto_allerton.pdf

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Accelerated gradient methods

Accelerated gradient methods ELE 538B: Large-Scale Optimization for Data Science Accelerated gradient methods Yuxin Chen Princeton University, Spring 018 Outline Heavy-ball methods Nesterov s accelerated gradient methods Accelerated

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Unconstrained optimization

Unconstrained optimization Chapter 4 Unconstrained optimization An unconstrained optimization problem takes the form min x Rnf(x) (4.1) for a target functional (also called objective function) f : R n R. In this chapter and throughout

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings

Structural and Multidisciplinary Optimization. P. Duysinx and P. Tossings Structural and Multidisciplinary Optimization P. Duysinx and P. Tossings 2018-2019 CONTACTS Pierre Duysinx Institut de Mécanique et du Génie Civil (B52/3) Phone number: 04/366.91.94 Email: P.Duysinx@uliege.be

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

A Conservation Law Method in Optimization

A Conservation Law Method in Optimization A Conservation Law Method in Optimization Bin Shi Florida International University Tao Li Florida International University Sundaraja S. Iyengar Florida International University Abstract bshi1@cs.fiu.edu

More information

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL) Part 3: Trust-region methods for unconstrained optimization Nick Gould (RAL) minimize x IR n f(x) MSc course on nonlinear optimization UNCONSTRAINED MINIMIZATION minimize x IR n f(x) where the objective

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

8 Numerical methods for unconstrained problems

8 Numerical methods for unconstrained problems 8 Numerical methods for unconstrained problems Optimization is one of the important fields in numerical computation, beside solving differential equations and linear systems. We can see that these fields

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression

More information

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization

Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization Dimitri P. Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology February 2014

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem min{f (x) : x R n }. The iterative algorithms that we will consider are of the form x k+1 = x k + t k d k, k = 0, 1,...

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture Scribe: Kevin Li Oct. 4, 05. CONVEX OPTIMIZATION FOR MACHINE LEARNING In this lecture, we will cover the basics of convex optimization

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

Unconstrained minimization of smooth functions

Unconstrained minimization of smooth functions Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Integration Methods and Optimization Algorithms

Integration Methods and Optimization Algorithms Integration Methods and Optimization Algorithms Damien Scieur INRIA, ENS, PSL Research University, Paris France damien.scieur@inria.fr Francis Bach INRIA, ENS, PSL Research University, Paris France francis.bach@inria.fr

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1

EE 546, Univ of Washington, Spring Proximal mapping. introduction. review of conjugate functions. proximal mapping. Proximal mapping 6 1 EE 546, Univ of Washington, Spring 2012 6. Proximal mapping introduction review of conjugate functions proximal mapping Proximal mapping 6 1 Proximal mapping the proximal mapping (prox-operator) of a convex

More information

Convex Optimization and l 1 -minimization

Convex Optimization and l 1 -minimization Convex Optimization and l 1 -minimization Sangwoon Yun Computational Sciences Korea Institute for Advanced Study December 11, 2009 2009 NIMS Thematic Winter School Outline I. Convex Optimization II. l

More information

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems Chapter 1 Optimality Conditions: Unconstrained Optimization 1.1 Differentiable Problems Consider the problem of minimizing the function f : R n R where f is twice continuously differentiable on R n : P

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information