Bregman Divergence and Mirror Descent

Size: px
Start display at page:

Download "Bregman Divergence and Mirror Descent"

Transcription

1 Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning, clustering, eponential family Definition (Bregman divergence) Let ψ : Ω R be a function that is: a) strictly conve, b) continuously differentiable, c) defined on a closed conve set Ω. hen the Bregman divergence is defined as ψ (, y) = ψ() ψ(y) ψ(y), y,, y Ω. () hat is, the difference between the value of ψ at and the first order aylor epansion of ψ around y evaluated at point. Eamples Euclidean distance. Let ψ() =. hen ψ (, y) = y. Ω = R n + : i i = }, and ψ() = i i log i. hen ψ (, y) = i i log i y i for, y Ω. his is called relative entropy, or Kullback Leibler divergence between probability distributions and y. l p norm. Let p and p + q =. ψ() = q. hen ψ(, y) = q + y q, y q. Note y q is not necessarily continuously differentiable, which makes this case not precisely consistent with our definition. Properties of Bregman divergence Strict conveity in the first argument. rivial by the strict conveity of ψ. Nonnegativity: ψ (, y) 0 for all, y. ψ (, y) = 0 if and only if = y. rivial by strict conveity. Asymmetry: in general, ψ (, y) ψ (y, ). Eg, KL-divergence. Symmetrization not always useful. Non-conveity in the second argument. Let Ω = [, ), ψ() = log. hen ψ (, y) = log + log y + y y. One can check its second order derivative in y is y ( y ), which is negative when < y. Linearity in ψ. For any a > 0, ψ+aϕ (, y) = ψ (, y) + a ϕ (, y). Gradient in : ψ(, y) = ψ() ψ(y). Gradient in y is trickier, and not commonly used. Generalized triangle inequality: ψ (, y) + ψ (y, z) = ψ() ψ(y) ψ(y), y + ψ(y) ψ(z) ψ(z), y z () = ψ (, z) + y, ψ(z) ψ(y). (3)

2 Special case: ψ is called strongly conve with respect to some norm with modulus σ if ψ() ψ(y) + ψ(y), y + σ y. (4) Note the norm here is not necessarily Euclidean norm. When the norm is Euclidean, this condition is equivalent to ψ() σ being conve. For eample, the ψ() = i i log i used in KL-divergence is -strongly conve over the simple Ω = R n + : i i = }, with respect to the l norm. When ψ is σ strong conve, we have ψ (, y) σ y. (5) Proof: By definition ψ (, y) = ψ() ψ(y) ψ(y), y σ y. Duality. Suppose ψ is strongly conve. hen ( ψ )( ψ()) =, ψ (, y) = ψ ( ψ(y), ψ()). (6) Proof: (for the first equality only) Recall ψ (y) = sup z, y ψ(z)}. (7) z Ω sup must be attainable because ψ is strongly conve and Ω is closed. is a maimizer if and only if y = ψ(). So ψ (y) + ψ() =, y y = ψ(). (8) Since ψ = ψ, so ψ (y) + ψ () =, y, which means y is the maimizer in his means = ψ (y). o summarize, ( ψ )( ψ()) =. ψ () = sup, z ψ (z)}. (9) z Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. hen is optimized at ū := E µ [U] = u S uµ(u). Proof: For any S, we have min E U µ[ ψ (U, )] (0) S E U µ [ ψ (U, )] E U µ [ ψ (U, ū)] () = E µ [ψ(u) ψ() (U ) ψ() ψ(u) + ψ(ū) + (U ū) ψ(ū)] () = ψ(ū) ψ() + ψ() ū ψ(ū) + E µ [ U ψ() + U ψ(ū)] (3) = ψ(ū) ψ() (ū ) ψ() (4) = ψ (ū, ). (5) his must be nonnegative, and is 0 if and only if = ū. Pythagorean heorem. If is the projection of 0 onto a conve set C Ω: hen = argmin ψ (, 0 ). (6) ψ (y, 0 ) ψ (y, ) + ψ (, 0 ). (7) In Euclidean case, it means the angle y 0 is obtuse. More generally Lemma Suppose L is a proper conve function whose domain is an open set containing C. L is not necessarily differentiable. Let be hen for any y C we have = argmin L() + ψ (, 0 )}. (8) L(y) + ψ (y, 0 ) L( ) + ψ (, 0 ) + ψ (y, ). (9)

3 he projection in (6) is just a special case of L = 0. his property is the key to the analysis of many optimization algorithms using Bregman divergence. Proof: Denote J() = L() + ψ (, 0 ). Since minimizes J over C, there must eist a subgradient d J( ) such that d, 0, C. (0) Since J( ) = g + = ψ (, 0 ) : g L( )} = g + ψ( ) ψ( 0 ) : g L( )}. So there must be a subgradient g L( ) such that g + ψ( ) ψ( 0 ), 0, C. () herefore using the property of subgradient, we have for all y C that Rearranging completes the proof. Mirror Descent L(y) L( ) + g, y () L( ) + ψ( 0 ) ψ( ), y (3) = L( ) ψ( 0 ), 0 + ψ( ) ψ( 0 ) (4) + ψ( 0 ), y 0 ψ(y) + ψ( 0 ) (5) ψ( ), y + ψ(y) ψ( ) (6) = L( ) + ψ (, 0 ) ψ (y, 0 ) + ψ (y, ). (7) Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the problem. Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule k+ = k α k g k, where g k f( k ) (8) k+ = argmin k+ = argmin ( k α k g k ). (9) his can be interpreted as follows. First approimate f around k by a first-order aylor epansion f() f( k ) + g k, k. (30) hen penalize the displacement by α k k. So the update rule is to find a regularized minimizer of the model k+ = argmin f( k ) + g k, k + } k. (3) α k It is trivial to see this is eactly equivalent to (9). o generalize the method beyond Euclidean distance, it is straightforward to use the Bregman divergence as a measure of displacement: k+ = argmin f( k ) + g k, k + } ψ (, k ) (3) α k = argmin α k f( k ) + α k g k, k + ψ (, k )}. (33) Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). hen we can take gradient with respect to and find the optimality condition g k + α k ( ψ( k+ ) ψ( k )) = 0 (34) ψ( k+ ) = ψ( k ) α k g k (35) k+ = ( ψ) ( ψ( k ) α k g k ) = ( ψ )( ψ( k ) α k g k ). (36) For eample, in KL-divergence over simple, the update rule becomes k+ (i) = k (i) ep( α k g k (i)). (37) Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps. 3

4 . Bound on single update. elescope k+ = k α k g k (38) = k α k g k, k + α k g k (39) k α k (f( k ) f( )) + α k g k. (40) + α k (f( k ) f( )) + 3. Bounding by R and g k G : α k (f( k ) f( )) R + G 4. Denote ɛ k = f( k ) f( ) and rearrange By setting the step size α k judiciously, we can achieve αk g k. (4) α k. (4) min ɛ k R + G α k k,..., }. (43) min k,..., } ɛ k α k RG. (44) Suppose C is the simple. hen R. If each coordinate of each gradient g i is upper bounded by M, then G can be at most M n, i.e. depends on the dimension. Clearly the step to 4 can be easily etended by replacing k+ with ψ(, k+ ). So the only challenge left is to etend step. his is actually possible via Lemma. We further assume ψ is σ strongly conve. In (33), consider α k (f( k ) + g k, k ) as the L in Lemma. hen α k (f( k ) + g k, k ) + ψ (, k ) α k (f( k ) + g k, k+ k ) + ψ ( k+, k ) + ψ (, k+ ) (45) Canceling some terms can rearranging, we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ ψ ( k+, k ) (46) = ψ (, k ) + α k g k, k + α k g k, k k+ ψ ( k+, k ) (47) ψ (, k ) α k (f( k ) f( )) + α k g k, k k+ σ k k+ (48) ψ (, k ) α k (f( k ) f( )) + α k g k k k+ σ k k+ (49) ψ (, k ) α k (f( k ) f( )) + α k σ g k (50) Now compare with (40), we have successfully replaced k+ with ψ(, i ). Again upper bound ψ (, ) by R and g k by G. Note the norm on g k is the dual norm. o see the advantage of mirror descent, suppose C is the n dimensional simple, and we use KL-divergence for which ψ is strongly conve with respect to the l norm. he dual norm of the l norm is the l norm. hen we can bound ψ (, ) by using KL-divergence, and it is at most log n. G can be upper bounded by M. So as for the value of RG, mirror descent is smaller than subgradient descent by an order of O( n log n ). Acceleration : f is strongly conve. ψ with modulus λ if We say f is strongly conve with respect to another conve function f() f(y) + g, y + λ ψ (, y) g f(y). (5) 4

5 Note we do not assume f is differentiable. Now in the step from (47) to (48), we can plug in the definition of strong conveity: ψ (, k+ ) =... + α k g k, k +... (copy of (47)) (5)... α k (f( k ) f( ) + λ ψ (, k )) +... (53)... (54) ( λα k ) ψ (, k ) α k (f( k ) f( )) + α k σ g k (55) Denote δ k = ψ (, k ). Set α k = λk. hen δ k+ k k δ k λk ɛ k + G σλ k kδ k+ (k )δ k λ ɛ k + G σλ (56) k Now telescope (sum up both sides from k = to ) δ + ɛ k + G λ σλ min k ɛ k G i,..., } σλ k G O(log ). (57) σλ Acceleration : f has Lipschitz continuous gradient. eists L > 0 such that If the gradient of f is Lipschitz continuous, there f() f(y) L y,, y. (58) Sometimes we just directly say f is smooth. It is also known that this is equivalent to We bound the g k, k+ term in (46) as follows Plug into (46), we get f() f(y) + f(y), y + L y. (59) g k, k+ = g k, k + g k, k k+ (60) f( ) f( k ) + f( k ) f( k+ ) + L k k+ (6) = f( ) f( k+ ) + L k k+. (6) ψ (, k+ ) ψ (, k ) + α k ( f( ) f( k+ ) + L k k+ ) σ k k+. (63) Set α k = σ L, we get elescope we get ψ (, k+ ) ψ (, k ) σ L (f( k+) f( )). (64) min f( k) f( ) L (, ) LR k,..., +} σ σ. (65) his gives O( ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( ). We will not go into the details but the algorithm and proof are again based on Lemma. his is often called accelerated proimal gradient method.. Composite Objective Suppose the objective function is h() = f()+r(), where f is smooth and r() is simple, like. If we directly apply the above rates for optimizing h, we get O( ) rate of convergence because h is not smooth. It will be nice if we can enjoy the O( ) rate as in smooth optimization. Fortunately this is possible thanks to the simplicity of r(), and we only need to etend the proimal operator (33) as follows: k+ = argmin = argmin f( k ) + g k, k + r() + α k ψ (, k ) } (66) α k f( k ) + α k g k, k + α k r() + ψ (, k )}. (67) 5

6 Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival picks a function f k. 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+. Here we use a first-order aylor approimation of f around k, but keep r() eact. Assuming this proimal operator can be computed efficiently, then we can show all the above rates carry over. We here only show the case of general f (not necessarily smooth or strongly conve), and leave the rest two cases as an eercise. In fact we can again achieve O( ) rate when f has Lipschitz continuous gradient. Consider α k (f( k ) + g k, k + r()) as the L in Lemma. hen α k (f( k ) + g k, k + r( )) + ψ (, k ) (68) α k (f( k ) + g k, k+ k + r( k+ )) + ψ ( k+, k ) + ψ (, k+ ). (69) Following eactly the derivations from (46) to (50), we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ + α k (r( ) r( k+ )) ψ ( k+, k ) (70)... (7) ψ (, k ) α k (f( k ) + r( k+ ) f( ) r( )) + α k σ g k. (7) his is almost the same as (50), ecept that we want to have r( k ) here, not r( k+ ). Fortunately this is not a problem as long as we use a slightly different way of telescoping. Denote δ k = ψ (, k ) and then f( k ) + r( k+ ) f( ) r( ) (δ k δ k+ ) + α k α k σ g k. (73) Summing up from k = to we obtain r( + ) r( ) + (h( k ) h( )) δ ( + δ k ) δ + + G α k (74) α α k α k α σ k= ( R ( + ) ) + G α k (75) α α k α k σ = R + G α σ k= α k. (76) Suppose we choose = argmin r(), which ensures r( + ) r( ) 0. Setting α k = G R σ k, we get (h( k ) h( )) RG σ ( + herefore min,..., h( k ) h( )} decays at the rate of O( RG σ ).. Online learning ) = RG O( ). (77) k σ he protocol of online learning is shown in Algorithm. he player s goal of online learning is to minimize the regret, the minimal possible loss k f k() over all possible : Regret = f k ( k ) min f k (). (78) Note there is no assumption made on how the rival picks f k, and it can adversarial. After obtaining f k at iteration k, let us update the model into k+ by using the mirror descent rule on function f k only: k+ = argmin f k ( k ) + g k, k + } ψ (, k ), where g k f k ( k ). (79) α k 6

7 hen it is easy to derive the regret bound. Using f k in step (50), we have f k ( k ) f k ( ) α k ( ψ (, k ) ψ (, k+ )) + α k σ g k. (80) Summing up from k = to n and using the same process as in (74) to (77), we get (f k ( k ) f k ( )) RG O( ). (8) σ So the regret grows in the order of O( ). f is strongly conve. immediately. Eactly use (55) with f k in place of f, and we can derive the O(log ) regret bound f has Lipschitz continuous gradient. he result in (64) can NO be etended to the online setting because if we replace f by f k we will get f k ( k+ ) f k ( ) on the right-hand side. elescoping will not give a regret bound. In fact, it is known that in the online setting, having a Lipschitz continuous gradient itself cannot reduce the regret bound from O( ) (as in nonsmooth objective) to O(log ). Composite objective. In the online setting, both the player and the rival know r(), and the rival changes f k () at each iteration. he loss incurred at each iteration is h k ( k ) = f k ( k ) + r( k ). he update rule is k+ = argmin f k ( k ) + g k, k + r() + } ψ (, k ), where g k f k ( k ). (8) α k Note in this setting, (73) becomes f k ( k ) + r( k+ ) f k ( ) r( ) α k (δ k δ k+ ) + α k σ g k. (83) Although we have r( k+ ) here rather than r( k ), it is fine because r does not change through iterations. Choosing = argmin r() and telescoping in the same way as from (74) to (77), we immediately obtain (h k ( k ) h k ( )) G O( ). (84) σ So the regret grows at O( ). When f k are strongly conve, we can get O(log ) regret for the composite case. But as epected, having Lipschitz continuity of f k alone cannot reduce the regret from O( ) to O(log )..3 Stochastic optimization Let us consider optimizing a function which takes a form of epectation min F () := E ω p [f(; ω)], (85) where p is a distribution of ω. his subsumes a lot of machine learning models. For eample, the SVM objective is F () = m ma0, c i a i, } + λ m. (86) i= It can be interpreted as (85) where ω is uniformly distributed in,,..., m} (i.e. p(ω = i) = m ), and f(; i) = ma0, c i a i, } + λ. (87) When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the updates on a single randomly chosen data point. It can be considered as a special case of online learning in Algorithm, where the rival in step 4 now randomly picks f k as f(; ω k ) with ω k being drawn independently from p. Ideally we hope that by using the mirror descent updates, k will gradually approach the minimizer 7

8 Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival randomly draws a ω k from p, which defines a function f k () := f(; ω k ). 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+ by, e.g., mirror descent (79). of F (). Intuitively this is quite reasonable, and by using f k we can compute an unbiased estimate of F ( k ) and a subgradient of F ( k ) (because ω k are sampled iid from p). his is a particular case of stochastic optimization, and we recap it in Algorithm. In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ω k at iteration k. hen an online learning algorithm A is simply a deterministic mapping from an ordered set ω,..., ω k } to k+. Denote as A(ω 0 ) the initial model. hen the following theorem is the key for online to batch conversion. heorem 3 Suppose an online learning algorithm A has regret bound R k after running Algorithm for k iterations. Suppose ω,..., ω + are drawn iid from p. Define ˆ = A(ω j+,..., ω ) where j is drawn uniformly random from 0,..., }. hen E[F (ˆ)] min F () R + +, (88) where the epectation is with respect to the randomness of ω,..., ω, and j. Similarly we can have high probability bounds, which can be stated in the form like (not eactly true) F (ˆ) min F () R + + log δ with probability δ, where the probability is with respect to the randomness of ω,..., ω, and j. (89) Proof of heorem 3. E[F (ˆ)] = E [f(ˆ; ω + )] = E [f(a(ω j+,..., ω ); ω + )] (90) j,ω,...,ω + j,ω,...,ω + = E f(a(ω j+,..., ω ); ω + ) (as j is drawn uniformly random) (9) ω,...,ω + + j=0 = E f(a(ω,..., ω j ); ω + j ) (shift iteration inde by iid of w i ) + = + + ω,...,ω + E ω,...,ω + E ω,...,ω + j=0 [ + [ ] f(a(ω,..., ω s ); ω s ) s= min min E ω [f(; ω] + R + + ] + f(; ω s ) + R + s= (9) (change of variable s = j + ) (93) (apply regret bound) (94) (epectation of min is smaller than min of epectation) (95) = min F () + R + +. (96) 8

Lecture 23: Conditional Gradient Method

Lecture 23: Conditional Gradient Method 10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley

More information

Lecture 24 November 27

Lecture 24 November 27 EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve

More information

CSC 576: Gradient Descent Algorithms

CSC 576: Gradient Descent Algorithms CSC 576: Gradient Descent Algorithms Ji Liu Department of Computer Sciences, University of Rochester December 22, 205 Introduction The gradient descent algorithm is one of the most popular optimization

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

1 Review and Overview

1 Review and Overview DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f

More information

Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Math 273a: Optimization Lagrange Duality

Math 273a: Optimization Lagrange Duality Math 273a: Optimization Lagrange Duality Instructor: Wotao Yin Department of Mathematics, UCLA Winter 2015 online discussions on piazza.com Gradient descent / forward Euler assume function f is proper

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25 Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization H. Brendan McMahan Google, Inc. Abstract We prove that many mirror descent algorithms for online conve optimization

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Monotonicity and Restart in Fast Gradient Methods

Monotonicity and Restart in Fast Gradient Methods 53rd IEEE Conference on Decision and Control December 5-7, 204. Los Angeles, California, USA Monotonicity and Restart in Fast Gradient Methods Pontus Giselsson and Stephen Boyd Abstract Fast gradient methods

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

The Uniform Hardcore Lemma via Approximate Bregman Projections

The Uniform Hardcore Lemma via Approximate Bregman Projections The Uniform Hardcore Lemma via Approimate Bregman Projections Boaz Barak Moritz Hardt Satyen Kale Abstract We give a simple, more efficient and uniform proof of the hard-core lemma, a fundamental result

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem 1 Conve Analsis Main references: Vandenberghe UCLA): EECS236C - Optimiation methods for large scale sstems, http://www.seas.ucla.edu/ vandenbe/ee236c.html Parikh and Bod, Proimal algorithms, slides and

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017 Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Introduction to Alternating Direction Method of Multipliers

Introduction to Alternating Direction Method of Multipliers Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

10-725/36-725: Convex Optimization Spring Lecture 21: April 6 10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

IFT Lecture 2 Basics of convex analysis and gradient descent

IFT Lecture 2 Basics of convex analysis and gradient descent IF 6085 - Lecture Basics of convex analysis and gradient descent his version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribes: Assya rofimov,

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Machine Learning Summer 2018 Exercise Sheet 4

Machine Learning Summer 2018 Exercise Sheet 4 Ludwig-Maimilians-Universitaet Muenchen 17.05.2018 Institute for Informatics Prof. Dr. Volker Tresp Julian Busch Christian Frey Machine Learning Summer 2018 Eercise Sheet 4 Eercise 4-1 The Simpsons Characters

More information

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725 Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.

More information

1 Kernel methods & optimization

1 Kernel methods & optimization Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions

More information

A Finite Newton Method for Classification Problems

A Finite Newton Method for Classification Problems A Finite Newton Method for Classification Problems O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 olvi@cs.wisc.edu Abstract A fundamental

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

Dual Ascent. Ryan Tibshirani Convex Optimization

Dual Ascent. Ryan Tibshirani Convex Optimization Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Lecture 14: Newton s Method

Lecture 14: Newton s Method 10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Information geometry of mirror descent

Information geometry of mirror descent Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

arxiv: v6 [math.oc] 24 Sep 2018

arxiv: v6 [math.oc] 24 Sep 2018 : The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Iteration-complexity of first-order penalty methods for convex programming

Iteration-complexity of first-order penalty methods for convex programming Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Gradient methods for minimizing composite functions

Gradient methods for minimizing composite functions Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December

More information

Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta

Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta CSCI699: opics in Learning & Game heory Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar hakoor & Umang Gupta No regret learning in zero-sum games Definition. A 2-player game of complete information is

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

HW #1 SOLUTIONS. g(x) sin 1 x

HW #1 SOLUTIONS. g(x) sin 1 x HW #1 SOLUTIONS 1. Let f : [a, b] R and let v : I R, where I is an interval containing f([a, b]). Suppose that f is continuous at [a, b]. Suppose that lim s f() v(s) = v(f()). Using the ɛ δ definition

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Smoothing Proximal Gradient Method. General Structured Sparse Regression for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:

More information

Lecture 6: September 12

Lecture 6: September 12 10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not

More information

too, of course, but is perhaps overkill here.

too, of course, but is perhaps overkill here. LUNDS TEKNISKA HÖGSKOLA MATEMATIK LÖSNINGAR OPTIMERING 018-01-11 kl 08-13 1. a) CQ points are shown as A and B below. Graphically: they are the only points that share the same tangent line for both active

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

A Primer on Multidimensional Optimization

A Primer on Multidimensional Optimization A Primer on Multidimensional Optimization Prof. Dr. Florian Rupp German University of Technology in Oman (GUtech) Introduction to Numerical Methods for ENG & CS (Mathematics IV) Spring Term 2016 Eercise

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,

More information

Lecture 2: Subgradient Methods

Lecture 2: Subgradient Methods Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Optimization and Optimal Control in Banach Spaces

Optimization and Optimal Control in Banach Spaces Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,

More information