Bregman Divergence and Mirror Descent
|
|
- Mervin Cooper
- 5 years ago
- Views:
Transcription
1 Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning, clustering, eponential family Definition (Bregman divergence) Let ψ : Ω R be a function that is: a) strictly conve, b) continuously differentiable, c) defined on a closed conve set Ω. hen the Bregman divergence is defined as ψ (, y) = ψ() ψ(y) ψ(y), y,, y Ω. () hat is, the difference between the value of ψ at and the first order aylor epansion of ψ around y evaluated at point. Eamples Euclidean distance. Let ψ() =. hen ψ (, y) = y. Ω = R n + : i i = }, and ψ() = i i log i. hen ψ (, y) = i i log i y i for, y Ω. his is called relative entropy, or Kullback Leibler divergence between probability distributions and y. l p norm. Let p and p + q =. ψ() = q. hen ψ(, y) = q + y q, y q. Note y q is not necessarily continuously differentiable, which makes this case not precisely consistent with our definition. Properties of Bregman divergence Strict conveity in the first argument. rivial by the strict conveity of ψ. Nonnegativity: ψ (, y) 0 for all, y. ψ (, y) = 0 if and only if = y. rivial by strict conveity. Asymmetry: in general, ψ (, y) ψ (y, ). Eg, KL-divergence. Symmetrization not always useful. Non-conveity in the second argument. Let Ω = [, ), ψ() = log. hen ψ (, y) = log + log y + y y. One can check its second order derivative in y is y ( y ), which is negative when < y. Linearity in ψ. For any a > 0, ψ+aϕ (, y) = ψ (, y) + a ϕ (, y). Gradient in : ψ(, y) = ψ() ψ(y). Gradient in y is trickier, and not commonly used. Generalized triangle inequality: ψ (, y) + ψ (y, z) = ψ() ψ(y) ψ(y), y + ψ(y) ψ(z) ψ(z), y z () = ψ (, z) + y, ψ(z) ψ(y). (3)
2 Special case: ψ is called strongly conve with respect to some norm with modulus σ if ψ() ψ(y) + ψ(y), y + σ y. (4) Note the norm here is not necessarily Euclidean norm. When the norm is Euclidean, this condition is equivalent to ψ() σ being conve. For eample, the ψ() = i i log i used in KL-divergence is -strongly conve over the simple Ω = R n + : i i = }, with respect to the l norm. When ψ is σ strong conve, we have ψ (, y) σ y. (5) Proof: By definition ψ (, y) = ψ() ψ(y) ψ(y), y σ y. Duality. Suppose ψ is strongly conve. hen ( ψ )( ψ()) =, ψ (, y) = ψ ( ψ(y), ψ()). (6) Proof: (for the first equality only) Recall ψ (y) = sup z, y ψ(z)}. (7) z Ω sup must be attainable because ψ is strongly conve and Ω is closed. is a maimizer if and only if y = ψ(). So ψ (y) + ψ() =, y y = ψ(). (8) Since ψ = ψ, so ψ (y) + ψ () =, y, which means y is the maimizer in his means = ψ (y). o summarize, ( ψ )( ψ()) =. ψ () = sup, z ψ (z)}. (9) z Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. hen is optimized at ū := E µ [U] = u S uµ(u). Proof: For any S, we have min E U µ[ ψ (U, )] (0) S E U µ [ ψ (U, )] E U µ [ ψ (U, ū)] () = E µ [ψ(u) ψ() (U ) ψ() ψ(u) + ψ(ū) + (U ū) ψ(ū)] () = ψ(ū) ψ() + ψ() ū ψ(ū) + E µ [ U ψ() + U ψ(ū)] (3) = ψ(ū) ψ() (ū ) ψ() (4) = ψ (ū, ). (5) his must be nonnegative, and is 0 if and only if = ū. Pythagorean heorem. If is the projection of 0 onto a conve set C Ω: hen = argmin ψ (, 0 ). (6) ψ (y, 0 ) ψ (y, ) + ψ (, 0 ). (7) In Euclidean case, it means the angle y 0 is obtuse. More generally Lemma Suppose L is a proper conve function whose domain is an open set containing C. L is not necessarily differentiable. Let be hen for any y C we have = argmin L() + ψ (, 0 )}. (8) L(y) + ψ (y, 0 ) L( ) + ψ (, 0 ) + ψ (y, ). (9)
3 he projection in (6) is just a special case of L = 0. his property is the key to the analysis of many optimization algorithms using Bregman divergence. Proof: Denote J() = L() + ψ (, 0 ). Since minimizes J over C, there must eist a subgradient d J( ) such that d, 0, C. (0) Since J( ) = g + = ψ (, 0 ) : g L( )} = g + ψ( ) ψ( 0 ) : g L( )}. So there must be a subgradient g L( ) such that g + ψ( ) ψ( 0 ), 0, C. () herefore using the property of subgradient, we have for all y C that Rearranging completes the proof. Mirror Descent L(y) L( ) + g, y () L( ) + ψ( 0 ) ψ( ), y (3) = L( ) ψ( 0 ), 0 + ψ( ) ψ( 0 ) (4) + ψ( 0 ), y 0 ψ(y) + ψ( 0 ) (5) ψ( ), y + ψ(y) ψ( ) (6) = L( ) + ψ (, 0 ) ψ (y, 0 ) + ψ (y, ). (7) Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the problem. Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule k+ = k α k g k, where g k f( k ) (8) k+ = argmin k+ = argmin ( k α k g k ). (9) his can be interpreted as follows. First approimate f around k by a first-order aylor epansion f() f( k ) + g k, k. (30) hen penalize the displacement by α k k. So the update rule is to find a regularized minimizer of the model k+ = argmin f( k ) + g k, k + } k. (3) α k It is trivial to see this is eactly equivalent to (9). o generalize the method beyond Euclidean distance, it is straightforward to use the Bregman divergence as a measure of displacement: k+ = argmin f( k ) + g k, k + } ψ (, k ) (3) α k = argmin α k f( k ) + α k g k, k + ψ (, k )}. (33) Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). hen we can take gradient with respect to and find the optimality condition g k + α k ( ψ( k+ ) ψ( k )) = 0 (34) ψ( k+ ) = ψ( k ) α k g k (35) k+ = ( ψ) ( ψ( k ) α k g k ) = ( ψ )( ψ( k ) α k g k ). (36) For eample, in KL-divergence over simple, the update rule becomes k+ (i) = k (i) ep( α k g k (i)). (37) Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps. 3
4 . Bound on single update. elescope k+ = k α k g k (38) = k α k g k, k + α k g k (39) k α k (f( k ) f( )) + α k g k. (40) + α k (f( k ) f( )) + 3. Bounding by R and g k G : α k (f( k ) f( )) R + G 4. Denote ɛ k = f( k ) f( ) and rearrange By setting the step size α k judiciously, we can achieve αk g k. (4) α k. (4) min ɛ k R + G α k k,..., }. (43) min k,..., } ɛ k α k RG. (44) Suppose C is the simple. hen R. If each coordinate of each gradient g i is upper bounded by M, then G can be at most M n, i.e. depends on the dimension. Clearly the step to 4 can be easily etended by replacing k+ with ψ(, k+ ). So the only challenge left is to etend step. his is actually possible via Lemma. We further assume ψ is σ strongly conve. In (33), consider α k (f( k ) + g k, k ) as the L in Lemma. hen α k (f( k ) + g k, k ) + ψ (, k ) α k (f( k ) + g k, k+ k ) + ψ ( k+, k ) + ψ (, k+ ) (45) Canceling some terms can rearranging, we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ ψ ( k+, k ) (46) = ψ (, k ) + α k g k, k + α k g k, k k+ ψ ( k+, k ) (47) ψ (, k ) α k (f( k ) f( )) + α k g k, k k+ σ k k+ (48) ψ (, k ) α k (f( k ) f( )) + α k g k k k+ σ k k+ (49) ψ (, k ) α k (f( k ) f( )) + α k σ g k (50) Now compare with (40), we have successfully replaced k+ with ψ(, i ). Again upper bound ψ (, ) by R and g k by G. Note the norm on g k is the dual norm. o see the advantage of mirror descent, suppose C is the n dimensional simple, and we use KL-divergence for which ψ is strongly conve with respect to the l norm. he dual norm of the l norm is the l norm. hen we can bound ψ (, ) by using KL-divergence, and it is at most log n. G can be upper bounded by M. So as for the value of RG, mirror descent is smaller than subgradient descent by an order of O( n log n ). Acceleration : f is strongly conve. ψ with modulus λ if We say f is strongly conve with respect to another conve function f() f(y) + g, y + λ ψ (, y) g f(y). (5) 4
5 Note we do not assume f is differentiable. Now in the step from (47) to (48), we can plug in the definition of strong conveity: ψ (, k+ ) =... + α k g k, k +... (copy of (47)) (5)... α k (f( k ) f( ) + λ ψ (, k )) +... (53)... (54) ( λα k ) ψ (, k ) α k (f( k ) f( )) + α k σ g k (55) Denote δ k = ψ (, k ). Set α k = λk. hen δ k+ k k δ k λk ɛ k + G σλ k kδ k+ (k )δ k λ ɛ k + G σλ (56) k Now telescope (sum up both sides from k = to ) δ + ɛ k + G λ σλ min k ɛ k G i,..., } σλ k G O(log ). (57) σλ Acceleration : f has Lipschitz continuous gradient. eists L > 0 such that If the gradient of f is Lipschitz continuous, there f() f(y) L y,, y. (58) Sometimes we just directly say f is smooth. It is also known that this is equivalent to We bound the g k, k+ term in (46) as follows Plug into (46), we get f() f(y) + f(y), y + L y. (59) g k, k+ = g k, k + g k, k k+ (60) f( ) f( k ) + f( k ) f( k+ ) + L k k+ (6) = f( ) f( k+ ) + L k k+. (6) ψ (, k+ ) ψ (, k ) + α k ( f( ) f( k+ ) + L k k+ ) σ k k+. (63) Set α k = σ L, we get elescope we get ψ (, k+ ) ψ (, k ) σ L (f( k+) f( )). (64) min f( k) f( ) L (, ) LR k,..., +} σ σ. (65) his gives O( ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( ). We will not go into the details but the algorithm and proof are again based on Lemma. his is often called accelerated proimal gradient method.. Composite Objective Suppose the objective function is h() = f()+r(), where f is smooth and r() is simple, like. If we directly apply the above rates for optimizing h, we get O( ) rate of convergence because h is not smooth. It will be nice if we can enjoy the O( ) rate as in smooth optimization. Fortunately this is possible thanks to the simplicity of r(), and we only need to etend the proimal operator (33) as follows: k+ = argmin = argmin f( k ) + g k, k + r() + α k ψ (, k ) } (66) α k f( k ) + α k g k, k + α k r() + ψ (, k )}. (67) 5
6 Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival picks a function f k. 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+. Here we use a first-order aylor approimation of f around k, but keep r() eact. Assuming this proimal operator can be computed efficiently, then we can show all the above rates carry over. We here only show the case of general f (not necessarily smooth or strongly conve), and leave the rest two cases as an eercise. In fact we can again achieve O( ) rate when f has Lipschitz continuous gradient. Consider α k (f( k ) + g k, k + r()) as the L in Lemma. hen α k (f( k ) + g k, k + r( )) + ψ (, k ) (68) α k (f( k ) + g k, k+ k + r( k+ )) + ψ ( k+, k ) + ψ (, k+ ). (69) Following eactly the derivations from (46) to (50), we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ + α k (r( ) r( k+ )) ψ ( k+, k ) (70)... (7) ψ (, k ) α k (f( k ) + r( k+ ) f( ) r( )) + α k σ g k. (7) his is almost the same as (50), ecept that we want to have r( k ) here, not r( k+ ). Fortunately this is not a problem as long as we use a slightly different way of telescoping. Denote δ k = ψ (, k ) and then f( k ) + r( k+ ) f( ) r( ) (δ k δ k+ ) + α k α k σ g k. (73) Summing up from k = to we obtain r( + ) r( ) + (h( k ) h( )) δ ( + δ k ) δ + + G α k (74) α α k α k α σ k= ( R ( + ) ) + G α k (75) α α k α k σ = R + G α σ k= α k. (76) Suppose we choose = argmin r(), which ensures r( + ) r( ) 0. Setting α k = G R σ k, we get (h( k ) h( )) RG σ ( + herefore min,..., h( k ) h( )} decays at the rate of O( RG σ ).. Online learning ) = RG O( ). (77) k σ he protocol of online learning is shown in Algorithm. he player s goal of online learning is to minimize the regret, the minimal possible loss k f k() over all possible : Regret = f k ( k ) min f k (). (78) Note there is no assumption made on how the rival picks f k, and it can adversarial. After obtaining f k at iteration k, let us update the model into k+ by using the mirror descent rule on function f k only: k+ = argmin f k ( k ) + g k, k + } ψ (, k ), where g k f k ( k ). (79) α k 6
7 hen it is easy to derive the regret bound. Using f k in step (50), we have f k ( k ) f k ( ) α k ( ψ (, k ) ψ (, k+ )) + α k σ g k. (80) Summing up from k = to n and using the same process as in (74) to (77), we get (f k ( k ) f k ( )) RG O( ). (8) σ So the regret grows in the order of O( ). f is strongly conve. immediately. Eactly use (55) with f k in place of f, and we can derive the O(log ) regret bound f has Lipschitz continuous gradient. he result in (64) can NO be etended to the online setting because if we replace f by f k we will get f k ( k+ ) f k ( ) on the right-hand side. elescoping will not give a regret bound. In fact, it is known that in the online setting, having a Lipschitz continuous gradient itself cannot reduce the regret bound from O( ) (as in nonsmooth objective) to O(log ). Composite objective. In the online setting, both the player and the rival know r(), and the rival changes f k () at each iteration. he loss incurred at each iteration is h k ( k ) = f k ( k ) + r( k ). he update rule is k+ = argmin f k ( k ) + g k, k + r() + } ψ (, k ), where g k f k ( k ). (8) α k Note in this setting, (73) becomes f k ( k ) + r( k+ ) f k ( ) r( ) α k (δ k δ k+ ) + α k σ g k. (83) Although we have r( k+ ) here rather than r( k ), it is fine because r does not change through iterations. Choosing = argmin r() and telescoping in the same way as from (74) to (77), we immediately obtain (h k ( k ) h k ( )) G O( ). (84) σ So the regret grows at O( ). When f k are strongly conve, we can get O(log ) regret for the composite case. But as epected, having Lipschitz continuity of f k alone cannot reduce the regret from O( ) to O(log )..3 Stochastic optimization Let us consider optimizing a function which takes a form of epectation min F () := E ω p [f(; ω)], (85) where p is a distribution of ω. his subsumes a lot of machine learning models. For eample, the SVM objective is F () = m ma0, c i a i, } + λ m. (86) i= It can be interpreted as (85) where ω is uniformly distributed in,,..., m} (i.e. p(ω = i) = m ), and f(; i) = ma0, c i a i, } + λ. (87) When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the updates on a single randomly chosen data point. It can be considered as a special case of online learning in Algorithm, where the rival in step 4 now randomly picks f k as f(; ω k ) with ω k being drawn independently from p. Ideally we hope that by using the mirror descent updates, k will gradually approach the minimizer 7
8 Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival randomly draws a ω k from p, which defines a function f k () := f(; ω k ). 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+ by, e.g., mirror descent (79). of F (). Intuitively this is quite reasonable, and by using f k we can compute an unbiased estimate of F ( k ) and a subgradient of F ( k ) (because ω k are sampled iid from p). his is a particular case of stochastic optimization, and we recap it in Algorithm. In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ω k at iteration k. hen an online learning algorithm A is simply a deterministic mapping from an ordered set ω,..., ω k } to k+. Denote as A(ω 0 ) the initial model. hen the following theorem is the key for online to batch conversion. heorem 3 Suppose an online learning algorithm A has regret bound R k after running Algorithm for k iterations. Suppose ω,..., ω + are drawn iid from p. Define ˆ = A(ω j+,..., ω ) where j is drawn uniformly random from 0,..., }. hen E[F (ˆ)] min F () R + +, (88) where the epectation is with respect to the randomness of ω,..., ω, and j. Similarly we can have high probability bounds, which can be stated in the form like (not eactly true) F (ˆ) min F () R + + log δ with probability δ, where the probability is with respect to the randomness of ω,..., ω, and j. (89) Proof of heorem 3. E[F (ˆ)] = E [f(ˆ; ω + )] = E [f(a(ω j+,..., ω ); ω + )] (90) j,ω,...,ω + j,ω,...,ω + = E f(a(ω j+,..., ω ); ω + ) (as j is drawn uniformly random) (9) ω,...,ω + + j=0 = E f(a(ω,..., ω j ); ω + j ) (shift iteration inde by iid of w i ) + = + + ω,...,ω + E ω,...,ω + E ω,...,ω + j=0 [ + [ ] f(a(ω,..., ω s ); ω s ) s= min min E ω [f(; ω] + R + + ] + f(; ω s ) + R + s= (9) (change of variable s = j + ) (93) (apply regret bound) (94) (epectation of min is smaller than min of epectation) (95) = min F () + R + +. (96) 8
Lecture 23: Conditional Gradient Method
10-725/36-725: Conve Optimization Spring 2015 Lecture 23: Conditional Gradient Method Lecturer: Ryan Tibshirani Scribes: Shichao Yang,Diyi Yang,Zhanpeng Fang Note: LaTeX template courtesy of UC Berkeley
More informationLecture 24 November 27
EE 381V: Large Scale Optimization Fall 01 Lecture 4 November 7 Lecturer: Caramanis & Sanghavi Scribe: Jahshan Bhatti and Ken Pesyna 4.1 Mirror Descent Earlier, we motivated mirror descent as a way to improve
More informationCSC 576: Gradient Descent Algorithms
CSC 576: Gradient Descent Algorithms Ji Liu Department of Computer Sciences, University of Rochester December 22, 205 Introduction The gradient descent algorithm is one of the most popular optimization
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More information1 Review and Overview
DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationPower EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract
Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationConvex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization
Convex Optimization Ofer Meshi Lecture 6: Lower Bounds Constrained Optimization Lower Bounds Some upper bounds: #iter μ 2 M #iter 2 M #iter L L μ 2 Oracle/ops GD κ log 1/ε M x # ε L # x # L # ε # με f
More informationRelative Loss Bounds for Multidimensional Regression Problems
Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationMath 273a: Optimization Lagrange Duality
Math 273a: Optimization Lagrange Duality Instructor: Wotao Yin Department of Mathematics, UCLA Winter 2015 online discussions on piazza.com Gradient descent / forward Euler assume function f is proper
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationLecture 19: Follow The Regulerized Leader
COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationApproximate inference, Sampling & Variational inference Fall Cours 9 November 25
Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationFollow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization
Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization H. Brendan McMahan Google, Inc. Abstract We prove that many mirror descent algorithms for online conve optimization
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationMonotonicity and Restart in Fast Gradient Methods
53rd IEEE Conference on Decision and Control December 5-7, 204. Los Angeles, California, USA Monotonicity and Restart in Fast Gradient Methods Pontus Giselsson and Stephen Boyd Abstract Fast gradient methods
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationThe Uniform Hardcore Lemma via Approximate Bregman Projections
The Uniform Hardcore Lemma via Approimate Bregman Projections Boaz Barak Moritz Hardt Satyen Kale Abstract We give a simple, more efficient and uniform proof of the hard-core lemma, a fundamental result
More information15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015
5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationIn applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem
1 Conve Analsis Main references: Vandenberghe UCLA): EECS236C - Optimiation methods for large scale sstems, http://www.seas.ucla.edu/ vandenbe/ee236c.html Parikh and Bod, Proimal algorithms, slides and
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationPavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017
Randomized Similar Triangles Method: A Unifying Framework for Accelerated Randomized Optimization Methods Coordinate Descent, Directional Search, Derivative-Free Method) Pavel Dvurechensky Alexander Gasnikov
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationIntroduction to Alternating Direction Method of Multipliers
Introduction to Alternating Direction Method of Multipliers Yale Chang Machine Learning Group Meeting September 29, 2016 Yale Chang (Machine Learning Group Meeting) Introduction to Alternating Direction
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More information10-725/36-725: Convex Optimization Spring Lecture 21: April 6
10-725/36-725: Conve Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 21: April 6 Scribes: Chiqun Zhang, Hanqi Cheng, Waleed Ammar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationIFT Lecture 2 Basics of convex analysis and gradient descent
IF 6085 - Lecture Basics of convex analysis and gradient descent his version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribes: Assya rofimov,
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationConvex Optimization on Large-Scale Domains Given by Linear Minimization Oracles
Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research
More informationMachine Learning Summer 2018 Exercise Sheet 4
Ludwig-Maimilians-Universitaet Muenchen 17.05.2018 Institute for Informatics Prof. Dr. Volker Tresp Julian Busch Christian Frey Machine Learning Summer 2018 Eercise Sheet 4 Eercise 4-1 The Simpsons Characters
More informationDual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Dual Methods Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 1 Last time: proimal Newton method Consider the problem min g() + h() where g, h are conve, g is twice differentiable, and h is simple.
More information1 Kernel methods & optimization
Machine Learning Class Notes 9-26-13 Prof. David Sontag 1 Kernel methods & optimization One eample of a kernel that is frequently used in practice and which allows for highly non-linear discriminant functions
More informationA Finite Newton Method for Classification Problems
A Finite Newton Method for Classification Problems O. L. Mangasarian Computer Sciences Department University of Wisconsin 1210 West Dayton Street Madison, WI 53706 olvi@cs.wisc.edu Abstract A fundamental
More informationComposite Objective Mirror Descent
Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research
More informationOnline Optimization : Competing with Dynamic Comparators
Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online
More informationDual Ascent. Ryan Tibshirani Convex Optimization
Dual Ascent Ryan Tibshirani Conve Optimization 10-725 Last time: coordinate descent Consider the problem min f() where f() = g() + n i=1 h i( i ), with g conve and differentiable and each h i conve. Coordinate
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course otes for EE7C (Spring 018): Conve Optimization and Approimation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Ma Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationOne Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties
One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationLecture 14: Newton s Method
10-725/36-725: Conve Optimization Fall 2016 Lecturer: Javier Pena Lecture 14: Newton s ethod Scribes: Varun Joshi, Xuan Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes
More informationMath 273a: Optimization Convex Conjugacy
Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper
More informationConvergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity
Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to
More informationInformation geometry of mirror descent
Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationarxiv: v6 [math.oc] 24 Sep 2018
: The First Direct Acceleration of Stochastic Gradient Methods version 6 arxiv:63.5953v6 [math.oc] 4 Sep 8 Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University / Institute for Advanced Study March
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationRegret bounded by gradual variation for online convex optimization
Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date
More informationIteration-complexity of first-order penalty methods for convex programming
Iteration-complexity of first-order penalty methods for convex programming Guanghui Lan Renato D.C. Monteiro July 24, 2008 Abstract This paper considers a special but broad class of convex programing CP)
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationGradient methods for minimizing composite functions
Math. Program., Ser. B 2013) 140:125 161 DOI 10.1007/s10107-012-0629-5 FULL LENGTH PAPER Gradient methods for minimizing composite functions Yu. Nesterov Received: 10 June 2010 / Accepted: 29 December
More informationLecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta
CSCI699: opics in Learning & Game heory Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar hakoor & Umang Gupta No regret learning in zero-sum games Definition. A 2-player game of complete information is
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationHW #1 SOLUTIONS. g(x) sin 1 x
HW #1 SOLUTIONS 1. Let f : [a, b] R and let v : I R, where I is an interval containing f([a, b]). Suppose that f is continuous at [a, b]. Suppose that lim s f() v(s) = v(f()). Using the ɛ δ definition
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationLecture 6: September 12
10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationtoo, of course, but is perhaps overkill here.
LUNDS TEKNISKA HÖGSKOLA MATEMATIK LÖSNINGAR OPTIMERING 018-01-11 kl 08-13 1. a) CQ points are shown as A and B below. Graphically: they are the only points that share the same tangent line for both active
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationA Primer on Multidimensional Optimization
A Primer on Multidimensional Optimization Prof. Dr. Florian Rupp German University of Technology in Oman (GUtech) Introduction to Numerical Methods for ENG & CS (Mathematics IV) Spring Term 2016 Eercise
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationDelay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms
Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,
More informationLecture 2: Subgradient Methods
Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationOptimization and Optimal Control in Banach Spaces
Optimization and Optimal Control in Banach Spaces Bernhard Schmitzer October 19, 2017 1 Convex non-smooth optimization with proximal operators Remark 1.1 (Motivation). Convex optimization: easier to solve,
More information