Bregman Divergence and Mirror Descent

Similar documents
Lecture 23: Conditional Gradient Method

Lecture 24 November 27

CSC 576: Gradient Descent Algorithms

A Unified Approach to Proximal Algorithms using Bregman Distance

Lecture 23: November 19

Adaptive Online Gradient Descent

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Lecture 16: FTRL and Online Mirror Descent

1 Review and Overview

Stochastic Gradient Descent with Variance Reduction

Proximal and First-Order Methods for Convex Optimization

Convex Optimization Lecture 16

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Exponentiated Gradient Descent

Generalization bounds

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

Relative Loss Bounds for Multidimensional Regression Problems

Conditional Gradient (Frank-Wolfe) Method

Subgradient Method. Ryan Tibshirani Convex Optimization

Math 273a: Optimization Lagrange Duality

Stochastic and online algorithms

Lecture 19: Follow The Regulerized Leader

Optimization methods

Online Convex Optimization

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Stochastic Subgradient Method

Math 273a: Optimization Subgradients of convex functions

Monotonicity and Restart in Fast Gradient Methods

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

The Uniform Hardcore Lemma via Approximate Bregman Projections

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

The FTRL Algorithm with Strongly Convex Regularizers

Gradient Sliding for Composite Optimization

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

In applications, we encounter many constrained optimization problems. Examples Basis pursuit: exact sparse recovery problem

Optimization methods

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Pavel Dvurechensky Alexander Gasnikov Alexander Tiurin. July 26, 2017

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Introduction to Alternating Direction Method of Multipliers

18.657: Mathematics of Machine Learning

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Convergence rate of SGD

10-725/36-725: Convex Optimization Spring Lecture 21: April 6

IFT Lecture 2 Basics of convex analysis and gradient descent

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Machine Learning Summer 2018 Exercise Sheet 4

Dual Methods. Lecturer: Ryan Tibshirani Convex Optimization /36-725

1 Kernel methods & optimization

A Finite Newton Method for Classification Problems

Composite Objective Mirror Descent

Online Optimization : Competing with Dynamic Comparators

Dual Ascent. Ryan Tibshirani Convex Optimization

U Logo Use Guidelines

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

Coordinate Descent and Ascent Methods

Lecture 14: Newton s Method

Math 273a: Optimization Convex Conjugacy

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Information geometry of mirror descent

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

arxiv: v6 [math.oc] 24 Sep 2018

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Selected Topics in Optimization. Some slides borrowed from

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

On Nesterov s Random Coordinate Descent Algorithms - Continued

Generalization theory

Regret bounded by gradual variation for online convex optimization

Iteration-complexity of first-order penalty methods for convex programming

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Gradient methods for minimizing composite functions

Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

HW #1 SOLUTIONS. g(x) sin 1 x

A Greedy Framework for First-Order Optimization

Smoothing Proximal Gradient Method. General Structured Sparse Regression

Lecture 6: September 12

too, of course, but is perhaps overkill here.

Algorithms for Nonsmooth Optimization

Stochastic Proximal Gradient Algorithm

A Primer on Multidimensional Optimization

Advanced Machine Learning

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Lecture 2: Subgradient Methods

Online Passive-Aggressive Algorithms

Bregman Divergences for Data Mining Meta-Algorithms

Optimization and Optimal Control in Banach Spaces

Transcription:

Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning, clustering, eponential family Definition (Bregman divergence) Let ψ : Ω R be a function that is: a) strictly conve, b) continuously differentiable, c) defined on a closed conve set Ω. hen the Bregman divergence is defined as ψ (, y) = ψ() ψ(y) ψ(y), y,, y Ω. () hat is, the difference between the value of ψ at and the first order aylor epansion of ψ around y evaluated at point. Eamples Euclidean distance. Let ψ() =. hen ψ (, y) = y. Ω = R n + : i i = }, and ψ() = i i log i. hen ψ (, y) = i i log i y i for, y Ω. his is called relative entropy, or Kullback Leibler divergence between probability distributions and y. l p norm. Let p and p + q =. ψ() = q. hen ψ(, y) = q + y q, y q. Note y q is not necessarily continuously differentiable, which makes this case not precisely consistent with our definition. Properties of Bregman divergence Strict conveity in the first argument. rivial by the strict conveity of ψ. Nonnegativity: ψ (, y) 0 for all, y. ψ (, y) = 0 if and only if = y. rivial by strict conveity. Asymmetry: in general, ψ (, y) ψ (y, ). Eg, KL-divergence. Symmetrization not always useful. Non-conveity in the second argument. Let Ω = [, ), ψ() = log. hen ψ (, y) = log + log y + y y. One can check its second order derivative in y is y ( y ), which is negative when < y. Linearity in ψ. For any a > 0, ψ+aϕ (, y) = ψ (, y) + a ϕ (, y). Gradient in : ψ(, y) = ψ() ψ(y). Gradient in y is trickier, and not commonly used. Generalized triangle inequality: ψ (, y) + ψ (y, z) = ψ() ψ(y) ψ(y), y + ψ(y) ψ(z) ψ(z), y z () = ψ (, z) + y, ψ(z) ψ(y). (3)

Special case: ψ is called strongly conve with respect to some norm with modulus σ if ψ() ψ(y) + ψ(y), y + σ y. (4) Note the norm here is not necessarily Euclidean norm. When the norm is Euclidean, this condition is equivalent to ψ() σ being conve. For eample, the ψ() = i i log i used in KL-divergence is -strongly conve over the simple Ω = R n + : i i = }, with respect to the l norm. When ψ is σ strong conve, we have ψ (, y) σ y. (5) Proof: By definition ψ (, y) = ψ() ψ(y) ψ(y), y σ y. Duality. Suppose ψ is strongly conve. hen ( ψ )( ψ()) =, ψ (, y) = ψ ( ψ(y), ψ()). (6) Proof: (for the first equality only) Recall ψ (y) = sup z, y ψ(z)}. (7) z Ω sup must be attainable because ψ is strongly conve and Ω is closed. is a maimizer if and only if y = ψ(). So ψ (y) + ψ() =, y y = ψ(). (8) Since ψ = ψ, so ψ (y) + ψ () =, y, which means y is the maimizer in his means = ψ (y). o summarize, ( ψ )( ψ()) =. ψ () = sup, z ψ (z)}. (9) z Mean of distribution. Suppose U is a random variable over an open set S with distribution µ. hen is optimized at ū := E µ [U] = u S uµ(u). Proof: For any S, we have min E U µ[ ψ (U, )] (0) S E U µ [ ψ (U, )] E U µ [ ψ (U, ū)] () = E µ [ψ(u) ψ() (U ) ψ() ψ(u) + ψ(ū) + (U ū) ψ(ū)] () = ψ(ū) ψ() + ψ() ū ψ(ū) + E µ [ U ψ() + U ψ(ū)] (3) = ψ(ū) ψ() (ū ) ψ() (4) = ψ (ū, ). (5) his must be nonnegative, and is 0 if and only if = ū. Pythagorean heorem. If is the projection of 0 onto a conve set C Ω: hen = argmin ψ (, 0 ). (6) ψ (y, 0 ) ψ (y, ) + ψ (, 0 ). (7) In Euclidean case, it means the angle y 0 is obtuse. More generally Lemma Suppose L is a proper conve function whose domain is an open set containing C. L is not necessarily differentiable. Let be hen for any y C we have = argmin L() + ψ (, 0 )}. (8) L(y) + ψ (y, 0 ) L( ) + ψ (, 0 ) + ψ (y, ). (9)

he projection in (6) is just a special case of L = 0. his property is the key to the analysis of many optimization algorithms using Bregman divergence. Proof: Denote J() = L() + ψ (, 0 ). Since minimizes J over C, there must eist a subgradient d J( ) such that d, 0, C. (0) Since J( ) = g + = ψ (, 0 ) : g L( )} = g + ψ( ) ψ( 0 ) : g L( )}. So there must be a subgradient g L( ) such that g + ψ( ) ψ( 0 ), 0, C. () herefore using the property of subgradient, we have for all y C that Rearranging completes the proof. Mirror Descent L(y) L( ) + g, y () L( ) + ψ( 0 ) ψ( ), y (3) = L( ) ψ( 0 ), 0 + ψ( ) ψ( 0 ) (4) + ψ( 0 ), y 0 ψ(y) + ψ( 0 ) (5) ψ( ), y + ψ(y) ψ( ) (6) = L( ) + ψ (, 0 ) ψ (y, 0 ) + ψ (y, ). (7) Why bother? Because the rate of convergence of subgradient descent often depends on the dimension of the problem. Suppose we want to minimize a function f over a set C. Recall the subgradient descent rule k+ = k α k g k, where g k f( k ) (8) k+ = argmin k+ = argmin ( k α k g k ). (9) his can be interpreted as follows. First approimate f around k by a first-order aylor epansion f() f( k ) + g k, k. (30) hen penalize the displacement by α k k. So the update rule is to find a regularized minimizer of the model k+ = argmin f( k ) + g k, k + } k. (3) α k It is trivial to see this is eactly equivalent to (9). o generalize the method beyond Euclidean distance, it is straightforward to use the Bregman divergence as a measure of displacement: k+ = argmin f( k ) + g k, k + } ψ (, k ) (3) α k = argmin α k f( k ) + α k g k, k + ψ (, k )}. (33) Mirror descent interpretation. Suppose the constraint set C is the whole space (i.e. no constraint). hen we can take gradient with respect to and find the optimality condition g k + α k ( ψ( k+ ) ψ( k )) = 0 (34) ψ( k+ ) = ψ( k ) α k g k (35) k+ = ( ψ) ( ψ( k ) α k g k ) = ( ψ )( ψ( k ) α k g k ). (36) For eample, in KL-divergence over simple, the update rule becomes k+ (i) = k (i) ep( α k g k (i)). (37) Rate of convergence. Recall in unconstrained subgradient descent we followed 4 steps. 3

. Bound on single update. elescope k+ = k α k g k (38) = k α k g k, k + α k g k (39) k α k (f( k ) f( )) + α k g k. (40) + α k (f( k ) f( )) + 3. Bounding by R and g k G : α k (f( k ) f( )) R + G 4. Denote ɛ k = f( k ) f( ) and rearrange By setting the step size α k judiciously, we can achieve αk g k. (4) α k. (4) min ɛ k R + G α k k,..., }. (43) min k,..., } ɛ k α k RG. (44) Suppose C is the simple. hen R. If each coordinate of each gradient g i is upper bounded by M, then G can be at most M n, i.e. depends on the dimension. Clearly the step to 4 can be easily etended by replacing k+ with ψ(, k+ ). So the only challenge left is to etend step. his is actually possible via Lemma. We further assume ψ is σ strongly conve. In (33), consider α k (f( k ) + g k, k ) as the L in Lemma. hen α k (f( k ) + g k, k ) + ψ (, k ) α k (f( k ) + g k, k+ k ) + ψ ( k+, k ) + ψ (, k+ ) (45) Canceling some terms can rearranging, we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ ψ ( k+, k ) (46) = ψ (, k ) + α k g k, k + α k g k, k k+ ψ ( k+, k ) (47) ψ (, k ) α k (f( k ) f( )) + α k g k, k k+ σ k k+ (48) ψ (, k ) α k (f( k ) f( )) + α k g k k k+ σ k k+ (49) ψ (, k ) α k (f( k ) f( )) + α k σ g k (50) Now compare with (40), we have successfully replaced k+ with ψ(, i ). Again upper bound ψ (, ) by R and g k by G. Note the norm on g k is the dual norm. o see the advantage of mirror descent, suppose C is the n dimensional simple, and we use KL-divergence for which ψ is strongly conve with respect to the l norm. he dual norm of the l norm is the l norm. hen we can bound ψ (, ) by using KL-divergence, and it is at most log n. G can be upper bounded by M. So as for the value of RG, mirror descent is smaller than subgradient descent by an order of O( n log n ). Acceleration : f is strongly conve. ψ with modulus λ if We say f is strongly conve with respect to another conve function f() f(y) + g, y + λ ψ (, y) g f(y). (5) 4

Note we do not assume f is differentiable. Now in the step from (47) to (48), we can plug in the definition of strong conveity: ψ (, k+ ) =... + α k g k, k +... (copy of (47)) (5)... α k (f( k ) f( ) + λ ψ (, k )) +... (53)... (54) ( λα k ) ψ (, k ) α k (f( k ) f( )) + α k σ g k (55) Denote δ k = ψ (, k ). Set α k = λk. hen δ k+ k k δ k λk ɛ k + G σλ k kδ k+ (k )δ k λ ɛ k + G σλ (56) k Now telescope (sum up both sides from k = to ) δ + ɛ k + G λ σλ min k ɛ k G i,..., } σλ k G O(log ). (57) σλ Acceleration : f has Lipschitz continuous gradient. eists L > 0 such that If the gradient of f is Lipschitz continuous, there f() f(y) L y,, y. (58) Sometimes we just directly say f is smooth. It is also known that this is equivalent to We bound the g k, k+ term in (46) as follows Plug into (46), we get f() f(y) + f(y), y + L y. (59) g k, k+ = g k, k + g k, k k+ (60) f( ) f( k ) + f( k ) f( k+ ) + L k k+ (6) = f( ) f( k+ ) + L k k+. (6) ψ (, k+ ) ψ (, k ) + α k ( f( ) f( k+ ) + L k k+ ) σ k k+. (63) Set α k = σ L, we get elescope we get ψ (, k+ ) ψ (, k ) σ L (f( k+) f( )). (64) min f( k) f( ) L (, ) LR k,..., +} σ σ. (65) his gives O( ) convergence rate. But if we are smarter, like Nesterov, the rate can be improved to O( ). We will not go into the details but the algorithm and proof are again based on Lemma. his is often called accelerated proimal gradient method.. Composite Objective Suppose the objective function is h() = f()+r(), where f is smooth and r() is simple, like. If we directly apply the above rates for optimizing h, we get O( ) rate of convergence because h is not smooth. It will be nice if we can enjoy the O( ) rate as in smooth optimization. Fortunately this is possible thanks to the simplicity of r(), and we only need to etend the proimal operator (33) as follows: k+ = argmin = argmin f( k ) + g k, k + r() + α k ψ (, k ) } (66) α k f( k ) + α k g k, k + α k r() + ψ (, k )}. (67) 5

Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival picks a function f k. 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+. Here we use a first-order aylor approimation of f around k, but keep r() eact. Assuming this proimal operator can be computed efficiently, then we can show all the above rates carry over. We here only show the case of general f (not necessarily smooth or strongly conve), and leave the rest two cases as an eercise. In fact we can again achieve O( ) rate when f has Lipschitz continuous gradient. Consider α k (f( k ) + g k, k + r()) as the L in Lemma. hen α k (f( k ) + g k, k + r( )) + ψ (, k ) (68) α k (f( k ) + g k, k+ k + r( k+ )) + ψ ( k+, k ) + ψ (, k+ ). (69) Following eactly the derivations from (46) to (50), we obtain ψ (, k+ ) ψ (, k ) + α k g k, k+ + α k (r( ) r( k+ )) ψ ( k+, k ) (70)... (7) ψ (, k ) α k (f( k ) + r( k+ ) f( ) r( )) + α k σ g k. (7) his is almost the same as (50), ecept that we want to have r( k ) here, not r( k+ ). Fortunately this is not a problem as long as we use a slightly different way of telescoping. Denote δ k = ψ (, k ) and then f( k ) + r( k+ ) f( ) r( ) (δ k δ k+ ) + α k α k σ g k. (73) Summing up from k = to we obtain r( + ) r( ) + (h( k ) h( )) δ ( + δ k ) δ + + G α k (74) α α k α k α σ k= ( R ( + ) ) + G α k (75) α α k α k σ = R + G α σ k= α k. (76) Suppose we choose = argmin r(), which ensures r( + ) r( ) 0. Setting α k = G R σ k, we get (h( k ) h( )) RG σ ( + herefore min,..., h( k ) h( )} decays at the rate of O( RG σ ).. Online learning ) = RG O( ). (77) k σ he protocol of online learning is shown in Algorithm. he player s goal of online learning is to minimize the regret, the minimal possible loss k f k() over all possible : Regret = f k ( k ) min f k (). (78) Note there is no assumption made on how the rival picks f k, and it can adversarial. After obtaining f k at iteration k, let us update the model into k+ by using the mirror descent rule on function f k only: k+ = argmin f k ( k ) + g k, k + } ψ (, k ), where g k f k ( k ). (79) α k 6

hen it is easy to derive the regret bound. Using f k in step (50), we have f k ( k ) f k ( ) α k ( ψ (, k ) ψ (, k+ )) + α k σ g k. (80) Summing up from k = to n and using the same process as in (74) to (77), we get (f k ( k ) f k ( )) RG O( ). (8) σ So the regret grows in the order of O( ). f is strongly conve. immediately. Eactly use (55) with f k in place of f, and we can derive the O(log ) regret bound f has Lipschitz continuous gradient. he result in (64) can NO be etended to the online setting because if we replace f by f k we will get f k ( k+ ) f k ( ) on the right-hand side. elescoping will not give a regret bound. In fact, it is known that in the online setting, having a Lipschitz continuous gradient itself cannot reduce the regret bound from O( ) (as in nonsmooth objective) to O(log ). Composite objective. In the online setting, both the player and the rival know r(), and the rival changes f k () at each iteration. he loss incurred at each iteration is h k ( k ) = f k ( k ) + r( k ). he update rule is k+ = argmin f k ( k ) + g k, k + r() + } ψ (, k ), where g k f k ( k ). (8) α k Note in this setting, (73) becomes f k ( k ) + r( k+ ) f k ( ) r( ) α k (δ k δ k+ ) + α k σ g k. (83) Although we have r( k+ ) here rather than r( k ), it is fine because r does not change through iterations. Choosing = argmin r() and telescoping in the same way as from (74) to (77), we immediately obtain (h k ( k ) h k ( )) G O( ). (84) σ So the regret grows at O( ). When f k are strongly conve, we can get O(log ) regret for the composite case. But as epected, having Lipschitz continuity of f k alone cannot reduce the regret from O( ) to O(log )..3 Stochastic optimization Let us consider optimizing a function which takes a form of epectation min F () := E ω p [f(; ω)], (85) where p is a distribution of ω. his subsumes a lot of machine learning models. For eample, the SVM objective is F () = m ma0, c i a i, } + λ m. (86) i= It can be interpreted as (85) where ω is uniformly distributed in,,..., m} (i.e. p(ω = i) = m ), and f(; i) = ma0, c i a i, } + λ. (87) When m is large, it can be costly to calculate F and its subgradient. So a simple idea is to base the updates on a single randomly chosen data point. It can be considered as a special case of online learning in Algorithm, where the rival in step 4 now randomly picks f k as f(; ω k ) with ω k being drawn independently from p. Ideally we hope that by using the mirror descent updates, k will gradually approach the minimizer 7

Algorithm : Protocol of online learning he player initializes a model. for k =,,... do 3 he player proposes a model k. 4 he rival randomly draws a ω k from p, which defines a function f k () := f(; ω k ). 5 he player suffers a loss f k ( k ). 6 he player gets access to f k and use it to update its model to k+ by, e.g., mirror descent (79). of F (). Intuitively this is quite reasonable, and by using f k we can compute an unbiased estimate of F ( k ) and a subgradient of F ( k ) (because ω k are sampled iid from p). his is a particular case of stochastic optimization, and we recap it in Algorithm. In fact, the method is valid in a more general setting. For simplicity, let us just say the rival plays ω k at iteration k. hen an online learning algorithm A is simply a deterministic mapping from an ordered set ω,..., ω k } to k+. Denote as A(ω 0 ) the initial model. hen the following theorem is the key for online to batch conversion. heorem 3 Suppose an online learning algorithm A has regret bound R k after running Algorithm for k iterations. Suppose ω,..., ω + are drawn iid from p. Define ˆ = A(ω j+,..., ω ) where j is drawn uniformly random from 0,..., }. hen E[F (ˆ)] min F () R + +, (88) where the epectation is with respect to the randomness of ω,..., ω, and j. Similarly we can have high probability bounds, which can be stated in the form like (not eactly true) F (ˆ) min F () R + + log δ with probability δ, where the probability is with respect to the randomness of ω,..., ω, and j. (89) Proof of heorem 3. E[F (ˆ)] = E [f(ˆ; ω + )] = E [f(a(ω j+,..., ω ); ω + )] (90) j,ω,...,ω + j,ω,...,ω + = E f(a(ω j+,..., ω ); ω + ) (as j is drawn uniformly random) (9) ω,...,ω + + j=0 = E f(a(ω,..., ω j ); ω + j ) (shift iteration inde by iid of w i ) + = + + ω,...,ω + E ω,...,ω + E ω,...,ω + j=0 [ + [ ] f(a(ω,..., ω s ); ω s ) s= min min E ω [f(; ω] + R + + ] + f(; ω s ) + R + s= (9) (change of variable s = j + ) (93) (apply regret bound) (94) (epectation of min is smaller than min of epectation) (95) = min F () + R + +. (96) 8