Lecture 16: FTRL and Online Mirror Descent

Similar documents
Exponentiated Gradient Descent

1 Review and Overview

Lecture 21: Minimax Theory

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Online Convex Optimization

The FTRL Algorithm with Strongly Convex Regularizers

Constrained Optimization and Lagrangian Duality

Online Learning and Online Convex Optimization

Lecture 19: Follow The Regulerized Leader

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Conditional Gradient (Frank-Wolfe) Method

Coordinate Descent and Ascent Methods

Regret bounded by gradual variation for online convex optimization

Adaptive Online Gradient Descent

Bregman Divergence and Mirror Descent

Information geometry of mirror descent

Convex Optimization Conjugate, Subdifferential, Proximation

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

A Greedy Framework for First-Order Optimization

U Logo Use Guidelines

18.657: Mathematics of Machine Learning

Lecture 4: November 13

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Online Learning with Experts & Multiplicative Weights Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Exponential Weights on the Hypercube in Polynomial Time

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Full-information Online Learning

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

Lecture 6 : Projected Gradient Descent

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Stochastic and Adversarial Online Learning without Hyperparameters

Lecture 7: Semidefinite programming

Relative Loss Bounds for Multidimensional Regression Problems

1 Overview. 2 Extreme Points. AM 221: Advanced Optimization Spring 2016

On Nesterov s Random Coordinate Descent Algorithms - Continued

Lecture 5 : Projections

Composite Objective Mirror Descent

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

CSC 411 Lecture 17: Support Vector Machine

Linear Classifiers and the Perceptron

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Learnability, Stability, Regularization and Strong Convexity

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Stochastic and online algorithms

Optimization and Gradient Descent

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Lecture notes for quantum semidefinite programming (SDP) solvers

Weighted Majority and the Online Learning Approach

Accelerating Stochastic Optimization

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Designing smoothing functions for improved worst-case competitive ratio in online optimization

Lecture 5: September 12

Fenchel Duality between Strong Convexity and Lipschitz Continuous Gradient

Empirical Risk Minimization Algorithms

Optimization, Learning, and Games with Predictable Sequences

Lecture 16: Perceptron and Exponential Weights Algorithm

The Online Approach to Machine Learning

A. Derivation of regularized ERM duality

Math (P)Review Part II:

Optimization for Machine Learning

Notes on online convex optimization

Lecture 23: November 19

IE 5531: Engineering Optimization I

Dan Roth 461C, 3401 Walnut

Online Optimization : Competing with Dynamic Comparators

The Randomized Newton Method for Convex Optimization

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Online Learning and Sequential Decision Making

14 : Theory of Variational Inference: Inner and Outer Approximation

Minimax Policies for Combinatorial Prediction Games

Lagrange Relaxation and Duality

CPSC 540: Machine Learning

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Convex Optimization and Modeling

A survey: The convex optimization approach to regret minimization

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Lecture 6: September 17

Proximal methods. S. Villa. October 7, 2014

Convex Repeated Games and Fenchel Duality

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Introduction to Logistic Regression and Support Vector Machine

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Optimization methods

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Convex Optimization. Ofer Meshi. Lecture 6: Lower Bounds Constrained Optimization

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

A Mini-Course on Convex Optimization with a view toward designing fast algorithms. Nisheeth K. Vishnoi

Notes on AdaGrad. Joseph Perla 2014

Transcription:

Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which is also called Hedge, Exponential Weights, or as we ll see today, Exponential Gradient. We proved that in the experts setting this algorithm achieves O( T log(d)) regret when there are T rounds and d experts. We also started discussion of the FTRL algorithm for the online convex optimization setting, which of course generalizes the experts setting. Here the algorithm is w t argmin R(w) + f i (w) where the convex loss functions are f t, R is some regularizer and w t s are the actions that the learner plays. Today we will study this algorithm in detail. FTRL Before turning to the general case let us just study the quadratic regularizer R(w) = η w with linear loss functions: w t argmin η w + w, l i = argmin η w w, θ t where θ t = t l i is the sum of the loss functions so far. When S = R d FTRL has a closed form w t = ηθ t which we can equivalently write as w t = w t ηl t which is the Online Gradient Descent algorithm. If S R d then we are doing a lazy projection step where w t = Π S (ηθ t ), This is called a lazy projection since we don t project the iterates θ t but only project when we need to make a prediction. This is also called Nesterov s Dual Averaging algorithm. Theorem. FTRL with quadratic regularizer η w and linear losses f t (w) = w, l t satisfies If u B and l t L then setting η = Regret(T, u) η u + η B L T l t gives O(BL T ) regret.

Proof. We first apply the Be-the-Leader lemma, where we imagine that the first loss is just R, this gives R(w 0 ) R(u) + w t, l t u, l t R(w 0 ) R(w ) + w t w t+, l t Re-arranging and using non-negativity of R we obtain the first term Regret(T, u) R(u) + Now for a single term w t w t+, l t w t w t+, l t w t w t+ l t = Π S (ηθ t ) Π S (ηθ t+ ) l t ηθ t ηθ t+ l t = η l t The above bound works for any linear losses, but what about convex losses? The trick here is to linearize the convex loss function and the main question is what linear function we should use? Given f t, we need to find a linear function that we then pass to our linear FTRL algorithm. The trick is to use l t f t (w t ) Proposition (Linearization). For a convex loss function f t, with l t f t (w t ) we have f t (w t ) f t (u) w t u, l t Proof. The proof is just by the definition of convexity, we know that f t (u) f t (w t ) + l t, u w t which is just a re-arranging of the claim. The intuition here is that, at least statistically, linear losses are the hardest to optimize, since the have no curvature. Looking at the quadratic example from last time, we saw that curvature seemed to help us, so this is not super surprising. Corollary 3 (FTRL Regret for experts setting). FTRL with quadratic regularizer for the experts setting has regret Regret(T ) dt Proof. Since S = (d) we know that u = while l t = d, since we actually have that l t [0, ] d. Optimizing for η proves the result. This is actually terrible since we saw that weighted majority achieves T log(d) regret in the experts setting. How can we achieve this with FTRL? 3 Online Mirror Descent The key is to use a different regularizer. To study the more general case we will need some more technical machinery. We will study the online mirror descent algorithm, which is equivalent to FTRL, but we ll see a different perspective as well. w t argmin R(w) w, θ t, θ t = η l i, l i f i (w i ) Here I also moved the learning rate to the loss term, but clearly this doesn t change anything.. OMD on losses f t just FTRL on linearization w, z t.. OGD is just OMD with quadratic regularizer.

Fenchel Conjugates. To understand OMD we need to introduce three concepts. The first is fenchel conjugates. For a function ψ that need not be convex ψ (θ) sup w R d w, θ ψ(w) The idea here is that ψ describes the supporting hyperplanes to the function ψ. In the one-dimensional case, ψ (θ) is the x-intercept of the supporting hyperplane to ψ with slope θ. The main properties about Fenchel Conjugates are. ψ is always convex. It is a pointwise maximum of linear functions. ψ (θ) + ψ(w) w, θ which is called the Fenchel-Young inequality. 3. If R(w) = αψ(w) then R (θ) = αψ (θ/α) for α > 0. 4. If ψ is differentiable then ψ (θ) = argmax w w, θ ψ(w). This follows since the gradient of a maximum is the gradient of the function acheiving the maximum, which in this case is just the w. Thus in the unconstrained case, we can write OMD in another way w t = R (θ t ), θ t = θ t ηl t, and we also know that w t and θ t are linked since θ t = R(w t ). This is where the word mirror comes from: the θ t are updated in a gradient descent style and the updates are mirrored to the primal space using Fenchal duality. Bregman Divergences. For any continuously differentiable convex function ψ, define D ψ (u v) = ψ(u) ψ(v) ψ(v), u v which is the difference between ψ(u) and the first order approximation relative to v. Convexity ensures that D ψ 0 but note that it is not in general symmetric. The idea for a better analysis of FTRL/OMD is to use a different regularizer and instead of the l norm w t w t+ that we saw in the proof above, we will use the Bregman divergence. Before turning to the real analysis, an characterization for the constrained case that is closer to the second one above is.. Choose w t+ so that R( w t+ ) = R( w t ) ηl t (Or inductively so that R( w t+ ) = θ t ). Choose w t+ = argmin D R (w w t+ ) In the unconstrained case this is equivalent to the second one since the unconstrained argmin is just w t+. Actually it doesn t really matter whether you use w t in the first line or not. This version is called the Lazy version, and if you use w t there it is called the Agile version (at least according to Hazan). Lemma 4. Lazy OMD and FTRL produce identical predictions when the loss functions are linear Proof. We need to prove that argmin D R (w w t ) = argmin R(w) η w, θ t First, observe that inductively we have that R( w t ) = θ t since that is where we are accumulating the gradients. This means that argmin D R (w w t ) = argmin which is exactly what the FTRL algorithm is doing. R(w) R( w t ) R( w t ), w w t = argmin R(w) w, θ t 3

Theorem 5. OMD with regularizer R obtains the regret bound ηregret(t, u) R(u) R(w ) + D R (θ t+ θ t ) Proof. We work only with linear loss functions since the convex case can be handled by first using Proposition. We know that R (θ T + ) u, θ T + R(u) = η by Fenchel-Young inequality. Now for the upper bound R (θ T + ) = R (θ ) + = R (θ ) + = R (θ ) + η u, l t R(u) R (θ t+ ) R (θ t ) R (θ t )(θ t+ θ t ) + D R (θ t+ θ t ) w t, l t + D R (θ t+ θ t ) For the first term, since R (θ ) = max w, θ R(w) = max R(w) and w = R (θ ) is the argmax, we get that R (θ ) = R(w ). Now, re-arrang things to obtain the result. Example (Quadratic regularizer). The quadratic regularizer R(w) = w has conjugate ψ (θ) = θ with ψ (θ) = θ. The Bregman divergence term D R (u v) = u v v, u v = u v, which happens to be symmetric. This reproduces the OGD analysis from last lecture. Strong convexity. To get a better bound for the experts setting we need to use a new regularizer and we need to understand its properties. Since we have S = (d), it makes some sense to use entropic regularization. d R(w) = w j log w j, R (θ) = log( d exp(θ j )), R exp(θ j ) (θ) j = d k= exp(θ j) This lead to the algorithm w t exp(θ t ) = exp( η l j ), which is just the weighted majority algorithm. Here we will refer to it as the Exponentiated Gradient. The last thing to verify is D R (u v) = KL(u v) = j u j log(u j /v j ), which we will use later. To complete the analysis we need to relate the dual bregman divergence to a norm. Working backwords, if we show that R is α-strongly smooth then we can upper bound the Bregman term. It turns out this is equivalent to asking that R is /α-strongly convex with respect to the dual norm. More formally 4

Definition 6. ψ is α-strongly convex with respect to a norm if for all u, v D ψ (u v) α u v Analogously ψ is α-strongly smooth with respect to a norm if for all u, v D ψ (u v) α u v For a norm the dual norm is x = sup y: y x, y. Lemma 7. ψ(w) is /α strongly convex with respect to some norm if and only if ψ (θ) is α strongly smooth with respect to the dual norm Proposition 8. The entropic regularizer R(w) = j w j log w j is strongly convex with respect to the L norm. Thus R is -strongly smooth with respect to the L norm and the second term in the regret is at most η t l t, leading to an O( T log(d)) regret bound. Proof. The main thing we need to prove is that the entropic regularizer is strongly convex. Let us just show that the KL-divergence is -strongly convex for the simpler case where d =, thus we must show KL(p, q) = p log p/q + ( p) log( p)/( q) ( p q + ( p) ( q) ) = (p q) Let us look at the difference between these two sides as a function of q g(q) = p log p/q + ( p) log( p)/( q) (p q) g (q) = p q + p [ ] + 4(p q) = (q p) q q( q) 4 Since q [0, ] we know that q( q) /4 and hence the derivative is negative for q p and positive for q p. This means that p minimizes this function and it is easy to see that g(p) = 0, which proves this result. The easy way to generalize this is to observe that the total varation distance p q = max S [d] P S Q S and also that the KL only contracts if we consider the binary distribution with probabilities P S = j S p j and Q S defined analogously. 5