The FTRL Algorithm with Strongly Convex Regularizers

Similar documents
Exponentiated Gradient Descent

Online Convex Optimization

Adaptive Online Gradient Descent

Lecture 16: FTRL and Online Mirror Descent

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Online Learning and Online Convex Optimization

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

1 Review and Overview

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

No-Regret Algorithms for Unconstrained Online Convex Optimization

Composite Objective Mirror Descent

Stochastic Gradient Descent with Variance Reduction

Notes on AdaGrad. Joseph Perla 2014

Bandits for Online Optimization

CPSC 540: Machine Learning

1 Sparsity and l 1 relaxation

Full-information Online Learning

Bandit Convex Optimization

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Accelerating Online Convex Optimization via Adaptive Prediction

The Online Approach to Machine Learning

Lecture 6 : Projected Gradient Descent

An Online Convex Optimization Approach to Blackwell s Approachability

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Stochastic Subgradient Method

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Adaptive Online Learning in Dynamic Environments

Optimistic Bandit Convex Optimization

DATA MINING AND MACHINE LEARNING

Lecture 19: Follow The Regulerized Leader

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal and First-Order Methods for Convex Optimization

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

Advanced Machine Learning

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Optimization methods

Trade-Offs in Distributed Learning and Optimization

Convex Repeated Games and Fenchel Duality

Introduction to Machine Learning (67577) Lecture 7

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and online algorithms

Bregman Divergence and Mirror Descent

1 Review of Winnow Algorithm

Convex Repeated Games and Fenchel Duality

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

The Interplay Between Stability and Regret in Online Learning

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Lecture 9: September 28

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

The No-Regret Framework for Online Learning

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Large-scale Stochastic Optimization

CSC321 Lecture 15: Exploding and Vanishing Gradients

Perceptron Mistake Bounds

Learnability, Stability, Regularization and Strong Convexity

Classification Logistic Regression

A Greedy Framework for First-Order Optimization

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

LEARNING IN CONCAVE GAMES

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Online Learning and Sequential Decision Making

Optimization methods

Optimization, Learning, and Games with Predictable Sequences

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Fast Rates for Exp-concave Empirial Risk Minimization

A simpler unified analysis of Budget Perceptrons

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Lecture 6: September 12

A survey: The convex optimization approach to regret minimization

Conditional Gradient (Frank-Wolfe) Method

6. Proximal gradient method

Learning by constraints and SVMs (2)

4.1 Online Convex Optimization

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

6. Proximal gradient method

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Stochastic Optimization: First order method

Online and Stochastic Learning with a Human Cognitive Bias

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

CSC 411 Lecture 6: Linear Regression

Lecture 16: Perceptron and Exponential Weights Algorithm

Support vector machines Lecture 4

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Computational and Statistical Learning Theory

Transcription:

CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked about regularization models that induce sparsity, and we explained why such models might be preferred: In statistics, a sparse vector might correspond to a feature selection. For example, when there are more features than training examples, we might choose to set some features to 0 using L -regularization. From systems perspective, having a large number of features requires storing a large number of coefficients. By setting some features to 0, we may reduce memory requirements. We also talked about the difference between L and L 2 regularization, and presented a version of an Online Gradient Descent (OGD) algorithm, that includes a prediction error and an L regularization term. For the most part today, our loss function will be defined as f t (ω) = l t (ω) }{{} + λ ω }{{}, () prediction error regularization term where l t (ω) represents a short-hand notation for l t (ω) := l(ω x t,y t ), λ ω is a regularization term used to induce sparsity and λ R is a weighting factor. In some sense, this loss function allows us to choose predictors such to minimize our regret compared to the best model making a prediction, while ensuring our predicted vector is sparse. (For example, if there exists a perfect predictor ω that makes a good prediction every time, but has a large L norm, we might not care about our regret compared to that predictor as much.) Remark : The weighting factor, λ, is highly dependant on the problem we are solving and, for now, we won t care about it. As a general rule, however, we note that a hypothetical sparsity/loss curve parameterized by λ has a form depicted in Figure. better = 0 sparsity = = 0 loss better Figure : A sparsity/loss curve parameterized by the weighting factor λ.

2 Composite Objective-OGD Algorithm Let s consider the OGD algorithm again, with the update rule given as ω t+ = ω t ηĝ t, (2) where ĝ t denotes a subgradient of the loss function f t (ω), ĝ t f t (ω t ). We ve analyzed this algorithm last time and observed it can results in oscillatory predictions, as there is nothing to really force the predictor to got exactly to 0. In fact, if there is any noise in the loss function gradient, the predictor will never go to 0. We can, however, rewrite the ODG update rule (2) as the following optimization problem ω t+ = argming t ω + ω 2η ω ω t 2, (3) whereweapproximatethepredictionerror,f t (ω),byitssubgradientatω t,butkeepthesparsity-regularization part intact, g t ω+λ ω, and g t f t (ω t ). This modified algorithm is known as the Composite-Objective Online Gradient Descent (CO-OGD). The CO-OGD algorithm (3) helps prevent prediction oscillations. It turns out, however, that we can achieve even sparser predictor vector using the Follow-the-Regularized-Leader (FTRL) algorithm, with the update rule defined as [3] ω t+ = argmin( g :t ω + tλ ω ω }{{} + r :t (ω) ). (4) }{{}}{{} subgradient approximation t copies of L penalty stabilization penalty In order to compare the CO-OGD with the FTRL algorithm and show that the FTRL algorithm (4) enjoys a better sparsity, we next rewrite the CO-OGD algorithm (3) in the following alternative form: ω t+ = argmin(g :t ω +φ :t ω +λ ω +r :t (ω)), (5) ω with r t (ω) defined as r t (ω) = σt 2 ω ω t 2 and φ t R n as a subgradient approximation of the L penalty, φ t (λ ω ). Thus, φ :t ω +λ ω approximates the term tλ ω from the FTRL algorithm (4). While the fact that CO-OGD update rule (3) and its alternative form (5) are the same is not immediately obvious (and we do not prove it here), using the alternative representation allows us to immediately see that in the CO-OGD there is a single L penalty for the current learning round, and all previous rounds are approximated using subgradient (linear) approximation. On the other hand, in the FTRL algorithm there are t copies of the L penalty, for each iteration of the game. Thus, the FTRL algorithm results in a sparser solution, since it does not assume approximation of the L regularization on any of the iterations of the game. Remark 2: Note that it is possible to implement this modified FTRL algorithm by storing only a single vector in R n, or two vectors in R n, if using adaptive per-coordinate learning rate. Thus, both FTRL and CO-OGD have the same storage requirements. Remark 3:[Geometrical interpretation of the stabilization penalty] After t rounds, the cumulative stabilization penalty is given as t σ s r :t (ω) = 2 ω ω s 2. s= This stabilization penalty corresponds to a slowly decreasing learning rate, which can geometrically be represented as depicted in Figure 2, where ω i denotes the predictor value in the current round. 2

r 2 2 r r 0 0 Figure 2: Geometrical interpretation of the stabilization penalty. 3 The FTRL Algorithm with Strongly Convex Proximal Regularizers We next analyze the FTRL algorithm with a strongly convex proximal regularizer, and derive its regret bounds. As a part of our regret analysis, we also show a simple way of defining general convex feasible sets. We start by defining strongly convex functions. 3. Strongly Convex Functions Definition. A convex function, f, is σ strongly convex with respect to some norm,, over a set W if for all u,w W, and every g such that g f(w) it holds that f(u) f(w)+g (u w)+ σ 2 u w 2. (6) A graphical interpretation of strong convexity is depicted in Figure 3. f(u) u -w 2 2 f(w) + (u-w) T f(w) f(w) w u Figure 3: A graphical interpretation of strong convexity. Remark 4: Strongly convex functions can be thought of as functions that, in addition to a linear lower bound, also have a quadratic lower bound. 3

Remark 5: We will almost always work with strongly convex regularization functions, but we will not assume the same about the loss functions. Example: Function f(w) = 2 w 2 2 is -strongly convex with respect to Euclidean norm. Proof. Function f(w) is differentiable, so its subdifferential consists of only one element, f(w) = w. Now, for some points u,w R, we can write: 2 u2 2 w2 +w(u w)+ 2 (u w)2, 2 u2 2 u2. Lemma 2. Let W be a nonempty convex set, and let f : W R be a σ strongly convex function over W with respect to a norm. Further, let w be the minimizer of f over W, w = argmin v W f(v). Then for every u W it holds that f(u) f(w ) σ 2 u w 2. (7) Proof. (Proof adapted from [2]) Let s first assume f is differentiable and w is in the interior of W. Then f(w) = 0 and from the definition of strong convexity it follows that u W, f(u) f(w ) σ 2 u w 2, as required. Let snowconsiderthecasewhenw isontheboundaryofw. Westillhavethatforallu W, f(w ) T (u w ) 0, otherwise w wouldn t have been optimal since we could make a small step in the direction u w and decrease the value of f. So, the desired inequality still holds. Finally, let s consider the case when f is not differentiable. Let g : R d R { } be such that g(w ) = f(w ) if w W and g(w ) = otherwise. We can therefore rewrite w = argmin v g(v). Since g a proper convex function (never receives value ), we know that 0 g(w ). Thus, the inequality (7) follows using the strong convexity of g. 3.2 Regret Analysis of the FTRL Algorithm with Strongly Convex Regularizers We next proceed with the regret analysis of the FTRL algorithm with the following update rule ω t+ = argmin ω R n f :t (ω)+r :t (ω). We further assume the following about loss and regularization functions: The loss f t is L t Lipschitz and convex, The regularization function is σ t strongly convex 0 r t (ω t ) Lemma 3 (FTRL lemma). The regret of the FTRL algorithm with a strongly convex regularizer, compared to some predictor, w, satisfies Regret(FTRL) (f t (w t ) f t (w t+ ))+r :T (w ). (8) Proof. The proof is analgous to the simpler version of the lemma with a single fixed regularizer r. 4

Theorem 4. Let f,...,f T be a sequence of convex functions such that f t is L t Lipschitz with respect to some norm. Assume that the FTRL algorithm is run on the sequence with a regularization function which is σ strongly convex with respect to the same norm. Then, for all u W Regret T (u) L 2 t +r :T (ω ). (9) Proof. Let s first recall the regret bound for the FTRL algorithm that we derived in Lemma 3 (f t (ω t ) f t (u)) If the loss function, f t, is L t Lipschitz with respect to a norm, then: (f t (ω t ) f t (ω t+ ))+r :T (ω ). (0) f t (ω t ) f t (ω t+ ) L t ω t ω t+. () Thus, in order to achieve a small regret, we need to ensure that ω t ω t+ is small. If the regularization function is strongly convex with respect to the same norm, that is indeed the case, since ω t is close to ω t+. To show that ω t is close to ω t+, let s fix t and let s define a helper function, h t, as h t (ω) = f :t (ω) +r :t (ω)+r t (ω). (2) }{{}}{{} t losses t regularizations By assumption, 0 r t (ω t ). This implies that h t is σ t strongly convex, and the update rule for ω t can be rewritten using the helper function, h t ω t = argminh t (ω). w Similarly, let s also rewrite the update rule for ω t+ ω t+ = argminh t (ω)+f t (ω). ω Since the sum of a convex and a strongly convex function remains strongly convex [2], we know h t (ω t ) and h t (ω t+ ) are strongly convex. We can thus apply Lemma 2 to h t (with minimizer ω t ) h t (ω t+ ) h t (ω t ) 2 ω t ω t+ 2. (3) Repeating the same argument for the helper functions h t+, we can write Summing inequalities (3) and (4), we obtain h t+ (ω t ) h t+ (ω t+ ) 2 ω t ω t+ 2. (4) f t (ω t ) f t (ω t+ ) ω t ω t+ 2. (5) Now, using the Lipschitzness of the loss function f t (inequality ()), we obtain Combining inequalities (5) and (6), we can further write L t ω t ω t+ f t (ω t ) f t (ω t+ ). (6) ω t ω t+ L t. (7) 5

Now, combining inequalities (6) and (7), we get Combining inequality (8) with the regret bound (0), we get Inequality (9) completes the proof. f t (ω t ) f t (ω t+ ) L2 t. (8) Regret T (u) L 2 t +r :T (ω ). (9) Let s now choose regularization function as r t (ω) = σt 2 ω ω t 2. The cumulative sum of regularizers becomes r :T (ω σ t ) = 2 ω ω t 2. Let sfurtherassumethatforeveryω W ω 2 R. Itfollowsthatthecumulativesumisupper-bounded by r :T (ω σ t ) = 2 ω ω t 2 σ t 2 (2R)2 = 2σ :T R 2. The regret bound (9) can now be rewritten as Regret L 2 t +2 R 2. (20) Taking a learning-rate schedule η t for gradient descent, we can choose σ t such that = η t, where η t denotes the learning rate. The regret bound (20) now becomes Regret L 2 tη 2 t + 2R2 η T. (2) Note that if f t (ω) = g t ω, then g t = L t. Thus, the lower bounds on the regret of the OGD algorithm with adaptive learning rates and the regret of the FTRL algorithm with strongly convex proximal regulizers differ by at most factor of 2. Remark 6: The scaling parameter σ t now becomes a part of the regret function. Larger σ t (smaller learning rate) produces a more stable algorithm, but it may take longer to move across the feasible set, resulting in a larger penalty when ω is far away. 6

3.3 Feasible Sets We can define proximal regularizers r t as -strongly convex functions such that 0 r t (0) (e.g., r t (w) = 2 w 2 ). Let s rewrite r t as r t (w) = σ t 2 r(w w t)+i w (w), (22) where I W (x) denotes an indicator function that ensures a player always plays a point form the feasible set W { 0, w W, I W (w) =, otherwise. Such an indicator function works for an arbitrary convex sets and does not hurt the regret bound. Let s now apply this same idea on the FTRL algorithm; given a loss function f t (ω) = l t (ω)+λ ω, let s redefine it as ˆf t (ω) = l t (ω t )+g t (w w t )+λ ω, where g t ( ) denotes a subgradient approximation of l t, g t := l t (ω t ). We can run the FTRL algorithm using ˆf t (w) and get the same regret bound, since: ˆf t (ω t ) = f t (ω t ) ˆf t (ω ) f t (ω ) Since constants do not contribute to the optimization, without loss of generality, we can write ˆf t (ω) = g t ω +λ ω. (23) So, we get ˆf :t (ω) = g :t (ω)+tλ ω, (24) and for such a loss function, we get the same update as with the FTRL algorithm. References [] N. Cesa-Bianchi and G. Lugosi, Prediction, Learning, and Games, Cambridge University Press, 2006. [2] S. Shalev-Shwartz, Online Learning and Online Convex Optimization, Foundations and Trends in Machine Learning, 202. [3] H. B. McMahan, Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L Regularization, Proceedings of the 4th International Conference on Artificiall Intelligence and Statistics (AISTATS), 20. 7