Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Similar documents
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Lecture 16: FTRL and Online Mirror Descent

Adaptive Online Learning in Dynamic Environments

Full-information Online Learning

Online Learning and Online Convex Optimization

Accelerating Online Convex Optimization via Adaptive Prediction

Dynamic Regret of Strongly Adaptive Methods

Optimization, Learning, and Games with Predictable Sequences

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

Exponentiated Gradient Descent

Online Optimization : Competing with Dynamic Comparators

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

From Bandits to Experts: A Tale of Domination and Independence

Near-Optimal Algorithms for Online Matrix Prediction

Adaptive Online Gradient Descent

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Exponential Weights on the Hypercube in Polynomial Time

A Second-order Bound with Excess Losses

A Greedy Framework for First-Order Optimization

The FTRL Algorithm with Strongly Convex Regularizers

The Online Approach to Machine Learning

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

No-Regret Algorithms for Unconstrained Online Convex Optimization

Online Learning with Predictable Sequences

Online Optimization with Gradual Variations

Learnability, Stability, Regularization and Strong Convexity

Advanced Machine Learning

Stochastic and Adversarial Online Learning without Hyperparameters

Learning, Games, and Networks

Online Convex Optimization

Bandit Convex Optimization

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Learning with Experts & Multiplicative Weights Algorithms

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Optimal and Adaptive Online Learning

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Bandits for Online Optimization

Learning with Large Number of Experts: Component Hedge Algorithm

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Notes on AdaGrad. Joseph Perla 2014

Online Learning for Time Series Prediction

Online Learning and Sequential Decision Making

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

More Adaptive Algorithms for Adversarial Bandits

Minimax Policies for Combinatorial Prediction Games

Meta Algorithms for Portfolio Selection. Technical Report

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

A survey: The convex optimization approach to regret minimization

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Stochastic Optimization

Online Convex Optimization with Unconstrained Domains and Losses

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Efficient learning by implicit exploration in bandit problems with side observations

Optimistic Bandit Convex Optimization

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

Online Learning with Feedback Graphs

Composite Objective Mirror Descent

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Meta Optimization and its Application to Portfolio Selection

Time Series Prediction & Online Learning

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

A Boosting Framework on Grounds of Online Learning

Convex Repeated Games and Fenchel Duality

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret bounded by gradual variation for online convex optimization

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

arxiv: v1 [math.st] 23 May 2018

Convex Repeated Games and Fenchel Duality

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Online Kernel PCA with Entropic Matrix Updates

Relative Loss Bounds for Multidimensional Regression Problems

Optimal and Adaptive Online Learning

Convergence rate of SGD

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Online Kernel PCA with Entropic Matrix Updates

Online Learning with Feedback Graphs

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

0.1 Motivating example: weighted majority algorithm

Optimization, Learning, and Games With Predictable Sequences

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Logarithmic Regret Algorithms for Online Convex Optimization

Competing With Strategies

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Beating SGD: Learning SVMs in Sublinear Time

Advanced Machine Learning

Advanced Machine Learning

Applications of on-line prediction. in telecommunication problems

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Regret in Online Combinatorial Optimization

Lecture notes for quantum semidefinite programming (SDP) solvers

Minimizing Adaptive Regret with One Gradient per Iteration

The No-Regret Framework for Online Learning

On-line Variance Minimization

Transcription:

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 1 / 10

Setup Setting is learning from experts: n experts, T rounds For t = 1,...,T : Learner chooses distribution w t n over the experts Nature reveals losses z t [ 1,1] n of the experts Learner suffers loss w t z t Goal: minimize Regret = where i is the best fixed expert. T t=1 w t z t T t=1 z t,i, Typical algorithm: multiplicative weights (aka exponentiated gradient): w t+1,i w t,i exp( ηz t,i ). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 2 / 10

Outline Compare two variants of the multiplicative weights (exponentiated gradient) algorithm Understand the difference through lens of adaptive mirror descent (Orabona et al., 2013) Combine with machinery of optimistic updates (Rakhlin & Sridharan, 2012) to beat best existing bounds. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 3 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) The regret is bounded as Regret log(n) η Regret log(n) η + η + η T t=1 z t 2 T z 2 t,i t=1 (Regret:MW1) (Regret:MW2) If best expert i has loss close to zero, then second bound better than first. Gap can be Θ( T) (in actual performance, not just upper bounds). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) (MW2) is NOT mirror descent for any fixed regularizer J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? No: can cast (MW2) as adaptive mirror descent (Orabona et al., 2013) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t 1 w t = argmin w ψ(w) + s=1 w z s. For ψ(w) = 1 η n i=1 w i log(w i ), we recover (MW1). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t 1 w t = argmin w ψ(w) + s=1 w z s. For ψ(w) = 1 η n i=1 w i log(w i ), we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t 1 w t = argmin w ψ t (w) + s=1 w z s. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t 1 w t = argmin w ψ(w) + s=1 w z s. For ψ(w) = 1 η n i=1 w i log(w i ), we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t 1 w t = argmin w ψ t (w) + s=1 w z s. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t 1 w t = argmin w ψ(w) + s=1 w z s. For ψ(w) = 1 η n i=1 w i log(w i ), we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t 1 w t = argmin w ψ t (w) + s=1 w z s. For ψ t (w) = 1 η n i=1 w i log(w i ) + η n i=1 t 1 s=1 w izs,i 2, we approximately recover (MW2). Update: w t+1,i w t,i exp( ηz t,i η 2 zt,i 2 ) w t,i(1 ηz t,i ) Enough to achieve better regret bound. Can recover (MW2) exactly with more complicated ψ t. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Advantages of Our Perspective So far, we have cast [ (MW2) as adaptive mirror ] descent, with regularizer ψ t (w) = n i=1 w 1 i η log(w i) + η t 1 s=1 z2 s,i. Explains the better regret bound while staying within the mirror descent framework, which is nice. Our new perspective also allows us to apply lots of modern machinery: optimistic updates (Rakhlin & Sridharan, 2012) matrix multiplicative weights (Tsuda et al., 2005; Arora & Kale, 2007) By turning the crank, we get results that beat state of the art! J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 6 / 10

Beating State of the Art Optimism Adaptivity In the above we let J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism S Kivinen & Warmuth, 1997 Adaptivity In the above we let S = T t=1 z t 2 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism S Kivinen & Warmuth, 1997 max i S i Adaptivity In the above we let S i Cesa-Bianchi et al., 2007 S = T t=1 z t 2 S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism V S Kivinen & Warmuth, 1997 max i V i Hazan & Kale, 2008 max i S i Adaptivity In the above we let S i Cesa-Bianchi et al., 2007 V = T t=1 z t z 2, S = T t=1 z t 2 V i = T t=1(z t,i z i ) 2, S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art D Chiang et al., 2012 Optimism V S Kivinen & Warmuth, 1997 max i V i Hazan & Kale, 2008 max i S i Adaptivity In the above we let S i Cesa-Bianchi et al., 2007 D = T t=1 z t z t 1 2, V = T t=1 z t z 2, S = T t=1 z t 2 V i = T t=1(z t,i z i ) 2, S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art D Chiang et al., 2012 Optimism V S Kivinen & Warmuth, 1997 max i D i max i V i Hazan & Kale, 2008 max i S i Adaptivity In the above we let D i this work V i S i Cesa-Bianchi et al., 2007 D = T t=1 z t z t 1 2, V = T t=1 z t z 2, S = T t=1 z t 2 D i = T t=1(z t,i z t 1,i ) 2, V i = T t=1(z t,i z i ) 2, S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Optimistic Updates: A Brief Review (Normal) mirror descent: [ ] t 1 w t = argmin w ψ(w) + w z s s=1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

Optimistic Updates: A Brief Review (Normal) mirror descent: [ ] t 1 w t = argmin w ψ(w) + w z s s=1 Optimistic mirror descent (Rakhlin & Sridharan, 2012); add a hint m t : ] w t = argmin w ψ(w) + w [m t 1 t + z s Guesses (m t ) the next term (z t ) in the cost function. Pay regret in terms of z t m t rather than z t. s=1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

Optimistic Updates: A Brief Review (Normal) mirror descent: [ ] t 1 w t = argmin w ψ(w) + w z s s=1 Optimistic mirror descent (Rakhlin & Sridharan, 2012); add a hint m t : ] w t = argmin w ψ(w) + w [m t 1 t + z s Guesses (m t ) the next term (z t ) in the cost function. Pay regret in terms of z t m t rather than z t. E.g.: m t = z t 1, m t = 1 t t 1 s=1 z s s=1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) MW2 β t+1,i = β t,i ηz t,i η 2 zt,i 2 w t,i exp(β t,i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) MW2 β t+1,i = β t,i ηz t,i η 2 zt,i 2 w t,i exp(β t,i ) MW3 β t+1,i = β t,i ηz t,i η 2 (z t,i z t 1,i ) 2 w t,i exp(β t,i ηz t 1,i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) MW2 β t+1,i = β t,i ηz t,i η 2 zt,i 2 w t,i exp(β t,i ) MW3 β t+1,i = β t,i ηz t,i η 2 (z t,i z t 1,i ) 2 w t,i exp(β t,i ηz t 1,i ) Regret of MW3: Regret log(n) η + η T t=1 Dominates all existing bounds in this setting! (z t,i z t 1,i ) 2 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Summary Cast multiplicative weights algorithm as adaptive mirror descent Applied machinery of optimistic updates to beat best existing bounds Also in paper: extension to general convex losses extension to matrices generalization of FTRL lemma to convex cones J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 10 / 10