Bandit Convex Optimization

Similar documents
Optimistic Bandit Convex Optimization

Bandits for Online Optimization

Bandit Online Convex Optimization

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

The Online Approach to Machine Learning

Accelerating Online Convex Optimization via Adaptive Prediction

Stochastic Optimization

Bandit Convex Optimization: T Regret in One Dimension

The FTRL Algorithm with Strongly Convex Regularizers

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

Online Learning and Online Convex Optimization

Full-information Online Learning

Exponentiated Gradient Descent

Stochastic and Adversarial Online Learning without Hyperparameters

Exponential Weights on the Hypercube in Polynomial Time

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

10. Unconstrained minimization

Adaptive Online Gradient Descent

Better Algorithms for Benign Bandits

Supplement: Universal Self-Concordant Barrier Functions

Better Algorithms for Benign Bandits

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

Online Learning with Predictable Sequences

Notes on AdaGrad. Joseph Perla 2014

Optimization, Learning, and Games with Predictable Sequences

Online Learning with Non-Convex Losses and Non-Stationary Regret

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Unconstrained minimization

The Interplay Between Stability and Regret in Online Learning

arxiv: v2 [math.oc] 8 Oct 2011

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Convex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University

Advanced Machine Learning

Trade-Offs in Distributed Learning and Optimization

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Adaptive Online Learning in Dynamic Environments

Convergence rate of SGD

Online Learning for Time Series Prediction

1 Review and Overview

Regret bounded by gradual variation for online convex optimization

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Bregman Divergence and Mirror Descent

Half of Final Exam Name: Practice Problems October 28, 2014

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Gaussian Process Optimization with Mutual Information

Unconstrained minimization of smooth functions

arxiv: v2 [stat.ml] 7 Sep 2018

A Mini-Course on Convex Optimization with a view toward designing fast algorithms. Nisheeth K. Vishnoi

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

An Optimal Affine Invariant Smooth Minimization Algorithm.

Information theoretic perspectives on learning algorithms

Lecture 16: FTRL and Online Mirror Descent

Ad Placement Strategies

Bandit View on Continuous Stochastic Optimization

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Stochastic Gradient Descent with Variance Reduction

Perturbed Proximal Gradient Algorithm

Linear Bandits. Peter Bartlett

Stochastic Composition Optimization

On the Generalization Ability of Online Strongly Convex Programming Algorithms

A survey: The convex optimization approach to regret minimization

Online Learning with Feedback Graphs

Online Passive-Aggressive Algorithms

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent

Barrier Method. Javier Peña Convex Optimization /36-725

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Better Algorithms for Benign Bandits

I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

A Quantum Interior Point Method for LPs and SDPs

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

Online Submodular Minimization

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Unconstrained minimization: assumptions

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Sublinear Time Algorithms for Approximate Semidefinite Programming

Newton s Method. Javier Peña Convex Optimization /36-725

An Online Convex Optimization Approach to Blackwell s Approachability

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Bandit Online Learning with Unknown Delays

Advanced Machine Learning

Math 361: Homework 1 Solutions

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Online Optimization with Gradual Variations

arxiv: v1 [math.oc] 1 Jul 2016

Minimax Policies for Combinatorial Prediction Games

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Descent methods. min x. f(x)

Dual Proximal Gradient Method

Lasso: Algorithms and Extensions

Conditional Gradient (Frank-Wolfe) Method

Transcription:

March 7, 2017

Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion

Learning scenario Compact convex action set K R d. For t = 1 to T : Predict x t K. Receive convex loss function f t : K R. Incur loss f t(x t). Bandit setting: only loss revealed, no other information. Regret of algorithm A: Reg T (A) = f t (x t ) min x K f t (x).

Related settings Online convex optimization: f t (and maybe 2 f t, 3 f t,...) known at each round. Multi-armed bandit: K = {1, 2,..., K } discrete. Zero-th order optimization: f t = f. Stochastic bandit convex optimization: f t (x) = f (x) + ɛ t, ɛ t D noisy estimate. Multi-point bandit convex optimization: Query f t at points (x t,i ) m i=1, m 2.

Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion

Gradient descent without a gradient Online gradient descent algorithm: x t+1 x t η f t (x t ). BCO setting: f t (x t ) is not known! BCO idea: Find ĝ t such that ĝ t f t(x t). Update x t+1 x t ηĝ t. Question: how do we pick ĝ t?

Single-point gradient estimates (one dimension) By the fundamental theorem of calculus: δ f (x) 1 f (x + y)dy = 1 [f (x + δ) f (x δ)] 2δ δ 2δ [ ] 1 z = E f (x + z) z D δ z where D(z) = δ w.p. 1 2 and = δ w.p. 1 2. With enough regularity (e.g. f Lipschitz), d 1 δ f (x + y)dy = 1 δ f (x + y)dy. dx 2δ δ 2δ δ

Single-point gradient estimates (higher dimensions) B 1 = {x R n : x 2 1}. S 1 = {x R n : x 2 = 1}. dy = A. A By Stokes theorem, f (x) 1 f (x + y)dy = 1 f (x + z) z δb 1 δb 1 δb 1 δs 1 z dz = 1 f (x + δz)zdz = S 1 f (x + δz)zdz δb 1 S 1 S 1 δb 1 S 1 = S 1 δb 1 E z U(S 1 ) [f (x + δz)] = d δ E z U(S 1 ) [f (x + δz)]. With enough regularity on f, 1 f (x + y)dy = 1 f (x + y)dy. δb 1 δb 1 δb 1 δb 1

Projection method [Flaxman et al, 2005] Let f (x) = 1 δb 1 δb 1 f (x + y)dy. Estimate f (x) by sampling on δs 1 (x) Project gradient descent update to keep samples inside K: K δ = 1 1 δ K. BANDITPGD(T, η, δ): x 1 0. For t = 1, 2,..., T : u t SAMPLE(S 1 ) y t x t + δu t PLAY(y t) f t(y t) RECEIVELOSS(y t) ĝ t d δ ft(yt)ut x t+1 Π Kδ (x t ηĝ t).

Analysis of BANDITPGD Theorem (Flaxman et al, 2005) Assume diam(k) D, f t C, and f t L. Then after T rounds, the (expected) regret of the BANDITPGD algorithm is bounded by: D 2 2η + ηc2 d 2 T 2δ 2 + δ(d + 2)LT. In particular, by setting η = δd Cd and δ = DCd, the regret is T (D+2)LT 1/2 upper bounded by: O(d 1/2 T 3/4 ).

Proof of BANDITPGD regret For any x K, let x δ = Π Kδ (x). ft (z) f t (z). Then E[f t (y t ) f t (x )] = [ E f t (y t ) f t (x t ) + f t (x t ) f t (x t ) + f t (x t ) f t (xδ ) + f t (x δ ) f t (x δ ) + f t (x δ ) f t (x ) E [ ft (x t ) f ] t (xδ ) + [2δLT + δdlt ] E [ ft (x t ) f ] t (xδ ) + δ(d + 2)LT. ]

Proof of BANDITPGD regret E [ ĝ t 2 ] C 2 d 2 δ 2. Thus, = = E [ ft (x t ) f ] t (xδ ) E [ ĝ t (x t xδ ) ] [ ] E f t (x t ) (x t xδ ) 1 2η E [ η 2 ĝ t 2 + x t x δ 2 x t ηĝ t x δ 2] 1 [η 2η E 2 C2 d 2 ] δ 2 + x t xδ 2 x t+1 xδ 2 1 2η E [ x 1 x δ 2 + η 2 C2 d 2 δ 2 ] 1 [D 2 + η 2 C2 d 2 2η δ 2 ].

Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion

Revisiting projection Goal of projection: Keep x t K δ so that y t K. Total cost of projection: δdlt. Deficiency: completely separate from gradient descent update. Question: is there a better way to ensure that y t K?

Gradient Descent to Follow-the-Regularized-Leader Let ĝ 1:t = t s=1 ĝs. Proximal form of gradient descent: x t+1 x t ηĝ t x t+1 argmin x R d ηĝ 1:t x + x 2 Follow-the-Regularized-Leader: for R : R d R, x t+1 argmin ĝ 1:t x + R(x), x R d BCO wishlist for R: Want to ensure that x t+1 stays inside K. Want enough room so that y t+1 K as well.

Self-concordant barriers Definition (Self-concordant barrier (SCB)) Let ν 0. A C 3 function R : int(k) R is a ν-self-concordant barrier for K if for any sequence (z s ) s=1 int(k), with z s K, we have R(z s ), and for all x K and y R n, the following inequalities hold: 3 R(x)[y, y, y] 2 y 3 x, R(x) y ν 1/2 y x, where z 2 x = z 2 2 R(x) = z 2 R(x)z.

Examples of barriers K = B 1 : R(x) = log(1 x 2 ) is 1-self-concordant. K = {x : ai x b i } m i=1 : R(x) = m log(b i ai x) i=1 is m-self-concordant. Existence of universal barrier [Nesterov & Nemirovski, 1994]: every closed convex domain K admits a O(d)-self-concordant barrier.

Properties of self-concordant-barriers Translation invariance: for any constant c R, R + z is also a SCB (so wlog, we assume min z K R(z) = 0.) Dikin ellipsoid contained in interior: let E(x) = {y R n : y x 1}. Then for any x int(k), E(x) int(k). Logarithmic growth away from boundary: for any ɛ (0, 1], let y = argmin z K R(z) and K y,ɛ = {y + (1 ɛ)(x y) : x K}. Then for all x K y,ɛ, R(x) ν log(1/ɛ). Proximity to minimizer: If R(x) x, 1 2, then x argmin R x 2 R(x) x,.

Adjusting to the local geometry Let A 0 SPD matrix. Sampling around A instead of Euclidean ball: u SAMPLE(S 1 ), x y + δau Smoothing over A instead of Euclidean ball: f (x) = Eu U(S1 )[f (x + δau)]. One-point gradient estimate based on A: ĝ = d δ f (x + [ĝ] δau)a 1 u, E u U(S1 ) = f (x). Local norm bound: ĝ 2 A 2 d 2 δ 2 C2.

BANDITFTRL [Abernethy et al, 2008; Saha and Tewari 2011] BANDITFTRL(R, δ, η, T, x 1 ) For t 1 to T : u t SAMPLE(U(S 1 )). y t x t + δ( 2 R(x t)) 1/2 u t. PLAY(y t). f t(y t) RECEIVELOSS(y t). ĝ t d δ ft(yt) 2 R(x t)u t. x t+1 argmin x R d ηĝ 1:tx + R(x).

Analysis of BANDITFTRL Theorem (Abernethy et al, 2008; Saha and Tewari, 2011) Assume diam(k) D. Let R be a self-concordant-barrier for K, f t C, and f t L. Then the regret of BANDITFTRL is upper bounded as follows: If (f t ) T are linear functions, then Reg T (BANDITFTRL) = Õ(T 1/2 ). If (f t ) T have Lipschitz gradients, then Reg T (BANDITFTRL) = Õ(T 2/3 ).

Proof of BANDITFTRL regret: linear case Approximation error of smoothed losses: x T = argmin x K f t(x) xɛ argmin y K,dist(y, K)>ɛ y x Because f t are linear, [ ] Reg T (BANDITFTRL) = E f t (y t ) f t (x ) [ = E f t (y t ) f t (y t ) + f t (y t ) f t (x t ) + f t (x t ) f t (xɛ ) + f t (xɛ ) ] f t (xɛ ) + f t (xɛ ) f t (x ) [ ] [ ] E ft (x t ) f t (xɛ ) + ɛlt = E ĝt (x t xɛ ) + ɛlt.

Proof of BANDITFTRL regret: linear case Claim: for any z K, ĝ t (x t+1 z) 1 η R(z). T = 1 case is true by definition of x 2. Assuming statement is true for T 1: 1 ĝt x t+1 = ĝt x t+1 + ĝt x T +1 1 T η R(x 1 T ) + ĝt x T + ĝt x T +1 1 T η R(x 1 T +1) + ĝ t x T +1 + ĝt x T +1 1 η R(z) + ĝt z.

Proof of BANDITFTRL regret: linear case [ ] E ft (x t ) f t (xɛ ) Proximity to minimizer for SCB: Recall: x t+1 = argmin x K ηĝ 1:tx + R(x) F t(x) = ηĝ 1:tx + R(x) is a SCB. E [ ft (x t ) f t (x t+1 ) + f t (x t+1 ) f ] t (xɛ ) E [ ĝ t xt, x t x t+1 xt ] + 1 η R(x ɛ ) Proximity bound: x t x t+1 xt F t (x t ) xt, = η ĝ t xt,

Proof of BANDITFTRL regret: linear case Reg T (BANDITFTRL) ɛlt + E [ η ĝ t 2 ] 1 x t, + η R(x ɛ ) By the local norm bound: E [ ] η ĝ t 2 x t, C 2 d 2. δ 2 By the logarithmic growth of the SCB: 1 η R(x ɛ ) ν log ( ) 1 ɛ. Reg T (BANDITFTRL) ɛlt + T η C2 d 2 δ 2 + ν ( ) 1 η log. ɛ

Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion

Issues with BANDITFTRL in the non-linear case Approximation error of f f : δt for C 0,1 functions δ 2 T for C 1,1 functions Variance of gradient estimates: E [ ] η ĝ t 2 x t, C 2 d 2 δ 2 Regret for non-linear loss functions: O(T 3/4 ) for C 0,1 functions O(T 2/3 ) for C 1,1 functions Question: can we reduce the variance of the gradient estimates to improve the regret?

Variance reduction Observation [Dekel et al, 2015]: If ḡ t = 1 k+1 k i=0 ĝt i, then ( C ḡ t 2 2 d 2 ) x t, = O δ 2. (k + 1) Note: averaged gradient ḡ t is no longer an unbiased estimate of f t. Idea: If f t is sufficiently regular, then the bias will still be manageable.

Improving variance reduction via optimism Optimistic FTRL [Rakhlin and Sridharan, 2013]: x t+1 argmin(g 1:t + g t+1 ) x + R(x) x K f t (x t ) f t (x ) η g t g t xt, + 1 η R(x ) By re-centering the averaged gradient at each step, we can further reduce the variance: g t = 1 k + 1 k ĝ t i. i=1 Variance of re-centered averaged gradients: ( ḡ t g t 2 x t, = 1 (k + 1) 2 ĝ t 2 C 2 = O d 2 ) x t, δ 2 (k + 1) 2.

BANDITFTRL-VR BANDITFTRL-VR(R, δ, η, k, T, x 1 ) For t 1 T : u t SAMPLE(U(S 1 )) y t x t + δ( 2 R(x t)) 1 2 ut PLAY(y t) f t(y t) RECEIVELOSS(y t) ĝ t d δ ft(yt)( 2 R(x t)) 1 2 ut ḡ t 1 k+1 k i=0 ĝt i g t+1 1 k+1 k i=1 ĝt+1 i x t+1 argmin x R d η(ḡ 1:t + g t+1 ) x + R(x)

Analysis of BANDITFTRL-VR Theorem (Mohri & Y., 2016) Assume diam(k) D. Let R be a self-concordant-barrier for K, f t C, and f t L. Then the regret of BANDITFTRL is upper bounded as follows: If (f t ) T are Lipschitz, then Reg 11 T (BANDITFTRL-VR) = Õ(T 16 ). If (f t ) T have Lipschitz gradients, then Reg T (BANDITFTRL-VR) = Õ(T 8 13 ).

Proof of BANDITFTRL-VR regret: Lipschitz case Approximation: real to smoothed losses Relate global optimum x to projected optimum x ɛ. Use Lipschitz property of losses to relate y t to x t and f t to f t. [ ] Reg T (BANDITFTRL-VR) = E f t (y t ) f t (x ) ɛlt + 2LδDT + E [ ft (x t ) f ] (xɛ ).

Proof of BANDITFTRL-VR regret: Lipschitz case Approximation: smoothed to averaged losses E [ ft (x t ) f ] (xɛ ) = + 1 k + 1 Ck 2 k i=0 + LT sup [ 1 k E ( ft (x t ) k + 1 f t i (x t i )) i=0 k ( ft (xɛ ) f ) ] t (xɛ ) ( ft i (x t i ) f t (x ɛ )) + 1 k + 1 t [1,T ] i [0,k t] E [ x t i x t 2 ] + i=0 E [ ḡt (x t xɛ ) ].

Proof of BANDITFTRL-VR regret: Lipschitz case FTRL analysis on averaged gradients with re-centering: E [ ḡt (x t x ɛ ) ] 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η R(x ɛ ). Cumulative analysis: Reg T (BANDITFTRL-VR) ɛlt + 2LδDT + Ck 2 + 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η R(x ɛ ) + LT sup t [1,T ] i [0,k t] E [ x t i x t 2 ].

Proof of BANDITFTRL-VR regret: Lipschitz case Stability estimate for the actions Want to bound: sup t [1,T ] E [ x t i x t 2 ]. i [0,k t] Fact: Let D be the diameter of K. For any x K and z R d, D 1 z x, z 2 D z x. By triangle inequality and equivalence of norms, E [ x t i x t 2 ] D t 1 s=t i t 1 s=t i E [ x s x s+1 2 ] E [ x s x s+1 xs ] D t 1 s=t i 2ηE [ ḡ s + g s+1 g s xs, ].

Proof of BANDITFTRL-VR regret: Lipschitz case ḡ s + g s+1 g s = 1 k k+1 i=0 ĝs i + 1 k+1ĝs Thus, E [ ḡ s + g s+1 g s 2 ] x s, 3 k 1 ] 2 k 2 E s i [ĝs i + 3 k 2 E k 1 ĝ s i E s i [ĝ s i ] i=0 x s, i=0 3 k 2 L + 2D2 L 2 + 3 2 k 1 k 2 E ĝ s i E s i [ĝ s i ]. i=0 x s, 2 x s, + 3 k 2 L

Proof of BANDITFTRL-VR regret: Lipschitz case Fact: 0 i k such that t i 1, 1 2 z x t i, z xt, 2 z xt i,. Because the terms in the sum make up martingale difference, 2 k 1 E ĝ s i E s i [ĝ s i ] 4E k 1 ĝ s i E s i [ĝ s i ] i=0 x s, k 1 [ ĝs i 4 E E s i [ĝ s i ] ] 2 x s k, i=0 k 1 [ 16 E ĝ s i E s i [ĝ s i ] ] 2 x s i, i=0 k 1 16 i=0 [ ĝs i ] E 2 x s i, k 1 16 i=0 C 2 d 2 i=0 δ 2 = 16k C2 d 2 δ 2. 2 x s k,

Proof of BANDITFTRL-VR regret: Lipschitz case By combining the components of the stability estimate, E [ x t i x t 2 ] 2ηD t 1 s=t i By the previous calculations, 3 k 2 L + 2D2 L 2 + 3 k 2 16k C2 d 2 δ 2. Reg T (BANDITFTRL-VR) ɛlt + 2LδDT + Ck 2 + 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η log(1/ɛ) + LTD2ηk 3 k 2 L + 2D2 L 2 + 48 k 2 C 2 d 2 δ 2. Now set η = T 11/16 d 3/8, δ = T 5/16 d 3/8, k = T 1/8 d 1/4.

Discussion of BANDITFTRL-VR regret: Lipschitz gradient case Approximation of real to smoothed losses incurs a δ 2 D 2 T penalty instead of δdt. Rest of analysis also leads to some changes in constants. General regret bound: Reg T (BANDITFTRL-VR) ɛlt + Hδ 2 D 2 T + Ck + 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η log(1/ɛ) + (TL + DHT )2ηkD 3 k 2 L + 2D2 L 2 + 48 k 2 C 2 d 2 δ 2. Now set η = T 8/13 d 5/6, δ = T 5/26 d 1/3, k = T 1/13 d 5/3.

Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion

Other BCO methods Strongly convex loss functions: Augment R in BANDITFTRL with additional regularization. C 0,1 [Agarawal et al, 2010]: O(T 2/3 ) regret C 1,1 [Hazan & Levy, 2014]: O(T 1/2 ) regret Other types of algorithms: Ellipsoid method-based algorithm [Hazan and Li, 2016]: O(2 d 4 log(t ) 2d T 1/2 ). Kernel-based algorithm [Bubeck et al, 2017]: O(d 9.5 T 1/2 )

Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion

Conclusion BCO is a flexible framework for modeling learning problems with sequential data and very limited feedback. BCO generalizes many existing models of online learning and optimization. State-of-the-art algorithms leverage techniques from online convex optimization and interior-point methods. Efficient algorithms obtaining optimal guarantees in C 0,1, C 1,1 cases are still open.