Bandit Convex Optimization
|
|
- Henry Sims
- 6 years ago
- Views:
Transcription
1 March 7, 2017
2 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
3 Learning scenario Compact convex action set K R d. For t = 1 to T : Predict x t K. Receive convex loss function f t : K R. Incur loss f t(x t). Bandit setting: only loss revealed, no other information. Regret of algorithm A: Reg T (A) = f t (x t ) min x K f t (x).
4 Related settings Online convex optimization: f t (and maybe 2 f t, 3 f t,...) known at each round. Multi-armed bandit: K = {1, 2,..., K } discrete. Zero-th order optimization: f t = f. Stochastic bandit convex optimization: f t (x) = f (x) + ɛ t, ɛ t D noisy estimate. Multi-point bandit convex optimization: Query f t at points (x t,i ) m i=1, m 2.
5 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
6 Gradient descent without a gradient Online gradient descent algorithm: x t+1 x t η f t (x t ). BCO setting: f t (x t ) is not known! BCO idea: Find ĝ t such that ĝ t f t(x t). Update x t+1 x t ηĝ t. Question: how do we pick ĝ t?
7 Single-point gradient estimates (one dimension) By the fundamental theorem of calculus: δ f (x) 1 f (x + y)dy = 1 [f (x + δ) f (x δ)] 2δ δ 2δ [ ] 1 z = E f (x + z) z D δ z where D(z) = δ w.p. 1 2 and = δ w.p With enough regularity (e.g. f Lipschitz), d 1 δ f (x + y)dy = 1 δ f (x + y)dy. dx 2δ δ 2δ δ
8 Single-point gradient estimates (higher dimensions) B 1 = {x R n : x 2 1}. S 1 = {x R n : x 2 = 1}. dy = A. A By Stokes theorem, f (x) 1 f (x + y)dy = 1 f (x + z) z δb 1 δb 1 δb 1 δs 1 z dz = 1 f (x + δz)zdz = S 1 f (x + δz)zdz δb 1 S 1 S 1 δb 1 S 1 = S 1 δb 1 E z U(S 1 ) [f (x + δz)] = d δ E z U(S 1 ) [f (x + δz)]. With enough regularity on f, 1 f (x + y)dy = 1 f (x + y)dy. δb 1 δb 1 δb 1 δb 1
9 Projection method [Flaxman et al, 2005] Let f (x) = 1 δb 1 δb 1 f (x + y)dy. Estimate f (x) by sampling on δs 1 (x) Project gradient descent update to keep samples inside K: K δ = 1 1 δ K. BANDITPGD(T, η, δ): x 1 0. For t = 1, 2,..., T : u t SAMPLE(S 1 ) y t x t + δu t PLAY(y t) f t(y t) RECEIVELOSS(y t) ĝ t d δ ft(yt)ut x t+1 Π Kδ (x t ηĝ t).
10 Analysis of BANDITPGD Theorem (Flaxman et al, 2005) Assume diam(k) D, f t C, and f t L. Then after T rounds, the (expected) regret of the BANDITPGD algorithm is bounded by: D 2 2η + ηc2 d 2 T 2δ 2 + δ(d + 2)LT. In particular, by setting η = δd Cd and δ = DCd, the regret is T (D+2)LT 1/2 upper bounded by: O(d 1/2 T 3/4 ).
11 Proof of BANDITPGD regret For any x K, let x δ = Π Kδ (x). ft (z) f t (z). Then E[f t (y t ) f t (x )] = [ E f t (y t ) f t (x t ) + f t (x t ) f t (x t ) + f t (x t ) f t (xδ ) + f t (x δ ) f t (x δ ) + f t (x δ ) f t (x ) E [ ft (x t ) f ] t (xδ ) + [2δLT + δdlt ] E [ ft (x t ) f ] t (xδ ) + δ(d + 2)LT. ]
12 Proof of BANDITPGD regret E [ ĝ t 2 ] C 2 d 2 δ 2. Thus, = = E [ ft (x t ) f ] t (xδ ) E [ ĝ t (x t xδ ) ] [ ] E f t (x t ) (x t xδ ) 1 2η E [ η 2 ĝ t 2 + x t x δ 2 x t ηĝ t x δ 2] 1 [η 2η E 2 C2 d 2 ] δ 2 + x t xδ 2 x t+1 xδ 2 1 2η E [ x 1 x δ 2 + η 2 C2 d 2 δ 2 ] 1 [D 2 + η 2 C2 d 2 2η δ 2 ].
13 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
14 Revisiting projection Goal of projection: Keep x t K δ so that y t K. Total cost of projection: δdlt. Deficiency: completely separate from gradient descent update. Question: is there a better way to ensure that y t K?
15 Gradient Descent to Follow-the-Regularized-Leader Let ĝ 1:t = t s=1 ĝs. Proximal form of gradient descent: x t+1 x t ηĝ t x t+1 argmin x R d ηĝ 1:t x + x 2 Follow-the-Regularized-Leader: for R : R d R, x t+1 argmin ĝ 1:t x + R(x), x R d BCO wishlist for R: Want to ensure that x t+1 stays inside K. Want enough room so that y t+1 K as well.
16 Self-concordant barriers Definition (Self-concordant barrier (SCB)) Let ν 0. A C 3 function R : int(k) R is a ν-self-concordant barrier for K if for any sequence (z s ) s=1 int(k), with z s K, we have R(z s ), and for all x K and y R n, the following inequalities hold: 3 R(x)[y, y, y] 2 y 3 x, R(x) y ν 1/2 y x, where z 2 x = z 2 2 R(x) = z 2 R(x)z.
17 Examples of barriers K = B 1 : R(x) = log(1 x 2 ) is 1-self-concordant. K = {x : ai x b i } m i=1 : R(x) = m log(b i ai x) i=1 is m-self-concordant. Existence of universal barrier [Nesterov & Nemirovski, 1994]: every closed convex domain K admits a O(d)-self-concordant barrier.
18 Properties of self-concordant-barriers Translation invariance: for any constant c R, R + z is also a SCB (so wlog, we assume min z K R(z) = 0.) Dikin ellipsoid contained in interior: let E(x) = {y R n : y x 1}. Then for any x int(k), E(x) int(k). Logarithmic growth away from boundary: for any ɛ (0, 1], let y = argmin z K R(z) and K y,ɛ = {y + (1 ɛ)(x y) : x K}. Then for all x K y,ɛ, R(x) ν log(1/ɛ). Proximity to minimizer: If R(x) x, 1 2, then x argmin R x 2 R(x) x,.
19 Adjusting to the local geometry Let A 0 SPD matrix. Sampling around A instead of Euclidean ball: u SAMPLE(S 1 ), x y + δau Smoothing over A instead of Euclidean ball: f (x) = Eu U(S1 )[f (x + δau)]. One-point gradient estimate based on A: ĝ = d δ f (x + [ĝ] δau)a 1 u, E u U(S1 ) = f (x). Local norm bound: ĝ 2 A 2 d 2 δ 2 C2.
20 BANDITFTRL [Abernethy et al, 2008; Saha and Tewari 2011] BANDITFTRL(R, δ, η, T, x 1 ) For t 1 to T : u t SAMPLE(U(S 1 )). y t x t + δ( 2 R(x t)) 1/2 u t. PLAY(y t). f t(y t) RECEIVELOSS(y t). ĝ t d δ ft(yt) 2 R(x t)u t. x t+1 argmin x R d ηĝ 1:tx + R(x).
21 Analysis of BANDITFTRL Theorem (Abernethy et al, 2008; Saha and Tewari, 2011) Assume diam(k) D. Let R be a self-concordant-barrier for K, f t C, and f t L. Then the regret of BANDITFTRL is upper bounded as follows: If (f t ) T are linear functions, then Reg T (BANDITFTRL) = Õ(T 1/2 ). If (f t ) T have Lipschitz gradients, then Reg T (BANDITFTRL) = Õ(T 2/3 ).
22 Proof of BANDITFTRL regret: linear case Approximation error of smoothed losses: x T = argmin x K f t(x) xɛ argmin y K,dist(y, K)>ɛ y x Because f t are linear, [ ] Reg T (BANDITFTRL) = E f t (y t ) f t (x ) [ = E f t (y t ) f t (y t ) + f t (y t ) f t (x t ) + f t (x t ) f t (xɛ ) + f t (xɛ ) ] f t (xɛ ) + f t (xɛ ) f t (x ) [ ] [ ] E ft (x t ) f t (xɛ ) + ɛlt = E ĝt (x t xɛ ) + ɛlt.
23 Proof of BANDITFTRL regret: linear case Claim: for any z K, ĝ t (x t+1 z) 1 η R(z). T = 1 case is true by definition of x 2. Assuming statement is true for T 1: 1 ĝt x t+1 = ĝt x t+1 + ĝt x T +1 1 T η R(x 1 T ) + ĝt x T + ĝt x T +1 1 T η R(x 1 T +1) + ĝ t x T +1 + ĝt x T +1 1 η R(z) + ĝt z.
24 Proof of BANDITFTRL regret: linear case [ ] E ft (x t ) f t (xɛ ) Proximity to minimizer for SCB: Recall: x t+1 = argmin x K ηĝ 1:tx + R(x) F t(x) = ηĝ 1:tx + R(x) is a SCB. E [ ft (x t ) f t (x t+1 ) + f t (x t+1 ) f ] t (xɛ ) E [ ĝ t xt, x t x t+1 xt ] + 1 η R(x ɛ ) Proximity bound: x t x t+1 xt F t (x t ) xt, = η ĝ t xt,
25 Proof of BANDITFTRL regret: linear case Reg T (BANDITFTRL) ɛlt + E [ η ĝ t 2 ] 1 x t, + η R(x ɛ ) By the local norm bound: E [ ] η ĝ t 2 x t, C 2 d 2. δ 2 By the logarithmic growth of the SCB: 1 η R(x ɛ ) ν log ( ) 1 ɛ. Reg T (BANDITFTRL) ɛlt + T η C2 d 2 δ 2 + ν ( ) 1 η log. ɛ
26 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
27 Issues with BANDITFTRL in the non-linear case Approximation error of f f : δt for C 0,1 functions δ 2 T for C 1,1 functions Variance of gradient estimates: E [ ] η ĝ t 2 x t, C 2 d 2 δ 2 Regret for non-linear loss functions: O(T 3/4 ) for C 0,1 functions O(T 2/3 ) for C 1,1 functions Question: can we reduce the variance of the gradient estimates to improve the regret?
28 Variance reduction Observation [Dekel et al, 2015]: If ḡ t = 1 k+1 k i=0 ĝt i, then ( C ḡ t 2 2 d 2 ) x t, = O δ 2. (k + 1) Note: averaged gradient ḡ t is no longer an unbiased estimate of f t. Idea: If f t is sufficiently regular, then the bias will still be manageable.
29 Improving variance reduction via optimism Optimistic FTRL [Rakhlin and Sridharan, 2013]: x t+1 argmin(g 1:t + g t+1 ) x + R(x) x K f t (x t ) f t (x ) η g t g t xt, + 1 η R(x ) By re-centering the averaged gradient at each step, we can further reduce the variance: g t = 1 k + 1 k ĝ t i. i=1 Variance of re-centered averaged gradients: ( ḡ t g t 2 x t, = 1 (k + 1) 2 ĝ t 2 C 2 = O d 2 ) x t, δ 2 (k + 1) 2.
30 BANDITFTRL-VR BANDITFTRL-VR(R, δ, η, k, T, x 1 ) For t 1 T : u t SAMPLE(U(S 1 )) y t x t + δ( 2 R(x t)) 1 2 ut PLAY(y t) f t(y t) RECEIVELOSS(y t) ĝ t d δ ft(yt)( 2 R(x t)) 1 2 ut ḡ t 1 k+1 k i=0 ĝt i g t+1 1 k+1 k i=1 ĝt+1 i x t+1 argmin x R d η(ḡ 1:t + g t+1 ) x + R(x)
31 Analysis of BANDITFTRL-VR Theorem (Mohri & Y., 2016) Assume diam(k) D. Let R be a self-concordant-barrier for K, f t C, and f t L. Then the regret of BANDITFTRL is upper bounded as follows: If (f t ) T are Lipschitz, then Reg 11 T (BANDITFTRL-VR) = Õ(T 16 ). If (f t ) T have Lipschitz gradients, then Reg T (BANDITFTRL-VR) = Õ(T 8 13 ).
32 Proof of BANDITFTRL-VR regret: Lipschitz case Approximation: real to smoothed losses Relate global optimum x to projected optimum x ɛ. Use Lipschitz property of losses to relate y t to x t and f t to f t. [ ] Reg T (BANDITFTRL-VR) = E f t (y t ) f t (x ) ɛlt + 2LδDT + E [ ft (x t ) f ] (xɛ ).
33 Proof of BANDITFTRL-VR regret: Lipschitz case Approximation: smoothed to averaged losses E [ ft (x t ) f ] (xɛ ) = + 1 k + 1 Ck 2 k i=0 + LT sup [ 1 k E ( ft (x t ) k + 1 f t i (x t i )) i=0 k ( ft (xɛ ) f ) ] t (xɛ ) ( ft i (x t i ) f t (x ɛ )) + 1 k + 1 t [1,T ] i [0,k t] E [ x t i x t 2 ] + i=0 E [ ḡt (x t xɛ ) ].
34 Proof of BANDITFTRL-VR regret: Lipschitz case FTRL analysis on averaged gradients with re-centering: E [ ḡt (x t x ɛ ) ] 2C2 d 2 ηt δ 2 (k + 1) η R(x ɛ ). Cumulative analysis: Reg T (BANDITFTRL-VR) ɛlt + 2LδDT + Ck 2 + 2C2 d 2 ηt δ 2 (k + 1) η R(x ɛ ) + LT sup t [1,T ] i [0,k t] E [ x t i x t 2 ].
35 Proof of BANDITFTRL-VR regret: Lipschitz case Stability estimate for the actions Want to bound: sup t [1,T ] E [ x t i x t 2 ]. i [0,k t] Fact: Let D be the diameter of K. For any x K and z R d, D 1 z x, z 2 D z x. By triangle inequality and equivalence of norms, E [ x t i x t 2 ] D t 1 s=t i t 1 s=t i E [ x s x s+1 2 ] E [ x s x s+1 xs ] D t 1 s=t i 2ηE [ ḡ s + g s+1 g s xs, ].
36 Proof of BANDITFTRL-VR regret: Lipschitz case ḡ s + g s+1 g s = 1 k k+1 i=0 ĝs i + 1 k+1ĝs Thus, E [ ḡ s + g s+1 g s 2 ] x s, 3 k 1 ] 2 k 2 E s i [ĝs i + 3 k 2 E k 1 ĝ s i E s i [ĝ s i ] i=0 x s, i=0 3 k 2 L + 2D2 L k 1 k 2 E ĝ s i E s i [ĝ s i ]. i=0 x s, 2 x s, + 3 k 2 L
37 Proof of BANDITFTRL-VR regret: Lipschitz case Fact: 0 i k such that t i 1, 1 2 z x t i, z xt, 2 z xt i,. Because the terms in the sum make up martingale difference, 2 k 1 E ĝ s i E s i [ĝ s i ] 4E k 1 ĝ s i E s i [ĝ s i ] i=0 x s, k 1 [ ĝs i 4 E E s i [ĝ s i ] ] 2 x s k, i=0 k 1 [ 16 E ĝ s i E s i [ĝ s i ] ] 2 x s i, i=0 k 1 16 i=0 [ ĝs i ] E 2 x s i, k 1 16 i=0 C 2 d 2 i=0 δ 2 = 16k C2 d 2 δ 2. 2 x s k,
38 Proof of BANDITFTRL-VR regret: Lipschitz case By combining the components of the stability estimate, E [ x t i x t 2 ] 2ηD t 1 s=t i By the previous calculations, 3 k 2 L + 2D2 L k 2 16k C2 d 2 δ 2. Reg T (BANDITFTRL-VR) ɛlt + 2LδDT + Ck 2 + 2C2 d 2 ηt δ 2 (k + 1) η log(1/ɛ) + LTD2ηk 3 k 2 L + 2D2 L k 2 C 2 d 2 δ 2. Now set η = T 11/16 d 3/8, δ = T 5/16 d 3/8, k = T 1/8 d 1/4.
39 Discussion of BANDITFTRL-VR regret: Lipschitz gradient case Approximation of real to smoothed losses incurs a δ 2 D 2 T penalty instead of δdt. Rest of analysis also leads to some changes in constants. General regret bound: Reg T (BANDITFTRL-VR) ɛlt + Hδ 2 D 2 T + Ck + 2C2 d 2 ηt δ 2 (k + 1) η log(1/ɛ) + (TL + DHT )2ηkD 3 k 2 L + 2D2 L k 2 C 2 d 2 δ 2. Now set η = T 8/13 d 5/6, δ = T 5/26 d 1/3, k = T 1/13 d 5/3.
40 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
41 Other BCO methods Strongly convex loss functions: Augment R in BANDITFTRL with additional regularization. C 0,1 [Agarawal et al, 2010]: O(T 2/3 ) regret C 1,1 [Hazan & Levy, 2014]: O(T 1/2 ) regret Other types of algorithms: Ellipsoid method-based algorithm [Hazan and Li, 2016]: O(2 d 4 log(t ) 2d T 1/2 ). Kernel-based algorithm [Bubeck et al, 2017]: O(d 9.5 T 1/2 )
42 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
43 Conclusion BCO is a flexible framework for modeling learning problems with sequential data and very limited feedback. BCO generalizes many existing models of online learning and optimization. State-of-the-art algorithms leverage techniques from online convex optimization and interior-point methods. Efficient algorithms obtaining optimal guarantees in C 0,1, C 1,1 cases are still open.
Optimistic Bandit Convex Optimization
Optimistic Bandit Convex Optimization Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 mohri@cims.nyu.edu Scott Yang Courant Institute 25 Mercer Street New York, NY 002 yangs@cims.nyu.edu
More informationBandits for Online Optimization
Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each
More informationBandit Online Convex Optimization
March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationAccelerating Online Convex Optimization via Adaptive Prediction
Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationBandit Convex Optimization: T Regret in One Dimension
Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February
More informationThe FTRL Algorithm with Strongly Convex Regularizers
CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked
More informationA Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds
Proceedings of Machine Learning Research 76: 40, 207 Algorithmic Learning Theory 207 A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Pooria
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationStochastic and Adversarial Online Learning without Hyperparameters
Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationAdaptivity and Optimism: An Improved Exponentiated Gradient Algorithm
Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationBetter Algorithms for Benign Bandits
Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com
More informationSupplement: Universal Self-Concordant Barrier Functions
IE 8534 1 Supplement: Universal Self-Concordant Barrier Functions IE 8534 2 Recall that a self-concordant barrier function for K is a barrier function satisfying 3 F (x)[h, h, h] 2( 2 F (x)[h, h]) 3/2,
More informationBetter Algorithms for Benign Bandits
Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com
More information15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015
5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and
More informationOnline Learning with Predictable Sequences
JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationOptimization, Learning, and Games with Predictable Sequences
Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic
More informationOnline Learning with Non-Convex Losses and Non-Stationary Regret
Online Learning with Non-Convex Losses and Non-Stationary Regret Xiang Gao Xiaobo Li Shuzhong Zhang University of Minnesota University of Minnesota University of Minnesota Abstract In this paper, we consider
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationUnconstrained minimization
CSCI5254: Convex Optimization & Its Applications Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions 1 Unconstrained
More informationThe Interplay Between Stability and Regret in Online Learning
The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com
More informationarxiv: v2 [math.oc] 8 Oct 2011
Stochastic convex optimization with bandit feedback Alekh Agarwal Dean P. Foster Daniel Hsu Sham M. Kakade, Alexander Rakhlin Department of EECS Department of Statistics Microsoft Research University of
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationConvex Optimization. 9. Unconstrained minimization. Prof. Ying Cui. Department of Electrical Engineering Shanghai Jiao Tong University
Convex Optimization 9. Unconstrained minimization Prof. Ying Cui Department of Electrical Engineering Shanghai Jiao Tong University 2017 Autumn Semester SJTU Ying Cui 1 / 40 Outline Unconstrained minimization
More informationAdvanced Machine Learning
Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationAdaptive Online Learning in Dynamic Environments
Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn
More informationConvergence rate of SGD
Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected
More informationOnline Learning for Time Series Prediction
Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.
More information1 Review and Overview
DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last
More informationRegret bounded by gradual variation for online convex optimization
Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationEASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University
EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationHalf of Final Exam Name: Practice Problems October 28, 2014
Math 54. Treibergs Half of Final Exam Name: Practice Problems October 28, 24 Half of the final will be over material since the last midterm exam, such as the practice problems given here. The other half
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationGaussian Process Optimization with Mutual Information
Gaussian Process Optimization with Mutual Information Emile Contal 1 Vianney Perchet 2 Nicolas Vayatis 1 1 CMLA Ecole Normale Suprieure de Cachan & CNRS, France 2 LPMA Université Paris Diderot & CNRS,
More informationUnconstrained minimization of smooth functions
Unconstrained minimization of smooth functions We want to solve min x R N f(x), where f is convex. In this section, we will assume that f is differentiable (so its gradient exists at every point), and
More informationarxiv: v2 [stat.ml] 7 Sep 2018
Projection-Free Bandit Convex Optimization Lin Chen 1, Mingrui Zhang 3 Amin Karbasi 1, 1 Yale Institute for Network Science, Department of Electrical Engineering, 3 Department of Statistics and Data Science,
More informationA Mini-Course on Convex Optimization with a view toward designing fast algorithms. Nisheeth K. Vishnoi
A Mini-Course on Convex Optimization with a view toward designing fast algorithms Nisheeth K. Vishnoi i Preface These notes are largely based on lectures delivered by the author in the Fall of 2014. The
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationAn Optimal Affine Invariant Smooth Minimization Algorithm.
An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationPerturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationLinear Bandits. Peter Bartlett
Linear Bandits. Peter Bartlett Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution.
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationOnline Passive-Aggressive Algorithms
Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationCS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi. Gradient Descent
CS 435, 2018 Lecture 3, Date: 8 March 2018 Instructor: Nisheeth Vishnoi Gradient Descent This lecture introduces Gradient Descent a meta-algorithm for unconstrained minimization. Under convexity of the
More informationBarrier Method. Javier Peña Convex Optimization /36-725
Barrier Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: Newton s method For root-finding F (x) = 0 x + = x F (x) 1 F (x) For optimization x f(x) x + = x 2 f(x) 1 f(x) Assume f strongly
More informationDelay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms
Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,
More informationA Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints
A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationBetter Algorithms for Benign Bandits
Journal of Machine Learning Research 2 (20) 287-3 Submitted 8/09; Revised 0/2; Published 4/ Better Algorithms for Benign Bandits Elad Hazan Department of IE&M echnion Israel Institute of echnology Haifa
More informationI. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm
I. The space C(K) Let K be a compact metric space, with metric d K. Let B(K) be the space of real valued bounded functions on K with the sup-norm Proposition : B(K) is complete. f = sup f(x) x K Proof.
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationA Quantum Interior Point Method for LPs and SDPs
A Quantum Interior Point Method for LPs and SDPs Iordanis Kerenidis 1 Anupam Prakash 1 1 CNRS, IRIF, Université Paris Diderot, Paris, France. September 26, 2018 Semi Definite Programs A Semidefinite Program
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationOnline Submodular Minimization
Online Submodular Minimization Elad Hazan IBM Almaden Research Center 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Yahoo! Research 4301 Great America Parkway, Santa Clara, CA 95054 skale@yahoo-inc.com
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationExtracting Certainty from Uncertainty: Regret Bounded by Variation in Costs
Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301
More informationUnconstrained minimization: assumptions
Unconstrained minimization I terminology and assumptions I gradient descent method I steepest descent method I Newton s method I self-concordant functions I implementation IOE 611: Nonlinear Programming,
More informationUnconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations
JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,
More informationSublinear Time Algorithms for Approximate Semidefinite Programming
Noname manuscript No. (will be inserted by the editor) Sublinear Time Algorithms for Approximate Semidefinite Programming Dan Garber Elad Hazan Received: date / Accepted: date Abstract We consider semidefinite
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationAn Online Convex Optimization Approach to Blackwell s Approachability
Journal of Machine Learning Research 17 (2016) 1-23 Submitted 7/15; Revised 6/16; Published 8/16 An Online Convex Optimization Approach to Blackwell s Approachability Nahum Shimkin Faculty of Electrical
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationBandit Online Learning with Unknown Delays
Bandit Online Learning with Unknown Delays Bingcong Li, Tianyi Chen, and Georgios B. Giannakis University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA {lixx5599,chen387,georgios}@umn.edu November
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationMath 361: Homework 1 Solutions
January 3, 4 Math 36: Homework Solutions. We say that two norms and on a vector space V are equivalent or comparable if the topology they define on V are the same, i.e., for any sequence of vectors {x
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationLecture 15 Newton Method and Self-Concordance. October 23, 2008
Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications
More informationOnline Optimization with Gradual Variations
JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationMinimax Policies for Combinatorial Prediction Games
Minimax Policies for Combinatorial Prediction Games Jean-Yves Audibert Imagine, Univ. Paris Est, and Sierra, CNRS/ENS/INRIA, Paris, France audibert@imagine.enpc.fr Sébastien Bubeck Centre de Recerca Matemàtica
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More information