March 7, 2017
Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
Learning scenario Compact convex action set K R d. For t = 1 to T : Predict x t K. Receive convex loss function f t : K R. Incur loss f t(x t). Bandit setting: only loss revealed, no other information. Regret of algorithm A: Reg T (A) = f t (x t ) min x K f t (x).
Related settings Online convex optimization: f t (and maybe 2 f t, 3 f t,...) known at each round. Multi-armed bandit: K = {1, 2,..., K } discrete. Zero-th order optimization: f t = f. Stochastic bandit convex optimization: f t (x) = f (x) + ɛ t, ɛ t D noisy estimate. Multi-point bandit convex optimization: Query f t at points (x t,i ) m i=1, m 2.
Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
Gradient descent without a gradient Online gradient descent algorithm: x t+1 x t η f t (x t ). BCO setting: f t (x t ) is not known! BCO idea: Find ĝ t such that ĝ t f t(x t). Update x t+1 x t ηĝ t. Question: how do we pick ĝ t?
Single-point gradient estimates (one dimension) By the fundamental theorem of calculus: δ f (x) 1 f (x + y)dy = 1 [f (x + δ) f (x δ)] 2δ δ 2δ [ ] 1 z = E f (x + z) z D δ z where D(z) = δ w.p. 1 2 and = δ w.p. 1 2. With enough regularity (e.g. f Lipschitz), d 1 δ f (x + y)dy = 1 δ f (x + y)dy. dx 2δ δ 2δ δ
Single-point gradient estimates (higher dimensions) B 1 = {x R n : x 2 1}. S 1 = {x R n : x 2 = 1}. dy = A. A By Stokes theorem, f (x) 1 f (x + y)dy = 1 f (x + z) z δb 1 δb 1 δb 1 δs 1 z dz = 1 f (x + δz)zdz = S 1 f (x + δz)zdz δb 1 S 1 S 1 δb 1 S 1 = S 1 δb 1 E z U(S 1 ) [f (x + δz)] = d δ E z U(S 1 ) [f (x + δz)]. With enough regularity on f, 1 f (x + y)dy = 1 f (x + y)dy. δb 1 δb 1 δb 1 δb 1
Projection method [Flaxman et al, 2005] Let f (x) = 1 δb 1 δb 1 f (x + y)dy. Estimate f (x) by sampling on δs 1 (x) Project gradient descent update to keep samples inside K: K δ = 1 1 δ K. BANDITPGD(T, η, δ): x 1 0. For t = 1, 2,..., T : u t SAMPLE(S 1 ) y t x t + δu t PLAY(y t) f t(y t) RECEIVELOSS(y t) ĝ t d δ ft(yt)ut x t+1 Π Kδ (x t ηĝ t).
Analysis of BANDITPGD Theorem (Flaxman et al, 2005) Assume diam(k) D, f t C, and f t L. Then after T rounds, the (expected) regret of the BANDITPGD algorithm is bounded by: D 2 2η + ηc2 d 2 T 2δ 2 + δ(d + 2)LT. In particular, by setting η = δd Cd and δ = DCd, the regret is T (D+2)LT 1/2 upper bounded by: O(d 1/2 T 3/4 ).
Proof of BANDITPGD regret For any x K, let x δ = Π Kδ (x). ft (z) f t (z). Then E[f t (y t ) f t (x )] = [ E f t (y t ) f t (x t ) + f t (x t ) f t (x t ) + f t (x t ) f t (xδ ) + f t (x δ ) f t (x δ ) + f t (x δ ) f t (x ) E [ ft (x t ) f ] t (xδ ) + [2δLT + δdlt ] E [ ft (x t ) f ] t (xδ ) + δ(d + 2)LT. ]
Proof of BANDITPGD regret E [ ĝ t 2 ] C 2 d 2 δ 2. Thus, = = E [ ft (x t ) f ] t (xδ ) E [ ĝ t (x t xδ ) ] [ ] E f t (x t ) (x t xδ ) 1 2η E [ η 2 ĝ t 2 + x t x δ 2 x t ηĝ t x δ 2] 1 [η 2η E 2 C2 d 2 ] δ 2 + x t xδ 2 x t+1 xδ 2 1 2η E [ x 1 x δ 2 + η 2 C2 d 2 δ 2 ] 1 [D 2 + η 2 C2 d 2 2η δ 2 ].
Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
Revisiting projection Goal of projection: Keep x t K δ so that y t K. Total cost of projection: δdlt. Deficiency: completely separate from gradient descent update. Question: is there a better way to ensure that y t K?
Gradient Descent to Follow-the-Regularized-Leader Let ĝ 1:t = t s=1 ĝs. Proximal form of gradient descent: x t+1 x t ηĝ t x t+1 argmin x R d ηĝ 1:t x + x 2 Follow-the-Regularized-Leader: for R : R d R, x t+1 argmin ĝ 1:t x + R(x), x R d BCO wishlist for R: Want to ensure that x t+1 stays inside K. Want enough room so that y t+1 K as well.
Self-concordant barriers Definition (Self-concordant barrier (SCB)) Let ν 0. A C 3 function R : int(k) R is a ν-self-concordant barrier for K if for any sequence (z s ) s=1 int(k), with z s K, we have R(z s ), and for all x K and y R n, the following inequalities hold: 3 R(x)[y, y, y] 2 y 3 x, R(x) y ν 1/2 y x, where z 2 x = z 2 2 R(x) = z 2 R(x)z.
Examples of barriers K = B 1 : R(x) = log(1 x 2 ) is 1-self-concordant. K = {x : ai x b i } m i=1 : R(x) = m log(b i ai x) i=1 is m-self-concordant. Existence of universal barrier [Nesterov & Nemirovski, 1994]: every closed convex domain K admits a O(d)-self-concordant barrier.
Properties of self-concordant-barriers Translation invariance: for any constant c R, R + z is also a SCB (so wlog, we assume min z K R(z) = 0.) Dikin ellipsoid contained in interior: let E(x) = {y R n : y x 1}. Then for any x int(k), E(x) int(k). Logarithmic growth away from boundary: for any ɛ (0, 1], let y = argmin z K R(z) and K y,ɛ = {y + (1 ɛ)(x y) : x K}. Then for all x K y,ɛ, R(x) ν log(1/ɛ). Proximity to minimizer: If R(x) x, 1 2, then x argmin R x 2 R(x) x,.
Adjusting to the local geometry Let A 0 SPD matrix. Sampling around A instead of Euclidean ball: u SAMPLE(S 1 ), x y + δau Smoothing over A instead of Euclidean ball: f (x) = Eu U(S1 )[f (x + δau)]. One-point gradient estimate based on A: ĝ = d δ f (x + [ĝ] δau)a 1 u, E u U(S1 ) = f (x). Local norm bound: ĝ 2 A 2 d 2 δ 2 C2.
BANDITFTRL [Abernethy et al, 2008; Saha and Tewari 2011] BANDITFTRL(R, δ, η, T, x 1 ) For t 1 to T : u t SAMPLE(U(S 1 )). y t x t + δ( 2 R(x t)) 1/2 u t. PLAY(y t). f t(y t) RECEIVELOSS(y t). ĝ t d δ ft(yt) 2 R(x t)u t. x t+1 argmin x R d ηĝ 1:tx + R(x).
Analysis of BANDITFTRL Theorem (Abernethy et al, 2008; Saha and Tewari, 2011) Assume diam(k) D. Let R be a self-concordant-barrier for K, f t C, and f t L. Then the regret of BANDITFTRL is upper bounded as follows: If (f t ) T are linear functions, then Reg T (BANDITFTRL) = Õ(T 1/2 ). If (f t ) T have Lipschitz gradients, then Reg T (BANDITFTRL) = Õ(T 2/3 ).
Proof of BANDITFTRL regret: linear case Approximation error of smoothed losses: x T = argmin x K f t(x) xɛ argmin y K,dist(y, K)>ɛ y x Because f t are linear, [ ] Reg T (BANDITFTRL) = E f t (y t ) f t (x ) [ = E f t (y t ) f t (y t ) + f t (y t ) f t (x t ) + f t (x t ) f t (xɛ ) + f t (xɛ ) ] f t (xɛ ) + f t (xɛ ) f t (x ) [ ] [ ] E ft (x t ) f t (xɛ ) + ɛlt = E ĝt (x t xɛ ) + ɛlt.
Proof of BANDITFTRL regret: linear case Claim: for any z K, ĝ t (x t+1 z) 1 η R(z). T = 1 case is true by definition of x 2. Assuming statement is true for T 1: 1 ĝt x t+1 = ĝt x t+1 + ĝt x T +1 1 T η R(x 1 T ) + ĝt x T + ĝt x T +1 1 T η R(x 1 T +1) + ĝ t x T +1 + ĝt x T +1 1 η R(z) + ĝt z.
Proof of BANDITFTRL regret: linear case [ ] E ft (x t ) f t (xɛ ) Proximity to minimizer for SCB: Recall: x t+1 = argmin x K ηĝ 1:tx + R(x) F t(x) = ηĝ 1:tx + R(x) is a SCB. E [ ft (x t ) f t (x t+1 ) + f t (x t+1 ) f ] t (xɛ ) E [ ĝ t xt, x t x t+1 xt ] + 1 η R(x ɛ ) Proximity bound: x t x t+1 xt F t (x t ) xt, = η ĝ t xt,
Proof of BANDITFTRL regret: linear case Reg T (BANDITFTRL) ɛlt + E [ η ĝ t 2 ] 1 x t, + η R(x ɛ ) By the local norm bound: E [ ] η ĝ t 2 x t, C 2 d 2. δ 2 By the logarithmic growth of the SCB: 1 η R(x ɛ ) ν log ( ) 1 ɛ. Reg T (BANDITFTRL) ɛlt + T η C2 d 2 δ 2 + ν ( ) 1 η log. ɛ
Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
Issues with BANDITFTRL in the non-linear case Approximation error of f f : δt for C 0,1 functions δ 2 T for C 1,1 functions Variance of gradient estimates: E [ ] η ĝ t 2 x t, C 2 d 2 δ 2 Regret for non-linear loss functions: O(T 3/4 ) for C 0,1 functions O(T 2/3 ) for C 1,1 functions Question: can we reduce the variance of the gradient estimates to improve the regret?
Variance reduction Observation [Dekel et al, 2015]: If ḡ t = 1 k+1 k i=0 ĝt i, then ( C ḡ t 2 2 d 2 ) x t, = O δ 2. (k + 1) Note: averaged gradient ḡ t is no longer an unbiased estimate of f t. Idea: If f t is sufficiently regular, then the bias will still be manageable.
Improving variance reduction via optimism Optimistic FTRL [Rakhlin and Sridharan, 2013]: x t+1 argmin(g 1:t + g t+1 ) x + R(x) x K f t (x t ) f t (x ) η g t g t xt, + 1 η R(x ) By re-centering the averaged gradient at each step, we can further reduce the variance: g t = 1 k + 1 k ĝ t i. i=1 Variance of re-centered averaged gradients: ( ḡ t g t 2 x t, = 1 (k + 1) 2 ĝ t 2 C 2 = O d 2 ) x t, δ 2 (k + 1) 2.
BANDITFTRL-VR BANDITFTRL-VR(R, δ, η, k, T, x 1 ) For t 1 T : u t SAMPLE(U(S 1 )) y t x t + δ( 2 R(x t)) 1 2 ut PLAY(y t) f t(y t) RECEIVELOSS(y t) ĝ t d δ ft(yt)( 2 R(x t)) 1 2 ut ḡ t 1 k+1 k i=0 ĝt i g t+1 1 k+1 k i=1 ĝt+1 i x t+1 argmin x R d η(ḡ 1:t + g t+1 ) x + R(x)
Analysis of BANDITFTRL-VR Theorem (Mohri & Y., 2016) Assume diam(k) D. Let R be a self-concordant-barrier for K, f t C, and f t L. Then the regret of BANDITFTRL is upper bounded as follows: If (f t ) T are Lipschitz, then Reg 11 T (BANDITFTRL-VR) = Õ(T 16 ). If (f t ) T have Lipschitz gradients, then Reg T (BANDITFTRL-VR) = Õ(T 8 13 ).
Proof of BANDITFTRL-VR regret: Lipschitz case Approximation: real to smoothed losses Relate global optimum x to projected optimum x ɛ. Use Lipschitz property of losses to relate y t to x t and f t to f t. [ ] Reg T (BANDITFTRL-VR) = E f t (y t ) f t (x ) ɛlt + 2LδDT + E [ ft (x t ) f ] (xɛ ).
Proof of BANDITFTRL-VR regret: Lipschitz case Approximation: smoothed to averaged losses E [ ft (x t ) f ] (xɛ ) = + 1 k + 1 Ck 2 k i=0 + LT sup [ 1 k E ( ft (x t ) k + 1 f t i (x t i )) i=0 k ( ft (xɛ ) f ) ] t (xɛ ) ( ft i (x t i ) f t (x ɛ )) + 1 k + 1 t [1,T ] i [0,k t] E [ x t i x t 2 ] + i=0 E [ ḡt (x t xɛ ) ].
Proof of BANDITFTRL-VR regret: Lipschitz case FTRL analysis on averaged gradients with re-centering: E [ ḡt (x t x ɛ ) ] 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η R(x ɛ ). Cumulative analysis: Reg T (BANDITFTRL-VR) ɛlt + 2LδDT + Ck 2 + 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η R(x ɛ ) + LT sup t [1,T ] i [0,k t] E [ x t i x t 2 ].
Proof of BANDITFTRL-VR regret: Lipschitz case Stability estimate for the actions Want to bound: sup t [1,T ] E [ x t i x t 2 ]. i [0,k t] Fact: Let D be the diameter of K. For any x K and z R d, D 1 z x, z 2 D z x. By triangle inequality and equivalence of norms, E [ x t i x t 2 ] D t 1 s=t i t 1 s=t i E [ x s x s+1 2 ] E [ x s x s+1 xs ] D t 1 s=t i 2ηE [ ḡ s + g s+1 g s xs, ].
Proof of BANDITFTRL-VR regret: Lipschitz case ḡ s + g s+1 g s = 1 k k+1 i=0 ĝs i + 1 k+1ĝs Thus, E [ ḡ s + g s+1 g s 2 ] x s, 3 k 1 ] 2 k 2 E s i [ĝs i + 3 k 2 E k 1 ĝ s i E s i [ĝ s i ] i=0 x s, i=0 3 k 2 L + 2D2 L 2 + 3 2 k 1 k 2 E ĝ s i E s i [ĝ s i ]. i=0 x s, 2 x s, + 3 k 2 L
Proof of BANDITFTRL-VR regret: Lipschitz case Fact: 0 i k such that t i 1, 1 2 z x t i, z xt, 2 z xt i,. Because the terms in the sum make up martingale difference, 2 k 1 E ĝ s i E s i [ĝ s i ] 4E k 1 ĝ s i E s i [ĝ s i ] i=0 x s, k 1 [ ĝs i 4 E E s i [ĝ s i ] ] 2 x s k, i=0 k 1 [ 16 E ĝ s i E s i [ĝ s i ] ] 2 x s i, i=0 k 1 16 i=0 [ ĝs i ] E 2 x s i, k 1 16 i=0 C 2 d 2 i=0 δ 2 = 16k C2 d 2 δ 2. 2 x s k,
Proof of BANDITFTRL-VR regret: Lipschitz case By combining the components of the stability estimate, E [ x t i x t 2 ] 2ηD t 1 s=t i By the previous calculations, 3 k 2 L + 2D2 L 2 + 3 k 2 16k C2 d 2 δ 2. Reg T (BANDITFTRL-VR) ɛlt + 2LδDT + Ck 2 + 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η log(1/ɛ) + LTD2ηk 3 k 2 L + 2D2 L 2 + 48 k 2 C 2 d 2 δ 2. Now set η = T 11/16 d 3/8, δ = T 5/16 d 3/8, k = T 1/8 d 1/4.
Discussion of BANDITFTRL-VR regret: Lipschitz gradient case Approximation of real to smoothed losses incurs a δ 2 D 2 T penalty instead of δdt. Rest of analysis also leads to some changes in constants. General regret bound: Reg T (BANDITFTRL-VR) ɛlt + Hδ 2 D 2 T + Ck + 2C2 d 2 ηt δ 2 (k + 1) 2 + 1 η log(1/ɛ) + (TL + DHT )2ηkD 3 k 2 L + 2D2 L 2 + 48 k 2 C 2 d 2 δ 2. Now set η = T 8/13 d 5/6, δ = T 5/26 d 1/3, k = T 1/13 d 5/3.
Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
Other BCO methods Strongly convex loss functions: Augment R in BANDITFTRL with additional regularization. C 0,1 [Agarawal et al, 2010]: O(T 2/3 ) regret C 1,1 [Hazan & Levy, 2014]: O(T 1/2 ) regret Other types of algorithms: Ellipsoid method-based algorithm [Hazan and Li, 2016]: O(2 d 4 log(t ) 2d T 1/2 ). Kernel-based algorithm [Bubeck et al, 2017]: O(d 9.5 T 1/2 )
Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion
Conclusion BCO is a flexible framework for modeling learning problems with sequential data and very limited feedback. BCO generalizes many existing models of online learning and optimization. State-of-the-art algorithms leverage techniques from online convex optimization and interior-point methods. Efficient algorithms obtaining optimal guarantees in C 0,1, C 1,1 cases are still open.