Bandit Online Convex Optimization

Size: px

Start display at page:

Download "Bandit Online Convex Optimization"

Shonda McKinney
5 years ago
Views:

1 March 31, 2015

2 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks

3 Review of (Online) Convex Optimization Set-up 1 Sequence of convex functions {c t } : S R over convex set S. 2 Learner chooses point x t S and receives c t (x t ) 3 Goal: minimize regret n c t(x t ) min n x S c t(x) 4 Update rule: x t+1 = x t η c t (x t ). Scenarios: 1 Offline: c t c fixed function 2 Online Stochastic: c t (x) = c(x) + ɛ t (x) noisy estimate 3 Online Adversarial: c t chosen adversarially

4 Review of (Online) Convex Optimization Set-up 1 Sequence of convex functions {c t } : S R over convex set S. 2 Learner chooses point x t S and receives c t (x t ) 3 Goal: minimize regret n c t(x t ) min n x S c t(x) 4 Update rule: x t+1 = x t η c t (x t ). Scenarios: 1 Offline: c t c fixed function 2 Online Stochastic: c t (x) = c(x) + ɛ t (x) noisy estimate 3 Online Adversarial: c t chosen adversarially Key Information: knowledge of gradients!

5 Bandit Setting Bandit Scenario: Set-up 1 Sequence of convex functions {c t } : S R over convex set S. 2 Learner chooses point x t S and receives value c t (x t ) 3 Goal: minimize regret n c t(x t ) min n x S c t(x) Question: can we perform gradient descent without gradients?

6 Estimating Gradients - Multi-point Multi-point use finite differences! 1 one dim: f (x) 1 h (f(x + h) f(x)) 2 d dim: f(x) 1 h ((f(x + he 1) f(x)),..., (f(x + he d ) f(x))) Theorem (Agarwal et al, 2010) Querying d + 1 points is enough to recover standard OCO bounds, and even querying 2 points will get you to within log(t ) terms.

7 Estimating Gradients - One-point 1 One-point f(x) vol(δb 1 ) δb 1 f(x + v)dv one dim: δ δ 1 f(x) 1 2δ f(x + v)dv 2 f (x) 1 δ 2δ δ f (x + v)dv = 1 2δ (f(x + δ) f(x δ)) d dim: 1 f(x) 2 R δb 1 f(x+v)dv vol(δb 1 ) f(x) ˆf(x) = = E v B1 [f(x + δv)] =: ˆf(x) δb 1 f(x + v)dv vol(δb 1 ) = = E u B 1 [f(x + δu)u]vol( (δb 1 )) vol(δb 1 ) (δb 1 ) u f(x + u) u du vol(δb 1 ) = E u B 1 [f(x + δu)u] d δ

8 Algorithm f(x) ˆf(x) = d δ E [f(x + δu)u] where u U( B 1 ) u B 1 Algorithm [Bandit Gradient Descent - Flaxman et al, 2004]: 1 Draw u t U( B 1 ) 2 Play x t = y t + δu t and receive value c t (x t ) 3 Update: y t+1 = P S (y t νc t (x t )u t )

9 Algorithm Algorithm [Bandit Gradient Descent - Flaxman et al, 2004]: 1 Draw u t U( B 1 ) 2 Play x t = y t + δu t and receive value c t (x t ) 3 Update: y t+1 = P S (y t νc t (x t )u t ) Subtle issue: y t + δu t might fall out of S, so use P (1 α)s for α (0, 1) instead New Update: y t+1 = P (1 α)s (y t νc t (x t )u t )

10 Review of OCO Results Theorem (Zinkevich 2003) Let S B R, {c t } : S R seq of convex functions, G = sup t c t (x t ), and x t+1 = P S (x t η c t (x t )). Then c t (x t ) min x S Lemma (Randomized Zinkevich) c t (x) R2 η + nηg2 2 Let S B R, {c t } : S R seq of convex functions, {g t } s.t. E[g t x t ] = c t (x t ), G = sup t g t, and x t+1 = P S (x t ηg t ). Then E[ c t (x t )] min x S c t (x) R2 η + nηg2 2

11 Bandit OCO Algorithm [Bandit Gradient Descent - Flaxman et al, 2004]: 1 Draw u t U( B 1 ) 2 Play x t = y t + δu t and receive value c t (x t ) 3 Update: y t+1 = P (1 α)s (y t νc t (x t )u t ) Theorem (Flaxman et al 2004) Assume that B r S B R and that c t C uniformly. Then for sufficiently large n and suitable choices of ν, δ, and α, we have E[ c t (x t )] min x S ( dr c t (x) 3Cn 5/6 r ) 1/3

12 Extending Randomized Zinkevich Need to show: 1 BGD s updates valid for randomized Zinkevich. 1 x t S for chosen α and δ. 2 g t = d δ c t(x t )u t are valid 2 Upper bounds on G for randomized Zinkevich. 3 Bound for ĉ t can be extended to c t. 4 Bound for (1 α)s can be extended to S.

13 BGD s Updates Valid for Randomized Zinkevich 1 x t S for chosen α and δ. δ αr x t = y t + δu t S 2 g t = d δ c t(x t )u t are valid E[g t y t ] = d δ E u B 1 [c t (y t + δu)u] = E v B1 [c t (y t + δv)]

14 Upper Bounds on G for Zinkevich s Theorem Fact: g t = d δ c t(x t )u t dc δ Thus, for α (0, 1), δ αr, η = ν δ d, G = dc δ, randomized Zinkevich implies that E[ ĉ t (y t )] min x (1 α)s ĉ t (x) R2 η + nηg2 2 = RG n = R dc δ n for η = R G n

15 Extending from ĉ t to c t Lemma c t (y) c t (x) 2C x y, αr ĉ t (y t ) c t (y t ) δ 2C αr, y (1 α)s, x S ĉ t(y t ) c t (x t ) 2δ 2C αr The previous regret bound implies or E[ c t (x t ) 2δ 2C αr ] E[ c t (x t )] min min x (1 α)s x (1 α)s c t (x) + δ 2C αr RdC n δ c t (x) RdC n δ + 3δ 2C αr

16 Extending from (1 α)s to S Lemma min x (1 α)s c t (x) min x S c t (x) + 2αCn Thus, E[ c t (x t )] min x S c t (x) RdC n δ + 3δ 2C αr + 2αCn

17 Recap 1) E[ ĉ t (y t )] min x (1 α)s ĉ t (x) R dc n δ (optimal η, and δ αr, α < 1, B r S, c t C) 2) E[ c t (x t )] min x (1 α)s c t (x) RdC n δ (Lip across (1 α)s and S, B r S, c t C) + 3δ 2C αr 3) E[ c t (x t )] min x S c t (x) RdC n δ (min s are close, B r S, c t C) + 3δ 2C αr + 2αCn

18 Remarks 1 Optimize over δ and α to get the theorem: E[ c t (x t )] min x S ( dr c t (x) 3Cn 5/6 r ) 1/3 2 If Lip(c t ) L known, then get O(n 3/4 ) bound. 3 Preconditioning can improve ratio R r

19 Theorem for L-Lipschitz Cost Functions Theorem (Flaxman et al 2004 for L-Lipschitz cost functions) If each c t is L-Lipschitz, then for n sufficiently large and ν = R C n, δ = n.25 RdCr 3(Lr+C), and α = δ r, we have. E[ c t (x t )] min x S c t (x) 2n 3RdC(L 3/4 + C r )

20 Reshaping Reshaping increases the accuracy of gradient descent! Above regret bound depends on R/r - can be very large. Idea: reshape the body to make it more round - put it in isotropic position: This amounts to finding an affine transformation T.

21 Isotropic Position and Algorithm Isotropic position: 1 Estimate covariance of random samples from S (estimating r and R in B r S B R ). 2 Find an affine transformation T so that the new covariance matrix is the identity matrix. 3 Apply T to S R d so B 1 T (S) B d. 4 Then R = d and r = 1. Algorithm [Lovasz and Vempala, 2003] - runs in time O(d 4 )poly-log(d, R/r) and - puts the body in a nearly isotropic position: R = 1.01d and r = 1.

22 Reshaping Properties Lemma (New Lip constant L = LR) Let c t(u) = c t (T 1 (u)). Then c t is LR-Lipschitz. Proof outline: Let x 1, x 2 S and u 1 = T (x 1 ), u 2 = T (x 2 ). Then, c t(u 1 ) c t(u 2 ) = c t (x 1 ) c t (x 2 ) L x 1 x 2 Using that T is affine hence bounded, prove by contradiction that x 1 x 2 R u 1 u 2, thus the LR-Lipschitz condition on c t.

23 Reshaping and the BGD Algorithm Corollary (Reshaping) For a set S of diameter D, and c t L-Lipschitz, after putting S into near-isotropic position, the BGD algorithm has expected regret E[ c t (x t )] min x S c t (x) 6n 3/4 d( CLR + C) Without the L-Lipschitz condition Proof. E[ c t (x t )] min x S c t (x) 6n 5/6 dc Use r = 1, R = 1.01d, L = LR, and C = C.

24 Oblivious Adversary So far, we have analyzed the algorithm in the case of an oblivious adversary who: 1 fixes the sequence of functions c 1, c 2,... 2 knows the decision maker s algorithm 3 doesn t have knowledge of the random decisions of the algorithm

25 Adaptive Adversary Consider an adaptive adversary who plays a game with the decision maker: 1 decision maker knows x 1, c 1 (x 1 ), x 2, c 2 (x 2 ),..., x t 1, c t 1 (x t 1 ) 2 decision maker chooses x t 3 adaptive adversary knows x 1, c 1, x 2, c 2,..., x t 1, c t 1 4 adaptive adversary chooses c t Main takeaway: theorems against an oblivious adversary all hold against an adaptive adversary, up to changes of multiplicative constant by a factor of at most 3.

26 Adaptive Adversary - Regret Fact: The bounds relating costs c t (x t ), c t (y t ), ĉ t (y t ) were all worst case bounds, i.e. they hold for arbitrary c t, regardless of whether the c t are adaptively chosen or not. Thus it suffices to bound the regret: E[ ĉ t (y t ) min y S ĉ t (y)] Idea: Need to show that the adversary s extra knowledge of {x k } t k=1 cannot help to maximize the above regret.

27 Extending Randomized Zinkevich for an Adaptive Adversary Need to show: 1 BGD s updates valid for randomized Zinkevich - next slides 1 x t S for chosen α and δ. 2 g t = d δ c t(x t )u t are valid 2 Upper bounds on G for randomized Zinkevich. (from before) 3 Bound for ĉ t can be extended to c t (from before) 4 Bound for (1 α)s can be extended to S - (from before)

28 Lemma - BGD Updates for an Adaptive Adversary Lemma (BGD updates for adaptive costs) Let S B R, {c t } : S R seq of convex differentiable functions,(c t+1 possibly depending on z 1, z 2,..., z t ), where z 1, z 2,..., z t S are defined by z t+1 = P S (z t ηg t ). Here {g t } are vector-valued random variables s.t. E[g t z 1, c 1, z 2, c 2,..., z t, c t ] = c t (z t ), G = sup t g t. Then, for η = E[ R G n c t (z t ) min x S c t (x)] 3( R2 η + nηg2 2 ) = 3RG n

29 Proof of Adaptive Adversary Lemma Proof: Let h t (x) := c t (x) + xξ t, where ξ t = g t c t (z t ). Observe that h t (z t ) = c t (z t ) + ξ t = g t and ξ t g t + c t (z t ) 2G. By Zinkevich s theorem, applied to h t - which are deterministic at this point of the game Since it suffices to show that h t (z t ) min x S h t (x) + RG n E[h t (z t )] = E[c t (z t )] + E[ξ t.z t ] = E[c t (z t )], E[min x S h t (x)] E[min x S c t (x)] + 2RG n

30 Proof of Adaptive Adversary Lemma - part 2 Proof continued: Left to show: E[min x S h t (x)] E[min x S c t (x)] + 2RG n By Cauchy Schwartz n (h t(x) c t (x)) = x ξ t x ξ t R ξ t. This is in particular true for the minimal x. We take the expectation and bound the sum by using properties of i.i.d. vectors (recall: ξ t 2G): (E[ ξ t ]) 2 E[ ξ t 2 ] = E[ ξ t 2 ]+2 E[ξ s.ξ t ] 4nG 2. 1 s<t n

31 Summary The paper extends Zinkevich s gradient descent idea to a problem in which one doesn t have access to the gradient. Instead, the gradient of a function is approximated from a single sample. Interpretation: approximation at each step is the gradient of a smoothed out version of the function at that step. Analysis applies to both oblivious and adaptive adversaries (bounds change by a factor of 3). Preconditioning ( reshaping ) improves bounds significantly.

32 Possible Extensions Extension of BGD to Zinkevich s model in the case of adaptive step size. Extension of BGD to Zinkevich s model in the case of a non-stationary adversary, i.e. when regret is of the form: c t (x t ) min w 1,w 2,...,w n S c t (w t ) Potential extension to minimizing any (convex?) function over a convex set by updates y t+1 := y t ν(c t (x t ) c t 1 (x t 1 ))u t, even if we don t know the uniform bound on c t.

33 References Main paper: A. D. Flaxman, A. T. Kalai, H. B. McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient Zinkevich s paper: M. Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Proceedings of the Twentieth International Conference on Machine Learning pp , Multi-point paper: A. Agarwal, O. Dekel, L. Xiao, (2010), Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback., in Adam Tauman Kalai and Mehryar Mohri, ed., COLT, Omnipress,, pp

34 Thank you Thank you!

Online convex optimization in the bandit setting: gradient descent without a gradient

Online convex optimization in the bandit setting: gradient descent without a gradient Abraham D. Flaxman Adam Tauman Kalai H. Brendan McMahan November 30, 2004 Abstract We study a general online convex