Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Size: px

Start display at page:

Download "Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett"

Scott Gilmore
5 years ago
Views:

1 Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights strategy. 1

2 Adversarial bandits Repeated game: strategy chooses I t, then adversary chooses (x 1,t,...,x k,t ). Aim to minimize regret, R n = max j x j,t x It,t, or pseudo-regret R n = max j E x j,t E x It,t, 2

3 Adversarial bandits Some scenarios: I t deterministic. Hopeless. Must have I t randomized: strategy plays a distribution. Regret is random. Could consider expected regret, or high probability regret bounds. Adversary chooses x j,t independent of strategy s previous random outcomes I t. Oblivious adversary. Adversary chooses x j,t with knowledge of strategy s previous random outcomes I t. Adaptive adversary or nonoblivious adversary. But then what does regret mean? 3

4 Adversarial bandits: Lower bounds It s clear that R n ER n. So a lower bound on pseudo-regret gives lower bounds on expected regret. Lower bounds in the stochastic setting suffice here (the adversary can certainly choose a sequence randomly). Theorem: For any strategy and any n, there is an oblivious adversary playingx j,t {0,1} for which R n 1 18 min{ nk,n}. In particular, it suffices for the adversary to play i.i.d. Bernoulli rewards to achieve a pseudo-regret that is this large. 4

5 Adversarial bandit strategies How should a strategy choose the distribution over I t in the adversarial setting? Exploration vs exploitation remains an important issue. Exploitation appears to be even more dangerous. Let s digress and consider the corresponding full information game. It s standard (and, we ll see, convenient) to pose it in terms of losses, rather than rewards. For x i,t in[0,1], think of l i,t = 1 x i,t, so that l i,t is in[0,1]. 5

6 An aside: A full information prediction game Consider the following repeated game: at timet, 1. strategy chooses a distributionp t over k experts (I t p t ), 2. adversary chooses a loss vector l t = (l 1,t,...,l k,t ) [0,1] k, 3. strategy sees l t. The aim is to ensure that, for all choices of the adversary, R n = El It,t min j l j,t is not too large. 6

7 An aside: A full information prediction game Exponential Weights Strategy (with parameter η) set p 1,1 = = p k,1 = 1/k. for t = 1,2,...,n, choose I t p t, observe l t. L i,t = p i,t+1 = t l i,s, s=1 exp( ηl i,t ) k j=1 exp( ηl j,t). 7

8 An aside: A full information prediction game Theorem: incurs regret The exponential weights strategy with parameter η R n nη 8 + logk η. Choosing η = 8logk/n givesr n nlogk/2. 8

9 An aside: A full information prediction game Proof idea: For this choice ofp t, Φ t = 1 η log ( k i=1 exp( ηl i,t ) ) is a measure of progress. WhenEl It,t is big, there is a big increase from Φ t 1 toφ t. But Φ n min j L j,n. 9

10 An aside: A full information prediction game For l i,t [0,1], Hoeffding s inequality shows that that is, logeexp( η(l It,t El It,t)) η2 8, El It,t η 8 1 η logeexp( ηl I t,t). But the choice ofp t means that the sum of these c.g.f.s telescopes: ( k ) i=1 logeexp( ηl It,t) = log exp( ηl i,t) k i=1 exp( ηl i,t 1) ( k ) = log exp( ηl i,n ) logk. i=1 10

11 An aside: A full information prediction game Thus, El It,t ηn 8 + logk η ηn 8 + logk η ) ( k 1 η log exp( ηl i,n ) i=1 }{{} exp( ηl j,n ) +min j L j,n. R n = El It,t min j l j,t ηn 8 + logk η. 11

12 Back to bandits What can the strategy do when it only sees l It,t? Importance sampling: in the strategy, replace l i,t by l i,t = l i,t p i,t 1[I t = i]. This is an unbiased estimate of the unseen losses: l j,t = E l j,t. 12

13 Exp3: Exponential weights Strategy Exp3 set p 1 uniform on{1,...,k}. for t = 1,2,...,n, choose I t p t, observe l It,t. l i,t = l i,t p i,t 1[I t = i], L i,t = p i,t+1 = t s=1 l i,s, ) exp ( η L i,t ). k j=1 ( η L exp j,t 13

14 Back to bandits Then the regret involves t (l I t,t l j,t ), and l It,t = k p i,t li,t l j,t = E l j,t. i=1 But we can no longer appeal to Hoeffding s inequality, because l i,t is unbounded. Happily, we only need an upper bound on the c.g.f. Γ(λ) for negative values ofλ = η. This corresponds to a lower tail concentration inequality for the non-negative random variable l It,t. But non-negative random variables with finite variance have sub-gaussian lower tails: 14

15 Sub-Gaussian lower tails for non-negative r.v.s Lemma: ForX 0 and λ > 0, hence logee λx λ2 2 EX2 λex, EX 1 λ logee λx + λ 2 EX2. Proof. logeexp( λ(x EX)) = logeexp( λx)+λex Eexp( λx) 1+λEX E λ2 X 2 2 because logy y 1 for all y and e x 1 x+x 2 /2 for x 0., 15

16 Exp3: Exponential weights Theorem: Exp3 with parameter η incurs regret R n nηk 2 + logk η. Choosing η = 2logk/(nk) givesr n 2nklogk. (Recall the lower boundr n = Ω( nk). This strategy matches it to within a logk factor.) 16

17 Exp3: Exponential weights l It,t = k i=1 ( λ 2 p i,t li,t k i=1p i,t l2 i,t 1 λ log( k i=1 For the first term, we can bound the variance byk: E k i=1 p i,t l2 i,t = E l2 I t,t p It,t E 1 p It,t p i,t exp( λ l i,t ) = k. For the second, as in the full information case, for λ = η and any j, ( 1 k ) log p i,t exp( λ l i,t ) logk +E L j,n. η η i=1 )). 17

18 Exp3: Exponential weights Hence, R n = E = E l It,t min j E ηnk 2 + logk η. l It,t min j E L j,n l j,t 18

19 Exp3: Exponential weights Auer, Cesa-Bianchi, Freund and Schapire introduced Exp3 (but with a uniform distribution mixed in, to keep 1/p It,t bounded). Stoltz showed that the uniform distribution is not necessary, through the sub-gaussian lower tails idea. (See the Readings page on the website.) Notice that the η parameter is set using the time horizon n. There are two approaches to avoiding this: Run Exp3 in multiple epochs, doublingnin each epoch, and then the regret is no more than the sum of the regrets, which grows like the square root of the total time horizon. Set η t on a decreasing schedule: η t = logk/(tk). Sinceη t is decreasing, the telescoping sum becomes a sum that is no more than 0, and the knη/2 becomes k t η t/2 = O( nklogk). 19

20 Exp3: Exponential weights High probability? Variances of the losses are big (E l 2 j,t = l 2 j,t /p j,t). Mixing the uniform distribution does not help (for a small regret, the uniform component must be as small as n 1/2 ). Need a new strategy. Instead, use a biased estimate in place of l j,t : subtract a small offset so L j,t becomes a lower confidence bound on the cumulative expected lossl j,t (or, in the case of gains, an upper confidence bound on the cumlative expected reward). Log factor? Can eliminate the logk factor. One approach is to replace exponential weights with a generalization, mirror descent, with an appropriate potential function. Other comparisons? Can compete with sequences of actions, with a bounded number of switches. 20

Advanced Machine Learning

Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his