Subsampling, Concentration and Multi-armed bandits

Size: px

Start display at page:

Download "Subsampling, Concentration and Multi-armed bandits"

Poppy Hunt
6 years ago
Views:

1 Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and Bandits 1 / 36

2 Ṛoadmap 1 Sub-sampling concentration: 1.1 Hoeffding-Serfling, 1.2 Bernstein-Serfling and 1.3 empirical Bernstein-Serfling bounds 2 Sub-sampling for stochastic multi-armed bandits: 2.1 "Best empirical sub-sampled arm" strategy, 2.2 Illustrative experiments, 2.3 Cumulative regret bound and extensions. O-A. Maillard Subsampling and Bandits 2 / 36

3 Sub-sampling concentration Introduction "Concentration inequalities for sampling without replacement", Bardenet and Maillard, Bernoulli, 2015.

4 Ṣub-sampling X = (x 1,..., x N ),a finite population of N real points. x 1 x 2 x 3 x 4 x 5... x N 2 x N 1 x N Sub-sample of size n N from X : X 1,..., X n picked uniformly randomly without replacement from X. x 1 X n 1 X 1 x 4 X 2... x N 2 X n x N Simple problem Approximating the population mean µ = 1 Ni=1 x N i. Concentration for partial sums of X 1,..., X n. Careful: dependency. O-A. Maillard Subsampling and Bandits 4 / 36

5 Ḥoeffding s reduction lemma Lemma (Hoeffding, 1963) Let X = (x 1,..., x N ) be a finite population of N real points, X 1,..., X n denote a random sample without replacement from X and Y 1,..., Y n denote a random sample with replacement from X. If f : R R is continuous and convex, then ( n ) ( n ) Ef X i Ef Y i. i=1 i=1 From sampling with to without replacement We can thus transfer some results for sampling w. replacement to the case of sampling without replacement (via Chernoff). O-A. Maillard Subsampling and Bandits 5 / 36

6 Ċomparing bounds on P(n 1 n i=1 X i µ 10 2 ), N = Probability bound Estimate Hoeffding 0.2 Bernstein Hoeffding-Serfling Bernstein-Serfling n (a) Gaussian N (0, 1) Probability bound Estimate Hoeffding 0.2 Bernstein Hoeffding-Serfling Bernstein-Serfling n (b) Log-normal ln N (1, 1) Probability bound Estimate Hoeffding Bernstein Hoeffding-Serfling Bernstein-Serfling Probability bound Estimate Hoeffding Bernstein Hoeffding-Serfling Bernstein-Serfling n (c) Bernoulli B(0.1) n (d) Bernoulli B(0.5) O-A. Maillard Subsampling and Bandits 6 / 36

7 Ṣerfling s key observation For 1 k N (considering fictitious X n+1,..., X N ) Z k = 1 k k (X t µ) and Zk = 1 t=1 N k k (X t µ). (1) t=1 Lemma (Serfling, 1974) The following forward martingale structure holds for {Zk } k N : [ ] E Zk Z k 1,..., Z1 = Zk 1. The following reverse martingale structure holds for {Z k } k N : ] E [Z k Z k+1,..., Z N 1 = Z k+1. = Structured dependency. O-A. Maillard Subsampling and Bandits 7 / 36

8 Ạ useful result Theorem (Serfling, 1974) Let a = min 1 i N x i, and b = max 1 i N x i. Then λ R +, it holds ) ( (b a)2 log E exp (λnz n λ 2 n 1 n 1 ). 8 N Moreover, ( ) log E exp λ max Z k 1 k n (b a)2 8 λ 2 ( (N n) n 2 1 n 1 N ). O-A. Maillard Subsampling and Bandits 8 / 36

9 Ạ useful result Theorem (Bardenet M. 2015) Let a = min 1 i N x i, and b = max 1 i N x i. Then λ R +, it also holds ) ( (b a)2 log E exp (λnz n λ 2 (n + 1) 1 n ). 8 N Moreover, ( ) log E exp λ max Z k 1 k n ( ) log E exp λ max n k N 1 Z k (b a)2 8 (b a)2 8 (Slight) improvement when n > N/2. λ 2 ( (N n) n 2 1 n 1 N λ 2 (1 n (n + 1) n 2 N ). ). O-A. Maillard Subsampling and Bandits 8 / 36

10 Ạ slightly improved Hoeffding-Serfling inequality Trivial corollary: Corollary (Bardenet M., 2015) For all n N, δ [0, 1], with probability higher than 1 δ, it holds nt=1 (X t µ) ρn log(1/δ) (b a), n 2n where we define (1 n 1 ) if n N/2 N ρ n =. (2) (1 n )(1 + 1/n) if n > N/2 N O-A. Maillard Subsampling and Bandits 9 / 36

11 Sub-sampling concentration Bernstein-Serfling

12 Ṭowards Bernstein-Serfling s inequality Let σ 2 = N 1 N i=1 (x i µ) 2, then Q k 1 = Q k+1 = k 1 1 ) ((X i µ) 2 σ 2 N k + 1 i=1 k+1 1 ) ((X i µ) 2 σ 2 k + 1 i=1 Lemma (Bardenet M., 2015) ] E [(X k µ) 2 Z 1,... Z k 1 = σ 2 Qk 1, where the Z i s are defined in (1). Likewise ] E [(X k+1 µ) 2 Z k+1,... Z N 1 = σ 2 + Q k+1. O-A. Maillard Subsampling and Bandits 11 / 36

13 Ạ Bernstein-Serfling inequality Corollary (Bardenet M., 2015) Let n N and δ [0, 1]. With probability larger than 1 2δ, it holds that nt=1 (X t µ) 2ρn log(1/δ) σ + κ n(b a) log(1/δ), n n n where (1 f n 1 ) if n N/2 ρ n = (1 f n )(1 + 1/n) if n > N/2 4 κ n = + f n 3 g n 1 if n N/2 4 + g 3 n+1 (1 f n ) if n > N/2, (3) with f n = n/n and g n = N/n 1. O-A. Maillard Subsampling and Bandits 12 / 36

14 Ạ Bernstein-Serling inequality Improvement over Bernstein Factor ρ n can give dramatic improvement. Proof elements Self-bounded property of the variance: Study of Z = 1 ni=1 (X (b a) 2 i µ) 2 (cf. Maurer and Pontil, 2006; via tensorization inequality for the entropy). Hoeffding reduction s lemma. O-A. Maillard Subsampling and Bandits 13 / 36

15 Ṭowards an empirical Bersntein-Serfling inequality σ n 2 = 1 n (X i µ n ) 2 = 1 n (X i X j ) 2, where µ n i=1 n 2 n = 1 n X i. i,j=1 2 n i=1 Lemma (Bardenet M., 2015) When sampling without replacement from a finite population X = (x 1,..., x N ) of size N, with range [a, b] and variance σ 2, the empirical variance σ n 2 using n < N samples satisfies ( ( ) ) log(3/δ) P σ σ n + (b a) ρ n δ. 2n Possible improvement Conjecture: Replace ( ρ n ) with 4ρ n. Difficulty: concentration for self-bounded random variables when sampling without replacement. O-A. Maillard Subsampling and Bandits 14 / 36

16 Ạn empirical Bernstein-Serfling inequality Corollary (Bardenet M., 2015) For all δ [0, 1], with probability larger than 1 5δ, it holds nt=1 (X t µ) n 2ρn log(1/δ) σ n + n κ(b a) log(1/δ) n, where we remind the definition of ρ n (1 n 1 ) if n N/2 N ρ n = (1 n )(1 + 1/n) if n > N/2, N and κ = O-A. Maillard Subsampling and Bandits 15 / 36

17 Ṣerfling-bounds Hoeffding-Serfling Bernstein-Serfling Empirical Bernstein-Serfling Hoeffding-Serfling Bernstein-Serfling Empirical Bernstein-Serfling Inverted bound Inverted bound n (e) Gaussian N (0, 1) n (f) Log-normal ln N (1, 1) Hoeffding-Serfling Bernstein-Serfling Empirical Bernstein-Serfling Hoeffding-Serfling Bernstein-Serfling Empirical Bernstein-Serfling Inverted bound Inverted bound n (g) Bernoulli B(0.1) n (h) Bernoulli B(0.5) O-A. Maillard Subsampling and Bandits 16 / 36

18 Ṣub-sampling recap What we did Improved Serfling-Hoeffding bound, new Bernstein-Serfling and empirical Bernstein-Serfling bounds. Improvement over Hoeffding s reduction due to ρ n. Improvement/Open question Tensorization inequality for the entropy in the case of sampling without replacement? Would lead to: ( ρ n ) replaced with 4ρ n. O-A. Maillard Subsampling and Bandits 17 / 36

19 Sub-sampling Bandits Introduction "Sub-sampling for multi-armed bandits", Baransi, Maillard, Mannor ECML, 2014.

20 Ṣtochastic Multi-armed bandit setting Setting Set of choices A. Each a A is associated with an unknown probability distribution ν a D with mean µ a. At each round t = 1... T the player first picks an arm A t A based on past observations. then receives (and sees) a stochastic payoff X t ν At. Goal and performance Minimize the regret at round T : [ ] def T R T = E T µ X t = a A(µ µ a ) E [ ] NT π,a t=1 where µ = max{ µ a ; a A }, a argmax{ µ a ; a A } T NT π,a = I{A t = a}. t=1 O-A. Maillard Subsampling and Bandits 19 / 36

21 Ḷower performance bound Theorem (Burnetas and Katehakis, 1996) For any strategy π that is consistent (for any bandit, sub-optimal arm a, β > 0 it holds E [ N π T,a] = o(t β )), and D P([0, 1]) lim inf T R T log T a: a>0 (µ µ a ) K inf (ν a, µ ), where K inf (ν a, µ ) def = inf{kl(ν a ν), ν D has mean > µ }. O-A. Maillard Subsampling and Bandits 20 / 36

22 Ọptimality Class of optimal algorithms Confidence bound: e.g. KL-UCB (Lai-Robbins, 1985) Bayesian: e.g. Thompson Sampling (Thompson, 1933) Sub-sampling? Provably optimal finite-time regret for some D Discrete or exponential families of dimension 1. They need to know D in order to be optimal A different algorithm for each D: TS or KL-UCB for Bernoulli, for Poisson, for Exponential, etc. O-A. Maillard Subsampling and Bandits 21 / 36

23 .Puzzling experiments (T = 20, 000, 50, 000 replicates) 10 Bernoulli(0.1, 3{0.05}, 3{0.02}, 3{0.01}) BESA kl-ucb kl-ucb+ TS Others Regret Beat BESA - 1.6% 35.4% 3.1% Run Rime 13.9X 2.8X 3.1X X 200 BESA 200 KLUCB 200 KLUCB+ 200 Thompson regret time x time x time x time x 10 Others: UCB, Moss, UCB-Tunes, DMED, UCB-V. (Credit: Akram Baransi) O-A. Maillard Subsampling and Bandits 22 / 36

24 .Puzzling experiments (T = 20, 000, 50, 000 replicates) Exponential( 1 5, 1 4, 1 3, 1 2, 1) BESA KL-UCB-exp UCB-tuned FTL 10 Others Regret ,120+ Beat BESA - 5.7% 4.3% - Run Rime 6X 2.8X X BESA 200 BESAT 200 KLUCBexp 200 UCBtuned regret time x time x time x time x 10 Others: UCB, Moss, kl-ucb,ucb-v. (Credit: Akram Baransi) O-A. Maillard Subsampling and Bandits 23 / 36

25 .Puzzling experiments (T = 20, 000, 50, 000 replicates) Poisson({ i 3 } i=1,...,6) BESA KL-UCB-Poisson kl-ucb FTL 10 Regret Beat BESA - 4.1% 0.7% - Run Rime 3.5X 1.2X X BESA 200 BESAT 200 KLUCBpoisson 200 KLUCB regret time x time x time x 10 4 time (Credit: Akram Baransi) x 10 4 O-A. Maillard Subsampling and Bandits 24 / 36

26 .Puzzling experiments (T = 20, 000, 50, 000 replicates) 300 Bernoulli all half but one BESA KL-UCB KL-UCB+ TS Regret Beat BESA % 41.6% 40.8% Run Rime 19.6X 2.8X 3X X BESA 300 KLUCB 300 KLUCB+ 300 Thompson regret time x time x time x 10 4 time (Credit: Akram Baransi) x 10 4 O-A. Maillard Subsampling and Bandits 25 / 36

27 Ạ Puzzling strategy BESA Competitive regret against state-of-the-art for various D. Same algorithm for all D. Not relying on upper confidence bounds, not Bayesian......and extremely simple to implement. Questions How is this possible? Can we prove optimality? For which distributions is it optimal? O-A. Maillard Subsampling and Bandits 26 / 36

28 Sub-sampling Bandits Best Empirical Sub-sampling Average

29 .Go back to "Follow the leader" FTL 1: Play each arm once. 2: At time t, define µ t,a = µ(x a 1:N t,a ) for all a A. µ(x ): empirical average of population X. X a 1:N t,a = {X s : A s = a, s t} 3: Choose (break ties in favor of the smallest N t ) Properties A t = argmax a {a,b} Generally bad: linear regret. µ t,a. A variant (ε-greedy) performs ok if well-tuned (Auer et al, 2002). Optimal for very specific distributions (e.g. deterministic). O-A. Maillard Subsampling and Bandits 28 / 36

30 .Follow the FAIR leader (aka BESA) Compare two arms based on "equal opportunity" i.e. same number of observations. BESA at time t for two arms a, b: 1: Sample I a t Wr(N t,a ; N t,b ) and I b t Wr(N t,b ; N t,a ). Wr(n, N): sample n points from {1,..., N} without replacement (return all the set if n N). 2: Define µ t,a = µ(x a 1:N t,a (I a t )) and µ t,b = µ(x b 1:N t,b (I b t )). 3: Choose (break ties in favor of the smallest N t ) Questions Why does it work? A t = argmax a {a,b} µ t,a. When can we prove log(t ) regret? Optimality? When does it fail? O-A. Maillard Subsampling and Bandits 29 / 36

31 .Follow the FAIR leader (aka BESA) Compare two arms based on "equal opportunity" i.e. same number of observations. BESA at time t for two arms a, b: 1: Sample I a t Wr(N t,a ; N t,b ) and I b t Wr(N t,b ; N t,a ). Ex: N t,a = 3,N t,b = 10, I t,a = {1, 2, 3}, I t,b = 3, sampled without replacement from {1,..., 10}. 2: Define µ t,a = µ(x a 1:N t,a (I a t )) and µ t,b = µ(x b 1:N t,b (I b t )). 3: Choose (break ties in favor of the smallest N t ) Questions Why does it work? A t = argmax a {a,b} µ t,a. When can we prove log(t ) regret? Optimality? When does it fail? O-A. Maillard Subsampling and Bandits 29 / 36

32 Ịntuition Assume µ b > µ a, N t,a = n a, N t,b = n b with n a > n b. The probability of making one mistake is approximatively [ ( ( P µ X1:n a a I(na ; n b ) )) ( ) ] > µ X1:n b b, (4) where I(n a ; n b ) Wr(n a ; n b ). The probability of making M consecutive mistakes is essentially [ ( ( P m [M] µ Im (n a (m) ; n b ) )) ( ) ] > µ X1:n b b, (5) X a 1:n (m) a where m M, I m (n a ; n b ) Wr(n a ; n b ), n (m) a = n a + m 1. For deterministic n a, n b : (4) decreases with e 2n b(µ b µ a) 2, (5) with e 2n b M(µ b µ a) 2 where M is the number of non-overlapping sub-samples (independent chunks). Exponential decay of probability of consecutive mistakes. O-A. Maillard Subsampling and Bandits 30 / 36

33 Ṛegret bound (slightly simplified statement) Let A = {, a} and define ( α(m, n) = E Z ν,n P Z νa,n (Z > Z ) P Z ν a,n (Z = Z ) Theorem (Regret of the BESA strategy) If α (0, 1), c > 0 such that α(m, 1) cα M, then R T 11 log(t ) µ µ a + C νa,ν + O(1), where C νa,ν depends on the problem, but not on T. Example Bernoulli µ a, µ : α(m, 1) = O( ( µa (1 µ a) 2 ) M ) ) M, O-A. Maillard Subsampling and Bandits 31 / 36

34 .Failure of the BESA strategy Uniform X a U([0.2, 0.4]), X U([0, 1.]): α(m, n) M 0.2 n Consider BESA with initial number of pulls n 0 = 0,... BESA n 0 = 0 n 0 = 3 n 0 = 7 n 0 = 8 n 0 = 9 n 0 = 10 Regret UCB kl-ucb TS FTL n 0 = 10 Regret Beat BESA n 0 = % 24.3% 24.7% - Beat BESA n 0 = 3 7.3% 7.3% 7.8% - Beat BESA n 0 = 7 1.6% 1.6% 1.8% - Beat BESA n 0 = % 0.6% 0.7% - (Credit: Akram Baransi) O-A. Maillard Subsampling and Bandits 32 / 36

35 Ṛegret performance of BESA Theorem (Regret of the BESA strategy) If α (0, 1), c > 0 such that α(m, 1) cα M, then R T 11 log(t ) µ µ a + C νa,ν + O(1), where C νa,ν depends on the problem, but not on T. If β (0, 1), c > such that α(1, n) cβ n, then BESA initialized with n 0,T ln(t )/ ln(1/β) pull of each arm gets R T Key points 11 log(t ) µ µ a + n 0,T + C νa,ν + O(1), First condition holds for large class: extends FTL. Initial number of pulls is less elegant. Alternatively: mixing with uniform like ε-greedy. O-A. Maillard Subsampling and Bandits 33 / 36

36 Ṣketch of regret analysis: 1 Basic concentration gives a log(t)1/2 min{n t,,n t,a} 1/2 : w.h.p. If N t,a < N t,, then N t,a 2 log(t). 2 w.h.p. If N t,a > N t,, then N t, w.h.p. Thus: On the event N t, > u t, must have N t,a 2 log(t) def 2 = u t. u t. w.h.p. 2 Show that N t, > u t (like for Thompson Sampling) Let τj : delay between the j th time t j and (j + 1) th time we play. ] u t [ ] P [N t, u t P τj t/u t 1 }{{} j=1 l t u t [ P s {0,..., l ] t 2, l t 2,..., l t} a tj +s = a j=1 u t j=1 [ P s { l t 2,..., l t} a tj +s = a N tj +s,a > l t 2 }{{} u t j for t c O-A. Maillard Subsampling and Bandits 34 / 36 ]

37 Ṣketch of regret analysis: 1 Basic concentration gives a log(t)1/2 min{n t,,n t,a} 1/2 : w.h.p. If N t,a < N t,, then N t,a 2 log(t). 2 w.h.p. If N t,a > N t,, then N t, w.h.p. Thus: On the event N t, > u t, must have N t,a 2 log(t) def 2 = u t. u t. w.h.p. 2 Show that N t, > u t (like for Thompson Sampling) Let τj : delay between the j th time t j and (j + 1) th time we play. ] u t [ ] P [N t, u t P τj t/u t 1 }{{} j=1 l t u t [ P s {0,..., l ] t 2, l t 2,..., l t} a tj +s = a t c j=1 u t j=1 [ P s { l t 2,..., l t} a tj +s = a N tj +s,a > N tj +s, }{{} =j ] O-A. Maillard Subsampling and Bandits 34 / 36

38 ḄESA - Recap Optimality and near-optimality regions in P([0, 1])? Properties Flexible: doesn t need class of distribution, nor the support. We can prove log(t ) regret for certain classes. Optimality (constants) unknown yet, but we are close. Exhibit cases when it fails: why, and how to repair it. O-A. Maillard Subsampling and Bandits 35 / 36

39 ḄESA - Recap Optimality and near-optimality regions in P([0, 1])? Properties Flexible: doesn t need class of distribution, nor the support. We can prove log(t ) regret for certain classes. Optimality (constants) unknown yet, but we are close. Exhibit cases when it fails: why, and how to repair it. O-A. Maillard Subsampling and Bandits 35 / 36

40 Thank you If you want to prove "adaptive" optimality of this strategy or extend it to contextual bandits, adversarial bandits, MDPs Come work with me!

Bandits : optimality in exponential families

Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities