Informational Confidence Bounds for Self-Normalized Averages and Applications

Size: px

Start display at page:

Download "Informational Confidence Bounds for Self-Normalized Averages and Applications"

Rosaline Dorsey
6 years ago
Views:

1 Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013

2 Context Tree Estimation Outline 1 Two Different Problems Context Tree Estimation Stochastic Bandit Problems 2 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Beyond the Sub-gaussian Case

3 Context Tree Estimation Variable Length Memory Linguistics, lossless compression: t r y i n g _ v a n i l l a _ q u i e t Music, biology, genomics... = memory structure as fingerprint of the source. Example: Brazialian and European Portuguese.

4 Context Tree Estimation Context Tree Sources A Context Tree Source (CTS) or Variable Length Markov Chain (VLMC) is a Markov Chain whose order may vary with the value of the past. T = {1, 10, 100, 000} P(X 4 1 = X 0 1 = 10) = P(X 1 = 0 X 0 1 = 10) P(X 2 = 0 X 1 1 = 100) P(X 3 = 1 X 2 1 = 1000) P(X 4 = 1 X 3 1 = 10001) P(X 5 = 0 X 4 1 = ) (1/5, 4/5) (1/3, 2/3) 0 1 (3/4, 1/4) (2/3, 1/3)

5 Context Tree Estimation Context Tree Sources A Context Tree Source (CTS) or Variable Length Markov Chain (VLMC) is a Markov Chain whose order may vary with the value of the past. T = {1, 10, 100, 000} P(X 4 1 = X 0 1 = 10) 0 1 = P(X 1 = 0 X 1 0 = 10) 3/4 P(X 2 = 0 X 1 1 = 100) 0 1 P(X 3 = 1 X 1 2 = 1000) P(X 4 = 1 X 1 3 = 10001) (3/4, 1/4) 0 1 P(X 5 = 0 X 1 4 = ) (1/5, 4/5) (1/3, 2/3) (2/3, 1/3)

6 Context Tree Estimation Context Tree Sources A Context Tree Source (CTS) or Variable Length Markov Chain (VLMC) is a Markov Chain whose order may vary with the value of the past. T = {1, 10, 100, 000} P(X 4 1 = X 0 1 = 10) = P(X 1 = 0 X 1 0 = 10) 3/4 P(X 2 = 0 X 1 1 = 100) 1/3 P(X 3 = 1 X 1 2 = 1000) P(X 4 = 1 X 1 3 = 10001) 0 1 (1/5, 4/5) 1/3 P(X 5 = 0 X 1 4 = ) 0 1 (, 2/3) 0 1 (3/4, 1/4) (2/3, 1/3)

7 Context Tree Estimation Context Tree Sources A Context Tree Source (CTS) or Variable Length Markov Chain (VLMC) is a Markov Chain whose order may vary with the value of the past. T = {1, 10, 100, 000} P(X 4 1 = X 0 1 = 10) = P(X 1 = 0 X 1 0 = 10) 3/4 P(X 2 = 0 X 1 1 = 100) 1/3 P(X 3 = 1 X 1 2 = 1000) 4/5 P(X 4 = 1 X 1 3 = 10001) (3/4, 1/4) (2/3, 1/3) P(X 5 = 0 X 4 1 = ) 4/5 (1/5, ) (1/3, 2/3)

8 Context Tree Estimation Context Tree Sources A Context Tree Source (CTS) or Variable Length Markov Chain (VLMC) is a Markov Chain whose order may vary with the value of the past. T = {1, 10, 100, 000} P(X 4 1 = X 0 1 = 10) = P(X 1 = 0 X 1 0 = 10) 3/4 P(X 2 = 0 X 1 1 = 100) 1/3 P(X 3 = 1 X 1 2 = 1000) 4/5 P(X 4 = 1 X 1 3 = 10001) 1/3 P(X 5 = 0 X 1 4 = ) (1/5, 4/5) (1/3, 2/3) 0 1 (3/4, 1/4) (2/3, 1/3)

9 Context Tree Estimation Context Tree Sources A Context Tree Source (CTS) or Variable Length Markov Chain (VLMC) is a Markov Chain whose order may vary with the value of the past. T = {1, 10, 100, 000} P(X 4 1 = X 0 1 = 10) = P(X 1 = 0 X 1 0 = 10) 3/4 P(X 2 = 0 X 1 1 = 100) 1/3 P(X 3 = 1 X 1 2 = 1000) 4/5 P(X 4 = 1 X 1 3 = 10001) 1/3 P(X 5 = 0 X 1 4 = ) 2/ (1/5, 4/5) (1/3, 2/3) 0 1 (3/4, 1/4) ( 2/3, 1/3)

10 Context Tree Estimation The Context Algorithm (Rissanen 81) Same principle as CART algorithm. For every possible node s A, compute a distortion such as: δ(s) = max ˆP ( s), ˆP ( as). a A Keep the nodes s A such that δ(s) ɛ(s, n) and their ancestors as internal nodes of the estimated tree ˆT C.

11 Context Tree Estimation Penalized Maximum Likelihood Estimator T P ML = arg max T log ˆP T (x n 1 x 0 ) + pen(n, T ), where pen(n, T ) = penalty function, grows with n and T. MDL - BIC Penalty pen(n, T ) = T ( A 1) 2 log n. Effective computation: recursive algorithm Context c (n1) (n2) (n3) =min(l(n1,n2,n3), c1+c2+c3+pen) Tree Maximization c1 c2 c3

12 Context Tree Estimation Under- and Over-estimation Two possible types of estimation errors: 1 under-estimation: s T 0 : s / ˆT = easily avoided (large deviation regime) at exponential rate 2 over-estimation: s ˆT : s / T 0 = more delicate, no exponential rates Asymptotic consistency results: [Bühlmann& Wyner 99, Csiszar&Talata 06, G. 06]

13 Context Tree Estimation Non asymptotic study [G. & Leonardi 11] For both algorithms, need to control the distance between ˆP t ( s) and P ( s). = the number N s (t) of summands is random: deviations of Z t = 1 t (1 N s (t) {Xu=a P (a s)})1 {X u 1 u s =s} u=

14 Context Tree Estimation Non asymptotic study [G. & Leonardi 11] For both algorithms, need to control the distance between ˆP t ( s) and P ( s). = the number N s (t) of summands is random: deviations of Z t = 1 t (1 N s (t) {Xu=a P (a s)})1 {X u 1 u s =s} u=1 The right distance to consider is: ( ) KL ˆPt ( s), P ( s). = the quantity of interest is: ( ) W t = N s (t) KL ˆPt ( s); P ( s).

15 Stochastic Bandit Problems Outline 1 Two Different Problems Context Tree Estimation Stochastic Bandit Problems 2 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Beyond the Sub-gaussian Case

16 Stochastic Bandit Problems Example: sequential clinical trials patients visit the medical center one after another for a given disease they are prescribed one of the (say) 5 treatments available the treatments are not equally efficient but nobody knows which one is the best: they only observe the effect of the prescribed treatment on each patient What is the best allocation strategy? Prescriptions may be chosen using only the previous observations The goal is not to estimate each treatment s efficiency precisely, but to heal as many patients as possible

17 Stochastic Bandit Problems The (stochastic) Multi-Armed Bandit Model Environment K arms with parameters θ = (θ 1,..., θ K ) such that for any possible choice of arm a t {1,..., K} at time t, one receives the reward X t = X at,t where, for any 1 a K and s 1, X a,s ν a, and the (X a,s ) a,s are independent. Reward distributions ν a F a parametric or not. Example Bernoulli rewards: θ [0, 1] K, ν a = B(θ a ) Strategy The agent s actions follow a dynamical strategy π = (π 1, π 2,... ) such that A t = π t (X 1,..., X t 1 )

Stochastic Bandit Problems Real challenges Randomized clinical trials original motivation since the 1930 s dynamic strategies can save resources Recommender systems: advertisement website

18 Stochastic Bandit Problems Real challenges Randomized clinical trials original motivation since the 1930 s dynamic strategies can save resources Recommender systems: advertisement website optimization news, blog posts,... Computer experiments large systems can be simulated in order to optimize some criterion over a set of parameters but the simulation cost may be high, so that only few choices are possible for the parameters Games and planning (tree-structured options)

19 Stochastic Bandit Problems Performance Evaluation, Regret Cumulated Reward S T = T t=1 X t Our goal Choose π so as to maximize E [S T ] = = T t=1 a=1 K E [ E [X t 1{A t = a} X 1,..., X t 1 ] ] K µ a E [Na π (T )] a=1 where N π a (T ) = t T 1{A t = a} is the number of draws of arm a up to time T, and µ a = E(ν a ). Regret Minimization equivalent to minimizing R T = T µ E [S T ] = (µ µ a )E [Na π (T )] a:µ a<µ where µ max{µ a : 1 a K}

20 Stochastic Bandit Problems Upper Confidence Bound Strategies UCB [Lai&Robins 85; Agrawal 95; Auer&al 02] Construct an upper confidence bound for the expected reward of each arm: S a (t) c log(t) u a (t) = + N a (t) 2N }{{} a (t) }{{} estimated reward exploration bonus Choose the arm with the highest UCB It is an index strategy [Gittins 79], easily interpretable and intuitively appealing.

21 Stochastic Bandit Problems Performance of the UCB algorithm Non-asymptotic regret bound [Auer, Cesa-Bianchi, Freund and Schapire 02] N a (T ) 16 log(t ) 2(µ µ a ) The lower-bound for Bernoulli arms is N a (T ) log(t ) ( ) 1 o(1) kl(µ a, µ ) where kl(p, q) = p log p 1 p q + (1 p) log 1 q. Thinking of Pinsker s inequality kl(µ a, µ ) 2(µ µ a ) 2 : = remove factor 16; = replace 2(µ µ a ) 2 by kl(µ a, µ ).

22 Stochastic Bandit Problems How to construct the UCB? The analysis shows that: u a (t) must upper-bound µ a with probability 1 1/t for all a and t T. Better UCB = better regret (both in theory and in practice) UCB algorithm: u a (t) = S a(t) N a (t) + c log(t) 2N a (t) Reminiscent of Hoeffding s inequality (c = 1) but the number of terms N a (t) is random.

23 Confidence Bounds for Self-Normalized Averages In both examples: We have a large total number of observations but they are splitted into sub-samples of (possibly small) random size we need to confidence regions for the parameters on each sub-sample the diameter of the confidence regions may be data-driven = Goal: do as if the number of obersations where not random

24 Confidence Bounds for Self-Normalized Averages In both examples: We have a large total number of observations but they are splitted into sub-samples of (possibly small) random size we need to confidence regions for the parameters on each sub-sample the diameter of the confidence regions may be data-driven WARNING: Law of Iterated Logarithm! = Goal: do as if the number of obersations where not random

25 Confidence Bounds for Self-Normalized Averages In both examples: We have a large total number of observations but they are splitted into sub-samples of (possibly small) random size we need to confidence regions for the parameters on each sub-sample the diameter of the confidence regions may be data-driven WARNING: Law of Iterated Logarithm! = Goal: do almost as if the number of obersations where not random

26 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Outline 1 Two Different Problems Context Tree Estimation Stochastic Bandit Problems 2 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Beyond the Sub-gaussian Case

27 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Simple framework Observations: (X t ) t iid with expectation µ. Optionnal skipping: ɛ t is {0, 1}-valued and σ ( X t,..., X t 1 ) -measurable. Total number of observations: N t = t s=1 ɛ t. Estimator of µ: ˆµ(t) = Nt 1 n k=1 ɛ tx t. Cumulated deviation: S t = n k=1 ɛ t( Xt µ ) satisfies λ, [ E e λst Ntφ(λ)] 1 where φ(λ) = log E [ e λx 1]. Sub-gaussian case: φ(λ) = σ 2 λ 2 /2. Can be generalized: X t martingale increments,....

28 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Idea 0: Plain union bound Union bound (sub-gaussian case) With probability at least 1 δ, ˆµ(t) µ σ Proof: Union bound + Chernoff P ( S t > σ 2N t log 2t δ 2 log 2t δ N t. ) ( t = P {e λns t > e σλn n=1 t e σλn 2n log 2t δ + σ2 λ 2 n n 2 n=1 t n=1 δ 2t 2N t log 2t δ for the choice λ n = 1 σ } ) {N t = n} 2 log 2t δ n.

29 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Idea 1: Method of Mixture [see De la Peña et al. 04, 07] Idea: as λ, E [ e λst Ntφ(λ)] 1: + ] [ E [e λst Nt σ2 λ 2 y 2 e λ2 y 2 y 2 dλ = E 2π [ and one obtains (for example): = E 2π + y Nσ2 + y e 2 Method of mixture (sub-gaussian case only!) With probability at least 1 δ, 2 log 1 ( δ µ(t) µ σ ) N t +1 2 log (N t + 1), where µ(t) = ( n t=1 ɛ tx t )/(N t +1). ] e λst Nt σ2 λ 2 2 λ2 y 2 2 dλ S 2 t 2(Nσ 2 +y 2 ) Succesfully used in [Abbasi-Yadkori, Pál and Szepesvári 11] for linear bandits. ] 1

30 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Idea 2: Peeling [G. & Moulines 08, 11] Decompose the value of N t by slices as follows: if N t > 1 then N t log(t) log(1+η) k=1 ](1 + η) k 1, (1 + η) k] Treat each slice independently with a unique λ k (instead of the λ n ) Control the loss in accuracy Peeling (sub-gaussian case) With probability at least 1 δ, ˆµ(t) µ σ ( 4 log(t) 2 log δ + log 2 log ( )) 4 log(t) N t δ.

31 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Outline 1 Two Different Problems Context Tree Estimation Stochastic Bandit Problems 2 Confidence Bounds for Self-Normalized Averages Sub-gaussian Case Beyond the Sub-gaussian Case

32 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Rewriting Chernoff s bound I(., µ) E [exp (λs t tφ(λ))] 1 if Xt = µ + S t /t, and x t µ, yields for λ = λ(x t ) : exp( t I(x t,µ)) P ( X t x t ) exp( ti(x t ; µ)) In other words: P ( I( X t ; µ) I(x t ; µ), X t µ ) exp( ti(x t ; µ)) or, denoting δ = ti(x t ; µ), P ( ti( X t ; µ) δ, X t µ ) exp( δ) = confidence interval of risk at most α : I-neighboorhood of X t { [at, bt] = µ : ti( X t ; µ) log 2 } α µ x t

33 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Rewriting Chernoff s bound I(., µ) I(x t,.) E [exp (λs t tφ(λ))] 1 if Xt = µ + S t /t, and x t µ, yields for λ = λ(x t ) : log(2/α) P ( X t x t ) exp( ti(x t ; µ)) In other words: P ( I( X t ; µ) I(x t ; µ), X t µ ) exp( ti(x t ; µ)) or, denoting δ = ti(x t ; µ), P ( ti( X t ; µ) δ, X t µ ) exp( δ) = confidence interval of risk at most α : I-neighboorhood of X t { [at, bt] = µ : ti( X t ; µ) log 2 } α a t X t b t

34 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Bounds for random N t [G. & Leonardi 11, G. & Cappé 11] General bound: For all δ > 0, P ( I (ˆµ(t); µ) δ ) 2e δ log(t) e δ N(t) Log-concave case If I( ; µ) is log-concave P ( t {1,..., n} : ti (ˆµ(t); µ) δ) 2 e δ 2 log(t) e δ Remark: the LIL suggests that there is little room for improvements.

35 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Extension: non-stationary observations [G. & Moulines 11] Let (X t ) t be independent rv bounded by B, with expectation µ t varying slowly (or rarely). Discounted estimator: for γ ]0, 1[, X γ (n) = S γ (n)/n γ (n) where S γ (n) = n t=1 γn t ε t X t and N γ (n) = n t=1 γn t ε t Bias-variance decomposition: if M γ (n) = n t=1 γn t ε t µ t, X γ (n) µ n = X γ (n) M γ(n) + M γ(n) N γ (n) N γ (n) µ n }{{} Fluctuations of the variance term: for all η > 0, ( ) S γ (n) M γ (n) log νγ (n) P δ exp Nγ 2(n) log(1 + η) où ν γ (n) = n t=1 γn t < min{(1 γ) 1, n}. ( 2δ2 B 2 )) (1 η2 16

36 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Multinomial laws [G. & Leonardi 11] Extension using the simple inequality: for all P, Q M 1 (A), KL(P ; Q) x A kl (P (x); Q(x)) Multinomial KL neighborhoods: If X 1,..., X n P 0 M 1 (A) are iid, and ˆP t (k) = t s=1 1{X s = k}/t P ( t {1,..., n} : ( ) KL ˆPt ; P 0 δ ) t 2e (δ log(n) + A ) exp ( δ ) A

37 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case KL-balls [Filippi, G. & Cappe 10] Sequence (Rt )t n of informational confidence regions for P0 simultaneously valid with probability at least 1 α: δ Rt = Q M1 (A) : KL(P t ; Q), t with δ such that 2e (δ log(n) + A ) exp ( δ/ A ) = α.

38 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Results: context tree estimation Context: ˆTC keeps node s if δ(s) = N n (bs)d (ˆp n ( bs); ˆp n ( s)) ɛ(n). b Penalized Maximum Likelihood: { ˆT P ML = arg max log ˆP } T (x n 1 x 0 ) + pen(n, T ). T Theorem Assume that pen(n, T ) = T ɛ(n). For every n 1 and ˆT (X n 1 ) { ˆTP ML (X n 1 ), ˆT C (X n 1 )} it holds that ( ) P ˆT (X n 1 ) T 0 1 e ( ɛ(n) log(n) + A 2) ( n 2 exp ɛ(n) ) A 2. No unnecessary assumptions like s, a A, P (a s) = 0 ou P (s; a) > ɛ.

39 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Results: an optimal UCB procedure UCB algorithm with { u a (t) = sup µ [0, 1] : kl (ˆµ a (t), µ ) } log(t) + 3 log log(t)). N a (t) Theorem ( ) E [ N a (T ) ] log(t ) 2π log µ (1 µ a) kl(µ a, µ ) + µ a(1 µ ) ( log(t kl(µa, µ ) ) ) + 3 log ( log(t ) ) 3/2 ( ( ) ) 2 ( ) 3 + 4e + kl(µ a, µ log ( log(t ) ) 2 log µ (1 µ a) µ a(1 µ ) + ) (kl(µ a, µ )) = improved logarithmic finite-time regret bound = asymptotically optimal in the Bernoulli case

40 Confidence Bounds for Self-Normalized Averages Beyond the Sub-gaussian Case Some References: 1 [Abbasi-Yadkori&al 11] Yasin Abbasi-Yadkori, Dávid Pál, Csaba Szepesvári: Online Least Squares Estimation with Self-Normalized Processes: An Application to Bandit Problems CoRR abs/ : (2011) 2 [Agrawal 95] R. Agrawal. Sample mean based index policies with O(log n) regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4) : , [Audibert&al 09] J-Y. Audibert, R. Munos, and Cs. Szepesvári. Exploration-exploitation tradeoff using variance estimates in multi-armed bandits. Theoretical Computer Science, 410(19), [Auer&al 02] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2) : , [Bubeck, Ernst&G. 11] S. Bubeck, D. Ernst, A. Garivier, Optimal discovery with probabilistic expert advice: finite time analysis and macroscopic optimality Journal of Machine Learning Research vol. 14 Feb p [De La Pena&al 04] V.H. De La Pena, M.J. Klass, and T.L. Lai. Self-normalized processes : exponential inequalities, moment bounds and iterated logarithm laws. Annals of Probability, 32(3) : , [Filippi, Cappé&Garivier 10] S. Filippi, O. Cappé, and A. Garivier. Optimism in reinforcement learning and Kullback-Leibler divergence. In Allerton Conf. on Communication, Control, and Computing, Monticello, US, [Filippi, Cappé, G.& Szepesvari 10] S. Filippi, O. Cappé, A. Garivier, and C. Szepesvari. Parametric bandits : The generalized linear case. In Neural Information Processing Systems (NIPS), [G.&Cappé 11] A. Garivier and O. Cappé. The KL-UCB algorithm for bounded stochastic bandits and beyond. In 23rd Conf. Learning Theory (COLT), Budapest, Hungary, [Cappé,G.&al 13] O. Cappé, A. Garivier, O-A. Maillard, R. Munos, G. Stoltz, Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation, Annals of Statistics, vol. 41 (3) Jun pp [G.&Leonardi 11] A. Garivier and F. Leonardi. Context tree selection : A unifying view. Stochastic Processes and their Applications, 121(11) : , Nov [G.&Moulines 11] A. Garivier and E. Moulines. On upper-confidence bound policies for non-stationary bandit problems. In Algorithmic Learning Theory (ALT), volume 6925 of Lecture Notes in Computer Science, [Lai&Robins 85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March