Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Size: px

Start display at page:

Download "Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008"

Barbara Kelley
5 years ago
Views:

1 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School, Ile de Re, France, 2008 with thanks to: RLAI group, SZTAKI group, Jean-Yves Audibert, Remi Munos

2 OUTLINE 1 HIGH LEVEL OVERVIEW OF THE TALKS 2 MOTIVATION What is it? Why should we care? The challenge 3 BANDITS Forcing ɛ-greedy Softmax Optimism in the Face of Uncertainty 4 BANDITS WITH SIDE INFORMATION 5 ONLINE LEARNING IN MDPS 6 CONCLUSIONS

3 MOTTO A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it. (Einstein)

4 HIGH LEVEL OVERVIEW OF THE TALKs Day 1: Online learning in stochastic environments Day 2: Batch learning in Markovian Decision Processes Day 3: Online learning in adversarial environments

5 REFRESHER Union bound: P (A B) P (A) + P (B) Inclusion: A B P (A) P (B) Inversion: If P (X > t) F(t) holds for all t then for any 0 < δ < 1, w.p. 1 δ, X F 1 (δ) Expectation: If X 0, E [X] = 0 P (X t) dt

6 WHAT IS IT? PROTOCOL OF LEARNING Concepts: Agent, Environment, sensations, actions, rewards Time: t = 1, 2,... Perception-action loop: 1 Agent senses Y t coming from the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment 4 t := t + 1, go to Step 1 Goal: T t=1 R t max RELATED PROBLEMS Cost-sensitive learning actions are predictions Passive (batch) learning no influence on training data Active learning learn fast (instead of cheaply)

7 WHY SHOULD WE CARE? Online Opportunity to keep improving Can learn with fewer data points Learning should be cheap learning you know! in stochastic environments convenience,..

8 THE CHALLENGE: EXPLORE OR EXPLOIT? CLINICAL TRIALS Drugs: i I def = {1, 2,..., k} Protocol: 1 Choose drug I t I for the next patient 2 Observe response R t (I t ) {0, 1} of the patient 3 t := t + 1, go to Step 1 Estimated response of drug i after n steps: Q n (i) = n t=1 I {It =i}r t (i) n t=1 I {It =i} WHICH DRUG TO CHOOSE? Best response so far? (greedy choice; exploit; optimize return) Least explored? (explore; collect information)???

9 BANDIT PROBLEMS BANDIT PROBLEM [ROBBINS, 1952] Choices: i I def = {1, 2,..., k} Protocol: 1 Choose an option I t I 2 Observe response R t (I t ) {0, 1} 3 t := t + 1, go to Step 1 Terminology: option = arm = action ASSUMPTION R t (i) P i ( ), independence within and between the arms

10 EXAMPLES EXAMPLES Clinical trials Web advertising Job shop scheduling.

11 SOME DEFINITIONS Expected payoffs: Q(i) = E [R 1 (i)] Optimal arm: Q(i ) = Q def = max i Q(i) Set of optimal arms: I = {i Q(i) = Q } Payoff loss ( gap ): i = Q Q(i) Total reward up to time n: R 1:n = n t=1 R t(i t ) A bandit problem instance: B Class of bandit problems: B

12 DEFINING THE GOALS DEFINITION (CONSISTENCY) A bandit algorithm A is strongly consistent on B if lim P (I t I ) 1 t holds when A is run on any instance from B. DEFINITION (HANNAN-CONSISTENCY ( NO-REGRET )) A bandit algorithm is Hannan-consistent on B if its expected regret L n is sublinear over time: L n def = n Q E [R 1:n ] = o(n) when A is run on any instance from B.

13 STRONG VS. HANNAN-CONSISTENCY PROPOSITION If an algorithm is strongly consistent then it is also Hannan-consistent. PROOF. Let a t = E [Q R t (I t )]. Then a t = E [Q R t (i) I t = i] P (I t = i) i = i P (I t = i). i: i >0 Hence a t 0. Cesaro: n 1 n t=1 a t 0. PROPOSITION The reverse does not hold.

14 EXPLORATION IS NECESSARY DEFINITION A bandit problem is non-trivial if for i optimal, i suboptimal, with positive probability R t (i ) < R t (i) holds. DEFINITION An algorithm stops exploring on problem B if there exists a time n such that after time step n the algorithm only chooses exploiting actions: Q n (I t ) = max i Q n (i). Time n may depend on B. PROPOSITION Let B contain a non-trivial bandit problem. If an algorithm stops exploring it cannot be consistent on B.

15 METHODS FOR ACHIEVING CONSISTENCY Forcing Fixed schedule ɛ-greedy Softmax Optimism in the Face of Uncertainty

16 FORCING IDEA Exploration is necessary make sure that every arm is selected infinitely often T n (i) = n t=1 I {It =i}: Number of times arm i is chosen up to time n. FORCING ALGORITHM((f t ) t N ) At time t do: 1 i 0 := argmin i T t (i) 2 if T t (i 0 ) < f t then I t := i 0 else I t := argmax i Q t (i). COMMENT Other possibility: Periodic forcing (forcing with a fixed timing)

17 REGRET OF THE FORCING ALGORITHM PROPOSITION Let f t 0 and lim t f t =. Then the Forcing Algorithm is Hannan-consistent, i.e., L n /n 0 (The Forcing Algorithm is not strongly consistent.).. but what is the growth rate of L n?

18 CENTRAL LIMIT THEOREM ( ) n ( ) P σ 2 X n E [X 1 ] y 1 Φ(y) 1 e y 2 /2, 2π y hence ( ) P X n E [X 1 ] ɛ = P ( n ( ) ) n σ 2 X n E [X 1 ] σ 2 ɛ e nɛ2 /(2σ 2 ) σ 2 nɛ 2 e nɛ2 /(2σ 2).

19 THEOREM (HOEFFDING S INEQUALITY) Let X i [0, 1] i.i.d., µ = E [X 1 ], X n = 1/n n t=1 X t. Then ( ) P X n µ + ɛ e 2nɛ2 ( ) P X n µ ɛ e 2nɛ2. COROLLARY (HOEFFDING S BOUND IN DEVIATION FORM) Let δ > 0. With probability 1 δ, log(1/δ) X n µ +. 2n Similarly, with probability 1 δ, log(1/δ) X n µ. 2n

20 HEURISTIC ANALYSIS Assume that R t (i) [0, 1]. Then (handwaving) w.p. 1 δ t, for all i {1,..., k} def log(2k/δ t ) Q t (i) Q(i) c t = 2f t since we explored i at least f t times. Counting mistakes: 1 When forcing: total contribution to L n is k f n 2 When the sample is atypical : n t=1 δ t 3 When the sample is typical : no mistake when c t < /2 def = min j: j >0 j /2 (*) f t 2 log(2k/δ t )/( ) 2 n L n f n + δ t + O(1) = O(log n) t=1 if δ t = 1/t and f t = c log(t) with c = Ω(( ) 2 ).

21 REGRET OF THE FORCING ALGORITHM THEOREM Assume that the payoffs are bounded. If f t = c log(t) with c = Ω(( ) 2 ) then the regret of the Forcing Algorithm grows logarithmically: L n = O(log n).

22 THE ɛ-greedy ALGORITHM ɛ-greedy((ɛ t ) t N ) At time t do: 1 Draw U t in [0, 1] uniformly at random 2 if U t < ɛ t then 3 else Pick I t randomly from {1, 2,..., k} I t := argmax i Q t (i). THEOREM (REGRET OF ɛ-greedy [AUER ET AL., 2002]) Assume that the payoffs are bounded. If ɛ t = c /t with c = Ω(k/( ) 2 ) then P (I t I ) O( c n ) and L n = O(c log n).

23 SOFTMAX ALGORITHMS Problem with the previous algorithms: When exploring they are indifferent about Q t (i). BOLTZMANN((τ t ) t N ) At time t do: 1 Let w t (i) = exp(q t (i)/τ t ), i {1, 2,..., k} 2 Let p t (i) = P w t (i) j w t (j) 3 Draw I t from p t ( ) COMMENTS τ t 0; computational temperature aka exponential weights algorithm, Gibbs-selection Plays a big role in adversarial environments

24 OFU: OPTIMISM IN THE FACE OF UNCERTAINTY IDEA [LAI AND ROBBINS, 1985] Choose the arm with the best potential Assumption: R t (i) p θi ( ), θ i R unknown, p known Mean payoff: Q(θ) = r p θ (r)dr Uncertainty set: U i,t = {θ log P ( R 1 (i),..., X Ti (t)(i) θ ) is large }... large depends on t, T i (t). Upper confidence index for arm i: UCI t (i) = max θ U i,t Q(θ) Algorithm UCI: I t := argmax i UCI t (i). Two reasons we select an arm: (i) Associated uncertainty is big, (ii) the arm looks good. Is this a good algorithm??

25 REGRET BOUND FOR UCI [LAI AND ROBBINS, 1985] THEOREM For any suboptimal arm i, ( ) 1 E [T n (i)] D(p i p ) + o(1) log(n), where p i = p θi D(p i p ) = p i (x) log p i(x) p (x) dx, and p is the distribution of an optimal arm. COROLLARY L n i: i >0 { } 1 i D(p i p ) + o(1) log n.

26 A LOWER BOUND [LAI AND ROBBINS, 1985] B = (p 1,..., p K ); L A n (B): regret of A when run on problem B. DEFINITION (UNIFORMLY GOOD ALGORITHMS) Algorithm A is uniformly good if L A n (B) = o(n a ) holds for all a > 0 and reward distributions B = (p 1,..., p K ) B. This is a minimum requirement! THEOREM (LOWER BOUND) If A is uniformly good then for any B = (p 1,..., p K ) B and any i, ( ) 1 E [T i (n)] D(p i p ) o(1) log n. COROLLARY UCI algorithms are asymptotically efficient.

27 UCB: A NONPARAMETRIC IMPLEMENTATION OF OFU Goal: Avoid parametric distributions! How to implement the OFU principle?! What is the potential of an arm? Need upper estimates of its mean payoff! [Agrawal, 1995] Large-deviation theory asymptotic results [Auer et al., 2002] Avoid asymptotics, use Hoeffding s inequality! Hoeffding: log(k/δ t ) Q(i) Q t (i) + 2T t (i) UCI t (i) := Q t (i) + p log(k/δ t ) 2T t (i) p > 1, δ t 0 tuning parameters

28 ALGORITHM UCB1 UCB1(p) At time t do: 1 UCI t (i) := Q t (i) + p log(t) 2T t (i) 2 I t := argmax i UCI t (i) THEOREM (UCB1 REGRET) Let 0 R t (i) 1, i.i.d., independent between the arms. Then the regret of UCB1 satisfies E [L n ] 2p ( 1 log(n) ) K i. i p 2 i=1 i: i >0 (Slightly) improved bound compared to [Auer et al., 2002] One can show that 1/D(p j p ) 1/(2 2 j ) Still far from the best possible constant (1/2).

29 UCBTUNED: MOTIVATION Assume that rewards are in [a, a + b]. UCB1 needs to be adjusted: p log(t) UCI t (i) := Q t (i) + b 2T t (i). Regret bound: R n 2p b 2 i i: i >0 log(n) + ( ) b p 2 Problem: outrageously big when Var [R t (i)] b 2! K i. i=1

30 UCBTUNED: ALGORITHM Idea: Estimate variance and use it to define the upper index! THEOREM (EMPIRICAL BERNSTEIN BOUND [AUDIBERT ET AL., 2007]) Let a X t a + b be ii.i.d., t > 2. Let X t be the empirical mean of X 1,..., X t, V t = 1/t t s=1 (X s X t ) 2 be the empirical variance estimate. Then for any 0 < δ < 1, w.p. at least 1 δ, X t E [X 1 ] where x δ = log(3/δ). 2Vt x δ t + 3bx δ, t UCBTUNED(p) At time t do: 1 UCI t (i) := Q t (i) + 2Vt (i)p log t T t (i) 2 I t := argmax i UCI t (i) + 3bp log t T t (i)

31 UCBTUNED: REGRET BOUND THEOREM (UCBTUNED REGRET [AUDIBERT ET AL., 2007]) Let 0 R t (i) b be i.i.d., independent between the arms, p > 1, σi 2 = Var [R t (i)]. Then the regret of UCBTuned(p) satisfies ( σ 2 ) E [L n ] c i p + 2b log n. i i: i >0 In particular, when p = 1.2, c p = 10.

32 WHAT REALLY HAPPENS.. Distribution of T t (2)/t plotted against time. Bandit: R t(1) Ber(0.5), R t(2) =

33 EXTENSIONS Switching costs [Agrawal et al., 1988] Dependent rewards [Lai and Yakowitz, 1995] Continuous action spaces Discretization [Kleinberg, 2004, Auer et al., 2007b] Tree-based methods [Kocsis and Szepesvári, 2006, Bubeck et al., 2008] Linear mean-payoff Q( ) [Dani et al., 2008] Drifts [Garivier and Moulines, 2008] Side information see below Feedback MDPs; see below

34 SUMMARY Forcing, ɛ-greedy: Exploration schedule provides a lower bound on regret Requires loss tolerance tuning OFU-based algs: No fixed schedule for exploration adaptivity Advantage: Minimal tuning You can mix these ideas!

35 BANDITS WITH SIDE INFORMATION ( ASSOCIATIVE BANDITS) PROTOCOL OF LEARNING Perception-action loop: 1 Agent senses Y t coming from the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment 4 t := t + 1, go to Step 1 ASSUMPTION Sensations are not influenced by the agent s decisions

36 VARIANTS Y t is from a m-element set Y m independent bandit problems Dependent payoffs; e.g., Q : Y A R, Q(, a) is smooth [Yang and Zhu, 2002], linear [Auer, 2003] Continuous action sets Delayed feedback [György et al., 2007]

37 ONLINE LEARNING IN MDPS Assumption: Environment is a finite MDP (X, A, p, r). PROTOCOL OF LEARNING Perception-action loop: 1 Agent senses state X t p( X t 1, A t 1 ) of the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment (E [R t X t = x, A t = a] = r(x, a)). 4 t := t + 1, go to Step 1 Goal: Minimize regret L n = nλ n t=1 R t, where λ is the best possible reward per time step

38 AVERAGE REWARD MDPS ASSUMPTION Irreducibility of the Markov chains under stationary policies AVERAGE REWARD BELLMAN OPTIMALITY EQUATIONS (BOE) Find λ R, h : X R, λ + h (x) = max [r(x, a) + p x (a), h ] a A(x) = max Q (x, a), x X. a A(x) RESTRICTED PROBLEM D(x) A(x) h D, Q D.

39 OPTIMISTIC LINEAR PROGRAMMING [TEWARI AND BARTLETT, 2007] ALGORITHM OLP 1 Update ˆp, the estimate of transition probabilities 2 D(x) := {a A(x) T t (x, a) log 2 T t (x)}, x X // keep well-sampled actions only 3 (λ, h, Q) := BOE(D, ˆp) // solve the Bellman equations 4 UCI(a) { := } sup r(x t, a) + q, h q 1, ˆp Xt (a) q 1 2 log t T t (X t,a) 5 Γ := {a A(X t ) Q(X t, a) = λ + h(x t ), T t (X t, a) < log 2 (T t (X t ) + 1)} 6 if A(X t ) = Γ /* all actions are in danger */ then else let A t be some element of Γ A t := argmax a A(Xt ) UCI(a).

40 REGRET BOUNDS FOR OLP (x, a) = h (x) + λ Q (x, a) Z ɛ (x, a) = {q 1 r(x, a) + q, h h (x) ɛ} C = {(x, a) Q (x, a) < λ + h (x), ɛ > 0, Z ɛ (x, a) } J x,a (p; ɛ) = inf{ p q 1 q Z ɛ (x, a)} K (x, a) = lim J x,a (p x (a); ɛ) ɛ 0 H = 2 (x, a) K (x, a) (x,a) C THEOREM ( [TEWARI AND BARTLETT, 2007]) Let the MDP M be irreducible. If L n is the regret of OLP after n steps then L n lim sup n log n H.

41 INTERPRETATION PROPOSITION Assume A(x) = A for all x X. Let = min (x,a) C (x, a). Then COMMENT H 2 X A h 2 1. The algorithm closely follows that of [Burnetas and Katehakis, 1997] who proved asymptotic efficiency for their policy.

42 THE UCRL2 ALGORITHM [AUER ET AL., 2007A] MOTIVATION Explore OFU in MDPs High probability bounds for finite horizon problems ALGORITHM UCRL2(δ, n) Phase initialization: 1 Estimate mean model ˆp t using ML 2 U := { p p( x, a) ˆp t ( x, a) 1 c X log( A n/δ) T t (x,a) } 3 p := argmax p U λ (p), π := π (p ) 4 C(x, a) := T t (x, a) Execution of a phase 1 Execute π until some (x, a) gets visited more than C(x, a) times.

43 UCRL2 RESULTS DEFINITION (DIAMETER) Let M be an MDP. Then the diameter of M is Results: Lower bound: Upper bounds: w.p. 1 δ/n, w.p. 1 δ, D(M) = max x,y min E [T (x y; π)]. π E [L n ] = Ω( D X A n) L n O ( D X ) A n log( A n/δ) ( L n O D 2 X 2 A log( A n/δ) ), where = min π:λπ <λ λ λ π.

44 CONCLUSIONS Exploration is necessary.. but should be controlled in a wise manner Tools: Forced exploration Softmax Optimism in the face of uncertainty After 50 years still lots of work to be done! Scaling up Non-stationary environments Dependencies

45 REFERENCES I Agrawal, R. (1995). Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Advances in Applied Probability, 27: Agrawal, R., Hedge, M., and Teneketzis, D. (1988). Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10): Audibert, J.-Y., Munos, R., and Szepesvári, C. (2007). Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory - 18th International Conference, ALT 2007, pages Springer. Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3): Auer, P., Jaksch, T., and Ortner, R. (2007a). Near-optimal regret bounds for reinforcement learning. Technical report, University of Leoben. Auer, P., Ortner, R., and Szepesvári, C. (2007b). Improved rates for the stochastic continuum-armed bandit problem. In Proceedings of the 20th Annual Conference on Learning Theory (COLT-07), pages

46 REFERENCES II Bubeck, S., Munos, R., and Szepesvári, C. (2008). Online optimization in X-armed bandits. In NIPS-21. Burnetas, A. and Katehakis, M. (1997). Optimal adaptive policies for Markov Decision Processes. Mathematics of Operations Research, 22(1): Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linear optimization under bandit feedback. COLT Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. Technical report, LTCI. György, A., Kocsis, L., Szabó, I., and Szepesvári, C. (2007). Continuous time associative bandit problems. In IJCAI-07, pages Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. In NIPS Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML-2006), pages Lai, T. and Yakowitz, S. (1995). Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control, 40:

47 REFERENCES III Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58: Tewari, A. and Bartlett, P. (2007). Optimistic linear programming gives logarithmic regret for irreducible mdps. NIPS-20. Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statistics, 30(1):

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March