Reinforcement Learning

Size: px

Start display at page:

Download "Reinforcement Learning"

Winfred French
5 years ago
Views:

1 Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology

2 Objectives of this lecture Introduce bandit optimisation: the most basic RL problem without dynamics. Optimising the exploration vs exploitation trade-off. Regret lower bounds Algorithms based on the optimism in front of uncertainty principle Thompson Sampling algorithm Structured bandits 2

3 Lecture 5: Outline. 1. Classifying bandit problems 2. Regret lower bounds 3. Algorithms based on the optimism in front of uncertainty principle 4. Thompson Sampling algorithm 5. Structured bandits 3

4 Lecture 5: Outline 1. Classifying bandit problems 2. Regret lower bounds 3. Algorithms based on the optimism in front of uncertainty principle 4. Thompson Sampling algorithm 5. Structured bandits 4

5 Bandit Optimisation Interact with an i.i.d. or adversarial environment Set of available actions A with unknown sequences of rewards r t (a), t = 1,... The reward is the only feedback bandit feedback Stochastic vs. adversarial bandits - i.i.d. environment: r t(a) random variable with mean θ a - adversarial environment: r t(a) is arbitrary! Objective: develop an action selection rule π maximising the expected cumulative reward up to step T Remark: π must select an action depending on the entire history of observations! 5

6 Regret Difference between the cumulative reward of an Oracle policy and that of agent π Regret quantifies the price to pay for learning Exploration vs. exploitation trade-off: we need to probe all actions to play the best later 6

7 Applications Clinical trial, Thompson Two available treatments with unknown rewards ( Live or Die ) - Bandit feedback: after administrating the treatment to a patient, we observe whether she survives or dies - Goal: design a treatment selection scheme π maximising the number of patients cured after treatment 7

8 Applications Rate adaptation in wireless systems - The AP sequentially sends packets to the receiver and has K available encoding rates r 1 < r 2 <... < r K - The unknown probability a packet sent at rate r k is received is θ k - Goal: design a rate selection scheme that learns the θ k s and quickly converges to rate r k maximising µ k = r k θ k over k 8

9 Applications Search engines - The engine should list relevant webpages depending on the request jaguar - The CTRs (Click-Through-Rate) are unknown - Goal: design a list selection scheme that learns the list maximising its global CTRs 9

10 Bandit Taxonomy Stochastic bandits: the sequence of rewards (r t (a), a A) t 1 is generated according to an i.i.d. process the average rewards are unknown Adversarial bandits: arbitrary sequence of rewards Most bandit problems in engineering are stochastic... 10

11 Stochastic Bandit Taxonomy Unstructured problems: average rewards are not related θ = (θ 1,..., θ K ) Θ = k [a k, b k ] Structured problems: the decision maker knows that average rewards are related. She knows Θ. The rewards observed for a given arm provides side information about the other arms. θ = (θ 1,..., θ K ) Θ not an hyperrectangle 11

12 Stochastic Bandit Taxonomy 12

13 Lecture 5: Outline 1. Classifying bandit problems 2. Regret lower bounds 3. Algorithms based on the optimism in front of uncertainty principle 4. Thompson Sampling algorithm 5. Structured bandits 13

14 Unstructured Stochastic Bandits Robbins 1952 Finite set of actions A = {1,..., K} (Unknown) rewards of action a A: (r t (a), t 0) i.i.d. Bernoulli with E[r t (a)] = θ a, θ Θ = [0, 1] K Optimal action a arg max a θ a Online policy π: select action a π t a π 1, r 1 (a π 1 ),..., a π t 1, r t 1 (a π t 1) at time t depending on Regret up to time T : R π (T ) = T θ a T t=1 θ a π t 14

15 Problem-specific regret Lower Bound Uniformly good algorithms: An algorithm π is uniformly good if for all θ Θ, for any sub-optimal arm a, the number of times N a (t) arm a is selected up to round t satisfies: E[N a (t)] = o(t α ) for all α > 0. Theorem (Lai-Robbins 1985) For any uniformly good algorithm π: lim inf T where KL(a, b) = a log( a b R π (T ) log(t ) a a θ a θ a KL(θ, θ a ) 1 a ) + (1 a) log( 1 b ) (KL divergence) 15

16 Minimax regret Lower Bound Theorem (Auer et al. 2002) For any T, we can find a problem (depending on T ) such for any algorithm π, R π (T ) KT (1 1 K ). 16

17 Unified proofs Change-of-measure: θ ν Log-likelihood ratio: E θ [L] = j E θ[n j (T )]KL(θ j, ν j ) Data processing inequality. For any event A F T, P ν (A) = E θ [exp( L)1 A ]. Jensen s inequality yields: P ν (A) exp( E θ [L] A)P θ (A) P ν (A c ) exp( E θ [L] A c )P θ (A c ) Hence E θ [L] KL(P θ (A), P ν (A)) Data processing inequality v2. For all Z F T -measurable, E θ [L] KL(E θ (Z), E ν (Z)) 17

18 Proof problem-specific lower bound Change-of-measure: θ ν with θ j = ν j for all j a, ν a = θ a + ɛ. E θ [L] = E θ [N a (t)]kl(θ a, θ a + ɛ) KL(P θ (A), P ν (A)) Select A = {N a (T ) T T }. Markov inequality yields (for uniformly good algorithms): lim T P θ [A] = 0 = lim T P ν [A c ] E[N a (T ) 1 lim inf t log(t ) KL(θ a, θ a + ɛ) 18

19 Proof minimax lower bound Change-of-measure: θ a = 1/2 for all a. Then there exists a such that E θ [N a (T )] T/K. ν i = θ i for all i a, and ν a = 1/2 + ɛ. E θ [L] = E θ [N a (t)]kl(1/2, 1/2 + ɛ) KL(E θ (Z), E ν (Z)) where KL(1/2, 1/2 + ɛ) = 1 2 log( 1 1 4ɛ 2 ). Select Z = N a (T )/T. Pinsker s inequality yields: E θ [N a (T ) 2 Hence with E θ [N a (T )] T/K, ( 1 log( 1 4ɛ 2 ) 2 Eν [N a (T )] T E ν [N a (T )] T 1 K T K log( 1 1 4ɛ 2 ) E ) 2 θ[n a (T )] T 19

20 Proof minimax lower bound Eν[Na(T )] Now R ν (T ) = T ɛ(1 T ), and we conclude choosing ɛ = K/T that: R ν (T ) KT (1 1/K). 20

21 Lecture 5: Outline 1. Classifying bandit problems 2. Regret lower bounds 3. Algorithms based on the optimism in front of uncertainty principle 4. Thompson Sampling algorithm 5. Structured bandits 21

22 Concentration The main tools in the design and analysis of algorithms for stochastic bandits are concentration-of-measure results. Let X 1, X 2,... i.i.d. real-valued random variable with mean µ, and with all moments G(λ) = log(e[e λ(xn µ) ]). S n = n i=1 X i. Strong law of large number: P[lim n S n n = µ] = 1 Concentration inequality: let δ, λ > 0, P[S n nµ δ] = P[e λ(sn nµ) e λδ ] e λδ E[e λ(sn nµ) ] n = e λδ E[e λ(xi µ) ] i=1 = e ng(λ) λδ e sup λ>0 (λδ ng(λ)) 22

23 Concentration P[S n nµ δ] e sup λ>0 (λδ ng(λ)) Bounded r.v. X n [a, b], G(λ) λ 2 (b a) 2 8 Hoeffding s inequality: P[S n nµ δ] e Sub-gaussian r.v.: G(λ) σ 2 λ 2 /2 2δ 2 n(b a) 2 Bernoulli r.v.: G(λ) = log(µe λ(1 µ) (1 µ)e λµ ) Chernoff s inequality: where KL(a, b) = a log( a b P[S n nµ δ] e nkl(µ+δ/n,µ) 1 a ) + (1 a) log( 1 b ) (KL divergence) 23

24 Algorithms Estimating the average reward of arm a: ˆθ a (t) = 1 N a (t) t r n (a)1 a(n)=a n=1 ɛ-greedy. In each round t: - with probability 1 ɛ, select the best empirical arm a (t) arg max a ˆθa(t) - with probability ɛ, select an arm uniformly at random The algorithm has linear regret (not uniformly good) 24

25 Algorithms ɛ t -greedy. In each round t: - with probability 1 ɛ t, select the best empirical arm a (t) arg max a ˆθa(t) - with probability ɛ t, select an arm uniformly at random The algorithm has logarithmic regret for Bernoulli rewards and ɛ t = min(1, K tδ 2 ) where δ = min a a (θ a θ a ) Sketch of proof. For a a to be selected in round t, we need (most often) ˆθ a (t) θ a + δ. The probability that this occurs is less than exp( 2δ 2 N a (t)). But N a (t) is close to log(t)/δ 2. Summing over t yields the result. 25

26 Algorithms Optimism in front of Uncertainty Upper Confidence Bound algorithm: b a (t) = ˆθ 2 log(t) a (t) + N a (t) ˆθ(t): empirical reward of a up to t N a (t): nb of times a played up to t In each round t, select the arm with highest index b a (t) Under UCB, the number of times a a is selected satisifies: E[N a (T )] 8 log(t ) (θ a θ a ) 2 + π2 6 26

27 Regret analysis of UCB (Auer et al. 2002) N a (T ) = 1 + l + l + l + T t=k+1 T t=k+1 T t=k+1 T 1{a(t) = a} l + T t=k+1 1{a(t) = a, N a (t) l} 1{N a (t) l}1{b a (t 1) b a (t 1)} 1{min s<t ( θ s + t=k+1 s<t,l s <t 2 log(t 1) 1{ θ s + s 2 log(t 1) s ) max l s<t ( θ a s + 2 log(t 1) )} s θ a s + 2 log(t 1) s )} 27

28 Regret analysis of UCB (Auer et al. 2002) θ s 2 log(t 1) + s θ s a + 2 log(t 1) s implies: A : θ 2 log(t 1) s θ, or B : s θ 2 log(t 1) s a θ a + s 2 log(t 1) or C : θ < θ a + 2 s Hoeffding ineuqality yields: P[A] t 4 and P[B] t 4. For l = 8 log(t )/(θ θ a ) 2 and s l, C does not happen. We conclude by: E[N a (T )] 8 log(t ) (θ θ a ) 2 + t 1 8 log(t ) (θ θ a ) 2 + π2 3 t s,s =1 2t 4 28

29 Algorithms KL-UCB algorithm: b a (t) = max{q 1 : N a (t)kl(ˆθ a (t), q) f(t)} where f(t) = log(t) + 3 log log(t) is the confidence level. In each round t, select the arm with highest index b a (t) Under KL-UCB, the number of times a a is selected satisifies: for all δ < θ a θ a, E[N a (T )] log(t ) + C log log(t ) + δ 2 KL(θ a + δ, θ a ) 29

30 Lecture 5: Outline 1. Classifying bandit problems 2. Regret lower bounds 3. Algorithms based on the optimism in front of uncertainty principle 4. Thompson Sampling algorithm 5. Structured bandits 30

31 Algorithms Bayesian framework, put a prior distribution on the parameters θ Example: Bernoulli distribution with uniform prior on [0, 1], we observed p successes ( 1 ) and q failures ( 0 ). Then θ β(p + 1, q + 1), i.e., the density is proportional to θ p (1 θ) q. Thompson Sampling algorithm: Assume that at round t, arm a had p a (t) successes and q a (t) failures. Let b a (t) β(p a (t) + 1, q a (t) + 1). The algorithm selects the arm a with the highest b a (t). Under Thompson Sampling, for any suboptimal arm a, we have: E[N a (T )] 1 lim sup = T log(t ) KL(θ a, θ a ) 31

32 Illustration: UCB vs. KL-UCB 32

33 Illustration: Thompson Sampling 33

34 Performance 34

35 Lecture 5: Outline 1. Classifying bandit problems 2. Regret lower bounds 3. Algorithms based on the optimism in front of uncertainty principle 4. Thompson Sampling algorithm 5. Structured bandits 35

36 Structured bandits Unstructured bandits: best possible regret K log(t ) How to exploit a known structure to speed up the learning process? Structure: the decision maker knows that θ Θ not an hyper-rectangle Examples: the average reward is a (convex, unimodal, Lipschitz,...) function of the arm: Θ = {θ : a θ a unimodal} 36

37 Regret lower bound Theorem (Application of Graves-Lai 1997) For any uniformly good algorithm π: lim inf T R π (T ) log(t ) c(θ) where c(θ) is the minimal value of: inf n a (θ a θ a ) n a,a a a a s.t. inf n a KL(θ a, λ a ) 1, λ B(θ) a a and B(θ) = {λ Θ : a not optimal under λ, λ a = θ a }. 37

38 Graphically unimodal bandits Arms = vertices of a known graph G 38

39 Graphically unimodal bandits Arms = vertices of a known graph G The average rewards are G-unimodal: from any vertex, there is a path in the graph along which rewards are increased. Notation: θ Θ G 39

40 G-unimodal bandits: Regret lower bound N (k): set of neighbours of k in G Theorem For any uniformly good algorithm π: lim inf T R π (T ) log(t ) c G(θ) = k N (a ) θ a θ k KL(θ k, θ a ) 40

41 G-unimodal bandits: Optimal algorithm Defined through the maximum degree of G, the empirical means ˆθ a (t), the leader L(t), the number of time arms have been the leader l a (t), and KL UCB indexes: b a (t) = sup{q : N a (t)kl(ˆθ a (t), q) log(l L(t) (t)} Algorithm. OAS (Optimal Action Sampling) 1. For t = 1,..., K, select action a(t) = k 2. For t K + 1, select action { L(t) if (l a(t) = L(t) (t) 1)/(γ + 1) N arg max k N (L(t)) b k (t) otherwise 41

42 G-unimodal bandits: OAS optimality Theorem For any θ Θ, R OAS (T ) lim sup c G (θ) T log(t ) 42

Rate adaptation in 802.11 Goal: adapt the modulation scheme to the channel quality 802.11 a/b/g Select the rate only rate r 1 r 2.

43 Rate adaptation in Goal: adapt the modulation scheme to the channel quality a/b/g Select the rate only rate r 1 r 2... r K success probability µ 1 µ 2... µ K throughput θ 1 θ 2... θ K (θ k = r k µ k ) Structure: unimodality of k θ k and µ 1 µ 2... µ K 43

44 Rate adaptation in n/ac Select the rate and a MIMO mode Structure: example with two modes, unimodality w.r.t. G 44

45 802.11g stationary channels Smooth throughput decay w.r.t. rate 45

46 802.11g stationary channels Steep throughput decay w.r.t. rate 46

47 802.11g nonstationary channels Traces 47

48 802.11g nonstationary channels Performance of OAS with sliding window 48

49 References Discrete unstructured bandits Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, 1933 Robbins, Some aspects of the sequential design of experiments, 1952 Lai and Robbins. Asymptotically efficient adaptive allocation rules, 1985 Lai. Adaptive treatment allocation and the multi-armed bandit problem, 1987 Gittins, Bandit Processes and Dynamic Allocation Indices, 1989 Auer, Cesa-Bianchi and Fischer, Finite time analysis of the multiarmed bandit problem Garivier and Moulines, On upper-confidence bound policies for non-stationary bandit problems, 2008 Slivkins and Upfal, Adapting to a changing environment: the brownian restless bandits,

50 References Garivier and Capp. The KL-UCB algorithm for bounded stochastic bandits and beyond, 2011 Honda and Takemura, An Asymptotically Optimal Bandit Algorithm for Bounded Support Models, 2010 Discrete structured bandits Anantharam, Varaiya, and Walrand, Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays, 1987 Graves and Lai Asymptotically efficient adaptive choice of control laws in controlled Markov chains, 1997 Gyrgy, Linder, Lugosi and Ottucsk, The on-line shortest path problem under partial monitoring, 2007 Yu and Mannor, Unimodal bandits, 2011 Cesa-Bianchi and Lugosi, Combinatorial bandits,

51 References Chen, Wang and Yuan. Combinatorial multi-armed bandit: General framework and applications, 2013 Combes and Proutiere, Unimodal bandits: Regret lower bounds and optimal algorithms, 2014 Magureanu, Combes, and Proutiere. Lipschitz bandits: Regret lower bounds and optimal algorithms, 2014 Thompson sampling Chapelle and Li, An Empirical Evaluation of Thompson Sampling, 2011 Korda, Kaufmann and Munos, Thompson Sampling: an asymptotically optimal finite-time analysis, 2012 Korda, Kaufmann and Munos, Thompson Sampling for one-dimensional exponential family bandits, 2013 Agrawal and Goyal, Further optimal regret bounds for Thompson Sampling,

52 Lecture 5: Appendix Adversarial bandits 52

53 Adversarial Optimisation Finite set of actions A = {1,..., K} Unknown and arbitrary rewards of action a A: (r t (a), t 0) decided by an adversary at time 0 Best empirical action a arg max a T t=1 r t(a) Online policy π: select action a π t at time t depending on: - Expert setting: (r 1(a),..., r t 1(a)) a=1,...,k - Bandit setting: a π 1, r 1(a π 1 ),..., a π t 1, r t 1(a π t 1) Regret up to time T : R π (T ) = T t=1 r t(a ) E T t=1 r t(a π t ) 53

54 Expert Setting Let S t (a) = t n=1 r t(a) be the cumulative reward of a Multiplicative update algorithm (Littlestone-Warmuth 1994) Select arm a with probability p t (a) = eηst 1(a) b eηst 1(b) Regret: the multiplicative update algorithm π has zero-regret: T, For η = 8 log(k)/t, R π (T ) R π (T ) T η 8 + log(k) η log(k)t 2 54

55 Bandit Setting Let S t (a) = t n=1 r t(a) be the cumulative reward of a Building an unbiased estimator of S t (a): Ŝ t (a) = t n=1 X t(a) where r t (a) X t (a) = 1 a π (t)=a p t (a) Multiplicative update algorithm (Littlestone-Warmuth 1994) Select arm a with probability p t (a) = eηŝt 1(a) b eηŝt 1(b) Regret: the multiplicative update algorithm π has zero-regret: T, R π (T ) T Kη 2 + log(k) η For η = 2 log(k)/(kt ), R π (T ) 2K log(k)t 55

56 Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 56

57 Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 57

58 Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 58

59 Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 59

60 Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 60

61 Adversarial Convex Bandits At the beginning of each year, Volvo has to select a vector x (in a convex set) representing the relative efforts in producing various models (S60, V70, V90,...). The reward is an arbitrarily varying and unknown concave function of x. How to maximise reward over say 50 years? 61

62 Adversarial Convex Bandits Continuous set of actions A = [0, 1] (Unknown) Arbitrary but concave rewards of action x A: r t (x) Online policy π: select action x π t x π 1, r 1 (x π 1 ),..., x π t 1, r t 1 (x π t 1) at time t depending on Regret up to time T : (defined w.r.t. the best empirical action up to time T ) T T R π (T ) = max r t (x) r t (x π t ) x [0,1] t=1 Can we do something smart at all? Achieve a sublinear regret? t=1 62

63 Adversarial Convex Bandits If r t ( ) = r( ), and if r( ) was known, we could apply a gradient ascent algorithm One-point gradient estimate: ˆf(x) = E v B [f(x + δv)], B = {x : x 2 1} E u S [f(x + δu)u] = δ ˆf(x), S = {x : x 2 = 1} Simulated Gradient Ascent algorithm: at each step t, do - u t uniformly chosen in S - y t = x t + δu t - y t+1 = y t + αr t(x t)u t Regret: R(T ) = O(T 5/6 ) 63

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)