Multi-armed Bandits and the Gittins Index

Size: px

Start display at page:

Download "Multi-armed Bandits and the Gittins Index"

Stephen Rose
5 years ago
Views:

1 Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6

2 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,...

3 Two-armed Bandit 3, 10, 4, 9, 12, 1,..., 6, 2, 15, 2, 7,... 5

4 Two-armed Bandit 3, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6

5 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1. Of course, in practice we must choose which arms to pull without knowing the future sequences of rewards.

6 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend? Dating strategy: should I contact a new prospect, or try another date with someone I have dated before?

7 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989) Exploration versus exploitation

8 Multi-armed Bandit Allocation Indices 2nd Edition edition 11 March 2011 Gittins, Glazebrook and Weber

9 Clinical Trials

10 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples. θ i is unknown, but has a prior distribution, perhaps uniform on [0,1] f(θ i ) = 1, 0 θ i 1.

11 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2). We wish to maximize the expected total discounted sum of number of successes.

12 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen). Objective: maximize the expected total β-discounted reward [ ] E r it (x it (t))β t, t=0 where i t is the arm pulled at time t, (0 < β < 1).

13 Dynamic Programming Solution The dynamic programming equation is F(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)f(x 1,...,x i 1,y,x i+1,...,x N )

14 Single Machine Scheduling N jobs are to be processed successively on one machine. Job i has a known processing times t i, a positive integer. On completion of job i a reward r i is obtained. If job 1 is processed immediately before job 2 the sum of discounted rewards from the two jobs is r 1 β t 1 +r 2 β t 1+t 2. r 1 β t 1 +r 2 β t 1+t 2 > r 2 β t 2 +r 1 β t 2+t 1 G 1 = (1 β) r 1β t 1 1 β t 1 > (1 β) r 2β t2 1 β t 2 = G 2. So total discounted reward is maximized by the index policy which processes jobs in decreasing order of indices, G i.

15 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. G i (x i ) is called the Gittins index.

16 Gittins Index [ ] τ 1 E G i (x i ) = sup t=0 ri(x i (t))βt xi [ (0) = x i τ 1 ] τ 1 E t=0 βt xi (0) = x i Discounted reward up to τ. Discounted time up to τ. In the scheduling problem τ = t i and G i = r i β t i 1+β + +β t i 1 = (1 β) r iβ ti 1 β t i

17 Calibration Alternatively, { G i (x i ) = sup λ : t=0 β t λ supe τ 1 [ τ 1 β t r i (x i (t))+ β t λ x i (0) = x i ]}. t=0 That is, we consider a problem with two bandit processes: bandit process B i and a calibrating bandit process, say Λ, which pays out a known reward λ at each step it is continued. t=τ The Gittins index of B i is the value of λ for which we are indifferent as to which of B i and Λ to continue initially. Notice that once we decide, at time τ, to switch from continuing B i to continuing Λ then information about B i does not change and so it must be optimal to stick with continuing Λ ever after.

18 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index = So we prefer to use drug 1 next, even though it has the smaller probability of success.

19 Gittins Index Theorem is Surprising Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved? Sir, the multi-armed bandit problem is not of such a nature that it can be solved.

20 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988) Weber (1992) El Karoui, Karatzas (1993) Ishikida and Varaiya (1994) Tsitsiklis (1994) Bertsimas, Niño-Mora (1996) Glazebrook, Garbe (1996) Kaspi, Mandelbaum (1998) Bäuerle, Stidham (2001) Dimitriu, Tetali, Winkler (2003)

21 Proof of the Index Theorem Start with a problem in which only bandit process B i is available. Define the fair charge, γ i (x i ), as the maximum amount that a gambler would be willing to pay per step to be permitted to continue B i for at least one more step, and with subsequent freedom to stop continuing it whenever he likes thereafter. This is γ i (x i ) = sup { λ : 0 supe τ 1 [ τ 1 t=0 Easy to check that γ i (x i ) = G i (x i ). The stopping time τ will be the first time that G i (x i (τ)) < G i (x i (0)), ]} β t( ) xi r i (x i (t)) λ (0) = x i. i.e. the first time that the fair charge becomes too expensive. The gambler would rather stop that pay this charge.

22 Prevailing charges Suppose that when G i (x i (τ)) < G i (x i (0)), we reduce the charge to a level such that it remains just-profitable for the gambler to keep on playing. This is called the prevailing charge, say g i (t) = min s t G i (x i (s)). g i (t) is a nonincreasing function of t and its value depends only on the states through which bandit i evolves. Observation 1. Suppose that in the problem with n alternative bandit processes, B 1,...,B n, the gambler not only collects r it (x it (t)), but also pays the prevailing charge g it (x it (t)) of the bandit B it that he chooses to continue at time t. Then he cannot do better than just break even (i.e. expected profit 0). This is because he could only make a strictly positive profit (in expected value) if this were to happen for at least one bandit. Yet the prevailing charge has been defined so that if he pays the prevailing charges he can only just break even.

23 Observation 2. He maximizes the expected discounted sum of the prevailing charges that he pays by always continuing the bandit with the greatest prevailing charge. This is because he thereby interleaves the n nonincreasing sequences of prevailing charges g i into one nonincreasing sequence of prevailing charges. This way of interleaving them maximizes their discounted sum. For example, prevailing charges of g 1 : 10,10,9,5,5,3,... g 2 : 20,15,7,4,2,2,... are best interleaved (so as to maximize discounted charge paid) as 20,15,10,10,9,7,5,5,4,3,2,2,...

24 Observation 3. Using this strategy he also just breaks even (because once he starts continuing B i he does so until its prevailing charge decreases). So this strategy, (of always continuing the bandit with the greatest G i (x i )), must maximize the expected discounted sum of the rewards that he can obtain from the bandits. I.e., for any policy, the definition of prevailing charges implies [ ] 0 E β t( ) x(0) r it (x it (t)) g it (x it(t)) t=0 and so [ ] [ ] E β t r it (x it ) x(0) E β t g it (x it ) x(0). t=0 But the right hand side is maximized by always continuing the bandit with greatest g i (x i (t)). This policy also achieves equality in the above, and so it must maximize the expected discounted sum of rewards (the left hand side). t=0

25 Restless Bandits Spinning Plates

26 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed inspection no inspection

27 Questions

Multi-armed Bandits and the Gittins Index

Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge Seminar at Microsoft, Cambridge, June 20, 2011 The Multi-armed Bandit Problem The Multi-armed Bandit