Multi-armed Bandits and the Gittins Index

Size: px

Start display at page:

Download "Multi-armed Bandits and the Gittins Index"

Douglas Hutchinson
5 years ago
Views:

1 Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge Seminar at Microsoft, Cambridge, June 20, 2011

2 The Multi-armed Bandit Problem

3 The Multi-armed Bandit Problem

4 Multi-armed Bandit Allocation Indices 2nd Edition edition 11 March 2011 Gittins, Glazebrook and Weber

5 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,...

6 Two-armed Bandit 3, 10, 4, 9, 12, 1,..., 6, 2, 15, 2, 7,... 5

7 Two-armed Bandit 3, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6

8 Two-armed Bandit, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3

9 Two-armed Bandit,, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10,

10 Two-armed Bandit,,, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10, 4

11 Two-armed Bandit,,,, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10, 4, 9

12 Two-armed Bandit,,,,, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10, 4, 9, 12

13 Two-armed Bandit,,,,, 1,...,,, 15, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2

14 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15

15 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1.

16 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1. Of course, in practice we must choose which arms to pull without knowing the future sequences of rewards.

17 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research?

18 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray?

19 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend?

20 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend? Dating strategy: should I contact a new prospect, or try another date with someone I have dated before?

21 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term.

22 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989)

23 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989) Exploration versus exploitation

24 Clinical Trials

25 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,...

26 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0.

27 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i.

28 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples.

29 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples. θ i is unknown, but has a prior distribution, perhaps uniform on [0,1] f(θ i ) = 1, 0 θ i 1.

30 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2).

31 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2). We wish to maximize the expected total discounted sum of number of successes.

32 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t).

33 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)).

34 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen).

35 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen). Objective: maximize the expected total β-discounted reward [ ] E r it (x it (t))β t, t=0 where i t is the arm pulled at time t, (0 < β < 1).

36 Dynamic Programming Solution The dynamic programming equation is V(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)v(x 1,...,x i 1,y,x i+1,...,x N )

37 Dynamic Programming Solution The dynamic programming equation is V(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)v(x 1,...,x i 1,y,x i+1,...,x N ) If bandit i moves on a state space of size k i, then (x 1,...,x N ) moves on a state space of size i k i (exponential in N).

38 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time.

39 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. One problem (on a state space size i k i) N problems (on state spaces sizes k 1,...,k n.)

40 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. One problem (on a state space size i k i) N problems (on state spaces sizes k 1,...,k n.) G i (x i ) is called the Gittins index.

41 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. One problem (on a state space size i k i) N problems (on state spaces sizes k 1,...,k n.) G i (x i ) is called the Gittins index. It can be computed in time O(k 3 i ).

42 Gittins Index G i (x i ) = sup τ 1 [ τ 1 ] E t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i

43 Gittins Index G i (x i ) = sup τ 1 [ τ 1 ] E t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i Discounted reward up to τ.

44 Gittins Index G i (x i ) = sup τ 1 [ τ 1 ] E t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i Discounted reward up to τ. Discounted time up to τ.

45 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index =

46 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index =

47 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index = So we prefer to use drug 1 next, even though it has the smaller probability of success.

48 What has Happened Since 1989? Index theorem has become better known.

49 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored.

50 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls

51 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach

52 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...).

53 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized.

54 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

55 Gittins Index Theorem has become Better Known Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved?

56 Gittins Index Theorem has become Better Known Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved? Sir, the multi-armed bandit problem is not of such a nature that it can be solved.

57 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988)

58 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988) Weber (1992) El Karoui, Karatzas (1993) Ishikida and Varaiya (1994) Tsitsiklis (1994) Bertsimas, Niño-Mora (1996) Glazebrook, Garbe (1996) Kaspi, Mandelbaum (1998) Bäuerle, Stidham (2001) Dimitriu, Tetali, Winkler (2003)

59 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

60 Golf with N Balls [Dimitriu, Tetali, Winkler 03, W. 92] N balls are strewn about a golf course at locations x 1,...,x N.

61 Golf with N Balls [Dimitriu, Tetali, Winkler 03, W. 92] N balls are strewn about a golf course at locations x 1,...,x N. Hitting a ball i, that is in location x i, costs c(x i ), x i y with probability P(x i,y) Ball goes in the hole with probability P(x i,0). Objective Minimize the expected total cost incurred up to sinking a first ball.

62 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball.

63 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes.

64 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize.

65 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize. If the ball arrives at a location y, from which g(x) is no longer great enough to motivate the golfer to continue playing, then, just as he is about to quit, we increase the prize to g(y), which becomes the new prevailing prize.

66 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize. If the ball arrives at a location y, from which g(x) is no longer great enough to motivate the golfer to continue playing, then, just as he is about to quit, we increase the prize to g(y), which becomes the new prevailing prize. Continue doing this until the ball is sunk.

67 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize. If the ball arrives at a location y, from which g(x) is no longer great enough to motivate the golfer to continue playing, then, just as he is about to quit, we increase the prize to g(y), which becomes the new prevailing prize. Continue doing this until the ball is sunk. This presents the golfer with a fair game, and it is optimal for him to keep playing until the ball is sunk. E(cost incurred) = E(prize won)

68 Golf with 1 Ball g(x) = 3.0 g(x) x

69 Golf with 1 Ball g(x) = 3.0, g(x ) = 2.5 g(x) x g(x) x

70 Golf with 1 Ball g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(x) x g(x) x g(x ) x

71 Golf with 1 Ball g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 Prevailing prize sequence is 3.0, 3.0, 4.0,... g(x) x g(x) x g(x ) x

72 Golf with 2 Balls g(x) = 3.0 g(y) = 3.2 g(x) x g(y) y

73 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5 g(y) = 3.2 g(x) x g(x) x g(y) y

74 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(y) = 3.2 g(x) x g(x) x x g(x ) g(y) y

75 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(y) = 3.2, g(y ) = 3.5 g(x) x g(x) x g(y ) y x g(x ) g(y) y

76 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(y) = 3.2, g(y ) = 3.5, g(y ) = 4.2 g(x) x g(y ) y g(x) x g(y ) y x g(x ) g(y) y

77 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)).

78 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing.

79 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)).

80 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)). Game is fair. It is impossible for golfer to make a strictly positive profit (since he would have to do so for some ball). E(cost incurred) E(prize won)

81 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)). Game is fair. It is impossible for golfer to make a strictly positive profit (since he would have to do so for some ball). E(cost incurred) E(prize won) Equality is achieved provided golfer does not switch away from a ball unless its prevailing prize increases.

82 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)). Game is fair. It is impossible for golfer to make a strictly positive profit (since he would have to do so for some ball). E(cost incurred) E(prize won) Equality is achieved provided golfer does not switch away from a ball unless its prevailing prize increases. Right hand side is minimized by always playing ball with least prevailing prize.

83 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon.

84 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon. The fair prize, g(x), is 1/(1 β) times the Gittins index, G(x).

85 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon. The fair prize, g(x), is 1/(1 β) times the Gittins index, G(x). g(x) = inf { g : supe τ 1 [ τ 1 c(x(t))β t t=0 ] } +(1 β)(1+β + +β τ 1 )g x(0) = x 0

86 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon. The fair prize, g(x), is 1/(1 β) times the Gittins index, G(x). g(x) = inf { g : supe τ 1 [ τ 1 c(x(t))β t t=0 ] } +(1 β)(1+β + +β τ 1 )g x(0) = x 0 [ ] τ 1 x(0) = 1 E 1 β inf t=0 c(x(t))βt = x [ τ 1 ] τ 1 x(0) E t=0 βt = x

87 Golf with N Balls and a Set of Clubs Suppose that a ball in location x can be played with a choice of shots, from a set A(x). Choosing shot a A(x), x y with probability P a (x,y) Now the golfer must choose, not only which ball to play, but with which shot to play it.

88 Golf with N Balls and a Set of Clubs Suppose that a ball in location x can be played with a choice of shots, from a set A(x). Choosing shot a A(x), x y with probability P a (x,y) Now the golfer must choose, not only which ball to play, but with which shot to play it. Under a condition, an index policy is again optimal. He should play the ball with least prevailing prize, choosing the shot from A that is optimal if that ball were the only ball present.

89 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

90 Achievable Performance Region Approach Suppose all arms move on state space E = {1,...,N}. Let I i (t) be an indicator for the event that at time t an arm is pulled that is in state i. We wish to maximize (conditional on the starting states of arms) [ ] E π r i I i (t)β t i E t=0

91 Achievable Performance Region Approach Suppose all arms move on state space E = {1,...,N}. Let I i (t) be an indicator for the event that at time t an arm is pulled that is in state i. We wish to maximize (conditional on the starting states of arms) [ ] E π r i I i (t)β t i E t=0 Suppose that under policy π, [ ] zi π = E π I i (t)β t We wish to maximize i E r iz π i. t=0

92 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E z π i = 1+β +β 2 + = 1 1 β

93 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β

94 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}.

95 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1.

96 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S.

97 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S. This is a near-trivial MABP. Easy to show i rs i zπ i minimized by any policy that gives priority to arms whose states are not in S. So { } A S i zπ i min A S π i zπ i i E i E

98 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i zπ i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S. This is a near-trivial MABP. Easy to show i rs i zπ i minimized by any policy that gives priority to arms whose states are not in S. So { } A S i zi π min A S π i zi π i E i E

99 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i zπ i = 1+β +β 2 + = 1 1 β = b(e) Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S. This is a near-trivial MABP. Easy to show i rs i zπ i minimized by any policy that gives priority to arms whose states are not in S. So { } A S i zi π min A S π i zi π = b(s) i E i E

100 Constraints on the Achieveable Region Lemma There exist positivea S i, as defined above, such that for any scheduling policy π, A S i zπ i b(s), for all S E, (1) i S A E i zi π = b(e), (2) i E and such that equality holds in (1) if π gives priority to arms whose states are not in S over any arms whose states are in S.

101 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i.

102 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. The optimal value of this LP is an upper bound on the optimal value for our bandit problem.

103 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem.

104 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem. A greedy algorithm computes dual vectors ȳ S that are dual feasible and complementary slack, and a primal solution z i = z π i which is the performance vector of a priority policy.

105 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem. A greedy algorithm computes dual vectors ȳ S that are dual feasible and complementary slack, and a primal solution z i = z π i which is the performance vector of a priority policy.

106 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem. A greedy algorithm computes dual vectors ȳ S that are dual feasible and complementary slack, and a primal solution z i = z π i which is the performance vector of a priority policy.

107 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN.

108 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}.

109 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum

110 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. So dual constraint, S :i 1 S y SA S i r i1, holds with equality.

111 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2

112 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2 Loop: while k N do let S k := S k 1 \{i k 1 }. ) /A S k i : i S k } {( let ȳ Sk := max r i k 1 j=1 ASj i choose i k S k attaining the above maximum

113 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2 Loop: while k N do let S k := S k 1 \{i k 1 }. {( let ȳ Sk := max r i k 1 j=1 ASj i ) /A S k i : i S k } choose i k S k attaining the above maximum set ȳ S = 0 for all S, s.t. i k S S k. So dual constraint, S :i k S y SA S i r ik, holds with equality.

114 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2 Loop: while k N do let S k := S k 1 \{i k 1 }. {( let ȳ Sk := max r i k 1 j=1 ASj i ) /A S k i : i S k } choose i k S k attaining the above maximum set ȳ S = 0 for all S, s.t. i k S S k. let g ik := g ik 1 +ȳ Sk let k := k +1 end {while}

115 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

116 Restless Bandits Spinning Plates

117 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0).

118 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken.

119 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously.

120 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery

121 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed P(y x,0) = ɛp(y x,1), y x

122 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed inspection no inspection

123 Opportunistic Spectrum Access Communication channels may be busy or free. Channel 1 Opportunities Channel T Opportunities

124 Opportunistic Spectrum Access Communication channels may be busy or free. Channel 1 Opportunities Channel T Opportunities Aim is to inspect m out of n channels, maximizing the number of these that are found to be free.

125 Opportunistic Spectrum Access Condition of the channel (busy or free) evolves as Markov chain. x(t) = P(channel is free at time t).

126 Opportunistic Spectrum Access Condition of the channel (busy or free) evolves as Markov chain. x(t) = P(channel is free at time t). a = 0 : x(t+1) = x(t)p 11 +(1 x(t))p 01

127 Opportunistic Spectrum Access Condition of the channel (busy or free) evolves as Markov chain. x(t) = P(channel is free at time t). a = 0 : x(t+1) = x(t)p 11 +(1 x(t))p 01 a = 1 : x(t+1) = { p01 p 11 with probability 1 x(t) x(t)

128 Dynamic Programming Equation Action set is Ω = {(a 1,...,a n ) : a i {0,1}, i a i = m}. For a state x = (x 1,...,x n ), { V(x) = max r(x i,a i )+β } V(y 1,...,y n ) P(y i x i,a i ) a Ω y 1,...,y n i i

129 Relaxed Problem for a Single Restless Bandit Let us consider a relaxed problem, posed for 1 bandit only. The aim is to maximize average reward obtained from this bandit under a constraint that a = 1 for only a fraction m/n of the time.

130 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize r(x,a)zx a x,a

131 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize x,a r(x,a)z a x s.t. z a x 0, for all x,a

132 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize r(x,a)zx a x,a s.t. zx a 0, for all x,a; zx a = 1 x,a

133 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize r(x,a)zx a x,a s.t. zx a 0, for all x,a; zx a = 1; x,a zx a = y a z a yp(x y,a(y)), for all x

134 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize x,a r(x,a)z a x zx a = y a s.t. zx a 0, for all x,a; zx a = 1; x,a zyp(x y,a(y)), a for all x; zx 0 = 1 m/n. x

135 The Subsidy Problem Optimal value of the dual LP problem is g, where this can be found from the average-cost dynamic programming equation { } φ(x)+g = max a {0,1} r(x,a)+λ(1 a)+ y φ(y)p(y x,a) λ and φ(x) are the Lagrange multipliers for constraints. λ may be interpreted as a subsidy for taking a = 0..

136 The Subsidy Problem Optimal value of the dual LP problem is g, where this can be found from the average-cost dynamic programming equation { } φ(x)+g = max a {0,1} r(x,a)+λ(1 a)+ y φ(y)p(y x,a) λ and φ(x) are the Lagrange multipliers for constraints. λ may be interpreted as a subsidy for taking a = 0. Solution partitions state space into sets: E 0 (a = 0), E 1 (a = 1) and E 01 (randomization between a = 0 and a = 1)..

137 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable.

138 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable. Whittle index, W(x), is the least subsidy for which it can be optimal to take a = 0 in state x.

139 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable. Whittle index, W(x), is the least subsidy for which it can be optimal to take a = 0 in state x. This motivates a heuristic policy: use active action on the m bandits with the greatest Whittle indices.

140 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable. Whittle index, W(x), is the least subsidy for which it can be optimal to take a = 0 in state x. This motivates a heuristic policy: use active action on the m bandits with the greatest Whittle indices. Like Gittins indices for classical bandits, Whittle indices can be computed separately for each bandit. Same as the Gittins index when a = 0 is freezing action.

141 Two Questions Under what assumptions is a restless bandit indexable?

142 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007).

143 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)).

144 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices?

145 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices? It may be optimal. (opportunistic spectrum access identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

146 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices? It may be optimal. (opportunistic spectrum access identical channels, Ahmad, Liu, Javidi, and Zhao (2009)). Lots of papers with numerical work.

147 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices? It may be optimal. (opportunistic spectrum access identical channels, Ahmad, Liu, Javidi, and Zhao (2009)). Lots of papers with numerical work. It is often asymptotically optimal, W. and Weiss (1990).

148 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn.

149 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i.

150 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a.

151 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i.

152 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i. u 1 1 (z) = u1 2 (z) = 1 u0 3 (z) = n 1 n 2 n 3 n 4 m = ρn

153 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i. u 1 1 (z) = u1 2 (z) = 1 0 < u1 3 (z) < 1 u0 3 (z) = n 1 n 2 n 3 n 4 m = ρn

154 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i. u 1 1 (z) = u1 2 (z) = 1 0 < u1 3 (z) < 1 u0 3 (z) = 1 q a ij 0 n 1 n 2 n 3 n 4 1 m = ρn = rate a bandit in state i jumps to state j under action a; q ij (z) = u 0 i(z)q 0 ij +u 1 i(z)q 1 ij

155 Fluid Approximation The fluid approximation for large n is given by piecewise linear differential equations, of the form: dz i /dt = j q ji (z)z j j q ij (z)z i E.g., k = 2. { (q q21 0 dz 1 /dt = )z 1 +(q12 0 q1 12 )ρ+q0 21, z 1 ρ (q12 1 +q1 21 )z 1 (q21 0 q1 21 )ρ+q0 21, z 1 ρ

156 Fluid Approximation The fluid approximation for large n is given by piecewise linear differential equations, of the form: dz i /dt = j q ji (z)z j j q ij (z)z i E.g., k = 2. { (q q21 0 dz 1 /dt = )z 1 +(q12 0 q1 12 )ρ+q0 21, z 1 ρ (q12 1 +q1 21 )z 1 (q21 0 q1 21 )ρ+q0 21, z 1 ρ dz/dt = A(z)z +b(z), where A(z) and b(z) are constant within k polyhedral regions.

157 Asymptotic Optimality Theorem [W. and Weiss 90] If bandits are indexable, and the fluid model has an asymptotically stable equilibrium point, then the Whittle index heuristic is asymptotically optimal, in the sense that the reward per bandit tends to the reward that is obtained under the relaxed policy. (proof via a theorem about law of large numbers for sample paths.)

158 Heuristic May Not be Asymptotically Optimal ( ) qij = ) (q, ij = r 0 = (0,1,10,10), r 1 = (10,10,10,0), ρ = 0.835

159 Heuristic May Not be Asymptotically Optimal ( ) qij = ) (q, ij = r 0 = (0,1,10,10), r 1 = (10,10,10,0), ρ = Bandit is indexable. Equilibrium point is ( z 1, z 2, z 3, z 4 ) = (0.409,0.327,0.100,0.164). z 1 + z 2 + z 3 = Relaxed policy obtains 10 per bandit per unit time.

160 Heuristic is Not Asymptotically Optimal But equilibrium point z is not asymptotically stable z a = 1 z z a = 0 a = 0/1 z 4 t Relaxed policy obtains 10 per bandit. Heuristic obtains only per bandit.

161 Questions

Multi-armed Bandits and the Gittins Index

Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,... Two-armed