Multi-armed Bandits and the Gittins Index

Size: px
Start display at page:

Download "Multi-armed Bandits and the Gittins Index"

Transcription

1 Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge Seminar at Microsoft, Cambridge, June 20, 2011

2 The Multi-armed Bandit Problem

3 The Multi-armed Bandit Problem

4 Multi-armed Bandit Allocation Indices 2nd Edition edition 11 March 2011 Gittins, Glazebrook and Weber

5 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,...

6 Two-armed Bandit 3, 10, 4, 9, 12, 1,..., 6, 2, 15, 2, 7,... 5

7 Two-armed Bandit 3, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6

8 Two-armed Bandit, 10, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3

9 Two-armed Bandit,, 4, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10,

10 Two-armed Bandit,,, 9, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10, 4

11 Two-armed Bandit,,,, 12, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10, 4, 9

12 Two-armed Bandit,,,,, 1,...,, 2, 15, 2, 7,... 5, 6, 3, 10, 4, 9, 12

13 Two-armed Bandit,,,,, 1,...,,, 15, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2

14 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15

15 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1.

16 Two-armed Bandit,,,,, 1,...,,,, 2, 7,... 5, 6, 3, 10, 4, 9, 12, 2, 15 Reward = β β 2 β 3 0 < β < 1. Of course, in practice we must choose which arms to pull without knowing the future sequences of rewards.

17 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research?

18 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray?

19 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend?

20 Dynamic Effort Allocation Research projects: how should I allocate my research time amongst my favorite open problems so as to maximize the value of my completed research? Job Scheduling: in what order should I work on the tasks in my in-tray? Searching for information: shall I spend more time browsing the web, or go to the library, or ask a friend? Dating strategy: should I contact a new prospect, or try another date with someone I have dated before?

21 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term.

22 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989)

23 Information vs. Immediate Payoff In all these problems one wishes to learn about the effectiveness of alternative strategies, while simultaneously wishing to use the best strategy in the short-term. Bandit problems embody in essential form a conflict evident in all human action: information versus immediate payoff. Peter Whittle (1989) Exploration versus exploitation

24 Clinical Trials

25 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,...

26 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0.

27 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i.

28 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples.

29 Bernoulli Bandits One of N drugs is to be administered at each of times t = 0,1,... The sth time drug i is used it is successful, X i (s) = 1, or unsuccessful, X i (s) = 0. P(X i (s) = 1) = θ i. X i (1),X i (2),... are i.i.d. samples. θ i is unknown, but has a prior distribution, perhaps uniform on [0,1] f(θ i ) = 1, 0 θ i 1.

30 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2).

31 Bernoulli Bandits Having seen s i successes and f i are failures, the posterior is f(θ i s i,f i ) = (s i+f i +1)! s i!f i! θ s i i (1 θ i) f i, 0 θ i 1, with mean (s i +1)/(s i +f i +2). We wish to maximize the expected total discounted sum of number of successes.

32 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t).

33 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)).

34 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen).

35 Multi-armed Bandit N independent arms, with known states x 1 (t),...,x N (t). At each time, t {0,1,2,...}, One arm is to be activated (pulled/continued) If arm i activated then it changes state: x y with probability P i (x,y) and produces reward r i (x i (t)). Other arms are to be passive (not pulled/frozen). Objective: maximize the expected total β-discounted reward [ ] E r it (x it (t))β t, t=0 where i t is the arm pulled at time t, (0 < β < 1).

36 Dynamic Programming Solution The dynamic programming equation is V(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)v(x 1,...,x i 1,y,x i+1,...,x N )

37 Dynamic Programming Solution The dynamic programming equation is V(x 1,...,x N ) { = max r i (x i )+β i y } P i (x i,y)v(x 1,...,x i 1,y,x i+1,...,x N ) If bandit i moves on a state space of size k i, then (x 1,...,x N ) moves on a state space of size i k i (exponential in N).

38 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time.

39 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. One problem (on a state space size i k i) N problems (on state spaces sizes k 1,...,k n.)

40 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. One problem (on a state space size i k i) N problems (on state spaces sizes k 1,...,k n.) G i (x i ) is called the Gittins index.

41 Gittins Index Solution Theorem [Gittins, 74, 79, 89] Reward is maximized by always continuing the bandit having greatest value of dynamic allocation index E G i (x i ) = sup τ 1 [ τ 1 ] t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i where τ is a (past-measurable) stopping-time. One problem (on a state space size i k i) N problems (on state spaces sizes k 1,...,k n.) G i (x i ) is called the Gittins index. It can be computed in time O(k 3 i ).

42 Gittins Index G i (x i ) = sup τ 1 [ τ 1 ] E t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i

43 Gittins Index G i (x i ) = sup τ 1 [ τ 1 ] E t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i Discounted reward up to τ.

44 Gittins Index G i (x i ) = sup τ 1 [ τ 1 ] E t=0 r i(x i (t))β t xi (0) = x i [ τ 1 ] E t=0 βt xi (0) = x i Discounted reward up to τ. Discounted time up to τ.

45 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index =

46 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index =

47 Gittins Indices for Bernoulli Bandits, β = 0.9 s f (s 1,f 1 ) = (2,3): posterior mean = 3 7 = , index = (s 2,f 2 ) = (6,7): posterior mean = 7 15 = , index = So we prefer to use drug 1 next, even though it has the smaller probability of success.

48 What has Happened Since 1989? Index theorem has become better known.

49 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored.

50 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls

51 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach

52 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...).

53 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized.

54 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

55 Gittins Index Theorem has become Better Known Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved?

56 Gittins Index Theorem has become Better Known Peter Whittle tells the story: A colleague of high repute asked an equally well-known colleague: What would you say if you were told that the multi-armed bandit problem had been solved? Sir, the multi-armed bandit problem is not of such a nature that it can be solved.

57 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988)

58 Proofs of the Index Theorem Since Gittins (1974, 1979), many researchers have reproved, remodelled and resituated the index theorem. Beale (1979) Karatzas (1984) Varaiya, Walrand, Buyukkoc (1985) Chen, Katehakis (1986) Kallenberg (1986) Katehakis, Veinott (1986) Eplett (1986) Kertz (1986) Tsitsiklis (1986) Mandelbaum (1986, 1987) Lai, Ying (1988) Whittle (1988) Weber (1992) El Karoui, Karatzas (1993) Ishikida and Varaiya (1994) Tsitsiklis (1994) Bertsimas, Niño-Mora (1996) Glazebrook, Garbe (1996) Kaspi, Mandelbaum (1998) Bäuerle, Stidham (2001) Dimitriu, Tetali, Winkler (2003)

59 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

60 Golf with N Balls [Dimitriu, Tetali, Winkler 03, W. 92] N balls are strewn about a golf course at locations x 1,...,x N.

61 Golf with N Balls [Dimitriu, Tetali, Winkler 03, W. 92] N balls are strewn about a golf course at locations x 1,...,x N. Hitting a ball i, that is in location x i, costs c(x i ), x i y with probability P(x i,y) Ball goes in the hole with probability P(x i,0). Objective Minimize the expected total cost incurred up to sinking a first ball.

62 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball.

63 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes.

64 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize.

65 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize. If the ball arrives at a location y, from which g(x) is no longer great enough to motivate the golfer to continue playing, then, just as he is about to quit, we increase the prize to g(y), which becomes the new prevailing prize.

66 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize. If the ball arrives at a location y, from which g(x) is no longer great enough to motivate the golfer to continue playing, then, just as he is about to quit, we increase the prize to g(y), which becomes the new prevailing prize. Continue doing this until the ball is sunk.

67 Golf with 1 Ball Given the golfer s ball is in location x, let us offer him a prize of value g(x) if he eventually sinks the ball. Let us set this prize just great enough so that he can break even, by playing one more stroke, and then quitting thereafter whenever he likes. g(x) = fair prize. If the ball arrives at a location y, from which g(x) is no longer great enough to motivate the golfer to continue playing, then, just as he is about to quit, we increase the prize to g(y), which becomes the new prevailing prize. Continue doing this until the ball is sunk. This presents the golfer with a fair game, and it is optimal for him to keep playing until the ball is sunk. E(cost incurred) = E(prize won)

68 Golf with 1 Ball g(x) = 3.0 g(x) x

69 Golf with 1 Ball g(x) = 3.0, g(x ) = 2.5 g(x) x g(x) x

70 Golf with 1 Ball g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(x) x g(x) x g(x ) x

71 Golf with 1 Ball g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 Prevailing prize sequence is 3.0, 3.0, 4.0,... g(x) x g(x) x g(x ) x

72 Golf with 2 Balls g(x) = 3.0 g(y) = 3.2 g(x) x g(y) y

73 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5 g(y) = 3.2 g(x) x g(x) x g(y) y

74 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(y) = 3.2 g(x) x g(x) x x g(x ) g(y) y

75 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(y) = 3.2, g(y ) = 3.5 g(x) x g(x) x g(y ) y x g(x ) g(y) y

76 Golf with 2 Balls g(x) = 3.0, g(x ) = 2.5, g(x ) = 4.0 g(y) = 3.2, g(y ) = 3.5, g(y ) = 4.2 g(x) x g(y ) y g(x) x g(y ) y x g(x ) g(y) y

77 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)).

78 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing.

79 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)).

80 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)). Game is fair. It is impossible for golfer to make a strictly positive profit (since he would have to do so for some ball). E(cost incurred) E(prize won)

81 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)). Game is fair. It is impossible for golfer to make a strictly positive profit (since he would have to do so for some ball). E(cost incurred) E(prize won) Equality is achieved provided golfer does not switch away from a ball unless its prevailing prize increases.

82 Optimal Play with N Balls Each of the N balls has an initial prevailing prize, ḡ i (0), attached to it. ḡ i (0) = g(x i (0)). Prevailing prize, ḡ i (t), is increased whenever it is insufficient to motivate golfer to play that ball; so it is nondecreasing. ḡ i (t) depends only on the ball s history, not what has happened to other balls. ḡ i (t) = max s t g(x i (s)). Game is fair. It is impossible for golfer to make a strictly positive profit (since he would have to do so for some ball). E(cost incurred) E(prize won) Equality is achieved provided golfer does not switch away from a ball unless its prevailing prize increases. Right hand side is minimized by always playing ball with least prevailing prize.

83 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon.

84 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon. The fair prize, g(x), is 1/(1 β) times the Gittins index, G(x).

85 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon. The fair prize, g(x), is 1/(1 β) times the Gittins index, G(x). g(x) = inf { g : supe τ 1 [ τ 1 c(x(t))β t t=0 ] } +(1 β)(1+β + +β τ 1 )g x(0) = x 0

86 Golf and the Multi-armed Bandit Having solved the golf problem, the solution to the multi-armed bandit problem follows. Just let P(x,0) = 1 β for all x. The expected cost incurred until a first ball is sunk equals the expected total β-discounted cost over the infinite horizon. The fair prize, g(x), is 1/(1 β) times the Gittins index, G(x). g(x) = inf { g : supe τ 1 [ τ 1 c(x(t))β t t=0 ] } +(1 β)(1+β + +β τ 1 )g x(0) = x 0 [ ] τ 1 x(0) = 1 E 1 β inf t=0 c(x(t))βt = x [ τ 1 ] τ 1 x(0) E t=0 βt = x

87 Golf with N Balls and a Set of Clubs Suppose that a ball in location x can be played with a choice of shots, from a set A(x). Choosing shot a A(x), x y with probability P a (x,y) Now the golfer must choose, not only which ball to play, but with which shot to play it.

88 Golf with N Balls and a Set of Clubs Suppose that a ball in location x can be played with a choice of shots, from a set A(x). Choosing shot a A(x), x y with probability P a (x,y) Now the golfer must choose, not only which ball to play, but with which shot to play it. Under a condition, an index policy is again optimal. He should play the ball with least prevailing prize, choosing the shot from A that is optimal if that ball were the only ball present.

89 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

90 Achievable Performance Region Approach Suppose all arms move on state space E = {1,...,N}. Let I i (t) be an indicator for the event that at time t an arm is pulled that is in state i. We wish to maximize (conditional on the starting states of arms) [ ] E π r i I i (t)β t i E t=0

91 Achievable Performance Region Approach Suppose all arms move on state space E = {1,...,N}. Let I i (t) be an indicator for the event that at time t an arm is pulled that is in state i. We wish to maximize (conditional on the starting states of arms) [ ] E π r i I i (t)β t i E t=0 Suppose that under policy π, [ ] zi π = E π I i (t)β t We wish to maximize i E r iz π i. t=0

92 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E z π i = 1+β +β 2 + = 1 1 β

93 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β

94 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}.

95 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1.

96 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S.

97 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i z π i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S. This is a near-trivial MABP. Easy to show i rs i zπ i minimized by any policy that gives priority to arms whose states are not in S. So { } A S i zπ i min A S π i zπ i i E i E

98 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i zπ i = 1+β +β 2 + = 1 1 β Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S. This is a near-trivial MABP. Easy to show i rs i zπ i minimized by any policy that gives priority to arms whose states are not in S. So { } A S i zi π min A S π i zi π i E i E

99 Some Conservation Laws Consider a MABP with r i = 1 for all i. This shows that for all π. i E A E i zπ i = 1+β +β 2 + = 1 1 β = b(e) Now pick a subset of states S E = {1,...,N}. Let Ti S = number of pulls needed for state to return to S from i. ] A S i = E [1+β + +β TS i 1. r S i = 0, i S, and r S i = A S i, i S. This is a near-trivial MABP. Easy to show i rs i zπ i minimized by any policy that gives priority to arms whose states are not in S. So { } A S i zi π min A S π i zi π = b(s) i E i E

100 Constraints on the Achieveable Region Lemma There exist positivea S i, as defined above, such that for any scheduling policy π, A S i zπ i b(s), for all S E, (1) i S A E i zi π = b(e), (2) i E and such that equality holds in (1) if π gives priority to arms whose states are not in S over any arms whose states are in S.

101 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i.

102 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. The optimal value of this LP is an upper bound on the optimal value for our bandit problem.

103 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem.

104 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem. A greedy algorithm computes dual vectors ȳ S that are dual feasible and complementary slack, and a primal solution z i = z π i which is the performance vector of a priority policy.

105 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem. A greedy algorithm computes dual vectors ȳ S that are dual feasible and complementary slack, and a primal solution z i = z π i which is the performance vector of a priority policy.

106 A Linear Programming Relaxation Primal maximize r i z i {z i} i E A S i z i b(s), for all S E, i S A E i z i = b(e), i S z i 0, for all i. Dual minimize y S b(s) {y S} S y S A S i r i, for all i, S:i S y S 0, for all S E, y E unrestricted in sign. The optimal value of this LP is an upper bound on the optimal value for our bandit problem. A greedy algorithm computes dual vectors ȳ S that are dual feasible and complementary slack, and a primal solution z i = z π i which is the performance vector of a priority policy.

107 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN.

108 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}.

109 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum

110 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. So dual constraint, S :i 1 S y SA S i r i1, holds with equality.

111 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2

112 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2 Loop: while k N do let S k := S k 1 \{i k 1 }. ) /A S k i : i S k } {( let ȳ Sk := max r i k 1 j=1 ASj i choose i k S k attaining the above maximum

113 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2 Loop: while k N do let S k := S k 1 \{i k 1 }. {( let ȳ Sk := max r i k 1 j=1 ASj i ) /A S k i : i S k } choose i k S k attaining the above maximum set ȳ S = 0 for all S, s.t. i k S S k. So dual constraint, S :i k S y SA S i r ik, holds with equality.

114 Greedy Algorithm Dual has 2 N 1 variables, y S, but only N of them are non-zero. They can be computed one by one: ȳ E,ȳ S2,ȳ S3,...,ȳ SN. Input: r i, p ij, i,j E. Initialization: let ȳ E := max{r i /A E i : i E}. choose i 1 E attaining the above maximum set ȳ S = 0 for all S, s.t. i 1 S E. let g i1 := ȳ E let S 1 := E; let k := 2 Loop: while k N do let S k := S k 1 \{i k 1 }. {( let ȳ Sk := max r i k 1 j=1 ASj i ) /A S k i : i S k } choose i k S k attaining the above maximum set ȳ S = 0 for all S, s.t. i k S S k. let g ik := g ik 1 +ȳ Sk let k := k +1 end {while}

115 What has Happened Since 1989? Index theorem has become better known. Alternative proofs have been explored. Playing golf with N balls Achievable Performance Region Approach Many applications (economics, engineering,...). Notions of indexation have been generalized. Restless Bandits

116 Restless Bandits Spinning Plates

117 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0).

118 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken.

119 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously.

120 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery

121 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed P(y x,0) = ɛp(y x,1), y x

122 Restless Bandits [Whittle 88] Two actions are available: active (a = 1) or passive (a = 0). Rewards, r(x,a), and transitions, P(y x,a), depend on the state and the action taken. Objective: Maximize time-average reward from n restless bandits under a constraint that only m (m < n) of them receive the active action simultaneously. active a = 1 passive a = 0 work, increasing fatigue rest, recovery high speed low speed inspection no inspection

123 Opportunistic Spectrum Access Communication channels may be busy or free. Channel 1 Opportunities Channel T Opportunities

124 Opportunistic Spectrum Access Communication channels may be busy or free. Channel 1 Opportunities Channel T Opportunities Aim is to inspect m out of n channels, maximizing the number of these that are found to be free.

125 Opportunistic Spectrum Access Condition of the channel (busy or free) evolves as Markov chain. x(t) = P(channel is free at time t).

126 Opportunistic Spectrum Access Condition of the channel (busy or free) evolves as Markov chain. x(t) = P(channel is free at time t). a = 0 : x(t+1) = x(t)p 11 +(1 x(t))p 01

127 Opportunistic Spectrum Access Condition of the channel (busy or free) evolves as Markov chain. x(t) = P(channel is free at time t). a = 0 : x(t+1) = x(t)p 11 +(1 x(t))p 01 a = 1 : x(t+1) = { p01 p 11 with probability 1 x(t) x(t)

128 Dynamic Programming Equation Action set is Ω = {(a 1,...,a n ) : a i {0,1}, i a i = m}. For a state x = (x 1,...,x n ), { V(x) = max r(x i,a i )+β } V(y 1,...,y n ) P(y i x i,a i ) a Ω y 1,...,y n i i

129 Relaxed Problem for a Single Restless Bandit Let us consider a relaxed problem, posed for 1 bandit only. The aim is to maximize average reward obtained from this bandit under a constraint that a = 1 for only a fraction m/n of the time.

130 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize r(x,a)zx a x,a

131 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize x,a r(x,a)z a x s.t. z a x 0, for all x,a

132 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize r(x,a)zx a x,a s.t. zx a 0, for all x,a; zx a = 1 x,a

133 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize r(x,a)zx a x,a s.t. zx a 0, for all x,a; zx a = 1; x,a zx a = y a z a yp(x y,a(y)), for all x

134 LP for the Relaxed Problem Let z a x be proportion of time that the bandit is in state x and action a is taken (under a stationary Markov policy). An upper bound for our problem can found from a LP in variables {z a x : x E, a {0,1}}: maximize x,a r(x,a)z a x zx a = y a s.t. zx a 0, for all x,a; zx a = 1; x,a zyp(x y,a(y)), a for all x; zx 0 = 1 m/n. x

135 The Subsidy Problem Optimal value of the dual LP problem is g, where this can be found from the average-cost dynamic programming equation { } φ(x)+g = max a {0,1} r(x,a)+λ(1 a)+ y φ(y)p(y x,a) λ and φ(x) are the Lagrange multipliers for constraints. λ may be interpreted as a subsidy for taking a = 0..

136 The Subsidy Problem Optimal value of the dual LP problem is g, where this can be found from the average-cost dynamic programming equation { } φ(x)+g = max a {0,1} r(x,a)+λ(1 a)+ y φ(y)p(y x,a) λ and φ(x) are the Lagrange multipliers for constraints. λ may be interpreted as a subsidy for taking a = 0. Solution partitions state space into sets: E 0 (a = 0), E 1 (a = 1) and E 01 (randomization between a = 0 and a = 1)..

137 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable.

138 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable. Whittle index, W(x), is the least subsidy for which it can be optimal to take a = 0 in state x.

139 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable. Whittle index, W(x), is the least subsidy for which it can be optimal to take a = 0 in state x. This motivates a heuristic policy: use active action on the m bandits with the greatest Whittle indices.

140 Indexability Reasonable that as the subsidy λ (for a = 0) increases from to + the set of states E 0 (where a = 0 optimal) should increase monotonically. If it does then we say the bandit is indexable. Whittle index, W(x), is the least subsidy for which it can be optimal to take a = 0 in state x. This motivates a heuristic policy: use active action on the m bandits with the greatest Whittle indices. Like Gittins indices for classical bandits, Whittle indices can be computed separately for each bandit. Same as the Gittins index when a = 0 is freezing action.

141 Two Questions Under what assumptions is a restless bandit indexable?

142 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007).

143 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)).

144 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices?

145 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices? It may be optimal. (opportunistic spectrum access identical channels, Ahmad, Liu, Javidi, and Zhao (2009)).

146 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices? It may be optimal. (opportunistic spectrum access identical channels, Ahmad, Liu, Javidi, and Zhao (2009)). Lots of papers with numerical work.

147 Two Questions Under what assumptions is a restless bandit indexable? This is somewhat mysterious. Special classes of restless bandits are indexable: such as dual speed, Glazebrook, Niño-Mora, Ansell (2002), W. (2007). Indexability can be proved in some problems (such as the opportunistic spectrum access problem, Liu and Zhao (2009)). How good is the heuristic policy using Whittle indices? It may be optimal. (opportunistic spectrum access identical channels, Ahmad, Liu, Javidi, and Zhao (2009)). Lots of papers with numerical work. It is often asymptotically optimal, W. and Weiss (1990).

148 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn.

149 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i.

150 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a.

151 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i.

152 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i. u 1 1 (z) = u1 2 (z) = 1 u0 3 (z) = n 1 n 2 n 3 n 4 m = ρn

153 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i. u 1 1 (z) = u1 2 (z) = 1 0 < u1 3 (z) < 1 u0 3 (z) = n 1 n 2 n 3 n 4 m = ρn

154 Asymptotic Optimality Suppose a priority policy orders the states 1,2,.... At time t there are (n 1,...,n k ) bandits in states 1,...,k. Let m = ρn. z i = n i /n be proportion in state i. n a i = number that receive action a. u a i (z) = na i /n i. u 1 1 (z) = u1 2 (z) = 1 0 < u1 3 (z) < 1 u0 3 (z) = 1 q a ij 0 n 1 n 2 n 3 n 4 1 m = ρn = rate a bandit in state i jumps to state j under action a; q ij (z) = u 0 i(z)q 0 ij +u 1 i(z)q 1 ij

155 Fluid Approximation The fluid approximation for large n is given by piecewise linear differential equations, of the form: dz i /dt = j q ji (z)z j j q ij (z)z i E.g., k = 2. { (q q21 0 dz 1 /dt = )z 1 +(q12 0 q1 12 )ρ+q0 21, z 1 ρ (q12 1 +q1 21 )z 1 (q21 0 q1 21 )ρ+q0 21, z 1 ρ

156 Fluid Approximation The fluid approximation for large n is given by piecewise linear differential equations, of the form: dz i /dt = j q ji (z)z j j q ij (z)z i E.g., k = 2. { (q q21 0 dz 1 /dt = )z 1 +(q12 0 q1 12 )ρ+q0 21, z 1 ρ (q12 1 +q1 21 )z 1 (q21 0 q1 21 )ρ+q0 21, z 1 ρ dz/dt = A(z)z +b(z), where A(z) and b(z) are constant within k polyhedral regions.

157 Asymptotic Optimality Theorem [W. and Weiss 90] If bandits are indexable, and the fluid model has an asymptotically stable equilibrium point, then the Whittle index heuristic is asymptotically optimal, in the sense that the reward per bandit tends to the reward that is obtained under the relaxed policy. (proof via a theorem about law of large numbers for sample paths.)

158 Heuristic May Not be Asymptotically Optimal ( ) qij = ) (q, ij = r 0 = (0,1,10,10), r 1 = (10,10,10,0), ρ = 0.835

159 Heuristic May Not be Asymptotically Optimal ( ) qij = ) (q, ij = r 0 = (0,1,10,10), r 1 = (10,10,10,0), ρ = Bandit is indexable. Equilibrium point is ( z 1, z 2, z 3, z 4 ) = (0.409,0.327,0.100,0.164). z 1 + z 2 + z 3 = Relaxed policy obtains 10 per bandit per unit time.

160 Heuristic is Not Asymptotically Optimal But equilibrium point z is not asymptotically stable z a = 1 z z a = 0 a = 0/1 z 4 t Relaxed policy obtains 10 per bandit. Heuristic obtains only per bandit.

161 Questions

Multi-armed Bandits and the Gittins Index

Multi-armed Bandits and the Gittins Index Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,... Two-armed

More information

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Index Policies and Performance Bounds for Dynamic Selection Problems

Index Policies and Performance Bounds for Dynamic Selection Problems Index Policies and Performance Bounds for Dynamic Selection Problems David B. Brown Fuqua School of Business Duke University dbbrown@duke.edu James E. Smith Tuck School of Business Dartmouth College jim.smith@dartmouth.edu

More information

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems Sofía S. Villar Postdoctoral Fellow Basque Center for Applied Mathemathics (BCAM) Lancaster University November 5th,

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms

Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Exploration vs. Exploitation with Partially Observable Gaussian Autoregressive Arms Julia Kuhn The University of Queensland, University of Amsterdam j.kuhn@uq.edu.au Michel Mandjes University of Amsterdam

More information

Sequential Selection of Projects

Sequential Selection of Projects Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics

More information

Wireless Channel Selection with Restless Bandits

Wireless Channel Selection with Restless Bandits Wireless Channel Selection with Restless Bandits Julia Kuhn and Yoni Nazarathy Abstract Wireless devices are often able to communicate on several alternative channels; for example, cellular phones may

More information

Optimality, Duality, Complementarity for Constrained Optimization

Optimality, Duality, Complementarity for Constrained Optimization Optimality, Duality, Complementarity for Constrained Optimization Stephen Wright University of Wisconsin-Madison May 2014 Wright (UW-Madison) Optimality, Duality, Complementarity May 2014 1 / 41 Linear

More information

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels

On the Optimality of Myopic Sensing. in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels On the Optimality of Myopic Sensing 1 in Multi-channel Opportunistic Access: the Case of Sensing Multiple Channels Kehao Wang, Lin Chen arxiv:1103.1784v1 [cs.it] 9 Mar 2011 Abstract Recent works ([1],

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari,QingZhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays

Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Multi-channel Opportunistic Access: A Case of Restless Bandits with Multiple Plays Sahand Haji Ali Ahmad, Mingyan Liu Abstract This paper considers the following stochastic control problem that arises

More information

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits

An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits An Optimal Index Policy for the Multi-Armed Bandit Problem with Re-Initializing Bandits Peter Jacko YEQT III November 20, 2009 Basque Center for Applied Mathematics (BCAM), Bilbao, Spain Example: Congestion

More information

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access

Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Optimality of Myopic Sensing in Multi-Channel Opportunistic Access Tara Javidi, Bhasar Krishnamachari, Qing Zhao, Mingyan Liu tara@ece.ucsd.edu, brishna@usc.edu, qzhao@ece.ucdavis.edu, mingyan@eecs.umich.edu

More information

On Robust Arm-Acquiring Bandit Problems

On Robust Arm-Acquiring Bandit Problems On Robust Arm-Acquiring Bandit Problems Shiqing Yu Faculty Mentor: Xiang Yu July 20, 2014 Abstract In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given

More information

15-780: LinearProgramming

15-780: LinearProgramming 15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems / 25

Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems / 25 Optimal Stopping of Markov chain, Gittins Index and Related Optimization Problems Department of Mathematics and Statistics University of North Carolina at Charlotte, USA http://...isaac Sonin in Google,

More information

On Optimization of the Total Present Value of Profits under Semi Markov Conditions

On Optimization of the Total Present Value of Profits under Semi Markov Conditions On Optimization of the Total Present Value of Profits under Semi Markov Conditions Katehakis, Michael N. Rutgers Business School Department of MSIS 180 University, Newark, New Jersey 07102, U.S.A. mnk@andromeda.rutgers.edu

More information

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel

Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Online Learning to Optimize Transmission over an Unknown Gilbert-Elliott Channel Yanting Wu Dept. of Electrical Engineering University of Southern California Email: yantingw@usc.edu Bhaskar Krishnamachari

More information

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs

Introduction to Mathematical Programming IE406. Lecture 10. Dr. Ted Ralphs Introduction to Mathematical Programming IE406 Lecture 10 Dr. Ted Ralphs IE406 Lecture 10 1 Reading for This Lecture Bertsimas 4.1-4.3 IE406 Lecture 10 2 Duality Theory: Motivation Consider the following

More information

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed

More information

1/2 1/2 1/4 1/4 8 1/2 1/2 1/2 1/2 8 1/2 6 P =

1/2 1/2 1/4 1/4 8 1/2 1/2 1/2 1/2 8 1/2 6 P = / 7 8 / / / /4 4 5 / /4 / 8 / 6 P = 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Andrei Andreevich Markov (856 9) In Example. 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 P (n) = 0

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Online Learning of Rested and Restless Bandits

Online Learning of Rested and Restless Bandits Online Learning of Rested and Restless Bandits Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 48109-2122 Email: {cmtkn, mingyan}@umich.edu

More information

Algebraic structures I

Algebraic structures I MTH5100 Assignment 1-10 Algebraic structures I For handing in on various dates January March 2011 1 FUNCTIONS. Say which of the following rules successfully define functions, giving reasons. For each one

More information

1 Markov decision processes

1 Markov decision processes 2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

Linear and non-linear programming

Linear and non-linear programming Linear and non-linear programming Benjamin Recht March 11, 2005 The Gameplan Constrained Optimization Convexity Duality Applications/Taxonomy 1 Constrained Optimization minimize f(x) subject to g j (x)

More information

Near-optimal policies for broadcasting files with unequal sizes

Near-optimal policies for broadcasting files with unequal sizes Near-optimal policies for broadcasting files with unequal sizes Majid Raissi-Dehkordi and John S. Baras Institute for Systems Research University of Maryland College Park, MD 20742 majid@isr.umd.edu, baras@isr.umd.edu

More information

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling 2015 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 2015 Osaka, Japan A Restless Bandit With No Observable States for Recommendation Systems and Communication Link Scheduling

More information

MULTI-ARMED BANDIT PROBLEMS

MULTI-ARMED BANDIT PROBLEMS Chapter 6 MULTI-ARMED BANDIT PROBLEMS Aditya Mahajan University of Michigan, Ann Arbor, MI, USA Demosthenis Teneketzis University of Michigan, Ann Arbor, MI, USA 1. Introduction Multi-armed bandit MAB)

More information

The Multi-Armed Bandit Problem with Commitments:

The Multi-Armed Bandit Problem with Commitments: The Multi-Armed Bandit Problem with Commitments: Generalized Depreciation and Necessity of Restart in State Indices Wesley Cowan Department of Mathematics, Rutgers University 110 Frelinghuysen Rd., Piscataway,

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

A Structured Multiarmed Bandit Problem and the Greedy Policy

A Structured Multiarmed Bandit Problem and the Greedy Policy A Structured Multiarmed Bandit Problem and the Greedy Policy Adam J. Mersereau Kenan-Flagler Business School, University of North Carolina ajm@unc.edu Paat Rusmevichientong School of Operations Research

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Distributed Optimization. Song Chong EE, KAIST

Distributed Optimization. Song Chong EE, KAIST Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links

More information

Algorithmic Game Theory and Applications. Lecture 7: The LP Duality Theorem

Algorithmic Game Theory and Applications. Lecture 7: The LP Duality Theorem Algorithmic Game Theory and Applications Lecture 7: The LP Duality Theorem Kousha Etessami recall LP s in Primal Form 1 Maximize c 1 x 1 + c 2 x 2 +... + c n x n a 1,1 x 1 + a 1,2 x 2 +... + a 1,n x n

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES*

ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE SERVICE CLASSES* Probability in the Engineering and Informational Sciences, 14, 2000, 259 297+ Printed in the U+S+A+ ON THE OPTIMALITY OF AN INDEX RULE IN MULTICHANNEL ALLOCATION FOR SINGLE-HOP MOBILE NETWORKS WITH MULTIPLE

More information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016 1 Entropy Since this course is about entropy maximization,

More information

Restless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies

Restless Bandits with Switching Costs: Linear Programming Relaxations, Performance Bounds and Limited Lookahead Policies Restless Bandits with Switching Costs: Linear Programming Relaations, Performance Bounds and Limited Lookahead Policies Jerome Le y Laboratory for Information and Decision Systems Massachusetts Institute

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality.

Topic: Primal-Dual Algorithms Date: We finished our discussion of randomized rounding and began talking about LP Duality. CS787: Advanced Algorithms Scribe: Amanda Burton, Leah Kluegel Lecturer: Shuchi Chawla Topic: Primal-Dual Algorithms Date: 10-17-07 14.1 Last Time We finished our discussion of randomized rounding and

More information

CO759: Algorithmic Game Theory Spring 2015

CO759: Algorithmic Game Theory Spring 2015 CO759: Algorithmic Game Theory Spring 2015 Instructor: Chaitanya Swamy Assignment 1 Due: By Jun 25, 2015 You may use anything proved in class directly. I will maintain a FAQ about the assignment on the

More information

Linear Programming Methods

Linear Programming Methods Chapter 11 Linear Programming Methods 1 In this chapter we consider the linear programming approach to dynamic programming. First, Bellman s equation can be reformulated as a linear program whose solution

More information

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement Jefferson Huang March 21, 2018 Reinforcement Learning for Processing Networks Seminar Cornell University Performance

More information

Abstract We present a polyhedral framework for establishing general structural properties on optimal solutions of stochastic scheduling problems, wher

Abstract We present a polyhedral framework for establishing general structural properties on optimal solutions of stochastic scheduling problems, wher On certain greedoid polyhedra, partially indexable scheduling problems, and extended restless bandit allocation indices Submitted to Mathematical Programming José Ni~no-Mora Λ April 5, 2 Λ Dept. of Economics

More information

Performance of Round Robin Policies for Dynamic Multichannel Access

Performance of Round Robin Policies for Dynamic Multichannel Access Performance of Round Robin Policies for Dynamic Multichannel Access Changmian Wang, Bhaskar Krishnamachari, Qing Zhao and Geir E. Øien Norwegian University of Science and Technology, Norway, {changmia,

More information

Understanding the Simplex algorithm. Standard Optimization Problems.

Understanding the Simplex algorithm. Standard Optimization Problems. Understanding the Simplex algorithm. Ma 162 Spring 2011 Ma 162 Spring 2011 February 28, 2011 Standard Optimization Problems. A standard maximization problem can be conveniently described in matrix form

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

minimize x subject to (x 2)(x 4) u,

minimize x subject to (x 2)(x 4) u, Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for

More information

Optimality Conditions for Constrained Optimization

Optimality Conditions for Constrained Optimization 72 CHAPTER 7 Optimality Conditions for Constrained Optimization 1. First Order Conditions In this section we consider first order optimality conditions for the constrained problem P : minimize f 0 (x)

More information

Simplex Algorithm for Countable-state Discounted Markov Decision Processes

Simplex Algorithm for Countable-state Discounted Markov Decision Processes Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision

More information

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret

Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Learning of Uncontrolled Restless Bandits with Logarithmic Strong Regret Cem Tekin, Member, IEEE, Mingyan Liu, Senior Member, IEEE 1 Abstract In this paper we consider the problem of learning the optimal

More information

Linear Programming Redux

Linear Programming Redux Linear Programming Redux Jim Bremer May 12, 2008 The purpose of these notes is to review the basics of linear programming and the simplex method in a clear, concise, and comprehensive way. The book contains

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony

More information

A Primal-Dual Algorithm for Computing a Cost Allocation in the. Core of Economic Lot-Sizing Games

A Primal-Dual Algorithm for Computing a Cost Allocation in the. Core of Economic Lot-Sizing Games 1 2 A Primal-Dual Algorithm for Computing a Cost Allocation in the Core of Economic Lot-Sizing Games 3 Mohan Gopaladesikan Nelson A. Uhan Jikai Zou 4 October 2011 5 6 7 8 9 10 11 12 Abstract We consider

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

21 Markov Decision Processes

21 Markov Decision Processes 2 Markov Decision Processes Chapter 6 introduced Markov chains and their analysis. Most of the chapter was devoted to discrete time Markov chains, i.e., Markov chains that are observed only at discrete

More information

6. Linear Programming

6. Linear Programming Linear Programming 6-1 6. Linear Programming Linear Programming LP reduction Duality Max-flow min-cut, Zero-sum game Integer Programming and LP relaxation Maximum Bipartite Matching, Minimum weight vertex

More information

Math 456: Mathematical Modeling. Tuesday, March 6th, 2018

Math 456: Mathematical Modeling. Tuesday, March 6th, 2018 Math 456: Mathematical Modeling Tuesday, March 6th, 2018 Markov Chains: Exit distributions and the Strong Markov Property Tuesday, March 6th, 2018 Last time 1. Weighted graphs. 2. Existence of stationary

More information

1. Algebraic and geometric treatments Consider an LP problem in the standard form. x 0. Solutions to the system of linear equations

1. Algebraic and geometric treatments Consider an LP problem in the standard form. x 0. Solutions to the system of linear equations The Simplex Method Most textbooks in mathematical optimization, especially linear programming, deal with the simplex method. In this note we study the simplex method. It requires basically elementary linear

More information

Reinforcement Learning

Reinforcement Learning 1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision

More information

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past. 1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if

More information

A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions

A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions arxiv:188.6314v2 [math.pr 22 Aug 218 A General Framework of Multi-Armed Bandit Processes by Arm Switch Restrictions Wenqing Bao 1, Xiaoqiang Cai 2, Xianyi Wu 3 1 Zhejiang Normal University, Jinhua Zhejiang,

More information

Optimal learning with non-gaussian rewards

Optimal learning with non-gaussian rewards Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems

More information

Deterministic Dynamic Programming

Deterministic Dynamic Programming Deterministic Dynamic Programming 1 Value Function Consider the following optimal control problem in Mayer s form: V (t 0, x 0 ) = inf u U J(t 1, x(t 1 )) (1) subject to ẋ(t) = f(t, x(t), u(t)), x(t 0

More information

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling

A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling A Generic Mean Field Model for Optimization in Large-scale Stochastic Systems and Applications in Scheduling Nicolas Gast Bruno Gaujal Grenoble University Knoxville, May 13th-15th 2009 N. Gast (LIG) Mean

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

II. Analysis of Linear Programming Solutions

II. Analysis of Linear Programming Solutions Optimization Methods Draft of August 26, 2005 II. Analysis of Linear Programming Solutions Robert Fourer Department of Industrial Engineering and Management Sciences Northwestern University Evanston, Illinois

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Week 3 Linear programming duality

Week 3 Linear programming duality Week 3 Linear programming duality This week we cover the fascinating topic of linear programming duality. We will learn that every minimization program has associated a maximization program that has the

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

1 Random Variable: Topics

1 Random Variable: Topics Note: Handouts DO NOT replace the book. In most cases, they only provide a guideline on topics and an intuitive feel. 1 Random Variable: Topics Chap 2, 2.1-2.4 and Chap 3, 3.1-3.3 What is a random variable?

More information

http://www.math.uah.edu/stat/markov/.xhtml 1 of 9 7/16/2009 7:20 AM Virtual Laboratories > 16. Markov Chains > 1 2 3 4 5 6 7 8 9 10 11 12 1. A Markov process is a random process in which the future is

More information

ARTIFICIAL INTELLIGENCE. Reinforcement learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning INFOB2KI 2018-2019 Utrecht University The Netherlands ARTIFICIAL INTELLIGENCE Reinforcement learning Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Multi-armed Bandits with Limited Exploration

Multi-armed Bandits with Limited Exploration Multi-armed Bandits with Limited Exploration Sudipto Guha Kamesh Munagala Abstract A central problem to decision making under uncertainty is the trade-off between exploration and exploitation: between

More information

Infinite-Horizon Discounted Markov Decision Processes

Infinite-Horizon Discounted Markov Decision Processes Infinite-Horizon Discounted Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Discounted MDP 1 Outline The expected

More information

Part 1. The Review of Linear Programming

Part 1. The Review of Linear Programming In the name of God Part 1. The Review of Linear Programming 1.5. Spring 2010 Instructor: Dr. Masoud Yaghini Outline Introduction Formulation of the Dual Problem Primal-Dual Relationship Economic Interpretation

More information

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes Optimization Charles J. Geyer School of Statistics University of Minnesota Stat 8054 Lecture Notes 1 One-Dimensional Optimization Look at a graph. Grid search. 2 One-Dimensional Zero Finding Zero finding

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook

More information

Lecture: Duality of LP, SOCP and SDP

Lecture: Duality of LP, SOCP and SDP 1/33 Lecture: Duality of LP, SOCP and SDP Zaiwen Wen Beijing International Center For Mathematical Research Peking University http://bicmr.pku.edu.cn/~wenzw/bigdata2017.html wenzw@pku.edu.cn Acknowledgement:

More information

Optimization 4. GAME THEORY

Optimization 4. GAME THEORY Optimization GAME THEORY DPK Easter Term Saddle points of two-person zero-sum games We consider a game with two players Player I can choose one of m strategies, indexed by i =,, m and Player II can choose

More information

A linear programming approach to constrained nonstationary infinite-horizon Markov decision processes

A linear programming approach to constrained nonstationary infinite-horizon Markov decision processes A linear programming approach to constrained nonstationary infinite-horizon Markov decision processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith Technical Report 13-01 March 6, 2013 University

More information

Optimal and Heuristic Resource Allocation Policies in Serial Production Systems

Optimal and Heuristic Resource Allocation Policies in Serial Production Systems Clemson University TigerPrints All Theses Theses 12-2008 Optimal and Heuristic Resource Allocation Policies in Serial Production Systems Ramesh Arumugam Clemson University, rarumug@clemson.edu Follow this

More information

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs

- Well-characterized problems, min-max relations, approximate certificates. - LP problems in the standard form, primal and dual linear programs LP-Duality ( Approximation Algorithms by V. Vazirani, Chapter 12) - Well-characterized problems, min-max relations, approximate certificates - LP problems in the standard form, primal and dual linear programs

More information

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains Markov Chains A random process X is a family {X t : t T } of random variables indexed by some set T. When T = {0, 1, 2,... } one speaks about a discrete-time process, for T = R or T = [0, ) one has a continuous-time

More information