Ordinal Optimization and Multi Armed Bandit Techniques

Size: px

Start display at page:

Download "Ordinal Optimization and Multi Armed Bandit Techniques"

Kerry Reynolds
5 years ago
Views:

1 Ordinal Optimization and Multi Armed Bandit Techniques Sandeep Juneja. with Peter Glynn September 10, 2014

2 The ordinal optimization problem Determining the best of d alternative designs for a system, on the basis of Monte Carlo simulation of each of the designs.

3 The ordinal optimization problem Determining the best of d alternative designs for a system, on the basis of Monte Carlo simulation of each of the designs. More precisely, the d different designs are compared on the basis of an associated (random) performance measure X (i), i d, and the goal is to identify i = arg min 1 j d µ(j), where µ(j) EX (j), 1 j d. Goal is only to identify the best design and not to actually estimate the performance.

4 The ordinal optimization problem Determining the best of d alternative designs for a system, on the basis of Monte Carlo simulation of each of the designs. More precisely, the d different designs are compared on the basis of an associated (random) performance measure X (i), i d, and the goal is to identify i = arg min 1 j d µ(j), where µ(j) EX (j), 1 j d. Goal is only to identify the best design and not to actually estimate the performance. We have the ability to generate iid realizations of each of the d random variables.

5 The ordinal optimization problem Determining the best of d alternative designs for a system, on the basis of Monte Carlo simulation of each of the designs. More precisely, the d different designs are compared on the basis of an associated (random) performance measure X (i), i d, and the goal is to identify i = arg min 1 j d µ(j), where µ(j) EX (j), 1 j d. Goal is only to identify the best design and not to actually estimate the performance. We have the ability to generate iid realizations of each of the d random variables. We focus primarily on d = 2, so given independent samples of X we want to find if the mean is positive or negative.

6 Talk Overview Estimating difference of mean values relies on central limit theorem with an associated slow convergence rate.

7 Talk Overview Estimating difference of mean values relies on central limit theorem with an associated slow convergence rate. Ho and others observed (1990) that identifying the best system typically has a faster convergence rate.

8 Talk Overview Estimating difference of mean values relies on central limit theorem with an associated slow convergence rate. Ho and others observed (1990) that identifying the best system typically has a faster convergence rate. Dai (1996) showed in a fairly general framework using large deviation methods that the probability of false selection decays at an exponential rate under mild light tailed assumptions.

9 Talk Overview Glynn and J (2004) optimized the large deviations function associated with this probability to determine optimal computational budget allocation to each design to minimise the false selection probability. Significant literature since then relying on large deviations analysis.

10 Talk Overview Glynn and J (2004) optimized the large deviations function associated with this probability to determine optimal computational budget allocation to each design to minimise the false selection probability. Significant literature since then relying on large deviations analysis. Expectation was that one can get algorithms that can guarantee that the probability of error is upper bounded by δ using O(log(1/δ)) computational effort.

11 Talk Overview Glynn and J (2004) optimized the large deviations function associated with this probability to determine optimal computational budget allocation to each design to minimise the false selection probability. Significant literature since then relying on large deviations analysis. Expectation was that one can get algorithms that can guarantee that the probability of error is upper bounded by δ using O(log(1/δ)) computational effort. However these large deviations-based methods need to estimate the underlying large deviations rate functions from the samples generated.

12 We argue through two reasonable settings that these rate functions are difficult to estimate accurately (NOT due to the heavy tails of estimated MGFs), the probability of mis-estimation will generally dominate the underlying large deviations probability, making it difficult to build algorithms with log(1/δ) convergence rate.

13 We argue through two reasonable settings that these rate functions are difficult to estimate accurately (NOT due to the heavy tails of estimated MGFs), the probability of mis-estimation will generally dominate the underlying large deviations probability, making it difficult to build algorithms with log(1/δ) convergence rate. Further we show that given any (ɛ, δ) algorithm - one that correctly separates designs with mean difference at least ɛ with probability at least 1 δ, given any constant K one can always find designs (in a large class) that require larger than K log(1/δ) effort.

14 We argue through two reasonable settings that these rate functions are difficult to estimate accurately (NOT due to the heavy tails of estimated MGFs), the probability of mis-estimation will generally dominate the underlying large deviations probability, making it difficult to build algorithms with log(1/δ) convergence rate. Further we show that given any (ɛ, δ) algorithm - one that correctly separates designs with mean difference at least ɛ with probability at least 1 δ, given any constant K one can always find designs (in a large class) that require larger than K log(1/δ) effort. Under explicitly available moment upper bounds, we develop truncation based O(log(1/δ)) computation time (ɛ, δ) algorithms.

15 We also adapt the recently proposed sequential algorithms in multi-armed bandit regret setting to this pure exploration setting. We argue through two reasonable settings that these rate functions are difficult to estimate accurately (NOT due to the heavy tails of estimated MGFs), the probability of mis-estimation will generally dominate the underlying large deviations probability, making it difficult to build algorithms with log(1/δ) convergence rate. Further we show that given any (ɛ, δ) algorithm - one that correctly separates designs with mean difference at least ɛ with probability at least 1 δ, given any constant K one can always find designs (in a large class) that require larger than K log(1/δ) effort. Under explicitly available moment upper bounds, we develop truncation based O(log(1/δ)) computation time (ɛ, δ) algorithms.

16 A two phase implementation Consider a single rv X with unknown mean EX. Need to decide whether EX > 0 or EX 0 with error probability δ (as δ 0).

17 A two phase implementation Consider a single rv X with unknown mean EX. Need to decide whether EX > 0 or EX 0 with error probability δ (as δ 0). If we knew the large deviations rate function I (x) = sup θ R (θx Λ(θ)) (here Λ(θ) = log Ee θx )

18 A two phase implementation Consider a single rv X with unknown mean EX. Need to decide whether EX > 0 or EX 0 with error probability δ (as δ 0). If we knew the large deviations rate function I (x) = sup θ R (θx Λ(θ)) (here Λ(θ) = log Ee θx ) Then if EX < 0, we may take exp( n inf x 0 I (x)) as a proxy for the probability of false selection.

19 A two phase implementation Consider a single rv X with unknown mean EX. Need to decide whether EX > 0 or EX 0 with error probability δ (as δ 0). If we knew the large deviations rate function I (x) = sup θ R (θx Λ(θ)) (here Λ(θ) = log Ee θx ) Then if EX < 0, we may take exp( n inf x 0 I (x)) as a proxy for the probability of false selection. Recall that I(x) EX x

20 A two phase implementation... Hence, proxy exp( ni (0)) for false probability

21 A two phase implementation... Hence, proxy exp( ni (0)) for false probability This proxy holds even if EX > 0.

22 A two phase implementation... Hence, proxy exp( ni (0)) for false probability This proxy holds even if EX > 0. Thus, log(1/δ) I (0) samples ensure that P(FS) δ.

23 Hence, one reasonable estimation procedure is

24 Hence, one reasonable estimation procedure is Generate m = log(1/δ) samples in the first phase to estimate I (0) by Î m (0).

25 Hence, one reasonable estimation procedure is Generate m = log(1/δ) samples in the first phase to estimate I (0) by Î m (0). Generate log(1/δ)/îm (0) = m/î m (0) samples of X in the second phase and decide the sign of EX based on whether the sample average X m > 0 or X m 0.

26 Hence, one reasonable estimation procedure is Generate m = log(1/δ) samples in the first phase to estimate I (0) by Î m (0). Generate log(1/δ)/îm (0) = m/î m (0) samples of X in the second phase and decide the sign of EX based on whether the sample average X m > 0 or X m 0. We now discuss some pitfalls of this methodology.

27 Graphic view of I(0) The log-moment generating function of X Λ(θ) = log E exp(θx ) is convex with Λ(0) = 0 and Λ (0) = EX.

28 Graphic view of I(0) The log-moment generating function of X Λ(θ) = log E exp(θx ) is convex with Λ(0) = 0 and Λ (0) = EX. Then, I (0) = inf θ Λ(θ).

29 Λ(θ) Graphic view of I(0) The log-moment generating function of X Λ(θ) = log E exp(θx ) is convex with Λ(0) = 0 and Λ (0) = EX. Then, I (0) = inf θ Λ(θ). EX I(0) θ

30 Estimating I (0) We generate samples X 1,..., X m and first estimate the function ( ) 1 m ˆΛ m (θ) = log exp(θx i ). m and set Î m (0) = inf θ ˆΛ m (θ). i=1

31 Estimating I (0) We generate samples X 1,..., X m and first estimate the function ( ) 1 m ˆΛ m (θ) = log exp(θx i ). m and set Î m (0) = inf θ ˆΛ m (θ). Then we generate log(1/δ)/î m (0) = m/î m (0) samples of X in the second phase. i=1

32 Estimating I (0) We generate samples X 1,..., X m and first estimate the function ( ) 1 m ˆΛ m (θ) = log exp(θx i ). m and set Î m (0) = inf θ ˆΛ m (θ). Then we generate log(1/δ)/î m (0) = m/î m (0) samples of X in the second phase. Note that large values of exp(θx i ) raise the curve, do not lower it. i=1

33 Estimating I (0) We generate samples X 1,..., X m and first estimate the function ( ) 1 m ˆΛ m (θ) = log exp(θx i ). m and set Î m (0) = inf θ ˆΛ m (θ). Then we generate log(1/δ)/î m (0) = m/î m (0) samples of X in the second phase. Note that large values of exp(θx i ) raise the curve, do not lower it. i=1 The undersampling in the second phase happens due to conspiratorial large deviations behaviour of all the terms.

34 Λ(θ) Graphic view of estimated log moment generating function θ Im(0)

35 Lower Bounding P(FS) For expository convenience, take P(FS) E exp( m Î m (0) I (0)) where m = log(1/δ).

36 Lower Bounding P(FS) For expository convenience, take where m = log(1/δ). Then, for a > 0. P(FS) E exp( m Î m (0) I (0)) 1 1 m log P(FS) sup log E exp( m θ m ˆΛ m (θ) I (0)) 1 sup log exp( m θ m a ɛ I (0)) P(ˆΛ m (θ) ( a ɛ, a ɛ)),

37 Then lim inf m 1 m ( log P(FS) sup sup I (0) ) a>0 θ a I θ(e a ) where I θ (ν) = sup(αν log E exp(αe θx )). α

38 Then lim inf m 1 m ( log P(FS) sup sup I (0) ) a>0 θ a I θ(e a ) where I θ (ν) = sup(αν log E exp(αe θx )). α Further, I θ (e I (0) ) = 0 for θ so that inf θ Λ(θ) = Λ(θ ). lim inf m 1 log P(FS) 1. m

39 Another common estimation method Generate m = c log(1/δ) samples in the first phase to estimate I (0) by Î m (0).

40 Another common estimation method Generate m = c log(1/δ) samples in the first phase to estimate I (0) by Î m (0). If exp( mîm(0)) δ, stop.

41 Another common estimation method Generate m = c log(1/δ) samples in the first phase to estimate I (0) by Î m (0). If exp( mîm(0)) δ, stop. Else, provide another c log(1/δ) of computational budget and so on.

42 Another common estimation method Generate m = c log(1/δ) samples in the first phase to estimate I (0) by Î m (0). If exp( mîm(0)) δ, stop. Else, provide another c log(1/δ) of computational budget and so on. We now identify distributions for which this would not be accurate.

43 Need to find X with EX < 0 so that X m 0 and exp( mî m (0)) δ with probability higher than δ. (Recall m = c log(1/δ)).

44 Need to find X with EX < 0 so that X m 0 and exp( mî m (0)) δ with probability higher than δ. (Recall m = c log(1/δ)). Choose X so that exp( c log(1/δ)i (0)) >> δ so that or 0 > inf θ I (0) < 1/c Λ(θ) > 1/c.

45 title Furthermore, P( X m 0 and exp( mî m (0)) δ) δ

46 title Furthermore, P( X m 0 and exp( mî m (0)) δ) δ Suffices to find θ < 0 such that P(ˆΛ(θ) 1/c) δ

47 title Furthermore, P( X m 0 and exp( mî m (0)) δ) δ Suffices to find θ < 0 such that P(ˆΛ(θ) 1/c) δ Roughly then, Or exp( mi θ (e 1/c )) > δ. I θ (e 1/c ) < 1/c.

48 title Furthermore, P( X m 0 and exp( mî m (0)) δ) δ Suffices to find θ < 0 such that P(ˆΛ(θ) 1/c) δ Roughly then, Or exp( mi θ (e 1/c )) > δ. I θ (e 1/c ) < 1/c. Theorem - Stay Tuned

49 Λ(θ) Graphic view -1/c θ

50 Stronger negative result (adapted from Lai and Robbins 1985, Manor and Tsitsiklis 2004) Let D contain pdfs such that If f, g D then I (g, f ) ( ) log g(x) f (x) g(x)dx <.

51 Stronger negative result (adapted from Lai and Robbins 1985, Manor and Tsitsiklis 2004) Let D contain pdfs such that If f, g D then I (g, f ) ( ) log g(x) f (x) g(x)dx <. Each g D has a finite moment generating function in the neighbourhood of zero.

52 Stronger negative result (adapted from Lai and Robbins 1985, Manor and Tsitsiklis 2004) Let D contain pdfs such that If f, g D then I (g, f ) ( ) log g(x) f (x) g(x)dx <. Each g D has a finite moment generating function in the neighbourhood of zero. Suppose there exists an (ɛ, δ) policy, i.e., given two arms separated by a mean of at least ɛ 0, it finds the arm with the largest mean with probability at least 1 δ. Let T g (ɛ, δ) be the time it spends on arm g.

53 Stronger negative result (adapted from Lai and Robbins 1985, Manor and Tsitsiklis 2004) Let D contain pdfs such that If f, g D then I (g, f ) ( ) log g(x) f (x) g(x)dx <. Each g D has a finite moment generating function in the neighbourhood of zero. Suppose there exists an (ɛ, δ) policy, i.e., given two arms separated by a mean of at least ɛ 0, it finds the arm with the largest mean with probability at least 1 δ. Let T g (ɛ, δ) be the time it spends on arm g. Then, lim inf δ 0 ET g (ɛ, δ) log(1/δ) for g, f D, µ g < µ f ɛ. Const. I (g, f ) + O(ɛ)

54 Same output different measures Let such that Λ f (θ ɛ) = µ f + ɛ f θɛ (x) = exp(θ ɛ x Λ f (θ ɛ ))f (x) P A g f Y 1 Y 2 X 1 X 2 Y T2 P B f f X T1

55 Note that P A ( algorithm announces f ) 1 δ

56 Note that P A ( algorithm announces f ) 1 δ P B (f ) δ

57 Note that P A ( algorithm announces f ) 1 δ P B (f ) δ T g P B (f ) = E PA ( i=1 f θɛ (Y i ) I (f )) g(y i ) = E PA (e Tg g(y i ) Tg i=1 f (Y i ) +θɛ i=1 Y i T g Λ f (θ ɛ) I (f )) = E PA (e ETg I (g,f )+ETg (θɛµg Λ f (θ ɛ))+small I (set high prob)). And the result is easily deduced.

58 Way forward Additional information needed to attain log(1/δ) convergence rates.

59 Way forward Additional information needed to attain log(1/δ) convergence rates. Great deal of structure is typically known about models used in simulation. Often upper bounds on moments may be available

60 Way forward Additional information needed to attain log(1/δ) convergence rates. Great deal of structure is typically known about models used in simulation. Often upper bounds on moments may be available Easy to develop such bounds once suitable Lyapunov functions can be identified (not to be discussed here)

61 Way forward Additional information needed to attain log(1/δ) convergence rates. Great deal of structure is typically known about models used in simulation. Often upper bounds on moments may be available Easy to develop such bounds once suitable Lyapunov functions can be identified (not to be discussed here) Assuming that such bounds are available, one may use them to develop (ɛ, δ) strategies by truncating random variables while controlling the error to be less than ɛ. Using Hoeffding type bounds for bounded random variables.

62 Way forward Additional information needed to attain log(1/δ) convergence rates. Great deal of structure is typically known about models used in simulation. Often upper bounds on moments may be available Easy to develop such bounds once suitable Lyapunov functions can be identified (not to be discussed here) Assuming that such bounds are available, one may use them to develop (ɛ, δ) strategies by truncating random variables while controlling the error to be less than ɛ. Using Hoeffding type bounds for bounded random variables. Multi-armed-bandits methods have been recently developed that do this in a sequential and adaptive manner.

63 A useful observation Suppose X is a class of non-negative random variables and f is a strictly increasing convex function.

64 A useful observation Suppose X is a class of non-negative random variables and f is a strictly increasing convex function. Consider the optimization problem max X X EXI (X x) such that Ef (X ) a,

65 A useful observation Suppose X is a class of non-negative random variables and f is a strictly increasing convex function. Consider the optimization problem max X X EXI (X x) such that Ef (X ) a, This has a two point solution relying on observation that if Y = E[X X < x]i (X < x) + E[X X x]i (X x) then EY = EX, EYI (Y x) = EXI (X x) and Ef (Y ) Ef (X ).

66 Obtaining exponential convergence guarantees We consider X ɛ = {X : EX > ɛ} where each X = A B and A, B are non-negative.

67 Obtaining exponential convergence guarantees We consider X ɛ = {X : EX > ɛ} where each X = A B and A, B are non-negative. We assume that we can find R a ( ɛ), R b ( ɛ) that truncate the excess mean by at least ɛ for each such value.

68 Obtaining exponential convergence guarantees We consider X ɛ = {X : EX > ɛ} where each X = A B and A, B are non-negative. We assume that we can find R a ( ɛ), R b ( ɛ) that truncate the excess mean by at least ɛ for each such value. If X = A B X ɛ, then AI (A < R a (βɛ)) BI (B < R b (βɛ)) X (1 β)ɛ.

69 Our algorithm then is:

70 Our algorithm then is: 1. Generate n independent samples of AI (A < R a (βɛ)) BI (B < R b (βɛ)).

71 Our algorithm then is: 1. Generate n independent samples of AI (A < R a (βɛ)) BI (B < R b (βɛ)). 2. Refer to these as Y 1, Y 2,..., Y n and compute Ȳ n = 1 n n Y i. i=1

72 Our algorithm then is: 1. Generate n independent samples of AI (A < R a (βɛ)) BI (B < R b (βɛ)). 2. Refer to these as Y 1, Y 2,..., Y n and compute Ȳ n = 1 n n Y i. i=1 3. If Ȳ n 0, declare that EX > 0.

73 Our algorithm then is: 1. Generate n independent samples of AI (A < R a (βɛ)) BI (B < R b (βɛ)). 2. Refer to these as Y 1, Y 2,..., Y n and compute Ȳ n = 1 n n Y i. i=1 3. If Ȳ n 0, declare that EX > If Ȳn < 0, declare that EX < 0.

74 titleusing Hoeffding Inequality to bound P(FS) Suppose that EX < ɛ. Then, EY i < (1 β)ɛ. Also, R b (βɛ) Y i R a (βɛ).

75 titleusing Hoeffding Inequality to bound P(FS) Suppose that EX < ɛ. Then, EY i < (1 β)ɛ. Also, R b (βɛ) Y i R a (βɛ). One can select n δ = (R a(βɛ) + R b (βɛ)) 2 2(1 β) 2 ɛ 2 log(1/δ).

76 titleusing Hoeffding Inequality to bound P(FS) Suppose that EX < ɛ. Then, EY i < (1 β)ɛ. Also, R b (βɛ) Y i R a (βɛ). One can select n δ = (R a(βɛ) + R b (βɛ)) 2 2(1 β) 2 ɛ 2 log(1/δ). Furthermore, β may be selected to minimize (R a (βɛ) + R b (βɛ)) 2 (1 β) 2.

77 Pure exploration bandit algorithms Total n arms. Each arm a when sampled gives a Bernoulli reward with mean µ a > 0.

78 Pure exploration bandit algorithms Total n arms. Each arm a when sampled gives a Bernoulli reward with mean µ a > 0. Let arm with the largest mean a = arg max a A µ a and let a = µ a µ a be assumed be positive for all a a.

79 Pure exploration bandit algorithms Total n arms. Each arm a when sampled gives a Bernoulli reward with mean µ a > 0. Let arm with the largest mean a = arg max a A µ a and let a = µ a µ a be assumed be positive for all a a. Even Dar, Mannor and Mansour 2006 devise a sequential sampling strategy amongst these arms to find a with probability at least 1 δ, (for a pre-specified small δ) with total number of samples generated of O ln(n/δ) 2. a a a

80 Foundational observation in much of the related Bandit literature Suppose that for an arm a with mean µ a, the sample mean based on t observations is denoted by ˆµ t a.

81 Foundational observation in much of the related Bandit literature Suppose that for an arm a with mean µ a, the sample mean based on t observations is denoted by ˆµ t a. Let α t = log(5nt 2 /δ)/t.

82 Foundational observation in much of the related Bandit literature Suppose that for an arm a with mean µ a, the sample mean based on t observations is denoted by ˆµ t a. Let α t = log(5nt 2 /δ)/t. Let E a,δ = { ˆµ t a µ a < α t for all t.}

83 Foundational observation in much of the related Bandit literature Suppose that for an arm a with mean µ a, the sample mean based on t observations is denoted by ˆµ t a. Let α t = log(5nt 2 /δ)/t. Let E a,δ = { ˆµ t a µ a < α t for all t.} Then, from Hoeffding, we have for any t, P( ˆµ t a µ a α t ) 2δ 5nt 2.

84 Hence, it follows that P(E a,δ ) 1 δ/n, so that if E δ = a E a,δ, then P(E δ ) 1 δ.

85 Hence, it follows that P(E a,δ ) 1 δ/n, so that if E δ = a E a,δ, then P(E δ ) 1 δ. Their algorithm relies on the fact that on E δ it always picks the correct winner and on this set quickly fathoms away the losers.

86 Successive Rejection Algorithm Sample every arm a once and let ˆµ t a be the average reward of arm a by time t;

87 Successive Rejection Algorithm Sample every arm a once and let ˆµ t a be the average reward of arm a by time t; Let ˆµ t max = max a ˆµ t a and recall that α t = log(5nt 2 /δ)/t;

88 Successive Rejection Algorithm Sample every arm a once and let ˆµ t a be the average reward of arm a by time t; Let ˆµ t max = max a ˆµ t a and recall that α t = log(5nt 2 /δ)/t; Each arm a such that ˆµ t max ˆµ t a 2α t is removed from consideration.

89 Successive Rejection Algorithm Sample every arm a once and let ˆµ t a be the average reward of arm a by time t; Let ˆµ t max = max a ˆµ t a and recall that α t = log(5nt 2 /δ)/t; Each arm a such that ˆµ t max ˆµ t a 2α t is removed from consideration. t = t + 1; Repeat till one arm left.

90 Graphical inaccurate representation 1 µ2 µ1 0 t

91 Generalizing to heavy tails In Bubeck, Cesa-Bianchi, Lugosi 2013, they develop log(1/δ) algorithms in regret settings when 1 + ɛ moments of each arm output are available.

92 Generalizing to heavy tails In Bubeck, Cesa-Bianchi, Lugosi 2013, they develop log(1/δ) algorithms in regret settings when 1 + ɛ moments of each arm output are available. Analysis again relies on forming a cone, which they do through truncation and clever usage of Bernstein inequality.

93 Generalizing to heavy tails In Bubeck, Cesa-Bianchi, Lugosi 2013, they develop log(1/δ) algorithms in regret settings when 1 + ɛ moments of each arm output are available. Analysis again relies on forming a cone, which they do through truncation and clever usage of Bernstein inequality. We perform some minor optimizations on their algorithm.

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied