Tractable Sampling Strategies for Ordinal Optimization

Submitted to Operations Research manuscript Please, provide the manuscript number!) Authors are encouraged to submit new papers to INFORMS journals by means of a style file template, which includes the journal title. However, use of a template does not certify that the paper has been accepted for publication in the named journal. INFORMS journal templates are for the exclusive purpose of submitting to an INFORMS journal and should not be used to distribute the papers in print or online or to submit the papers to another publication. Tractable Sampling Strategies for Ordinal Optimization Dongwook Shin, Mark Broadie, Assaf Zeevi Graduate School of Business, Columbia University, New York, NY 1005 {dshin17@gsb.columbia.edu, mnb@columbia.edu, assaf@gsb.columbia.edu } We consider the problem of selecting the best of several competing alternatives systems ), where probability distributions of system performance are not known, but can be learned via sampling. The objective is to dynamically allocate a finite sampling budget to ultimately select the best system. Motivated by Welch s t-test, we introduce a tractable performance criterion and a sampling policy that seeks to optimize it. First, we characterize the optimal policy in an ideal full information setting where the means and variances of the underlying distributions are fully known. Then we construct a dynamic sampling policy which asymptotically attains the same performance as in the full information setting; this policy eventually allocates samples as if the underlying probability distributions are fully known. We characterize the efficiency of the proposed policy with respect to the probability of false selection, and show via numerical testing that it performs well compared to other benchmark policies. Key words : ranking and selection; optimal learning; dynamic sampling 1. Introduction Given a finite number of statistical populations, henceforth referred to as systems, we are concerned with the problem of dynamically learning the average outcomes of the systems to ultimately select the best one. This is an instance of ordinal optimization when the systems cannot be evaluated analytically but it is possible to sequentially sample from each system subject to a given sampling budget. A commonly used performance measure for sampling policies is the probability of false 1

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) selection, the likelihood of a policy mistakenly selecting a suboptimal system. Unfortunately, as is well documented in the literature, this objective is not analytically tractable. The main objective of this paper is to formulate a tractable performance measure for the aforementioned problem, and leverage it to design well-performing policies. We propose a performance measure based on Welch s t-test statistic Welch 1947). As it turns out, this performance measure is closely related to the probability of false selection. Further, this measure does not require specific distributional assumptions, which is often the premise in most of the antecedent literature. To arrive at a tractable dynamic sampling policy, our departure point will be a full information relaxation. This lends itself to some key structural insights, which we later use as guideline to propose the sampling strategy, referred to herein as the Welch Divergence WD) policy. In particular, the policy attempts to match the allocation patterns in the full information setting, and when the sampling budget is sufficiently large, the WD policy allocates samples as if it had all information a priori. The WD policy is structured around the first two moments of the underlying probability distributions. This tends to suggest that it works well with respect to the probability of false selection when the behavior of this criterion can be approximated by the first two moments. One of the contributions of this paper is to provide a rigorous statement of this observation and numerical evidence to the efficacy of the approach. The remainder of this paper is organized as follows: In the next section, we review the literature on ordinal optimization. In 3 we formulate a dynamic optimization problem; the asymptotic behavior of the optimal solution is explored in 4. In 5 we propose the WD policy and its asymptotic performance is addressed in 6. In 7 we test the policy numerically and compare with several benchmark policies. Proofs are collected in the electronic companion to this paper.. Literature Review One of the early contributions traces back to the work of Bechhofer 1954) who established the indifference-zone IZ) formulation. The aim of the IZ procedure is to take as few samples as possible

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 3 to satisfy a desired guarantee on the probability of correct selection. A large body of research on the IZ procedure followed see, e.g., Paulson 1964, Rinott 1978, Nelson et al. 001, Kim and Nelson 001, Goldsman et al. 00, Hong 006). It is commonly assumed that distributions are normal and Bonferroni-type inequalities are used to bound the probability of correct selection. Recent IZ procedures developed by Kim and Dieker 011), Dieker and Kim 01) and Frazier 014) focus on finding tight bounds on the probability of correct selection in the normal distribution case. While the number of samples is determined by the experimenter in the IZ formulation, the goal of ordinal optimization is to select the best one when a finite sampling budget is given. Glynn and Juneja 004) use large deviation theory with light-tailed distributions to identify the rate function associated with the probability of false selection. The optimal static full information allocation is obtained by minimizing the rate function of the probability of false selection. Broadie et al. 007) and Blanchet et al. 008) study the heavy-tailed case. The performance criterion introduced in this paper has a close connection to the rate function as will be shown in 6, however, our work is significantly different than Glynn and Juneja 004) in that dynamic sampling policies need to learn the means and variances of the systems and simultaneously allocate samples to optimize the objective. In addition to the frequentist approaches described thus far, a substantial amount of progress has been made using the Bayesian paradigm. A series of papers beginning with Chen 1995) and continuing with Chen et al. 1996, 1997, 000, 003) suggest a family of policies known as optimal computing budget allocation OCBA). These are heuristic policies based on the Bayesian approximation of the probability of correct selection. Specifically, they assume that each system is characterized by a normal distribution with known variance and unknown mean that follows a conjugate normal) prior distribution. Then the Bonferroni inequality is used to approximate the probability of correct selection and an optimal allocation is derived by solving a static optimization problem in which the approximate probability of correct selection is the objective function. Any OCBA policy can be implemented as a dynamic procedure when samples are generated from each system and then used to estimate optimal allocations. Using these allocations more samples

4 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) are generated and the optimal allocations are re-estimated repeatedly. The WD policy resembles the dynamic OCBA policy in that they both myopically optimize a certain objective function, but the objectives are different in each stage. Further discussion on key differences between the two policies is deferred to later in the paper. A set of Bayesian sampling policies distinct from OCBA were introduced in Chick and Inoue 001). They assume unknown variances, and hence the posterior mean is student-t distributed. They derive a bound for the expected loss associated with potential incorrect selections, then asymptotically minimize that bound. Another representative example of a Bayesian sampling policy is the knowledge-gradient KG) algorithm proposed by Frazier et al. 008). Like the OCBA policy, it assumes that each distribution is normal with common known variance, and that the unknown means follow a conjugate normal distribution. However, while the probability of correct selection in the OCBA policy is classified as 0-1 loss, the KG policy aims to minimize expected linear loss. The 0-1 loss is more appropriate in situations where identification of the best system is critical, while the linear loss is more appropriate when the ultimate value corresponds to the selected system s mean. An unknown-variance version of the KG policy is developed in Chick et al. 007) under the name LL 1. 3. Preliminaries 3.1. Model Primitives Consider k stochastic systems, whose performance is governed by a random variable X j with distribution F j ), j = 1,..., k. We assume that E[Xj ] <, for j = 1,..., k, and let µ j = xdf j x) and σ j = x df j x) µ j) 1/ be the mean and standard deviation of the jth system. Denote µ respectively, σ) the k-dimensional vector of means respectively, standard deviations). We assume that each σ j is strictly positive to avoid trivial cases, and without loss of generality, let system 1 be the best system, i.e., µ 1 > max j 1 µ j. A decision maker is given a sampling budget T, which means T independent samples can be drawn from the k systems, and the goal is to correctly identify the best system. A dynamic sampling policy π is defined as a sequence of random variables, π 1, π,..., taking values in the set {1,,..., k}; the event {π t = j} means a sample from system j is taken at stage

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 5 t. Define X jt, t = 1,..., T, as a random sample from system j in stage t and let F t be the σ-field generated by the samples and sampling decisions taken up to stage t, i.e., {π τ, X πτ,τ)} t τ=1, with the convention that F 0 is the nominal sigma algebra associated with underlying probability space. The set of dynamic non-anticipating policies is denoted as Π, in which the sampling decision in stage t is determined based on all the sampling decisions and samples observed in previous stages, i.e., {π t = j} F t 1 for j = 1,..., k and t = 1,..., T. Let N jt π) be the cumulative number of samples from system j up to stage t induced by policy π and define α jt π) := N jt π)/t as the sampling rate for system j at stage t. Let X jt π) and Sjtπ) be the sample mean and variance of system j in stage t, respectively. Formally, t N jt π) = I{π τ = j} 1) τ=1 t τ=1 X jt π) = X jτi{π τ = j} ) N jt π) t Sjtπ) τ=1 = X jτ X jt π)) I{π τ = j}, 3) N jt π) where I{A} = 1 if A is true, and 0 otherwise. With a single subscript t, α t π) = α 1t π),..., α kt π)) and N t π) = N 1t π),..., N kt π)) denote vectors of sampling rates and cumulative numbers of samples in stage t, respectively. Likewise, we let X t π) and S t π) be the vectors of sample means and variances in stage t, respectively. For brevity, the argument π may be dropped when it is clear from the context. To ensure that X t and S jt are well defined, each system is sampled once initially. 3.. Problem Formulation We define the following measure of divergence between systems 1 and j for j =,..., k: X 1t d jt π) = X jt S 1t/N 1t + S, t = 1,..., T, 4) jt/n jt where S jt = maxs jt, σ ) is the truncated sample variance and σ > 0 is a constant that is sufficiently smaller than the true variances, i.e., σ min j {σ j }. The value σ is introduced to ensure that 4) is well defined. Note that the divergence measure d jt π) is related to Welch s t-test statistic Welch

6 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 1947). The null hypothesis in the t-test can be rejected with a higher significance level when the test statistic, or equivalently, the divergence measure d jt π) is large. In other words, d jt π) essentially measures the amount of information collected to reject the null hypothesis, H j) 0 : µ 1 > µ j. While the formulation in Welch 1947) assumes normal populations to study the distribution of the test statistic, no such assumption is made here. For an admissible policy π Π let J t π) = min j 1 d jtπ), 5) be the minimum level of the absolute) divergence between system 1 and the other systems. Recall that Π is the set of all non-anticipating policies. In the optimization problem we consider, we further restrict attention to Π Π in which each system is sampled infinitely often with probability one if the sampling budget were infinite. We call Π the set of consistent policies because the sample means and variances are consistent estimators of the population counterparts under such policies. This observation is summarized in the following proposition. Proposition 1 Consistency of estimators). For any consistent policy π Π, X jt µ j 6) S jt σ j 7) almost surely as t. The dynamic optimization problem can now be formulated as follows: J T = sup E[J T π)]. 8) π Π Note that this problem can only be solved with full knowledge of the underlying probability distributions, given that the expectation above requires this information. Further, even when the decision maker has full knowledge of the distributions, the dynamic optimization problem 8) is essentially computationally intractable. Nevertheless, structural properties of the solution to 8) can be extracted when T is large.

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 7 4. A Full Information Relaxation and an Asymptotic Benchmark 4.1. A Full Information Relaxation Let d jt π) be the analog of d jt π) defined in 4), i.e., d jt π) := µ 1 µ j σ 1/N 1t + σ j /N jt, t = 1,..., T, 9) with sample means and variances replaced with the population means and variances, respectively. Also, define J t π) := min j 1 d jt π) 10) to be the counterpart of J t π) in 5). Note that J T π) does not depend on the sampling order, but only on the cumulative number of samples from each system. Hence, it suffices to consider a set of static policies where π t, t = 1,..., T, is defined on the k 1) dimensional simplex, k 1 = { α 1,..., α k ) R k : } k α j = 1 and α j 0 for all j, 11) j=1 with N jt = α j T for each j = 1,..., k, ignoring integrality constraints on N jt. The full information benchmark problem is formulated as follows: ) α µ 1 µ j ) = arg max min. 1) α k 1 j 1 σ1/α 1 + σj /α j ) Note that the objective function of 1) is equivalent to J T π)/t ) with N jt replaced with α j T for each j = 1,..., k. Also, define J T := min j 1 µ 1 µ j ) σ 1/α 1T ) + σ j /α jt ), 13) to be the value of J T π) for π that satisfies N jt π) = α jt for each j. Since 1) is a convex programming problem, the following result can be used to determine the unique) optimal solution to the full-information benchmark based on first order conditions.

8 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) Proposition Characterization of optimal allocation under full information). The optimal solution to the full-information benchmark problem 1), α k 1, is strictly positive and satisfies the following system of equations: µ 1 µ i ) σ 1/α 1 + σ i /α i α 1) σ 1 = µ 1 µ j ), σ1/α 1 + σj /αj for all i, j =,..., k 14) k αj) =. 15) j= σ j Example 1. Setting µ =1, 0, 0, 0) and σ = 4, 4, 4, 4), we obtain α =0.514, 0.16, 0.16, 0.16). When µ =1, /3, 1/3, 0) and the same σ as in the base case, we obtain α = 0.457, 0.451, 0.065, 0.07); that is, we take more samples from a system whose mean is closer to that of the best system. Next, consider the second case with µ =1, 0, 0, 0) as in the base case but σ =1, 3, 5, 7). In this case, the optimal α =0.099, 0.098, 0.71, 0.53); that is, the greater the variance, the more samples are taken. 4.. An Asymptotic Benchmark In this section we characterize some salient features of optimal sampling policies when the sampling budget is sufficiently large, using the optimal solution to the full information benchmark problem 1). First we introduce a rather straightforward characterization of asymptotic optimality. Definition 1 Asymptotic Optimality). A policy π Π is asymptotically optimal if E[J T π)] J T 1 as T. 16) The key observation is that, under a consistent policy π Π, the sample mean and variance of each system approach the population mean and variance, respectively, almost surely as the sampling budget grows large Proposition 1). In other words, J T is close to J T for sufficiently large T under any consistent policy π Π, which allows us to substitute JT for JT.

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 9 Proposition 3 Equivalent characterization of asymptotic optimality). A policy π Π that satisfies E[J T π)] J T 1 as T, 17) where J T is defined in 13), is asymptotically optimal. This provides a significantly simpler criterion for asymptotic optimality; while it is difficult to verify that 16) holds because it is difficult to compute J T either analytically or numerically, it is straightforward to obtain JT. Based on Proposition 3, the next result provides a sufficient condition for asymptotic optimality. Theorem 1 Sufficient condition for asymptotic optimality). A policy π Π is asymptotically optimal if α T π) α as T, 18) almost surely, where α is the optimal solution to 1). 5. The Welch Divergence Policy In this section we consider a setting where no information on the underlying probability distributions is given a priori. We propose a dynamic sampling policy, called the Welch Divergence WD) policy, that achieves asymptotic optimality. Theorem 1 suggests a simple condition for asymptotic optimality, 18), from which we can construct a policy that meets the criterion. Recall that α in 18), the optimal solution to the full information benchmark problem, is not known a priori in this setting and one needs to infer it from sample estimates. Denote a vector ˆα t = ˆα 1t,..., ˆα kt ) as its estimator in stage t, which is defined as follows: ˆα t = arg max min X bt X ) jt ) α k 1 j b Sbt /α, 19) b + Sjt/α j ) with b = arg max j { X jt }. The optimization problem in 19) is called a perturbed problem since µ j and σ j in 1) are replaced with their noisy estimates, Xjt and S jt, respectively. Note that the

10 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) objective function 19) is closely related to Welch s t-test statistic Welch 1947) hence the name of the policy. To achieve 18), it suffices to have i) α t close to ˆα t ; and ii) ˆα t approach α as t increases to infinity. It can be easily seen that ii) is achieved if Xjt µ j and Sjt σj for each j = 1,..., k. See Lemma A. for details.) The WD policy is constructed to match α t with ˆα t in each stage, satisfying ii) automatically. The WD policy is summarized below in an algorithmic form, with κ being a parameter of the algorithm. Algorithm 1 WDκ) For each j, take samples until S jt > 0 while t T do Solve the perturbed problem 19) to obtain ˆα 1t,..., ˆα kt ) Let π t+1 such that j π t+1 = arg max j=1,,...,k ˆα jt α jt ) If α jt < κ for some j Otherwise Sample from system π t+1 and let t = t + 1 end for Remark 1 Greedy- and constrained- WD policy). The parameter κ of the WD policy represents the minimum sampling rate for each system; that is, the procedure guarantees that N jt κt for each j. We call the WD policy with κ = 0 the greedy-wd and the one with κ > 0 the constrained-wd. Essentially, κ > 0 ensures that the number of samples from each system increases at least at a linear rate. Remark WD policy in discrete distribution case). When the underlying distributions are discrete, then an event X it = X jt for i j may occur with positive probability. In this case, the objective function in 19) is 0 for all α k 1 and ˆα t may not be well defined. To avoid technical

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 11 difficulties due to ties, we let ɛ j > 0 be a sufficiently small positive number for j = 1,..., k, with ɛ i ɛ j for each i j. Then we replace each X jt in 19) with X jt + ɛ j /N jt. Note that the magnitude of the perturbation, ɛ j /N jt, diminishes as N jt. The modified WD policy, as well as the proofs for relevant theorems, is provided in Appendix C. 6. Main Results 6.1. Consistency and Asymptotic Optimality We first address the primary asymptotic performance measure for the WD policy. The next theorem establishes the consistency of the WD policy. Theorem Consistency). For any 0 κ 1/k, the WD policy is consistent. Although one can easily show that any static policy with positive sampling rates i.e., α t = α > 0 for all t) is consistent, hence also the constrained-wd policy is consistent), it is less obvious that the greedy-wd exhibits this property. The following is a simple example of a myopic policy that is not consistent. Example. Let π t = arg max j { X jt } in every stage t; that is, a sample is taken from the system with the greatest sample mean. Assume all systems are sampled once initially. For simplicity, suppose k = and assume the populations are normal with µ 1, µ ) =, 1) and unit variances. At stage, with π 1 = 1 and π =, it can be easily seen that the event, A = { X 1, 0) and X 1, )}, occurs with positive probability. Conditional on this event, the system 1 would not be sampled at all if Xt 0 for all t 3. The latter event occurs with positive probability since { t s= X s : t 3} is a random walk with positive drift and the probability that it falls below zero is strictly less than one. Combined with the fact that PA) > 0, this policy is not consistent. The next theorem establishes the asymptotic optimality of the WD policy. Put α = min j {α j}. Theorem 3 Asymptotic optimality). For any 0 κ α, the WD policy is asymptotically optimal.

1 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) Theorem 3 implies that the WD policy recovers full information optimality even in its myopic form. We make two remarks on this theorem. First, for the WD policy with some κ α, 1/k], it is not difficult to show that α t α α, in which case the WD policy is not asymptotically optimal by Theorem 1. Second, even though the WD policy is asymptotically optimal for any κ [0, α ], its finite-time performance still depends on κ. To be specific, when κ is set too large, i.e., close to α, the finite-time performance may become poor due to the excessive sampling from suboptimal systems. When κ is set too low, i.e., close to 0, the finite-time performance may also be poor due to inaccurate sample estimates in early stages. 6.. Connection to the Probability of False Selection The probability of false selection, denoted PFS t ) with FS t := { X 1t < max j 1 Xjt }, is a widely used criterion for the efficiency of a sampling policy see, e.g., a survey paper by Kim and Nelson 006). In this section we show that asymptotic optimality in terms of the proposed performance measure, 16), can be translated into asymptotic efficiency in terms of PFS t ); the term will be made precise in what follows. Some large deviation preliminaries. To analyze the probability of false selection we need to establish tail bounds that rely on existence of the moment generating function. Define M j θ) := E[e θx j] and Λ j θ) := log M j θ) and let I j ) denote the Fenchel-Legendre transform of Λ j ), i.e., I j x) = sup{θx Λ j θ)}. 0) θ Let D j = {θ R : Λ j θ) < } and H j = {Λ jθ) : θ D 0 j }, where A 0 denotes the interior of a set A and Λ jθ) denotes the derivative of Λ j ) at θ. Let θ j x) be the unique solution to the following equation, Λ jθ) = x, 1) The following assumptions ensures that the sample mean from each system can take any value in the interval [min j 1 µ j, µ 1 ]. This assumption is satisfied for commonly encountered families of distributions such as the normal, Bernoulli, Poisson, and gamma.

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 13 Assumption 1. The interval [min j 1 µ j, µ 1 ] k j=1h 0 j. When this assumption is satisfied, Glynn and Juneja 004) show that with ρα) = min j 1 G j α), where 1 t log PFS t) ρα), ) G j α) = inf x {α 1I 1 x) + α j I j x)}. 3) One can interpret ) as a statement that PFS t ) behaves roughly like exp ρα)t) for large values of t. It follows that the best possible asymptotic performance is obtained for α k 1 that maximizes ρα). We denote by ρ G α) the rate function when underlying distributions are normal with the same mean and standard deviation vectors µ = µ 1,..., µ k ) and σ = σ 1,..., σ k ). The subscript G is mnemonic for Gaussian. It is easily seen that where α k 1. ρ G α) = min j 1 µ 1 µ j ) σ 1/α 1 + σ j /α j ), 4) Remark 3 Relation to the Welch divergence metric). With a slight abuse of notation, setting J T π) := J T α) for a static allocation α k 1, we have ρ G α) = J T α) T. 5) In view of Theorem 3, the WD policy asymptotically maximizes J T α). Hence, the WD policy asymptotically maximizes ρ G ), the exponent of PFS t ), when the underlying distributions are normal. In this perspective, the WD policy resembles the sequential sampling procedure in Glynn and Juneja 004), which targets the allocation that asymptotically optimizes ρ ). The latter procedure, however, requires a family of underlying distributions to be known. Accuracy of the Gaussian approximation. We now show that one may approximate ρα) by ρ G α) if the underlying distributions are not normal but µ 1 and max j 1 {µ j } are suitably close to each other. The following theorem establishes the relation between ρ ) and ρ G ).

14 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) Proposition 4 Local relative efficiency of the Gaussian approximation). Suppose Assumption 1 holds. Then for any α k 1 ρα) 1 as max ρ G α) {µ j} µ 1. 6) j 1 Proposition 4, combined with the fact that α = arg max α {ρ G α)}, suggests that the WD policy can achieve near-optimal performance in terms of PFS t ) when the means of the best and the second-best systems are sufficiently close to each other, regardless of the underlying probability distributions. We present a simple example to verify 6). Example 3. Consider two Bernoulli systems with parameters µ 1, µ 0, 1) with µ 1 > µ so that PX j = 1) = µ j = 1 PX j = 0) for j = 1,. The rate functions are I j x) = x log x µ j + 1 x) log 1 x 1 µ j, j = 1,. 7) Using a second order Taylor expansion of I j x) at x = µ j and the fact that σ j = µ j 1 µ j ), it can be seen that I j x) = x µ j) σ j + o µ 1 µ ) ), 8) where ox) is such that ox)/x 0 as x 0. Since ρα 1, α ) = inf x {α 1 I 1 x) + α I x)}, it is not difficult to show that ρα 1, α ) = which reduces to 4) as µ 1 µ 0. µ 1 µ ) σ 1/α 1 + σ /α ) + o µ 1 µ ) ), 9) To paraphrase Proposition 4, after taking t, the probability of false selection on a logarithmic scale depends on the underlying distributions only through the first two moments as the gap in the means shrinks to 0. To state this insensitivity property more precisely, consider a stylized sequence of system configurations defined as follows. Each system configuration is indexed by

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 15 superscript t. For t = 1 we let µ 1 j = µ j for each j = 1,..., k. For t, we fix µ t 1 = µ 1 and set µ t j, j =,..., k, so that µ 1 µ t j) = δ t µ 1 µ j ) for some 0 < δ t 1. That is, the probability distributions for systems j =,..., k are shifted so that their means are µ t j. For brevity, we set δ 1 = 1 and let the system configuration at each t be characterized by δ t. Note that we assume that E[e iθx j µ t j ) ; δ t ] = E[e iθx j µ 1 j ) ; δ 1 ] 30) for each j =,..., k, where E[ ; δ t ] denotes the expectation under which the system configuration is parametrized by δ t. This implies the probability distribution of X j is shifted so that the central moments of X j are unchanged. Define PFS t α); δ t ) as the probability of false selection under the t-th configuration, parametrized by δ t and α k 1. Proposition 5 Validity of the Gaussian approximation). Under Assumption 1, if tδt and δ t 0 as t, then for any α k 1 1 tδ t log PFS t α); δ t ) ρ G α) as t. 31) Heuristically, Proposition 5 implies that the behavior of PFS t α); δ) for large value of t is related to the magnitude of δ. Specifically, PFS t α); δ) exp{ ρ G α)tδ }. 6.3. Efficiency of the WD Policy Recall from Remark 3 that the WD policy optimizes ρ G α). If we write ρ max = arg max α {ρα)}, the loss in the asymptotic efficiency can be written as lt) = 1 ρα t )/ρ max which can be decomposed into two parts: i) an approximation error due to the loss of maximizing ρ G ) instead of ρ ); and ii) an estimation error due to the loss incurred by noisy estimates of means and variances. Formally, for any t k lt) = 1 ρα t) 3) ρ max ) ) = 1 ρα ) ρα ) ρα t ) + 33) ρ max ρ max = l 1 + l t), 34)

16 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) where l 1 corresponds to the loss due to the approximation error and l t) corresponds to the loss due to noisy sample estimates. Note that l 1 is independent of t and that l t) decreases to 0 almost surely as t, which can be easily seen from Theorem 1. For the regime with small t, the latter error dominates the former, i.e., l 1 l t), and hence, learning the mean of each system is more important than a precise approximation of ρ ). In other words, in this regime, ρ G ) is a reasonable proxy for ρ ). On the other hand, when t is large, l 1 is the dominating factor in the efficiency loss. In the non-normal distribution case, l 1 > 0 and the policy cannot recover the full asymptotic efficiency, ρ max. However, it is important to note that the magnitude of l 1 is not significant as long as the gap in means is small enough Proposition 4). For the purpose of visualization, we present Figure 1 that illustrates the magnitudes of l 1 and l t) in two simple configurations, each with two Bernoulli systems; one with δ = 0.05 and another one with δ = 0.005, where δ = µ 1 µ. For both configurations, there exists some t c such that l 1 = l t c ). For t t c, maximizing ρ G ) does not lead to a significant loss in the efficiency because l 1 l t). Also, one can observe that t c is roughly proportional to 1/δ ; when δ becomes 10 times smaller, it requires approximately 100 times more sampling budget to reduce l t) to the level of l 1, which is consistent with the result of Proposition 5 that the behavior of PFS t ; δ) is closely related to δ for large t. 6.4. Performance of the WD Policy For static policies, the exact convergence rate of PFS t ) can be characterized and optimized by choice of α) for light-tailed systems Glynn and Juneja 004); see also Broadie et al. 007) and Blanchet et al. 008) for the case of heavy tails. However, these results do not apply to dynamic policies. The next result provides a guarantee on the convergence rate of PFS t ) under the WD policy. Theorem 4 Bounds on probability of false selection). Under the WD policy with any 0 κ 1/k, PFS t ) 0 as t. Moreover, if κ > 0 and E X j q < for some q, then PFS t ) k 1)ηt q/, for all t k. 35)

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 17 a) δ = 0.05 b) δ = 0.005 Loss in asymptotic efficiency 0.04 0.0 t c 600 l 1 l t) 0.0004 0.000 t c 60, 000 l 1 l t) 0 500 1,000 1,500,000 0 50,000 100,000 150,000 00,000 t t Figure 1 Decomposition of the efficiency loss under the WD policy with κ = 0.1. Each configuration consists of two Bernoulli systems. The graph a) is for the configuration with µ 1, µ ) = 0.9, 0.85) and the graph b) is for the configuration with µ 1, µ ) = 0.9, 0.895). Each l t) is estimated by taking average of 10 6 generated values of ρα t) ρα ))/ρ max. Standard error of each estimate for l t) is less than 10 7 in graph a) and is less than 10 9 in graph b). The two graphs suggest that the crossing point of l 1 and l t) is proportional to δ. If E[e θx j] < for all j = 1,..., k for θ θ 0 for some θ 0 > 0, then PFS t ) k 1)e γκt, for all t k, 36) where the constants η, γ > 0 are explicitly identified in the proof. Note that this result can be applied to set the total sampling budget that is required to guarantee PFS T ) ɛ. Specifically, 36) implies that T = logk 1)/ɛ)/γκ guarantees this desired level of accuracy. The exact value of T cannot be computed a priori because the constant γ or η) depends on the unknown) probability distributions. In the cases addressed below, one may set T explicitly utilizing a bound on either γ or η. Remark 4 Bounded random variables). Suppose that X jt [a, b] with probability one for each j = 1,..., k, and that µ 1 µ j + δ for j =,..., k. Then, following similar steps to those in the proof of Theorem 4, we have that ) PFS t ) k 1) exp κtδ, for all t k, 37) b a)

18 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) where the right-hand side is obtained by iteratively applying Hoeffding s inequality. Hence, setting T = b a) logk 1)/ɛ)/κδ guarantees PFS T ) ɛ. Remark 5 Sub-Gaussian random variables). Suppose that X jt is sub-gaussian for each j = 1,..., k; that is, there exists c j > 0 such that E[e θx jt] e c j θ / for all θ θ 0. Also, suppose that µ 1 µ j + δ for j =,..., k. In this case, following similar steps to those in the proof of Theorem 4, one can establish that ) PFS t ) k 1) exp κtδ, for all t k, 38) 8c where c min j {c j }. Hence, setting T = 8c logk 1)/ɛ)/κδ guarantees PFS T ) ɛ. 7. Numerical Testing and Comparison with Other Policies 7.1. Different Sampling Policies We compare the WD policy against three other policies: the dynamic algorithm for Optimal Computing Budget Allocation OCBA) Chen et al. 000), the Knowledge Gradient KG) algorithm Frazier et al. 008), and the Equal Allocation EA) policy; see details below. Optimal Computing Budget Allocation OCBA). Chen et al. 000) assumed that the underlying distributions are normal. They formulate the problem as follows: X bt min Φ X jt, 39) α k 1 j b Sbt /α bt) + Sjt/α j t) where b = arg max j { X jt }, Φ ) is the cumulative standard normal distribution function and Φ ) = 1 Φ ). They suggested a dynamic policy based on first-order conditions for 39). Initial τ 0 samples are taken from each system, which are then used to allocate incremental samples. Specifically, compute ˆα 1,..., ˆα k ) by solving ˆα j ˆα i = Sjt / X bt X ) jt ) S it / X bt X it ) for i, j 1,,..., k and i j b, and 40) ˆα b Sbt = j b ˆα j S jt, 41) which are derived from the first order conditions of 39). Then, allocate ˆα j samples to system j. This procedure is repeated until the sampling budget is exhausted. We let = 1 in this paper.

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 19 Knowledge Gradient KG). We briefly describe the KG policy; further description can be found in Frazier et al. 008). Let µ t j and σ t j denote the mean and variance of the posterior distribution for the unknown performance of system j in stage t. Then KG policy maximizes the single-period expected increase in value, E t [max j µ t+1 j max j µ t j], where E t [ ] indicates the conditional expectation with respect to the filtration up to stage t. Although Frazier et al. 008) mainly discusses the KG policy for the known variance case, we use its version with unknown variance in order to be consistent with the other benchmark policies herein. This policy requires users to set parameters for prior distributions and we use τ 0 initial sample estimates to set those. Equal Allocation EA). This is a simple static policy, where all systems are equally sampled, i.e., π t = arg min j N jt, breaking ties by selecting the system with the smallest index. Note that the dynamic OCBA procedure described above resembles the WD policy in that it myopically optimizes a certain objective function. However, the objective function in 39) is different than that of the WD policy in each stage. Further, Chen et al. 000) assumed that N bt N jt to derive the equations, 40) and 41), which lead to an asymptotic allocation different than that of the WD policy. In Chen et al. 000), the OCBA policy takes an initial number of samples, τ 0, as a parameter. Also, in the KG policy described above, τ 0 initial samples are taken to estimate parameters for prior distributions. However, in this paper, for the purpose of comparison with the WD policy, the KG and the OCBA policies are modified so that they take κ as a single parameter. To be specific, in each stage t, we check whether each system is sampled more than κt times: If N jt < κt for some j, we take a sample from the system j. If N jt κt for all j, we follow the sampling decision from the KG or the OCBA. This affects performances of the two policies as illustrated in Figure in which the two policies are tested for different values of κ; the configuration is four normally distributed systems with means µ = 1, 0.5, 0.5, 0.5) and standard deviations σ = 4, 4, 4, 4). In Figure a), we set κ = 0 but initial τ 0 = 5 samples are taken from each system. There is no general rule for choosing an optimal value of κ. In the following numerical experiments, we set κ = 0.4/k for all policies. This is by no means an optimal choice, but we remark the qualitative conclusions hold for other choices of κ as well.

0 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) κ = 0 b) κ = 0.1 10 0 10 0 10 1 10 1 PFSt) 10 10 10 3 OCBA KG EA 10 3 OCBA KG EA 10 4 1000 000 3000 4000 5000 6000 10 4 1000 000 3000 4000 5000 6000 t t Figure Effect of constrained sampling on the probability of false selection as a function of sampling budget t, plotted in log-linear scale. The three sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); and equal allocation EA). The configuration is four normally distributed systems with µ = 1, 0.5, 0.5, 0.5) and σ = 4, 4, 4, 4). In a), κ = 0 but initial τ 0 = 5 samples are taken from each system. The performance of the two policies degrade dramatically as t increases because of the insufficient samples in initial stages. However, in b), the performance does not degrade when κ = 0.1. 7.. Numerical Experiments The numerical experiments include a series of tests using normal, Student t, and Bernoulli distributions. PFS t ) is estimated by counting the number of false selections out of m simulation trials, which is chosen so that: Pt 1 P t ) m P t 10, 4) where P t is the order of magnitude of PFS t ) for a given budget t. This implies the standard error for each estimate of PFS t ) is at least ten times smaller than the value of PFS t ). For example, in Figure 3, the minimum value of PFS t ) is 10 4, so we choose m to be at least 10 6 by this argument. Consequently, we have sufficiently high confidence that the results are not driven by simulation error. The experiments are implemented in C using the CFSQP optimization library Lawrence et al. 1997) on a Windows 7 operating system with 3GB RAM and Intel core i7.7ghz CPU. For

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 1 10 0 a) SC 10 0 b) MDM 10 1 10 1 PFSt) 10 10 10 3 OCBA KG EA WD 10 3 OCBA KG EA WD 10 4 1000 000 3000 4000 5000 6000 10 4 1000 000 3000 4000 5000 6000 Figure 3 The probability of false selection in normal distribution case as a function of sampling budget t, plotted in log-linear scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). The left two panels are for a) Slippage Configuration SC) with µ = 1, 0.5, 0.5, 0.5) and the right two panels are for b) Monotone Decreasing Means MDM) with µ = 1, 0.5, 0.5, 0.15). 10 4 simulation trials with T = 6400, the CPU times for the OCBA, KG, EA, and WD policies are 31.0, 8.0, 3.6, and 16.3 minutes, respectively. The OCBA and WD policies take more CPU times compared to the KG and EA algorithms, since the latter two policies do not involve optimization in each stage. Although the implementation of each algorithm was not optimized for CPU time, the relative CPU times illustrate the dramatic differences between the algorithms. 7..1. Normal distribution We consider two configurations with four normally distributed systems. The first configuration will be referred to as the Slippage Configuration SC) in which the means of the non-best systems are equally separated from the best system; µ = 1, 0.5, 0.5, 0.5). The second configuration will be referred to as the Monotone Decreasing Means MDM) in which µ = 1, 0.5, 0.5, 0.15). For both configurations we set σ = 4, 4, 4, 4), and estimate probabilities of false selection for different sampling budgets. Figure 3 shows the test results for the WD and the other three benchmark procedures. In particular, we plot PFS t ) on log-linear scale as a function of the sampling budget t. In Figure 3, one can observe that the WD and KG policies perform equally well and outperform the other policies with respect to PFS t ). Further, PFS t ) under these policies converges to 0

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) exponentially fast. Note that PFS t ) C exp βt) implies that log PFS t ) should be approximately linear in t with slope β. As t increases, the performance of the OCBA policy becomes inferior to that of the WD. We also perform the experiment with four normally distributed systems, but with different variances. First, we consider variants of SC and MDM configurations with increasing variances, i.e., σ = 4, 4.5, 5, 5.5). We call these SC-INC and MDM-INC configurations, respectively. Similarly, we consider SC-DEC and MDM-DEC configurations with decreasing variances, i.e., σ = 4, 3.5, 3,.5). The simulation results are illustrated in Figure 4. 7... Student t-distribution We consider four systems with Student t-distributions, each of which is parameterized by the location parameter µ j, the scale parameter σ j, and the degrees of freedom ν j. We again take the two configurations, T-SC and T-MDM. In the T-SC configuration, µ = 1, 0.5, 0.5, 0.5), σ = 1, 1, 1, 1), and ν =.1,.1,.1,.1). In the T-MDM configuration, µ = 1, 0.5, 0.5, 0.15) with the same σ and ν as in the T-SC configuration. Note that the OCBA and KG policies are developed on the premise that the underlying distributions are normal light-tailed). Also notice that both x- and y-axes are on a logarithmic scale in Figure 5. PFS t ) for the WD and other benchmark policies) converges to 0 at a polynomial rate consistent with Theorem 4. The KG policy performs as well as the WD does, however, the performance of the OCBA becomes less efficient as t increases. 7..3. Bernoulli distribution The four systems have Bernoulli distributions. The distribution of system j is parameterized by a single parameter, µ j 0, 1), so that its mean and variance are µ j and µ j 1 µ j ), respectively. Note that this distribution becomes more skewed as µ j approaches 0 or 1. We consider four configurations in this experiment, BER-LARGE-DIFF, BER-SMALL-DIFF, BER-SKEWED-LARGE-DIFF, and BER-SKEWED-SMALL-DIFF. In the first two configurations, distributions are only mildly skewed with µ = 0.6, 0.3, 0.3, 0.3) and µ = 0.6, 0.57, 0.57, 0.57), respectively. In the last two configurations, distributions are significantly skewed with µ = 0.9, 0.6, 0.6, 0.6) and µ = 0.9, 0.87, 0.87, 0.87), respectively.

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 3 10 0 a) SC-INC 10 0 b) MDM-INC 10 1 10 1 PFSt) 10 10 10 3 10 4 OCBA KG EA WD 1000 000 3000 4000 5000 6000 10 3 10 4 OCBA KG EA WD 1000 000 3000 4000 5000 6000 c) SC-DEC d) MDM-DEC 10 0 10 0 10 1 10 1 PFSt) 10 10 10 3 10 4 OCBA KG EA WD 1000 000 3000 4000 5000 6000 10 3 10 4 OCBA KG EA WD 1000 000 3000 4000 5000 6000 t t Figure 4 Probability of false selection in normal distribution cases with unequal variances, as a function of sampling budget t, plotted in log-linear scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). The graphs a) and b) are for SC-INC and MDM-INC configurations with increasing variance, σ = 4, 4.5, 5, 5.5), and the graphs c) and d) are SC-DEC and MDM-DEC configurations with decreasing variance, σ = 4, 3.5, 3,.5). In this experiment, we also consider a variant of the WD policy. Recall that the WD policy solves 19) in each stage to optimize ρ G ), the rate function of PFS t ) when underlying distributions are normal. Let us define a policy called WD-BER in a similar manner. When underlying distributions are Bernoulli, it can be seen that the rate function of PFS t ), under a static policy with α, is given

4 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 10 0 a) T-SC 10 0 b) T-MDM 10 1 10 1 PFSt) 10 10 10 3 OCBA KG EA WD 10 4 400 800 1600 300 6400 t 10 3 OCBA KG EA WD 10 4 400 800 1600 300 6400 t Figure 5 Probability of false selection in the Student t distribution case as a function of the sampling budget t. Note that both x- and y-axes are in log scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). The two graphs are for a) T-SC configuration with µ = 1, 0.5, 0.5, 0.5) and b) T-MDM configuration with µ = 1, 0.5, 0.5, 0.15). by ρα) = min j 1 [ α 1 + α j ) log α 1 α 1 µ 1 ) 1 +α j ) α 1 µ j ) 1 +α j ) + µ α j α 1 α 1 +α j ) 1 µ j α j α 1 +α j ) )]. 43) Then, define the WD-BER policy so that it solves, in each stage, max α k 1 [ min α b + α j ) log 1 X α b α bt ) b +α j ) 1 X α j α jt ) b +α j ) + X j b α b α b +α j ) bt X α b α b +α j ) jt )]. P t) The WD-BER serves as a hypothetical benchmark since the knowledge that underlying distributions are Bernoulli is typically not available to the policy designer. Recall that the WD policy is predicated on a two-moment approximation. Hence, one may expect that the WD policy performs on par with the WD-BER policy when the distributions are less skewed. This can indeed be observed in Figure 6 a) and b). As the distributions become more skewed, one may expect that the WD policy performs poorly compared to the WD-BER policy, and this can be seen in Figure 6c). Despite the effect of skewness, the WD policy is seen to perform well when the means of the best and second-best systems are close, as seen in Figure 6d); this is consistent with Proposition 4.

Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 5 10 1 a) BER-LARGE-DIFF OCBA KG EA WD WD-BER 10 0 b) BER-SMALL-DIFF PFSt) 10 10 3 10 1 OCBA KG EA WD WD-BER 10 4 50 100 150 00 50 300 350 400 10 1000 000 3000 4000 5000 6000 c) BER-SKEWED-LARGE-DIFF d) BER-SKEWED-SMALL-DIFF 10 1 OCBA KG EA WD WD-BER 10 0 10 1 PFSt) 10 10 3 10 OCBA KG EA WD WD-BER 10 4 50 100 150 00 50 300 350 400 10 3 1000 000 3000 4000 5000 6000 t t Figure 6 Probability of false selection in Bernoulli distribution case as a function of the sampling budget t, plotted in log-linear scale. The five sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); Welch Divergence WD); and a Bernoulli-variant of the WD policy WD-BER). The four configurations are a) BER-LARGE-DIFF with µ = 0.6, 0.3, 0.3, 0.3), b) BER-SMALL-DIFF with µ = 0.6, 0.57, 0.57, 0.57), c) BER-SKEWED-LARGE-DIFF with µ = 0.9, 0.6, 0.6, 0.6), and d) BER-SKEWED-SMALL-DIFF with µ = 0.9, 0.87, 0.87, 0.87). 7..4. Large number of systems We consider larger problem spaces with 16 and 3 normally distributed systems, each referred to as MDM-16sys and MDM-3sys, respectively. In the MDM-16sys configuration, µ j = 0.5 j 1 for j = 1,,..., 16. In the MDM-3sys configuration, µ j = 0.5 j 1 for j = 1,,..., 3. For both configurations we set σ = 4,..., 4). The simulation results are depicted in Figure 7, in which the WD policy outperforms the other policies.

6 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) MDM-16sys b) MDM-3sys 10 0 10 0 PFSt) 10 1 10 1 10 OCBA KG EA WD 000 4000 6000 8000 10000 1000 10 OCBA KG EA WD 000 4000 6000 8000 10000 1000 t t Figure 7 Probability of false selection in normal distribution case with large number of systems, plotted as a function of the sampling budget t in log-linear scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). 8. Concluding Remark In this paper we propose a tractable dynamic sampling policy for the problem of ordinal optimization. It requires a parameter, κ, that constrains the minimum sampling rate. How to adaptively choose κ to further improve the efficiency of the policy is an interesting problem for future work. Also, while we primarily focused on the probability of false selection as a performance criterion for sampling policies, the methodology developed in this paper can be extended to at least two other criteria. First, it can be applied to the problem of selecting m > 1) best systems out of k. If PFS t ) is defined as the probability of falsely selecting one of the k m) inferior systems, one can track the behavior of PFS t ) for large t using the relationship ), for which the explicit form of the rate function, ρ ), can be easily established using large deviations theory. A different criterion, E[µ 1 µ bt ], penalizes an incorrect choice based on the performance discrepancy. Although it is not analytically tractable for finite t, we note that it is possible to approximate the asymptotic behavior of this criterion using the relationship ).