Tractable Sampling Strategies for Ordinal Optimization

Size: px
Start display at page:

Download "Tractable Sampling Strategies for Ordinal Optimization"

Transcription

1 Submitted to Operations Research manuscript Please, provide the manuscript number!) Authors are encouraged to submit new papers to INFORMS journals by means of a style file template, which includes the journal title. However, use of a template does not certify that the paper has been accepted for publication in the named journal. INFORMS journal templates are for the exclusive purpose of submitting to an INFORMS journal and should not be used to distribute the papers in print or online or to submit the papers to another publication. Tractable Sampling Strategies for Ordinal Optimization Dongwook Shin, Mark Broadie, Assaf Zeevi Graduate School of Business, Columbia University, New York, NY 1005 {dshin17@gsb.columbia.edu, mnb@columbia.edu, assaf@gsb.columbia.edu } We consider the problem of selecting the best of several competing alternatives systems ), where probability distributions of system performance are not known, but can be learned via sampling. The objective is to dynamically allocate a finite sampling budget to ultimately select the best system. Motivated by Welch s t-test, we introduce a tractable performance criterion and a sampling policy that seeks to optimize it. First, we characterize the optimal policy in an ideal full information setting where the means and variances of the underlying distributions are fully known. Then we construct a dynamic sampling policy which asymptotically attains the same performance as in the full information setting; this policy eventually allocates samples as if the underlying probability distributions are fully known. We characterize the efficiency of the proposed policy with respect to the probability of false selection, and show via numerical testing that it performs well compared to other benchmark policies. Key words : ranking and selection; optimal learning; dynamic sampling 1. Introduction Given a finite number of statistical populations, henceforth referred to as systems, we are concerned with the problem of dynamically learning the average outcomes of the systems to ultimately select the best one. This is an instance of ordinal optimization when the systems cannot be evaluated analytically but it is possible to sequentially sample from each system subject to a given sampling budget. A commonly used performance measure for sampling policies is the probability of false 1

2 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) selection, the likelihood of a policy mistakenly selecting a suboptimal system. Unfortunately, as is well documented in the literature, this objective is not analytically tractable. The main objective of this paper is to formulate a tractable performance measure for the aforementioned problem, and leverage it to design well-performing policies. We propose a performance measure based on Welch s t-test statistic Welch 1947). As it turns out, this performance measure is closely related to the probability of false selection. Further, this measure does not require specific distributional assumptions, which is often the premise in most of the antecedent literature. To arrive at a tractable dynamic sampling policy, our departure point will be a full information relaxation. This lends itself to some key structural insights, which we later use as guideline to propose the sampling strategy, referred to herein as the Welch Divergence WD) policy. In particular, the policy attempts to match the allocation patterns in the full information setting, and when the sampling budget is sufficiently large, the WD policy allocates samples as if it had all information a priori. The WD policy is structured around the first two moments of the underlying probability distributions. This tends to suggest that it works well with respect to the probability of false selection when the behavior of this criterion can be approximated by the first two moments. One of the contributions of this paper is to provide a rigorous statement of this observation and numerical evidence to the efficacy of the approach. The remainder of this paper is organized as follows: In the next section, we review the literature on ordinal optimization. In 3 we formulate a dynamic optimization problem; the asymptotic behavior of the optimal solution is explored in 4. In 5 we propose the WD policy and its asymptotic performance is addressed in 6. In 7 we test the policy numerically and compare with several benchmark policies. Proofs are collected in the electronic companion to this paper.. Literature Review One of the early contributions traces back to the work of Bechhofer 1954) who established the indifference-zone IZ) formulation. The aim of the IZ procedure is to take as few samples as possible

3 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 3 to satisfy a desired guarantee on the probability of correct selection. A large body of research on the IZ procedure followed see, e.g., Paulson 1964, Rinott 1978, Nelson et al. 001, Kim and Nelson 001, Goldsman et al. 00, Hong 006). It is commonly assumed that distributions are normal and Bonferroni-type inequalities are used to bound the probability of correct selection. Recent IZ procedures developed by Kim and Dieker 011), Dieker and Kim 01) and Frazier 014) focus on finding tight bounds on the probability of correct selection in the normal distribution case. While the number of samples is determined by the experimenter in the IZ formulation, the goal of ordinal optimization is to select the best one when a finite sampling budget is given. Glynn and Juneja 004) use large deviation theory with light-tailed distributions to identify the rate function associated with the probability of false selection. The optimal static full information allocation is obtained by minimizing the rate function of the probability of false selection. Broadie et al. 007) and Blanchet et al. 008) study the heavy-tailed case. The performance criterion introduced in this paper has a close connection to the rate function as will be shown in 6, however, our work is significantly different than Glynn and Juneja 004) in that dynamic sampling policies need to learn the means and variances of the systems and simultaneously allocate samples to optimize the objective. In addition to the frequentist approaches described thus far, a substantial amount of progress has been made using the Bayesian paradigm. A series of papers beginning with Chen 1995) and continuing with Chen et al. 1996, 1997, 000, 003) suggest a family of policies known as optimal computing budget allocation OCBA). These are heuristic policies based on the Bayesian approximation of the probability of correct selection. Specifically, they assume that each system is characterized by a normal distribution with known variance and unknown mean that follows a conjugate normal) prior distribution. Then the Bonferroni inequality is used to approximate the probability of correct selection and an optimal allocation is derived by solving a static optimization problem in which the approximate probability of correct selection is the objective function. Any OCBA policy can be implemented as a dynamic procedure when samples are generated from each system and then used to estimate optimal allocations. Using these allocations more samples

4 4 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) are generated and the optimal allocations are re-estimated repeatedly. The WD policy resembles the dynamic OCBA policy in that they both myopically optimize a certain objective function, but the objectives are different in each stage. Further discussion on key differences between the two policies is deferred to later in the paper. A set of Bayesian sampling policies distinct from OCBA were introduced in Chick and Inoue 001). They assume unknown variances, and hence the posterior mean is student-t distributed. They derive a bound for the expected loss associated with potential incorrect selections, then asymptotically minimize that bound. Another representative example of a Bayesian sampling policy is the knowledge-gradient KG) algorithm proposed by Frazier et al. 008). Like the OCBA policy, it assumes that each distribution is normal with common known variance, and that the unknown means follow a conjugate normal distribution. However, while the probability of correct selection in the OCBA policy is classified as 0-1 loss, the KG policy aims to minimize expected linear loss. The 0-1 loss is more appropriate in situations where identification of the best system is critical, while the linear loss is more appropriate when the ultimate value corresponds to the selected system s mean. An unknown-variance version of the KG policy is developed in Chick et al. 007) under the name LL Preliminaries 3.1. Model Primitives Consider k stochastic systems, whose performance is governed by a random variable X j with distribution F j ), j = 1,..., k. We assume that E[Xj ] <, for j = 1,..., k, and let µ j = xdf j x) and σ j = x df j x) µ j) 1/ be the mean and standard deviation of the jth system. Denote µ respectively, σ) the k-dimensional vector of means respectively, standard deviations). We assume that each σ j is strictly positive to avoid trivial cases, and without loss of generality, let system 1 be the best system, i.e., µ 1 > max j 1 µ j. A decision maker is given a sampling budget T, which means T independent samples can be drawn from the k systems, and the goal is to correctly identify the best system. A dynamic sampling policy π is defined as a sequence of random variables, π 1, π,..., taking values in the set {1,,..., k}; the event {π t = j} means a sample from system j is taken at stage

5 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 5 t. Define X jt, t = 1,..., T, as a random sample from system j in stage t and let F t be the σ-field generated by the samples and sampling decisions taken up to stage t, i.e., {π τ, X πτ,τ)} t τ=1, with the convention that F 0 is the nominal sigma algebra associated with underlying probability space. The set of dynamic non-anticipating policies is denoted as Π, in which the sampling decision in stage t is determined based on all the sampling decisions and samples observed in previous stages, i.e., {π t = j} F t 1 for j = 1,..., k and t = 1,..., T. Let N jt π) be the cumulative number of samples from system j up to stage t induced by policy π and define α jt π) := N jt π)/t as the sampling rate for system j at stage t. Let X jt π) and Sjtπ) be the sample mean and variance of system j in stage t, respectively. Formally, t N jt π) = I{π τ = j} 1) τ=1 t τ=1 X jt π) = X jτi{π τ = j} ) N jt π) t Sjtπ) τ=1 = X jτ X jt π)) I{π τ = j}, 3) N jt π) where I{A} = 1 if A is true, and 0 otherwise. With a single subscript t, α t π) = α 1t π),..., α kt π)) and N t π) = N 1t π),..., N kt π)) denote vectors of sampling rates and cumulative numbers of samples in stage t, respectively. Likewise, we let X t π) and S t π) be the vectors of sample means and variances in stage t, respectively. For brevity, the argument π may be dropped when it is clear from the context. To ensure that X t and S jt are well defined, each system is sampled once initially. 3.. Problem Formulation We define the following measure of divergence between systems 1 and j for j =,..., k: X 1t d jt π) = X jt S 1t/N 1t + S, t = 1,..., T, 4) jt/n jt where S jt = maxs jt, σ ) is the truncated sample variance and σ > 0 is a constant that is sufficiently smaller than the true variances, i.e., σ min j {σ j }. The value σ is introduced to ensure that 4) is well defined. Note that the divergence measure d jt π) is related to Welch s t-test statistic Welch

6 6 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 1947). The null hypothesis in the t-test can be rejected with a higher significance level when the test statistic, or equivalently, the divergence measure d jt π) is large. In other words, d jt π) essentially measures the amount of information collected to reject the null hypothesis, H j) 0 : µ 1 > µ j. While the formulation in Welch 1947) assumes normal populations to study the distribution of the test statistic, no such assumption is made here. For an admissible policy π Π let J t π) = min j 1 d jtπ), 5) be the minimum level of the absolute) divergence between system 1 and the other systems. Recall that Π is the set of all non-anticipating policies. In the optimization problem we consider, we further restrict attention to Π Π in which each system is sampled infinitely often with probability one if the sampling budget were infinite. We call Π the set of consistent policies because the sample means and variances are consistent estimators of the population counterparts under such policies. This observation is summarized in the following proposition. Proposition 1 Consistency of estimators). For any consistent policy π Π, X jt µ j 6) S jt σ j 7) almost surely as t. The dynamic optimization problem can now be formulated as follows: J T = sup E[J T π)]. 8) π Π Note that this problem can only be solved with full knowledge of the underlying probability distributions, given that the expectation above requires this information. Further, even when the decision maker has full knowledge of the distributions, the dynamic optimization problem 8) is essentially computationally intractable. Nevertheless, structural properties of the solution to 8) can be extracted when T is large.

7 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 7 4. A Full Information Relaxation and an Asymptotic Benchmark 4.1. A Full Information Relaxation Let d jt π) be the analog of d jt π) defined in 4), i.e., d jt π) := µ 1 µ j σ 1/N 1t + σ j /N jt, t = 1,..., T, 9) with sample means and variances replaced with the population means and variances, respectively. Also, define J t π) := min j 1 d jt π) 10) to be the counterpart of J t π) in 5). Note that J T π) does not depend on the sampling order, but only on the cumulative number of samples from each system. Hence, it suffices to consider a set of static policies where π t, t = 1,..., T, is defined on the k 1) dimensional simplex, k 1 = { α 1,..., α k ) R k : } k α j = 1 and α j 0 for all j, 11) j=1 with N jt = α j T for each j = 1,..., k, ignoring integrality constraints on N jt. The full information benchmark problem is formulated as follows: ) α µ 1 µ j ) = arg max min. 1) α k 1 j 1 σ1/α 1 + σj /α j ) Note that the objective function of 1) is equivalent to J T π)/t ) with N jt replaced with α j T for each j = 1,..., k. Also, define J T := min j 1 µ 1 µ j ) σ 1/α 1T ) + σ j /α jt ), 13) to be the value of J T π) for π that satisfies N jt π) = α jt for each j. Since 1) is a convex programming problem, the following result can be used to determine the unique) optimal solution to the full-information benchmark based on first order conditions.

8 8 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) Proposition Characterization of optimal allocation under full information). The optimal solution to the full-information benchmark problem 1), α k 1, is strictly positive and satisfies the following system of equations: µ 1 µ i ) σ 1/α 1 + σ i /α i α 1) σ 1 = µ 1 µ j ), σ1/α 1 + σj /αj for all i, j =,..., k 14) k αj) =. 15) j= σ j Example 1. Setting µ =1, 0, 0, 0) and σ = 4, 4, 4, 4), we obtain α =0.514, 0.16, 0.16, 0.16). When µ =1, /3, 1/3, 0) and the same σ as in the base case, we obtain α = 0.457, 0.451, 0.065, 0.07); that is, we take more samples from a system whose mean is closer to that of the best system. Next, consider the second case with µ =1, 0, 0, 0) as in the base case but σ =1, 3, 5, 7). In this case, the optimal α =0.099, 0.098, 0.71, 0.53); that is, the greater the variance, the more samples are taken. 4.. An Asymptotic Benchmark In this section we characterize some salient features of optimal sampling policies when the sampling budget is sufficiently large, using the optimal solution to the full information benchmark problem 1). First we introduce a rather straightforward characterization of asymptotic optimality. Definition 1 Asymptotic Optimality). A policy π Π is asymptotically optimal if E[J T π)] J T 1 as T. 16) The key observation is that, under a consistent policy π Π, the sample mean and variance of each system approach the population mean and variance, respectively, almost surely as the sampling budget grows large Proposition 1). In other words, J T is close to J T for sufficiently large T under any consistent policy π Π, which allows us to substitute JT for JT.

9 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 9 Proposition 3 Equivalent characterization of asymptotic optimality). A policy π Π that satisfies E[J T π)] J T 1 as T, 17) where J T is defined in 13), is asymptotically optimal. This provides a significantly simpler criterion for asymptotic optimality; while it is difficult to verify that 16) holds because it is difficult to compute J T either analytically or numerically, it is straightforward to obtain JT. Based on Proposition 3, the next result provides a sufficient condition for asymptotic optimality. Theorem 1 Sufficient condition for asymptotic optimality). A policy π Π is asymptotically optimal if α T π) α as T, 18) almost surely, where α is the optimal solution to 1). 5. The Welch Divergence Policy In this section we consider a setting where no information on the underlying probability distributions is given a priori. We propose a dynamic sampling policy, called the Welch Divergence WD) policy, that achieves asymptotic optimality. Theorem 1 suggests a simple condition for asymptotic optimality, 18), from which we can construct a policy that meets the criterion. Recall that α in 18), the optimal solution to the full information benchmark problem, is not known a priori in this setting and one needs to infer it from sample estimates. Denote a vector ˆα t = ˆα 1t,..., ˆα kt ) as its estimator in stage t, which is defined as follows: ˆα t = arg max min X bt X ) jt ) α k 1 j b Sbt /α, 19) b + Sjt/α j ) with b = arg max j { X jt }. The optimization problem in 19) is called a perturbed problem since µ j and σ j in 1) are replaced with their noisy estimates, Xjt and S jt, respectively. Note that the

10 10 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) objective function 19) is closely related to Welch s t-test statistic Welch 1947) hence the name of the policy. To achieve 18), it suffices to have i) α t close to ˆα t ; and ii) ˆα t approach α as t increases to infinity. It can be easily seen that ii) is achieved if Xjt µ j and Sjt σj for each j = 1,..., k. See Lemma A. for details.) The WD policy is constructed to match α t with ˆα t in each stage, satisfying ii) automatically. The WD policy is summarized below in an algorithmic form, with κ being a parameter of the algorithm. Algorithm 1 WDκ) For each j, take samples until S jt > 0 while t T do Solve the perturbed problem 19) to obtain ˆα 1t,..., ˆα kt ) Let π t+1 such that j π t+1 = arg max j=1,,...,k ˆα jt α jt ) If α jt < κ for some j Otherwise Sample from system π t+1 and let t = t + 1 end for Remark 1 Greedy- and constrained- WD policy). The parameter κ of the WD policy represents the minimum sampling rate for each system; that is, the procedure guarantees that N jt κt for each j. We call the WD policy with κ = 0 the greedy-wd and the one with κ > 0 the constrained-wd. Essentially, κ > 0 ensures that the number of samples from each system increases at least at a linear rate. Remark WD policy in discrete distribution case). When the underlying distributions are discrete, then an event X it = X jt for i j may occur with positive probability. In this case, the objective function in 19) is 0 for all α k 1 and ˆα t may not be well defined. To avoid technical

11 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 11 difficulties due to ties, we let ɛ j > 0 be a sufficiently small positive number for j = 1,..., k, with ɛ i ɛ j for each i j. Then we replace each X jt in 19) with X jt + ɛ j /N jt. Note that the magnitude of the perturbation, ɛ j /N jt, diminishes as N jt. The modified WD policy, as well as the proofs for relevant theorems, is provided in Appendix C. 6. Main Results 6.1. Consistency and Asymptotic Optimality We first address the primary asymptotic performance measure for the WD policy. The next theorem establishes the consistency of the WD policy. Theorem Consistency). For any 0 κ 1/k, the WD policy is consistent. Although one can easily show that any static policy with positive sampling rates i.e., α t = α > 0 for all t) is consistent, hence also the constrained-wd policy is consistent), it is less obvious that the greedy-wd exhibits this property. The following is a simple example of a myopic policy that is not consistent. Example. Let π t = arg max j { X jt } in every stage t; that is, a sample is taken from the system with the greatest sample mean. Assume all systems are sampled once initially. For simplicity, suppose k = and assume the populations are normal with µ 1, µ ) =, 1) and unit variances. At stage, with π 1 = 1 and π =, it can be easily seen that the event, A = { X 1, 0) and X 1, )}, occurs with positive probability. Conditional on this event, the system 1 would not be sampled at all if Xt 0 for all t 3. The latter event occurs with positive probability since { t s= X s : t 3} is a random walk with positive drift and the probability that it falls below zero is strictly less than one. Combined with the fact that PA) > 0, this policy is not consistent. The next theorem establishes the asymptotic optimality of the WD policy. Put α = min j {α j}. Theorem 3 Asymptotic optimality). For any 0 κ α, the WD policy is asymptotically optimal.

12 1 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) Theorem 3 implies that the WD policy recovers full information optimality even in its myopic form. We make two remarks on this theorem. First, for the WD policy with some κ α, 1/k], it is not difficult to show that α t α α, in which case the WD policy is not asymptotically optimal by Theorem 1. Second, even though the WD policy is asymptotically optimal for any κ [0, α ], its finite-time performance still depends on κ. To be specific, when κ is set too large, i.e., close to α, the finite-time performance may become poor due to the excessive sampling from suboptimal systems. When κ is set too low, i.e., close to 0, the finite-time performance may also be poor due to inaccurate sample estimates in early stages. 6.. Connection to the Probability of False Selection The probability of false selection, denoted PFS t ) with FS t := { X 1t < max j 1 Xjt }, is a widely used criterion for the efficiency of a sampling policy see, e.g., a survey paper by Kim and Nelson 006). In this section we show that asymptotic optimality in terms of the proposed performance measure, 16), can be translated into asymptotic efficiency in terms of PFS t ); the term will be made precise in what follows. Some large deviation preliminaries. To analyze the probability of false selection we need to establish tail bounds that rely on existence of the moment generating function. Define M j θ) := E[e θx j] and Λ j θ) := log M j θ) and let I j ) denote the Fenchel-Legendre transform of Λ j ), i.e., I j x) = sup{θx Λ j θ)}. 0) θ Let D j = {θ R : Λ j θ) < } and H j = {Λ jθ) : θ D 0 j }, where A 0 denotes the interior of a set A and Λ jθ) denotes the derivative of Λ j ) at θ. Let θ j x) be the unique solution to the following equation, Λ jθ) = x, 1) The following assumptions ensures that the sample mean from each system can take any value in the interval [min j 1 µ j, µ 1 ]. This assumption is satisfied for commonly encountered families of distributions such as the normal, Bernoulli, Poisson, and gamma.

13 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 13 Assumption 1. The interval [min j 1 µ j, µ 1 ] k j=1h 0 j. When this assumption is satisfied, Glynn and Juneja 004) show that with ρα) = min j 1 G j α), where 1 t log PFS t) ρα), ) G j α) = inf x {α 1I 1 x) + α j I j x)}. 3) One can interpret ) as a statement that PFS t ) behaves roughly like exp ρα)t) for large values of t. It follows that the best possible asymptotic performance is obtained for α k 1 that maximizes ρα). We denote by ρ G α) the rate function when underlying distributions are normal with the same mean and standard deviation vectors µ = µ 1,..., µ k ) and σ = σ 1,..., σ k ). The subscript G is mnemonic for Gaussian. It is easily seen that where α k 1. ρ G α) = min j 1 µ 1 µ j ) σ 1/α 1 + σ j /α j ), 4) Remark 3 Relation to the Welch divergence metric). With a slight abuse of notation, setting J T π) := J T α) for a static allocation α k 1, we have ρ G α) = J T α) T. 5) In view of Theorem 3, the WD policy asymptotically maximizes J T α). Hence, the WD policy asymptotically maximizes ρ G ), the exponent of PFS t ), when the underlying distributions are normal. In this perspective, the WD policy resembles the sequential sampling procedure in Glynn and Juneja 004), which targets the allocation that asymptotically optimizes ρ ). The latter procedure, however, requires a family of underlying distributions to be known. Accuracy of the Gaussian approximation. We now show that one may approximate ρα) by ρ G α) if the underlying distributions are not normal but µ 1 and max j 1 {µ j } are suitably close to each other. The following theorem establishes the relation between ρ ) and ρ G ).

14 14 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) Proposition 4 Local relative efficiency of the Gaussian approximation). Suppose Assumption 1 holds. Then for any α k 1 ρα) 1 as max ρ G α) {µ j} µ 1. 6) j 1 Proposition 4, combined with the fact that α = arg max α {ρ G α)}, suggests that the WD policy can achieve near-optimal performance in terms of PFS t ) when the means of the best and the second-best systems are sufficiently close to each other, regardless of the underlying probability distributions. We present a simple example to verify 6). Example 3. Consider two Bernoulli systems with parameters µ 1, µ 0, 1) with µ 1 > µ so that PX j = 1) = µ j = 1 PX j = 0) for j = 1,. The rate functions are I j x) = x log x µ j + 1 x) log 1 x 1 µ j, j = 1,. 7) Using a second order Taylor expansion of I j x) at x = µ j and the fact that σ j = µ j 1 µ j ), it can be seen that I j x) = x µ j) σ j + o µ 1 µ ) ), 8) where ox) is such that ox)/x 0 as x 0. Since ρα 1, α ) = inf x {α 1 I 1 x) + α I x)}, it is not difficult to show that ρα 1, α ) = which reduces to 4) as µ 1 µ 0. µ 1 µ ) σ 1/α 1 + σ /α ) + o µ 1 µ ) ), 9) To paraphrase Proposition 4, after taking t, the probability of false selection on a logarithmic scale depends on the underlying distributions only through the first two moments as the gap in the means shrinks to 0. To state this insensitivity property more precisely, consider a stylized sequence of system configurations defined as follows. Each system configuration is indexed by

15 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 15 superscript t. For t = 1 we let µ 1 j = µ j for each j = 1,..., k. For t, we fix µ t 1 = µ 1 and set µ t j, j =,..., k, so that µ 1 µ t j) = δ t µ 1 µ j ) for some 0 < δ t 1. That is, the probability distributions for systems j =,..., k are shifted so that their means are µ t j. For brevity, we set δ 1 = 1 and let the system configuration at each t be characterized by δ t. Note that we assume that E[e iθx j µ t j ) ; δ t ] = E[e iθx j µ 1 j ) ; δ 1 ] 30) for each j =,..., k, where E[ ; δ t ] denotes the expectation under which the system configuration is parametrized by δ t. This implies the probability distribution of X j is shifted so that the central moments of X j are unchanged. Define PFS t α); δ t ) as the probability of false selection under the t-th configuration, parametrized by δ t and α k 1. Proposition 5 Validity of the Gaussian approximation). Under Assumption 1, if tδt and δ t 0 as t, then for any α k 1 1 tδ t log PFS t α); δ t ) ρ G α) as t. 31) Heuristically, Proposition 5 implies that the behavior of PFS t α); δ) for large value of t is related to the magnitude of δ. Specifically, PFS t α); δ) exp{ ρ G α)tδ } Efficiency of the WD Policy Recall from Remark 3 that the WD policy optimizes ρ G α). If we write ρ max = arg max α {ρα)}, the loss in the asymptotic efficiency can be written as lt) = 1 ρα t )/ρ max which can be decomposed into two parts: i) an approximation error due to the loss of maximizing ρ G ) instead of ρ ); and ii) an estimation error due to the loss incurred by noisy estimates of means and variances. Formally, for any t k lt) = 1 ρα t) 3) ρ max ) ) = 1 ρα ) ρα ) ρα t ) + 33) ρ max ρ max = l 1 + l t), 34)

16 16 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) where l 1 corresponds to the loss due to the approximation error and l t) corresponds to the loss due to noisy sample estimates. Note that l 1 is independent of t and that l t) decreases to 0 almost surely as t, which can be easily seen from Theorem 1. For the regime with small t, the latter error dominates the former, i.e., l 1 l t), and hence, learning the mean of each system is more important than a precise approximation of ρ ). In other words, in this regime, ρ G ) is a reasonable proxy for ρ ). On the other hand, when t is large, l 1 is the dominating factor in the efficiency loss. In the non-normal distribution case, l 1 > 0 and the policy cannot recover the full asymptotic efficiency, ρ max. However, it is important to note that the magnitude of l 1 is not significant as long as the gap in means is small enough Proposition 4). For the purpose of visualization, we present Figure 1 that illustrates the magnitudes of l 1 and l t) in two simple configurations, each with two Bernoulli systems; one with δ = 0.05 and another one with δ = 0.005, where δ = µ 1 µ. For both configurations, there exists some t c such that l 1 = l t c ). For t t c, maximizing ρ G ) does not lead to a significant loss in the efficiency because l 1 l t). Also, one can observe that t c is roughly proportional to 1/δ ; when δ becomes 10 times smaller, it requires approximately 100 times more sampling budget to reduce l t) to the level of l 1, which is consistent with the result of Proposition 5 that the behavior of PFS t ; δ) is closely related to δ for large t Performance of the WD Policy For static policies, the exact convergence rate of PFS t ) can be characterized and optimized by choice of α) for light-tailed systems Glynn and Juneja 004); see also Broadie et al. 007) and Blanchet et al. 008) for the case of heavy tails. However, these results do not apply to dynamic policies. The next result provides a guarantee on the convergence rate of PFS t ) under the WD policy. Theorem 4 Bounds on probability of false selection). Under the WD policy with any 0 κ 1/k, PFS t ) 0 as t. Moreover, if κ > 0 and E X j q < for some q, then PFS t ) k 1)ηt q/, for all t k. 35)

17 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 17 a) δ = 0.05 b) δ = Loss in asymptotic efficiency t c 600 l 1 l t) t c 60, 000 l 1 l t) ,000 1,500, , , ,000 00,000 t t Figure 1 Decomposition of the efficiency loss under the WD policy with κ = 0.1. Each configuration consists of two Bernoulli systems. The graph a) is for the configuration with µ 1, µ ) = 0.9, 0.85) and the graph b) is for the configuration with µ 1, µ ) = 0.9, 0.895). Each l t) is estimated by taking average of 10 6 generated values of ρα t) ρα ))/ρ max. Standard error of each estimate for l t) is less than 10 7 in graph a) and is less than 10 9 in graph b). The two graphs suggest that the crossing point of l 1 and l t) is proportional to δ. If E[e θx j] < for all j = 1,..., k for θ θ 0 for some θ 0 > 0, then PFS t ) k 1)e γκt, for all t k, 36) where the constants η, γ > 0 are explicitly identified in the proof. Note that this result can be applied to set the total sampling budget that is required to guarantee PFS T ) ɛ. Specifically, 36) implies that T = logk 1)/ɛ)/γκ guarantees this desired level of accuracy. The exact value of T cannot be computed a priori because the constant γ or η) depends on the unknown) probability distributions. In the cases addressed below, one may set T explicitly utilizing a bound on either γ or η. Remark 4 Bounded random variables). Suppose that X jt [a, b] with probability one for each j = 1,..., k, and that µ 1 µ j + δ for j =,..., k. Then, following similar steps to those in the proof of Theorem 4, we have that ) PFS t ) k 1) exp κtδ, for all t k, 37) b a)

18 18 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) where the right-hand side is obtained by iteratively applying Hoeffding s inequality. Hence, setting T = b a) logk 1)/ɛ)/κδ guarantees PFS T ) ɛ. Remark 5 Sub-Gaussian random variables). Suppose that X jt is sub-gaussian for each j = 1,..., k; that is, there exists c j > 0 such that E[e θx jt] e c j θ / for all θ θ 0. Also, suppose that µ 1 µ j + δ for j =,..., k. In this case, following similar steps to those in the proof of Theorem 4, one can establish that ) PFS t ) k 1) exp κtδ, for all t k, 38) 8c where c min j {c j }. Hence, setting T = 8c logk 1)/ɛ)/κδ guarantees PFS T ) ɛ. 7. Numerical Testing and Comparison with Other Policies 7.1. Different Sampling Policies We compare the WD policy against three other policies: the dynamic algorithm for Optimal Computing Budget Allocation OCBA) Chen et al. 000), the Knowledge Gradient KG) algorithm Frazier et al. 008), and the Equal Allocation EA) policy; see details below. Optimal Computing Budget Allocation OCBA). Chen et al. 000) assumed that the underlying distributions are normal. They formulate the problem as follows: X bt min Φ X jt, 39) α k 1 j b Sbt /α bt) + Sjt/α j t) where b = arg max j { X jt }, Φ ) is the cumulative standard normal distribution function and Φ ) = 1 Φ ). They suggested a dynamic policy based on first-order conditions for 39). Initial τ 0 samples are taken from each system, which are then used to allocate incremental samples. Specifically, compute ˆα 1,..., ˆα k ) by solving ˆα j ˆα i = Sjt / X bt X ) jt ) S it / X bt X it ) for i, j 1,,..., k and i j b, and 40) ˆα b Sbt = j b ˆα j S jt, 41) which are derived from the first order conditions of 39). Then, allocate ˆα j samples to system j. This procedure is repeated until the sampling budget is exhausted. We let = 1 in this paper.

19 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 19 Knowledge Gradient KG). We briefly describe the KG policy; further description can be found in Frazier et al. 008). Let µ t j and σ t j denote the mean and variance of the posterior distribution for the unknown performance of system j in stage t. Then KG policy maximizes the single-period expected increase in value, E t [max j µ t+1 j max j µ t j], where E t [ ] indicates the conditional expectation with respect to the filtration up to stage t. Although Frazier et al. 008) mainly discusses the KG policy for the known variance case, we use its version with unknown variance in order to be consistent with the other benchmark policies herein. This policy requires users to set parameters for prior distributions and we use τ 0 initial sample estimates to set those. Equal Allocation EA). This is a simple static policy, where all systems are equally sampled, i.e., π t = arg min j N jt, breaking ties by selecting the system with the smallest index. Note that the dynamic OCBA procedure described above resembles the WD policy in that it myopically optimizes a certain objective function. However, the objective function in 39) is different than that of the WD policy in each stage. Further, Chen et al. 000) assumed that N bt N jt to derive the equations, 40) and 41), which lead to an asymptotic allocation different than that of the WD policy. In Chen et al. 000), the OCBA policy takes an initial number of samples, τ 0, as a parameter. Also, in the KG policy described above, τ 0 initial samples are taken to estimate parameters for prior distributions. However, in this paper, for the purpose of comparison with the WD policy, the KG and the OCBA policies are modified so that they take κ as a single parameter. To be specific, in each stage t, we check whether each system is sampled more than κt times: If N jt < κt for some j, we take a sample from the system j. If N jt κt for all j, we follow the sampling decision from the KG or the OCBA. This affects performances of the two policies as illustrated in Figure in which the two policies are tested for different values of κ; the configuration is four normally distributed systems with means µ = 1, 0.5, 0.5, 0.5) and standard deviations σ = 4, 4, 4, 4). In Figure a), we set κ = 0 but initial τ 0 = 5 samples are taken from each system. There is no general rule for choosing an optimal value of κ. In the following numerical experiments, we set κ = 0.4/k for all policies. This is by no means an optimal choice, but we remark the qualitative conclusions hold for other choices of κ as well.

20 0 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) κ = 0 b) κ = PFSt) OCBA KG EA 10 3 OCBA KG EA t t Figure Effect of constrained sampling on the probability of false selection as a function of sampling budget t, plotted in log-linear scale. The three sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); and equal allocation EA). The configuration is four normally distributed systems with µ = 1, 0.5, 0.5, 0.5) and σ = 4, 4, 4, 4). In a), κ = 0 but initial τ 0 = 5 samples are taken from each system. The performance of the two policies degrade dramatically as t increases because of the insufficient samples in initial stages. However, in b), the performance does not degrade when κ = Numerical Experiments The numerical experiments include a series of tests using normal, Student t, and Bernoulli distributions. PFS t ) is estimated by counting the number of false selections out of m simulation trials, which is chosen so that: Pt 1 P t ) m P t 10, 4) where P t is the order of magnitude of PFS t ) for a given budget t. This implies the standard error for each estimate of PFS t ) is at least ten times smaller than the value of PFS t ). For example, in Figure 3, the minimum value of PFS t ) is 10 4, so we choose m to be at least 10 6 by this argument. Consequently, we have sufficiently high confidence that the results are not driven by simulation error. The experiments are implemented in C using the CFSQP optimization library Lawrence et al. 1997) on a Windows 7 operating system with 3GB RAM and Intel core i7.7ghz CPU. For

21 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) SC 10 0 b) MDM PFSt) OCBA KG EA WD 10 3 OCBA KG EA WD Figure 3 The probability of false selection in normal distribution case as a function of sampling budget t, plotted in log-linear scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). The left two panels are for a) Slippage Configuration SC) with µ = 1, 0.5, 0.5, 0.5) and the right two panels are for b) Monotone Decreasing Means MDM) with µ = 1, 0.5, 0.5, 0.15) simulation trials with T = 6400, the CPU times for the OCBA, KG, EA, and WD policies are 31.0, 8.0, 3.6, and 16.3 minutes, respectively. The OCBA and WD policies take more CPU times compared to the KG and EA algorithms, since the latter two policies do not involve optimization in each stage. Although the implementation of each algorithm was not optimized for CPU time, the relative CPU times illustrate the dramatic differences between the algorithms Normal distribution We consider two configurations with four normally distributed systems. The first configuration will be referred to as the Slippage Configuration SC) in which the means of the non-best systems are equally separated from the best system; µ = 1, 0.5, 0.5, 0.5). The second configuration will be referred to as the Monotone Decreasing Means MDM) in which µ = 1, 0.5, 0.5, 0.15). For both configurations we set σ = 4, 4, 4, 4), and estimate probabilities of false selection for different sampling budgets. Figure 3 shows the test results for the WD and the other three benchmark procedures. In particular, we plot PFS t ) on log-linear scale as a function of the sampling budget t. In Figure 3, one can observe that the WD and KG policies perform equally well and outperform the other policies with respect to PFS t ). Further, PFS t ) under these policies converges to 0

22 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) exponentially fast. Note that PFS t ) C exp βt) implies that log PFS t ) should be approximately linear in t with slope β. As t increases, the performance of the OCBA policy becomes inferior to that of the WD. We also perform the experiment with four normally distributed systems, but with different variances. First, we consider variants of SC and MDM configurations with increasing variances, i.e., σ = 4, 4.5, 5, 5.5). We call these SC-INC and MDM-INC configurations, respectively. Similarly, we consider SC-DEC and MDM-DEC configurations with decreasing variances, i.e., σ = 4, 3.5, 3,.5). The simulation results are illustrated in Figure Student t-distribution We consider four systems with Student t-distributions, each of which is parameterized by the location parameter µ j, the scale parameter σ j, and the degrees of freedom ν j. We again take the two configurations, T-SC and T-MDM. In the T-SC configuration, µ = 1, 0.5, 0.5, 0.5), σ = 1, 1, 1, 1), and ν =.1,.1,.1,.1). In the T-MDM configuration, µ = 1, 0.5, 0.5, 0.15) with the same σ and ν as in the T-SC configuration. Note that the OCBA and KG policies are developed on the premise that the underlying distributions are normal light-tailed). Also notice that both x- and y-axes are on a logarithmic scale in Figure 5. PFS t ) for the WD and other benchmark policies) converges to 0 at a polynomial rate consistent with Theorem 4. The KG policy performs as well as the WD does, however, the performance of the OCBA becomes less efficient as t increases Bernoulli distribution The four systems have Bernoulli distributions. The distribution of system j is parameterized by a single parameter, µ j 0, 1), so that its mean and variance are µ j and µ j 1 µ j ), respectively. Note that this distribution becomes more skewed as µ j approaches 0 or 1. We consider four configurations in this experiment, BER-LARGE-DIFF, BER-SMALL-DIFF, BER-SKEWED-LARGE-DIFF, and BER-SKEWED-SMALL-DIFF. In the first two configurations, distributions are only mildly skewed with µ = 0.6, 0.3, 0.3, 0.3) and µ = 0.6, 0.57, 0.57, 0.57), respectively. In the last two configurations, distributions are significantly skewed with µ = 0.9, 0.6, 0.6, 0.6) and µ = 0.9, 0.87, 0.87, 0.87), respectively.

23 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) SC-INC 10 0 b) MDM-INC PFSt) OCBA KG EA WD OCBA KG EA WD c) SC-DEC d) MDM-DEC PFSt) OCBA KG EA WD OCBA KG EA WD t t Figure 4 Probability of false selection in normal distribution cases with unequal variances, as a function of sampling budget t, plotted in log-linear scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). The graphs a) and b) are for SC-INC and MDM-INC configurations with increasing variance, σ = 4, 4.5, 5, 5.5), and the graphs c) and d) are SC-DEC and MDM-DEC configurations with decreasing variance, σ = 4, 3.5, 3,.5). In this experiment, we also consider a variant of the WD policy. Recall that the WD policy solves 19) in each stage to optimize ρ G ), the rate function of PFS t ) when underlying distributions are normal. Let us define a policy called WD-BER in a similar manner. When underlying distributions are Bernoulli, it can be seen that the rate function of PFS t ), under a static policy with α, is given

24 4 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) 10 0 a) T-SC 10 0 b) T-MDM PFSt) OCBA KG EA WD t 10 3 OCBA KG EA WD t Figure 5 Probability of false selection in the Student t distribution case as a function of the sampling budget t. Note that both x- and y-axes are in log scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). The two graphs are for a) T-SC configuration with µ = 1, 0.5, 0.5, 0.5) and b) T-MDM configuration with µ = 1, 0.5, 0.5, 0.15). by ρα) = min j 1 [ α 1 + α j ) log α 1 α 1 µ 1 ) 1 +α j ) α 1 µ j ) 1 +α j ) + µ α j α 1 α 1 +α j ) 1 µ j α j α 1 +α j ) )]. 43) Then, define the WD-BER policy so that it solves, in each stage, max α k 1 [ min α b + α j ) log 1 X α b α bt ) b +α j ) 1 X α j α jt ) b +α j ) + X j b α b α b +α j ) bt X α b α b +α j ) jt )]. P t) The WD-BER serves as a hypothetical benchmark since the knowledge that underlying distributions are Bernoulli is typically not available to the policy designer. Recall that the WD policy is predicated on a two-moment approximation. Hence, one may expect that the WD policy performs on par with the WD-BER policy when the distributions are less skewed. This can indeed be observed in Figure 6 a) and b). As the distributions become more skewed, one may expect that the WD policy performs poorly compared to the WD-BER policy, and this can be seen in Figure 6c). Despite the effect of skewness, the WD policy is seen to perform well when the means of the best and second-best systems are close, as seen in Figure 6d); this is consistent with Proposition 4.

25 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) BER-LARGE-DIFF OCBA KG EA WD WD-BER 10 0 b) BER-SMALL-DIFF PFSt) OCBA KG EA WD WD-BER c) BER-SKEWED-LARGE-DIFF d) BER-SKEWED-SMALL-DIFF 10 1 OCBA KG EA WD WD-BER PFSt) OCBA KG EA WD WD-BER t t Figure 6 Probability of false selection in Bernoulli distribution case as a function of the sampling budget t, plotted in log-linear scale. The five sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); Welch Divergence WD); and a Bernoulli-variant of the WD policy WD-BER). The four configurations are a) BER-LARGE-DIFF with µ = 0.6, 0.3, 0.3, 0.3), b) BER-SMALL-DIFF with µ = 0.6, 0.57, 0.57, 0.57), c) BER-SKEWED-LARGE-DIFF with µ = 0.9, 0.6, 0.6, 0.6), and d) BER-SKEWED-SMALL-DIFF with µ = 0.9, 0.87, 0.87, 0.87) Large number of systems We consider larger problem spaces with 16 and 3 normally distributed systems, each referred to as MDM-16sys and MDM-3sys, respectively. In the MDM-16sys configuration, µ j = 0.5 j 1 for j = 1,,..., 16. In the MDM-3sys configuration, µ j = 0.5 j 1 for j = 1,,..., 3. For both configurations we set σ = 4,..., 4). The simulation results are depicted in Figure 7, in which the WD policy outperforms the other policies.

26 6 Article submitted to Operations Research; manuscript no. Please, provide the manuscript number!) a) MDM-16sys b) MDM-3sys PFSt) OCBA KG EA WD OCBA KG EA WD t t Figure 7 Probability of false selection in normal distribution case with large number of systems, plotted as a function of the sampling budget t in log-linear scale. The four sampling policies are: Optimal computing budget allocation OCBA); knowledge gradient KG); equal allocation EA); and Welch Divergence WD). 8. Concluding Remark In this paper we propose a tractable dynamic sampling policy for the problem of ordinal optimization. It requires a parameter, κ, that constrains the minimum sampling rate. How to adaptively choose κ to further improve the efficiency of the policy is an interesting problem for future work. Also, while we primarily focused on the probability of false selection as a performance criterion for sampling policies, the methodology developed in this paper can be extended to at least two other criteria. First, it can be applied to the problem of selecting m > 1) best systems out of k. If PFS t ) is defined as the probability of falsely selecting one of the k m) inferior systems, one can track the behavior of PFS t ) for large t using the relationship ), for which the explicit form of the rate function, ρ ), can be easily established using large deviations theory. A different criterion, E[µ 1 µ bt ], penalizes an incorrect choice based on the performance discrepancy. Although it is not analytically tractable for finite t, we note that it is possible to approximate the asymptotic behavior of this criterion using the relationship ).

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huscha, and S. E. Chic, eds. OPTIMAL COMPUTING BUDGET ALLOCATION WITH EXPONENTIAL UNDERLYING

More information

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds. TRACTABLE SAMPLING STRATEGIES FOR QUANTILE-BASED ORDINAL OPTIMIZATION

More information

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied

More information

Selecting a Selection Procedure

Selecting a Selection Procedure Jürgen Branke Stephen E. Chick Christian Schmidt Abstract Selection procedures are used in a variety of applications to select the best of a finite set of alternatives. Best is defined with respect to

More information

Ordinal Optimization and Multi Armed Bandit Techniques

Ordinal Optimization and Multi Armed Bandit Techniques Ordinal Optimization and Multi Armed Bandit Techniques Sandeep Juneja. with Peter Glynn September 10, 2014 The ordinal optimization problem Determining the best of d alternative designs for a system, on

More information

NEW GREEDY MYOPIC AND EXISTING ASYMPTOTIC SEQUENTIAL SELECTION PROCEDURES: PRELIMINARY EMPIRICAL RESULTS

NEW GREEDY MYOPIC AND EXISTING ASYMPTOTIC SEQUENTIAL SELECTION PROCEDURES: PRELIMINARY EMPIRICAL RESULTS Proceedings of the 2007 Winter Simulation Conference S. G. Henderson, B. Biller, M.-H. Hsieh, J. Shortle, J. D. Tew, and R. R. Barton, eds. NEW GREEDY MYOPIC AND EXISTING ASYMPTOTIC SEQUENTIAL SELECTION

More information

A Maximally Controllable Indifference-Zone Policy

A Maximally Controllable Indifference-Zone Policy A Maimally Controllable Indifference-Zone Policy Peter I. Frazier Operations Research and Information Engineering Cornell University Ithaca, NY 14853, USA February 21, 2011 Abstract We consider the indifference-zone

More information

Indifference-Zone-Free Selection of the Best

Indifference-Zone-Free Selection of the Best Submitted to Operations Research manuscript (Please, provide the manuscript number!) Indifference-Zone-Free Selection of the Best Weiwei Fan Department of Management Science, School of Management, University

More information

Applying Bayesian Estimation to Noisy Simulation Optimization

Applying Bayesian Estimation to Noisy Simulation Optimization Applying Bayesian Estimation to Noisy Simulation Optimization Geng Deng Michael C. Ferris University of Wisconsin-Madison INFORMS Annual Meeting Pittsburgh 2006 Simulation-based optimization problem Computer

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club 36-825 1 Introduction Jisu Kim and Veeranjaneyulu Sadhanala In this report

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

Consider the context of selecting an optimal system from among a finite set of competing systems, based

Consider the context of selecting an optimal system from among a finite set of competing systems, based INFORMS Journal on Computing Vol. 25, No. 3, Summer 23, pp. 527 542 ISSN 9-9856 print) ISSN 526-5528 online) http://dx.doi.org/.287/ijoc.2.59 23 INFORMS Optimal Sampling Laws for Stochastically Constrained

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

A Fully Sequential Elimination Procedure for Indi erence-zone Ranking and Selection with Tight Bounds on Probability of Correct Selection

A Fully Sequential Elimination Procedure for Indi erence-zone Ranking and Selection with Tight Bounds on Probability of Correct Selection Submitted to Operations Research manuscript OPRE-0-09-480 A Fully Sequential Elimination Procedure for Indi erence-zone Ranking and Selection with Tight Bounds on Probability of Correct Selection Peter

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

1 Problem Formulation

1 Problem Formulation Book Review Self-Learning Control of Finite Markov Chains by A. S. Poznyak, K. Najim, and E. Gómez-Ramírez Review by Benjamin Van Roy This book presents a collection of work on algorithms for learning

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box 90251 Durham, NC 27708, USA Summary: Pre-experimental Frequentist error probabilities do not summarize

More information

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances Advances in Decision Sciences Volume 211, Article ID 74858, 8 pages doi:1.1155/211/74858 Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances David Allingham 1 andj.c.w.rayner

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Statistical Inference

Statistical Inference Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Week 12. Testing and Kullback-Leibler Divergence 1. Likelihood Ratios Let 1, 2, 2,...

More information

Lecture 4 January 23

Lecture 4 January 23 STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Guesswork Subject to a Total Entropy Budget

Guesswork Subject to a Total Entropy Budget Guesswork Subject to a Total Entropy Budget Arman Rezaee, Ahmad Beirami, Ali Makhdoumi, Muriel Médard, and Ken Duffy arxiv:72.09082v [cs.it] 25 Dec 207 Abstract We consider an abstraction of computational

More information

CCP Estimation. Robert A. Miller. March Dynamic Discrete Choice. Miller (Dynamic Discrete Choice) cemmap 6 March / 27

CCP Estimation. Robert A. Miller. March Dynamic Discrete Choice. Miller (Dynamic Discrete Choice) cemmap 6 March / 27 CCP Estimation Robert A. Miller Dynamic Discrete Choice March 2018 Miller Dynamic Discrete Choice) cemmap 6 March 2018 1 / 27 Criteria for Evaluating Estimators General principles to apply when assessing

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

NEW RESULTS ON PROCEDURES THAT SELECT THE BEST SYSTEM USING CRN. Stephen E. Chick Koichiro Inoue

NEW RESULTS ON PROCEDURES THAT SELECT THE BEST SYSTEM USING CRN. Stephen E. Chick Koichiro Inoue Proceedings of the 2000 Winter Simulation Conference J. A. Joines, R. R. Barton, K. Kang, and P. A. Fishwick, eds. NEW RESULTS ON PROCEDURES THAT SELECT THE BEST SYSTEM USING CRN Stephen E. Chick Koichiro

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

A New Dynamic Programming Decomposition Method for the Network Revenue Management Problem with Customer Choice Behavior

A New Dynamic Programming Decomposition Method for the Network Revenue Management Problem with Customer Choice Behavior A New Dynamic Programming Decomposition Method for the Network Revenue Management Problem with Customer Choice Behavior Sumit Kunnumkal Indian School of Business, Gachibowli, Hyderabad, 500032, India sumit

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

Quantifying Stochastic Model Errors via Robust Optimization

Quantifying Stochastic Model Errors via Robust Optimization Quantifying Stochastic Model Errors via Robust Optimization IPAM Workshop on Uncertainty Quantification for Multiscale Stochastic Systems and Applications Jan 19, 2016 Henry Lam Industrial & Operations

More information

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

CROSS-VALIDATION OF CONTROLLED DYNAMIC MODELS: BAYESIAN APPROACH

CROSS-VALIDATION OF CONTROLLED DYNAMIC MODELS: BAYESIAN APPROACH CROSS-VALIDATION OF CONTROLLED DYNAMIC MODELS: BAYESIAN APPROACH Miroslav Kárný, Petr Nedoma,Václav Šmídl Institute of Information Theory and Automation AV ČR, P.O.Box 18, 180 00 Praha 8, Czech Republic

More information

Unbounded Regions of Infinitely Logconcave Sequences

Unbounded Regions of Infinitely Logconcave Sequences The University of San Francisco USF Scholarship: a digital repository @ Gleeson Library Geschke Center Mathematics College of Arts and Sciences 007 Unbounded Regions of Infinitely Logconcave Sequences

More information

FINDING THE BEST IN THE PRESENCE OF A STOCHASTIC CONSTRAINT. Sigrún Andradóttir David Goldsman Seong-Hee Kim

FINDING THE BEST IN THE PRESENCE OF A STOCHASTIC CONSTRAINT. Sigrún Andradóttir David Goldsman Seong-Hee Kim Proceedings of the 2005 Winter Simulation Conference M. E. Kuhl, N. M. Steiger, F. B. Armstrong, and J. A. Joines, eds. FINDING THE BEST IN THE PRESENCE OF A STOCHASTIC CONSTRAINT Sigrún Andradóttir David

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Bounded Infinite Sequences/Functions : Orders of Infinity

Bounded Infinite Sequences/Functions : Orders of Infinity Bounded Infinite Sequences/Functions : Orders of Infinity by Garimella Ramamurthy Report No: IIIT/TR/2009/247 Centre for Security, Theory and Algorithms International Institute of Information Technology

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

2.1.3 The Testing Problem and Neave s Step Method

2.1.3 The Testing Problem and Neave s Step Method we can guarantee (1) that the (unknown) true parameter vector θ t Θ is an interior point of Θ, and (2) that ρ θt (R) > 0 for any R 2 Q. These are two of Birch s regularity conditions that were critical

More information

Rate of Convergence of Learning in Social Networks

Rate of Convergence of Learning in Social Networks Rate of Convergence of Learning in Social Networks Ilan Lobel, Daron Acemoglu, Munther Dahleh and Asuman Ozdaglar Massachusetts Institute of Technology Cambridge, MA Abstract We study the rate of convergence

More information

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix Labor-Supply Shifts and Economic Fluctuations Technical Appendix Yongsung Chang Department of Economics University of Pennsylvania Frank Schorfheide Department of Economics University of Pennsylvania January

More information

Performance Measures for Ranking and Selection Procedures

Performance Measures for Ranking and Selection Procedures Rolf Waeber Performance Measures for Ranking and Selection Procedures 1/23 Performance Measures for Ranking and Selection Procedures Rolf Waeber Peter I. Frazier Shane G. Henderson Operations Research

More information

Bayesian Modeling and Classification of Neural Signals

Bayesian Modeling and Classification of Neural Signals Bayesian Modeling and Classification of Neural Signals Michael S. Lewicki Computation and Neural Systems Program California Institute of Technology 216-76 Pasadena, CA 91125 lewickiocns.caltech.edu Abstract

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

Asymptotics for Polling Models with Limited Service Policies

Asymptotics for Polling Models with Limited Service Policies Asymptotics for Polling Models with Limited Service Policies Woojin Chang School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, GA 30332-0205 USA Douglas G. Down Department

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Inference with few assumptions: Wasserman s example

Inference with few assumptions: Wasserman s example Inference with few assumptions: Wasserman s example Christopher A. Sims Princeton University sims@princeton.edu October 27, 2007 Types of assumption-free inference A simple procedure or set of statistics

More information

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Decoupling Coupled Constraints Through Utility Design

Decoupling Coupled Constraints Through Utility Design 1 Decoupling Coupled Constraints Through Utility Design Na Li and Jason R. Marden Abstract The central goal in multiagent systems is to design local control laws for the individual agents to ensure that

More information

Asymptotic distribution of the sample average value-at-risk

Asymptotic distribution of the sample average value-at-risk Asymptotic distribution of the sample average value-at-risk Stoyan V. Stoyanov Svetlozar T. Rachev September 3, 7 Abstract In this paper, we prove a result for the asymptotic distribution of the sample

More information

Classical regularity conditions

Classical regularity conditions Chapter 3 Classical regularity conditions Preliminary draft. Please do not distribute. The results from classical asymptotic theory typically require assumptions of pointwise differentiability of a criterion

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Common Knowledge and Sequential Team Problems

Common Knowledge and Sequential Team Problems Common Knowledge and Sequential Team Problems Authors: Ashutosh Nayyar and Demosthenis Teneketzis Computer Engineering Technical Report Number CENG-2018-02 Ming Hsieh Department of Electrical Engineering

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information

Fast Angular Synchronization for Phase Retrieval via Incomplete Information Fast Angular Synchronization for Phase Retrieval via Incomplete Information Aditya Viswanathan a and Mark Iwen b a Department of Mathematics, Michigan State University; b Department of Mathematics & Department

More information

The Expected Opportunity Cost and Selecting the Optimal Subset

The Expected Opportunity Cost and Selecting the Optimal Subset Applied Mathematical Sciences, Vol. 9, 2015, no. 131, 6507-6519 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ams.2015.58561 The Expected Opportunity Cost and Selecting the Optimal Subset Mohammad

More information

Worst case analysis for a general class of on-line lot-sizing heuristics

Worst case analysis for a general class of on-line lot-sizing heuristics Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University

More information

Volatility. Gerald P. Dwyer. February Clemson University

Volatility. Gerald P. Dwyer. February Clemson University Volatility Gerald P. Dwyer Clemson University February 2016 Outline 1 Volatility Characteristics of Time Series Heteroskedasticity Simpler Estimation Strategies Exponentially Weighted Moving Average Use

More information

Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures

Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures Online Companion: Risk-Averse Approximate Dynamic Programming with Quantile-Based Risk Measures Daniel R. Jiang and Warren B. Powell Abstract In this online companion, we provide some additional preliminary

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Collapsed Variational Inference for Sum-Product Networks

Collapsed Variational Inference for Sum-Product Networks for Sum-Product Networks Han Zhao 1, Tameem Adel 2, Geoff Gordon 1, Brandon Amos 1 Presented by: Han Zhao Carnegie Mellon University 1, University of Amsterdam 2 June. 20th, 2016 1 / 26 Outline Background

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Prioritized Sweeping Converges to the Optimal Value Function

Prioritized Sweeping Converges to the Optimal Value Function Technical Report DCS-TR-631 Prioritized Sweeping Converges to the Optimal Value Function Lihong Li and Michael L. Littman {lihong,mlittman}@cs.rutgers.edu RL 3 Laboratory Department of Computer Science

More information

Bios 6649: Clinical Trials - Statistical Design and Monitoring

Bios 6649: Clinical Trials - Statistical Design and Monitoring Bios 6649: Clinical Trials - Statistical Design and Monitoring Spring Semester 2015 John M. Kittelson Department of Biostatistics & Informatics Colorado School of Public Health University of Colorado Denver

More information

Tests of the Co-integration Rank in VAR Models in the Presence of a Possible Break in Trend at an Unknown Point

Tests of the Co-integration Rank in VAR Models in the Presence of a Possible Break in Trend at an Unknown Point Tests of the Co-integration Rank in VAR Models in the Presence of a Possible Break in Trend at an Unknown Point David Harris, Steve Leybourne, Robert Taylor Monash U., U. of Nottingam, U. of Essex Economics

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)? ECE 830 / CS 76 Spring 06 Instructors: R. Willett & R. Nowak Lecture 3: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we

More information

CS261: Problem Set #3

CS261: Problem Set #3 CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:

More information