arxiv: v1 [cs.lg] 14 Nov 2018 Abstract

Size: px

Start display at page:

Download "arxiv: v1 [cs.lg] 14 Nov 2018 Abstract"

Valentine Butler
5 years ago
Views:

1 Sample complexity of partition identification using multi-armed bandits Sandeep Juneja Subhashini Krishnasamy TIFR, Mumbai November 15, 2018 arxiv: v1 [cs.lg] 14 Nov 2018 Abstract Given a vector of probability distributions, or arms, each of which can be sampled independently, we consider the problem of identifying the partition to which this vector belongs from a finitely partitioned universe of such vector of distributions. We study this as a pure exploration problem in multi armed bandit settings and develop sample complexity bounds on the total mean number of samples required for identifying the correct partition with high probability. This framework subsumes well studied problems in the literature such as finding the best arm or the best few arms. We consider distributions belonging to the single parameter exponential family and primarily consider partitions where the vector of means of arms lie either in a given set or its complement. The sets considered correspond to distributions where there exists a mean above a specified threshold, where the set is a half space and where either the set or its complement is convex. When the set is convex we restrict our analysis to its complement being a union of half spaces. In all these settings, we characterize the lower bounds on mean number of samples for each arm. Further, inspired by the lower bounds, and building upon Garivier and Kaufmann (2016), we propose algorithms that can match these bounds asymptotically with decreasing probability of error. Applications of this framework may be diverse. We briefly discuss a few associated with finance. 1 Introduction Suppose that Ω denotes a collection of vectors ν = (ν 1,..., ν K ) where each ν i is a probability distribution on R. Further, Ω = m A i where the component sets A i are disjoint, and thus partition Ω. In this set-up, given µ = (µ 1,..., µ K ) Ω, we consider the problem of identifying the correct component A i that contains µ. The distributions (µ i : i K) are not known to us, however, it is possible to generate independent samples from each µ i. We call this the partition identification or PI problem. In the multi-armed bandit literature, for any ν Ω, generating a sample from distribution ν i is referred to as sampling from, or pulling, an arm i. We also follow this convention. We consider algorithms that sequentially and adaptively generate samples from each arm in µ and then after generating finitely many samples, stop to announce a component of Ω that is erred to contain µ. We study the so-called δ-pac algorithms in the PI framework. As is well known, PAC algorithms stands for probably approximately correct algorithms. Definition 1. For any δ (0, 1), an algorithm is said to be δ-pac for the PI problem Ω = m A i if, for every µ Ω, it restricts the probability of announcing an incorrect component to most δ. 1

2 More generally, in similar sequential decision making problems, algorithms are said to provide δ-pac guarantees if the probability of incorrect decision is bounded from above by δ for each δ (0, 1). The PI framework is quite general and captures popular pure exploration problems studied in the multi-armed bandit literature. For instance, finding the best arm, that is, the arm with the highest mean, with above δ-pac type guarantees, is well studied in literature and fits PI framework (see, e.g., in learning theory Garivier and Kaufmann (2016), Kaufmann et al. (2016), Russo (2016), Jamieson et al. (2014), Bubeck et al. (2011), Audibert and Bubeck (2010), Even-Dar et al. (2006), Mannor and Tsitsiklis (2004); in earlier statistics literature - Jennison et al. (1982), Bechhofer et al. (1968), Paulson et al. (1964), Chernoff (1959); in simulation theory literature - Glynn and Juneja (2004), Kim and Nelson (2001), Chen et al. (2000), Dai (1996), Ho et al. (1992)). More generally, identifying m arms (for some m < K) with the the largest m means amongst K distributions also is a PI problem ( see, e.g., Kaufmann and Kalyanakrishnan (2013), Kalyanakrishnan et al. (2012)). The advantage of PI framework is that it provides a unified approach to tackle a large class of problems, both in developing lower bounds on the sample complexity (expected total number of arm pulls) to achieve δ-pac guarantees (we call this the lower bound problem), as well as in arriving at algorithms that match up to the developed lower bounds, under certain distributional restrictions. Analysis of lower bound problem relies on a fundamental inequality developed by Garivier and Kaufmann (2016). Their work in turn is built upon the earlier analysis that goes back at least to Lai and Robbins (1985) (also see Mannor and Tsitsiklis (2004), Burnetas and Katehakis (1996)). This inequality allows us to formulate the lower bound problem as an optimization problem - a linear program with initely many constraints; as well as an equivalent min-max formulation. To further analyze this optimization problem, some distributional restrictions are needed (see Glynn and Juneja (2018) for necessity of distributional restrictions). As is customary in the learning theory literature (see, e.g., Cappé et al. (2013), Garivier and Kaufmann (2016), Kaufmann et al. (2016)), we assume that each arm distribution belongs to a single parameter exponential family. Examples include Binomial, Poisson, Gaussian with known variance distribution, etc. See, Cappé et al. (2013) for an elaborate discussion on such distributions. Any member of a single parameter exponential family can be represented by its corresponding parameter (say its mean), which allows us to consider the partition problem in the parameter space (i.e., Ω R K ) instead of the distribution space. In our analysis, we solve the lower bound problem in the following settings that fit the general framework: Given a µ Ω and a threshold u, determining if max i K µ i > u. We refer to this as the threshold crossing problem. We briefly discuss how this problem arises naturally in financial applications involving nested simulations. Ascertaining if µ Ω lies in the half space for (a 1,..., a K, b) R K+1. {ν R K : a i ν i > b} This is a generic problem with potentially many applications. One application is in the capital budgeting problem faced by a financial manager, where the manager needs to ascertain if expected profitability of a given set of projects exceeds an acceptable threshold. While expected cash flow from a project may not be known in closed form, independent, identically, distributed samples of cash flow from each project can be generated via simulation. 2

3 More generally, ascertaining if µ Ω lies in a convex set or in a complement of a convex set. When considering complement of a convex set, we restrict our analysis to a simpler setting where this set is a union of half spaces. Here, under further simplifying assumptions, we highlight some of the key features of the solution. Garivier and Kaufmann (2016) solve an equivalent optimization problem in the best arm setting. They further use the solution to arrive at an adaptive δ-pac algorithm whose sample complexity asymptotically matches the lower bound (as δ 0). We note that their algorithm can also be adapted to the problems we consider to again arrive at an adaptive δ-pac algorithm whose sample complexity asymptotically matches the corresponding lower bound. The rest of the paper is organized as follows: In Section 2, we state the lower bound inequality from Kaufmann et al. (2016) and state the resultant lower bound problem in our framework as an optimization problem. We also spell out preliminaries such as the single parameter exponential family distributions and related assumptions in this section. In Section 3, we characterize the solution to the lower bound problem for various special cases of partition of Ω into sets A and A c (complement of set A). For the threshold crossing problem (Section 3.1), we give a closed form expression for the solution to the lower bound problem and discuss its applications to a specific problem in finance. For the half-space problem (Section 3.2), we give a simple characterization of the solution that is useful in designing the sampling rule in our δ-pac algorithm. Similarly, for the problem where Ω is partitioned into a convex set and its complement, we derive some useful properties of the solution to the lower bound problem (Sections 3.3 and 3.4). In Section 4, we propose a δ-pac algorithm that in substantial generality achieves the derived lower bounds asymptotically as δ decreases to zero. 2 Preliminaries and basic optimization problem Recall that Ω denotes a collection of vectors ν = (ν 1,..., ν K ) where each ν i is a probability distribution in R. Further, Ω = m A i where the A i are disjoint, and thus partition Ω. Let KL(µ i ν i ) = log( µ i ν i (x))dµ i (x) denote the Kullback-Leibler divergence between distributions µ i and ν i. We further assume that for each ν, ν Ω, the components ν i and ν i for each i are mutually absolutely continuous and the expectation KL(ν i ν i ) exists (it may be inite). For p, q (0, 1), let d(p, q) := p log ( ) ( ) p 1 p + (1 p) log, q 1 q that is d(p, q) denotes the KL-divergence between Bernoulli distributions with mean p and q, respectively. For any set B, let B c denote its complement, B o its interior, B its closure and B its boundary. Under a δ - PAC algorithm, and for µ A j it follows from Kaufmann et al. (2016) that ( ) 1 E µ N i KL(µ i ν i ) d(δ, 1 δ) log (1) 2.4δ for any ν A c j, where N i denotes the number of times Arm i is pulled by the algorithm. The assumption that KL(ν i ν i ) exists, allows the use of Wald s Lemma in proof of (1) in Kaufmann et al. (2016). It is easy to see that d(δ, 1 δ) log δ 1 as δ 0. Taking t i = E µ N i / log( 1 2.4δ ), our lower bound problem can be modelled as the following convex programming problem, when µ A j : min K t i (2) t=(t 1,...,t K ) s.t. ν A c j K t ikl(µ i ν i ) 1, (3) 3 t i 0 i.

4 Observe that, by making the following change of variables letting P K {w R k : w i 0 i, w i = 1} t i j t j = w i and 1 j t j = Λ, and denote the K-dimensional probability simplex, our optimization problem maybe equivalently stated as The above problem is in turn equivalent to max w PK,Λ s.t. ν A c j K w ikl(µ i ν i ) Λ. Λ max w P K ν A c j w i KL(µ i ν i ). (Problem LB) Let C (µ) be the optimal value of the above problem. The lower bound on the total expected number of samples is then given by log( 1 2.4δ )T (µ) where T (µ) = 1/C (µ). Remark 1. As mentioned earlier, the optimization problem in (2) and (3) is equivalent to Problem LB. Its one advantage is that it can be viewed as a linear program with initely many constraints, or a semi-inite linear program (see, e.g., López and Still (2007)). Then linear programming duality provides a great deal of insight into the solution structure for the problems that we consider. However, we instead present our analysis on Problem LB, since on it Sion s minimax theorem can be applied to directly arrive at the solution. Remark 2. For any w P K, the sub-problem in LB, ν A c j K w ikl(µ i ν i ), has an elegant geometrical interpretation. For c > 0, consider the sublevel set S(µ, w, c) { ν : } w i KL(µ i ν i ) c. Then, for element-wise strictly positive w, S(µ, w, 0) = {µ}. The set S(µ, w, c) for some c > 0 intersects with A c j. Further, the set shrinks as c reduces. We are looking for the smallest c for which S(µ, w, c) has a non-empty intersection with Āc j. Equivalently, we are looking for the first c > 0 for which the set grows beyond the interior of A j and intersects with Āc j. Thus, ν Āc j w i KL(µ i ν i ) = {c : S(µ, w, c) Āc j }. Garivier and Kaufmann (2016) considered Problem LB in the best arm setting. In Section 3, we discuss how the lower bound problem simplifies in other specific settings. In Section 4, we analyze an asymptotically optimal δ-pac algorithm in the general PI setting. 2.1 Single Parameter Exponential Families In the remaining paper, we consider single parameter exponential family (SPEF) of distributions for each arm. For each 1 i K, let ρ i denote a reference measure on the real line, and let ( ) Λ i (η) log exp(ηx)dρ i (x). x R 4

5 Λ i is referred to as a cumulant or the log-partition function. Further, set D i {η : Λ i (η) < }. An SPEF distribution for arm i and η D i, p i,η has the form dp i,η (x) = exp(ηx Λ i (η))dρ i (x). Note that Λ i is C in Di o (see, e.g., Dembo and Zeitouni (2011)). Further, Λ i (η) is a convex function of η Di o, and if the underlying distribution is non-degenerate, then it is strictly convex. Let Λ i denote the Legendre-Fenchel transform of Λ i, that is Λ i (θ) = sup η D i (ηθ Λ i (η)). Further, let µ i denote the mean under p i,ηi. Then, µ i = Λ i(η i ) for η i D o i. In particular, µ i is a strictly increasing function of η i, and there is one to one mapping between the two. Below we suppress the notational dependence of µ i on η i and vice-versa. Let U i {Λ i(η i ), η i D o i }. Since Λ i (η i) is strictly increasing for η i Di o, U i is an open interval, and sans the boundary cases, denotes the value of means attainable for arm i. For η i Di o, the following are well known and easily checked. For η i, β i Di o, It is easily seen that η i = Λ i (µ i ) Λ i (µ i ) + Λ i (η i ) = µ i η i. (4) KL(p i,ηi p i,βi ) = Λ i (β i ) Λ i (η i ) µ i (β i η i ) where again µ i = Λ i (η i). We denote the above by K i (µ i ν i ) with ν i = Λ i (β i) emphasizing that when the two distributions are from the same SEPF, Kullback-Leibler divergence only depends on the mean values of the distributions. Using (4), we have K i (µ i ν i ) = Λ i (µ i ) Λ i (ν i ) β i (µ i ν i ), (5) where β i = Λ i (ν i ). Again, it can be shown that Λ i is C in U i (see, Dembo and Zeitouni (2011)), and it is strictly convex if Λ i is strictly convex. Thus, K i is C in U i with respect to each of its arguments. In the remaining paper, Problem LB refers to max w P K ν A c j w i KL(µ i ν i ), (6) with µ, ν taking values in R K and A c j a subset of RK. 5

6 2.1.1 Conditions on KL-Divergence Since Λ i is a convex function, we have that K i is convex in its first argument. Since K i (µ i ν i ) decreases with ν i for ν i µ i, and it increases with ν i for ν i µ i, it is a quasi-convex function of ν i. Remark 3. For many known SPEFs, for instance, Bernoulli, Poisson and Gaussian with known variance, the KL-divergence is also strictly convex in the second argument. But there are also SPEFs for which it is not convex in the second argument, for e.g., Rayleigh, centered Laplacian and negative Binomial (with number of failures fixed). Problem LB is easier to analyze when the sublevel sets of K w ikl(µ i ν i ) are convex, i.e., when K w ikl(µ i ) is a quasiconvex function. But it is known that sum of quasi-convex functions need not be quasi-convex. Further, is also known that (see Yaari (1977)), if u is a quasi-convex real function defined on R n and is of the form u(x 1, x 2,..., x n ) = u 1 (x 1 ) + u 2 (x 2 ) + + u n (x n ), where u i, 1 i n, are real continuous functions whose domains are intervals on the real line, then at least n 1 of the functions u i, 1 i n are convex. We therefore restrict ourselves only to those SPEFs whose KL-divergence is strictly convex in its second argument. Assumption 1. For each i, Di o is non-empty and Λ i(η i ) is strictly convex for η i Di o. Further, for any µ i U i, K i (µ i ν i ) is a strictly convex function of ν i U i. Under this assumption, the function K w ikl(µ i ) satisfies strict convexity. In addition, the following assumption considerably simplifies our analysis. Assumption 2. For any µ i U i, K i (µ i ν i ) as ν i U i with ν i taking values in U i. 3 Lower bounds for some PI problems 3.1 Threshold crossing problem Let U = K Consider a Ω = A 1 A 2 where A 1 = {µ U : max i K µ i > u}, and A 2 = {µ U : max i K µ i < u}. In this section, we consider the associated lower bound problem for µ Ω. We first discuss how the threshold crossing problem arises naturally in nested simulation used in financial portfolio risk measurement. Example 1. Consider the problem of measuring tail risk in a portfolio comprising financial derivatives. The key property of a financial derivative is that as a function of underlying stock prices or other financial instruments, it s value is a conditional expectation (see, e.g., Duffie (2010), Shreve (2004)). Thus, the value of a portfolio of financial securities that contains financial derivatives can also be expressed as a conditional expectation given the value of underlying financial instruments. Suppose that (X 1,..., X K ), where each X t is a vector in a Euclidean space, denote the macroceconomic variables and financial instruments at time t, such as prevailing interest rates, stock index value and stock prices, on which the value of a portfolio depends. For notational convenience we have assumed that times take integer values. U i. 6

7 Portfolio loss amount at any time t is a function of X t (X 1,..., X t ) and is given by E(Y t X t ) for some random variable Y t (see, e.g. Gordy and Juneja (2010), Broadie et al. (2011) for further discussion on portfolio loss as a conditional expectation, and the need for nested simulation). The quantity E(Y t X t ) is not known, however, conditional on X t, independent samples of Y t can be generated via simulation. Our interest is in estimating the probability that the portfolio loss by time K exceeds a large threshold u or γ P ( max 1 t K Z t u), (7) where Z t = E(Y t X t ). These probabilities typically do not have a closed form expression and are estimated using Monte Carlo simulation. An algorithm to estimate this probability maybe nested and is given as follows: 1. Repeat the outer loop iterations for 1 j n. 2. At outer loop iteration j, generate through Monte Carlo a sample of underlying factors (X 1,j,..., X K,j ). 3. Given this sample, we need to ascertain whether Then, W j max 1 t K Z t,j u, where Z t,j = E(Y t X t,j ). This fits our framework of threshold crossing problem where we may sequentially generate conditionally independent samples of Y t for each t conditional on (X 1,j,..., X t,j ) and arrive at an indicator Ŵj that equals W j with probability 1 δ. ˆγ n ( ) 1 n denotes our estimator for γ. There are interesting technical issues related to optimally distributing computational budget in deciding the number of samples in the outer loop, in the inner loop and the value of δ to be selected. These issues, however, are not addressed in the paper and may be a topic for future research. Theorem 1 below points to an interesting asymmetry that arises in the lower bound problem associated with threshold crossing as a function µ Ω. Theorem 1. Suppose that (u,..., u) U. Consider µ A 1 such that, w.l.o.g., for some i 1, n Ŵ j µ j > u for j = 1,..., i, µ j < u for i + 1 j K, and K 1 (µ 1 u) > K j (µ j u) for j = 1,..., i. Then, Problem LB has a unique solution given by w 1 = 1, and w j = 0 for j = 2,..., K. (8) The lower bound on expected total number of samples generated equals 1 K 1 (µ 1 u) log( 1 2.4δ ). When µ A 2, Problem LB has a unique solution given by w j 1/K j (µ j u), 1 j K, (9) and the lower bound on expected total number of samples generated equals 1 K j (µ j u) log( 1 2.4δ ). 7

8 Proof: To see (8), first observe that due to continuity of each K j (µ j ν j ) as a function of ν j U j, we have w j K j (µ j ν j ) = w j K j (µ j ν j ), ν A 2 ν Ā2 where recall that for any set A, Ā denotes its closure. The RHS above is solved by in the sense that for any other ν Ā2, w j K j (µ j ν j ) Our lower bound problem reduces to ν = (u,..., u, µ i+1,..., µ k ) w j K j (µ j ν j ) = max w P K i w j K j (µ j u). i w j K j (µ j u). This can easily be seen to be solved uniquely by w1 = 1, w j = 0 for j = 2,..., K, and the optimal value C is K 1 (µ 1 u). The lower bound on the overall expected number of samples generated is then given by log( 1 2.4δ )/C. To see (9), observe that to simplify ν Ā 1 K w jk j (µ j ν j ), it suffices to consider ν(s) Ā 1 for each s (1 s K) where in the sense that for any ν Ā1 w j K j (µ j ν j ) ν(s) (µ 1,..., µ s 1, u, µ s+1,..., µ k ), min s=1,...,k w j K j (µ j ν j (s)) = The lower bound problem then reduces to The solution to this problem is given by ( K max min w j K j (µ j u). w P K j w j 1/K j (µ j u) j, min w sk s (µ s u). s=1,...,k and the optimal value C 1 1. is K j (µ j u)) The lower bound on the overall expected number of samples generated is equal to log( 1 2.4δ )/C. 3.2 Half-space problem In this section, we consider the problem of identifying the half-space to which the mean vector belongs, i.e., where the partition is defined by the hyperplane K k=1 a kν k = b. Set A 1 = {ν R K U : 8 a k ν k < b} k=1

9 and A 2 = {ν R K U : a k ν k > b}. W.l.o.g. each a i can be taken to be non-zero and b > 0. Problem LB may be formulated as: For µ A 1, and non-empty A 2, max w P K ν Ā2 k=1 w j K j (µ j ν j ). (10) Theorem 2. Under Assumptions 1, 2, and that A 2 is non-empty, there is a unique optimal solution (w, ν ) to Problem LB. Further, K i (µ i ν i ) = K 1 (µ 1 ν 1) i, (11) a k νk = b, (12) k=1 ν i > µ i if a i > 0, and ν i < µ i if a i < 0. (13) Relations (11), (12) and (13) uniquely specify ν U. Moreover, w i a i K i(µ i ν i ) = w 1 a 1 K 1(µ 1 ν 1) i. (14) Let u i = sup{u U i }, and u i = {u U i }. Further, set û i = u i if a i > 0, and û i = u i if a i < 0. The following lemma is useful in proving Theorem 2. Lemma 1. Under Assumption 2, the following are equivalent 1. A K a iû i > b. 3. There exists a unique ν U such that (11), (12) and (13) hold. Proof of Lemma 1: Claim 1 implies existence of ν such that K a iν i > b and K i (µ i ν i ) < for all i. Claim 2 follows as a i ν i < a i û i. To see that Claim 2 implies Claim 3, recall that K i (µ i ν i ) equals zero at ν i = µ i. It strictly increases with ν i for ν i µ i and it strictly reduces with ν i for ν i µ i. Assume w.l.o.g. that a 1 > 0, and for ν 1 µ 1, consider the function ν i (ν 1 ) = K 1 i (K 1 (µ 1 ν 1 )) where ν i (ν 1 ) µ i if a i > 0, and ν i (ν 1 ) µ i if a i < 0. Now, the function h(ν 1 ) a i ν i (ν 1 ) < b for ν 1 = µ 1 and it strictly increases with ν 1. 9

10 Further, observe that as ν 1 u 1, ν i (ν 1 ) u i if a i > 0, and ν i (ν 1 ) u i if a i > 0. Thus, h(ν 1 ) K a iû i and thus there exists a unique ν U so that h(ν1 ) = b, and (11) and (13) hold. To see that Claim 3 implies Claim 1, observe that Claim 3 guarantees that (ν 1, ν 2 (ν 1),..., ν K (ν 1)) U By selecting ν 1 > ν1 and sufficiently small, Claim 1 follows. Proof of Theorem 2: Lemma 1 guarantees the existence of ν, w that solve Eqs. (11) to (14). Here, observe that (14) defines w. Note that ν is the solution to the optimization problem: ν Ā2 wj K j (µ j ν j ). This can be verified by observing that the first order KKT conditions for this convex programmimg problem are given by Eqs. (12) to (14) (recall that Ā 2 = {ν : K a iν i b}). Further, from Eq. (11), it follows that ν Ā2 wj K j (µ j ν j ) = For any another feasible solution w, we have wj K j (µ j νj ) = K 1 (µ 1 ν1). ν Ā2 w i K i (µ i ν i ) w i K i (µ i ν i ) K 1 (µ 1 ν 1), which shows that w is an optimal solution to the problem. Uniqueness: It remains to show that above is a unique solution to Problem LB. We skip the details as, in Section 3.4, we prove uniqueness of the solution for a general case where Ā2 is a union of half-spaces. See Lemma 3 for uniqueness in this more general setting. 3.3 A c is convex Suppose that Ω = A A c where µ A and A c is a closed convex set. To avoid trivialities, assume that Ω U. Further, there exists a ν (0) A c so that A c is non-empty. Let the associated lower bound problem be denoted by Problem CVX. max w P K ν A c w j K j (µ j ν j ). (Problem CVX) The solution to Problem CVX and each of its sub-problems ν A c K w jk j (µ j ν j ) is finite. This follows as for each feasible w P K, ν A c w j K j (µ j ν j ) w j K j (µ j ν (0) j ) < max K j (µ j ν (0) j j ). 10

11 Let C denote the optimal value for Problem CVX. Under Assumption 1, K w jk j (µ j ) is strictly convex and there is a unique ν A c that achieves the minimum in the sub-problem ν A c w j K j (µ j ν j ). Let ν(w) denote this unique solution for any w P K. Lemma 2 below shows that for every optimal solution to Problem CVX, the same ν achieves the minimum in the above sub-problem. Lemma 2. Under Assumption 1, for any w, s that are optimal for Problem CVX, ν(w ) = ν(s ). Proof. To see this, first note that ν A c K w jk j (µ j ν j ) is a concave function of w. This shows that, if w and s are two optimal solutions, then αw + (1 α)s for α (0, 1) is another optimal solution. Since it is optimal, we have (αwj + (1 α)s j)k j (µ j ν j (αw + (1 α)s )) = C. Now due to Assumption 1, wj K j (µ j ν j (αw + (1 α)s )) > C if ν(αw + (1 α)s ) ν(w ) and s jk j (µ j ν j (αw + (1 α)s )) > C if ν(αw + (1 α)s ) ν(s ), it follows that ν(w ) = ν(αw + (1 α)s ) = ν(s ). Let ν be the unique value of ν which achieves the minimum in the sub-problem for every optimal solution. In Theorem 3, we provide an alternate characterization of ν, as well as a characterization of the solution of Problem CVX. Some notation is needed to state Theorem 3. For any index set J [K] and vector ν R K, let ν J denote the projection of the vector ν on to the lower dimensional subspace with coordinate set given by J. Similarly, for any set B R K, let B J denote its projection onto the subspace restricted to the coordinate set J, i.e, B J = {ν J : ν B}. Note that if B is convex, then B J is also convex. If B is the c-sublevel set of a convex function f, then B J = {ν J : f(ν J, ν J c) c for some ν J c R J c } = {ν J : f(ν J, ν J c) c}. ν J c R J c In other words, B J is the c-sublevel set of the function h J := νj c R J c f(ν J, ν J c). Theorem 3. Suppose that µ A, A c is non-empty, and Assumptions 1 and 2 hold. Then, for any optimal solution (w, ν ) to Problem CVX, the ν uniquely solves the min-max problem max K i (µ i ν i ). (15) ν A c i Further, the following are necessary and sufficient conditions for such an (w, ν ). Let I = arg max i K i (µ i ν i ). Then, (a) w i = 0 i Ic, 11

12 (b) ν I Ac I, and (c) there exists a supporting hyperplane of (A c ) I at ν I given by i I a iν i = b such that ν i > µ i if a i > 0, and ν i < µ i if a i < 0 i I, (16) w i a i K i(µ i ν i ) = w j a j K i(µ j ν j ) i, j I. (17) Remark 4. Condition (c) shows that the problem has a unique solution, i.e., the optimal w is a singleton, if there is a unique supporting hyperplane of (A c ) I at νi satisfying (16). Consider the case where A c = {ν : f(ν) c} is the c-sublevel set of a convex function f. Then, A c I is the c-sublevel set of the function h : R I R, h(ν I ) := νi c R Ic f(ν I, ν I c). Further suppose that h( ) is a smooth function. Then, the unique tangential hyperplane at νi is given by h(νi) (ν I νi) = 0. In particular, in this case for i I, w i h ν i (ν I ) K i (µ i ν i ). Proof of Theorem 3. Let B n denote a closed ball centered at µ with radius n. Consider n sufficiently large so that ν defined as the solution to (15) lies in B n (since the objective function max i K i (µ i ν i ) is strictly convex in ν, such a ν is unique). Since A c B n is a compact set, and K w ik i (µ i ν i ) is continuous in w and ν and concave in w P K and convex in ν A c B n, by Sion s Minimax Theorem max w P K Observe that ν A c B n w i K i (µ i ν i ) = ν A c B n max K w PK w ik i (µ i ν i ) r n (w) ν A c B n = ν A c B n max i K i (µ i ν i ) = ν A c max i K i (µ i ν i ). (18) w i K i (µ i ν i ) is continuous in w (see Theorem 2.1 in Fiacco and Ishizuka (1990)) and decreases with n to r(w) ν A c K w ik i (µ i ν i ). Thus, we have uniform convergence (see Theorem 7.13 in Rudin (1976)) sup w P K r n (w) r(w) 0. This in turn implies that max w P K r n (w) max w P K r(w). From (18) it follows that LHS above is independent of n. Therefore, the min-max relation max w P K ν A c w i K i (µ i ν i ) = max ν A c i K i (µ i ν i ) (19) holds. Now if (w, ν ) is a saddlepoint of the minmax problem, and since ν is unique, it equals ν. 12

13 Necessity of conditions on optimal (w, ν ): Let I = arg max i K i (µ i νi ). The minimax equality in (19) shows that (w, ν ) is a saddle point, and therefore, w solves the optimization problem max (w 1,...,w K ) P K From this, it is easy to see that w i = 0 i Ic. To see (b), note that ν uniquely solves the optimization problem min (ν 1,...,ν K ) A c w j K j (µ j νj ). (20) wj K j (µ j ν j ). (21) If ν I is in the interior of Ac I, it is easy to come up with ν ν on A c, with a smaller value of K w j K j(µ j ν j ). Now, consider the convex set { C := ν I R I : i I w i K i (µ i ν i ) < i I w i K i (µ i ν i ) } (convexity of C follows from Assumption 1). By the separating hyperplane theorem, there exists a hyperplane i I a iν i = b that separates C and A c I. Since ν I C Ac I, this hyperplane passes through νi, and is a supporting hyperplane to both convex sets C and Ac I. From the fact that it is a supporting hyperplane to C at νi, we have This proves (c). w i a i K i(µ i ν i ) = w j a j K i(µ j ν j ) i, j I. Sufficiency: Let ν and w be such that (a), (b), (c) hold. Note that i I a iµ i < b and (A c ) I {ν I : i I a iν i b}. Then, from Theorem 2, wi and ν I solve the following half space problem in the lower dimensional subspace restricted to coordinate set I: max w I P I ν I : w i K i (µ i ν i ). i I a iν i b In particular, i I ν I : w i I a i K i (µ i ν i ) = wi K i (µ i νi ). iν i b i I i I Further, for any w I, note that w i K i (µ i ν i ) ν I (A c ) I ν I : w i K i (µ i ν i ). i I a iν i b This shows that wj K j (µ j ν j ) = ν A c i I w ν I (A c i K i (µ i ν i ) = ) I i I i I i I wi K i (µ i νi ) = max K i (µ i νi ). i Now, consider any w which is a feasible solution of Problem CVX. Then, ν A c w i K i (µ i ν i ) w i K i (µ i νi ) max K i (µ i νi ). i This proves our claim that w, ν(w ) = ν form an optimal solution. 13

14 3.4 A c is non-convex union of half spaces In Sections 3.2 and 3.3, A c is convex, which allows us to explicitly characterize the solution to the lower bound problem. In this section, we consider a problem where A c is not convex. Specifically, we examine the case where A c is a union of half-spaces. Just as the single halfspace problem was useful in studying the case where A c is convex, analyzing A c when it is a union of half-spaces, may provide insights to a more general problem where A c is a union of convex sets. In Section 3.4.1, we restrict ourselves to two arms both having a Gaussian distribution with known and common variance. This simple setting lends itself to elegant analysis and graphical interpretation. Extensions to general distributions, K > 2 arms and settings where A c is a union of convex sets are part of our ongoing research. Let B j {ν R K : a j,k ν k b j }, (22) each b j 0, and A c = m B j be the union of these half-spaces. Again, suppose that A c U. The lower bound problem can be expressed as max w P K ν m B j k=1 w i K i (µ i ν i ) = max min w P K j ν B j w i K i (µ i ν i ). (23) Remark 5. It is easy to see that the best arm identification problem is a special case of this problem. To see this, suppose arm 1 has the highest mean among the K arms, i.e., µ 1 µ j j 1. We then have A c = K j=2 B j, where for any j, B j = {ν R K : ν j ν 1 }. Lemma 3 shows that the optimization problem in (23) has a unique solution. Lemma 3. There is a unique w P K that achieves the maximum in (23). Proof. Denote the optimal value of (23) by C. We first show that if q, s P K are two distinct optimal solutions and ν(q), ν(s) A c, respectively achieve the minimum in the sub-problem, then ν(q) ν(s). To see this, suppose ν(q) = ν(s) = ν B j for some 1 j m. Then ν achieves the minimum in the subproblem ν Bj K w ik i (µ i ν i ) for both w = q and w = s. Hence, both q, s solve the following equations: w i K i (µ i ν i ) = C, (24) w i a j,i K i(µ i ν i ) = w 1 a j,1 K 1(µ 1 ν 1 ) i. (25) This is a contradiction as the above set of equations has a unique solution. Now, suppose q, s P K are two distinct optimal solutions of the convex program (23). Then any convex combination z = αq + (1 α)s is also an optimal solution. Let ν(z) achieve the minimum in the sub-problem for z. Then C = z i K i (µ i ν i (z)) = α q i K i (µ i ν i (z)) + (1 α) s i K i (µ i ν i (z)). In addition, for any ν, we have K w ik i (µ i ν i ) C for both w = q and w = s. Then, the above equality is possible only if K q ik i (µ i ν i (z)) = K s ik i (µ i ν i (z)) = C. This in turn implies that ν(z) achieves the minimum in the sub-problem for both q, s, which is a contradiction to our earlier result. Hence proved. 14

15 3.4.1 Two arms Gaussian setting To illustrate the issues that arise with A c being non-convex, consider a simple setting of two arms. Both are assumed to have a Gaussian distribution and the variance of each arm is assumed to be 1/2. W.l.o.g. mean of each arm is set to zero. Then, for j = 1, 2, B j = {ν R 2 : a j,1 ν 1 + a j,1 ν 2 b j }, (26) and A c = B 1 B 2 be the union of the two half-spaces. To avoid degeneracies we assume that each a j,k 0. Further suppose that a 1,1 a 1,2 a 2,1 a 2,2 so that A c is non-convex. The lower bound problem is then given by max (w 1,w 2 ) P 2 ν A c 2 w i νi 2. (27) The following geometrical result provides useful insights towards solution of (27). Proposition 1. For w 1, w 2, C > 0, a necessary and sufficient condition for an ellipse of the form to be uniquely tangential to lines and is that Then, the tangential ellipse is specified by and The ellipse (28) meets the line (29) at point ( Ca1,1, Ca ) 1,2 w 1 b 1 w 2 b 1 and it meets line (30) at point w 1 ν w 2 ν 2 2 = C (28) a 1,1 ν 1 + a 1,2 ν 2 = b 1 (29) a 2,1 ν 1 + a 2,2 ν 2 = b 2 (30) min a 2,k < b 2 < max k=1,2 a 1,k b a 2,k. (31) 1 k=1,2 a 1,k w 1 C = (a 1,2a 2,1 ) 2 (a 1,1 a 2,2 ) 2 (b 2 a 1,2 ) 2 (b 1 a 2,2 ) 2 (32) w 2 C = (a 1,2a 2,1 ) 2 (a 1,1 a 2,2 ) 2 (b 1 a 2,1 ) 2 (b 2 a 1,1 ) 2. (33) ( Ca2,1, Ca ) 2,2. w 1 b 2 w 2 b 2 Proof. A necessary and sufficient condition for ellipse (28) to be tangential to line (29) at point (ν1, ν 2 ) is for (ν 1, ν 2 ) to satisfy the two equations of ellipse and the line, respectively, and the slope matching condition w 1 ν1 = w 2ν2. (34) a 1,1 a 1,2 The fact that (ν1, ν 2 ) satisfies (28) and (29) implies that (34) equals C/b 2. Plugging (ν1, ν 2 ) from (34) into (28), we observe, a 2 1,1 + a2 1,2 = b2 1 w 1 w 2 C. 15

16 Similarly, considering the other half-space, we get a 2 2,1 w 1 + a2 2,2 w 2 = b2 2 C. The result follows by solving the two equations. Theorem 4. The solution to (27) depends in the following way on the underlying parameters Case 1: ( ) ( ) 2 b2 a 2 2,1 a 1,1 + a2 2,2 ( a 1,1 + a 1,2 ) 1. (35) a 1,2 b 1 In this case, (27) reduces to the half-space problem where A c = B 1 so that the optimal solution to (27) is given by wi a 1,i =, i = 1, 2, (36) a 1,1 + a 1,2 and the optimal value C = Case 2: b 2 1 ( a 1,1 + a 1,2 ) 2. ( b2 b 1 ) 2 ( ) a 2 1 1,1 a 2,1 + a2 1,2 ( a 2,1 + a 2,2 ). a 2,2 This simply corresponds to Case 1, with the (a 1,1, a 1,2, b 1 ) interchanged with (a 2,1, a 2,2, b 2 ). Case 3: ( ) a 2 1 ( ) ( ) 1,1 a 2,1 + a2 2 1,2 b2 a 2 2,1 ( a 2,1 + a 2,2 ) < < a 2,2 a 1,1 + a2 2,2 ( a 1,1 + a 1,2 ) 1. (37) a 1,2 Here (31) holds, and the optimal w1 and w 2 are given by (32) and (33), respectively. b 1 Proof. Case 1: First consider the half-space problem where A c = B 1. Our analysis in 3.2 shows that there is a unique (w1, w 2 ) and (ν 1, ν 2 ) that solves the resulting problem, and a 1,1 ν 1 + a 1,2ν 2 = b 1 so that Further, from sign(a 1,1 )ν 1 = ν 1 = sign(a 1,2 )ν 2 = ν 2, ν 1 = ν 2 = b 1 a 1,1 + a 1,2. w 1 ν 1 a 1,1 = w 2 ν 2 a 1,2, (38) it follows that for the half-space problem, wi a 1,i is the optimal solution and the optimal value C b = 2 1 ( a 1,1 + a 1,2. ) 2 Returning to (27), we show that when (35) is true and and wi a 1,i, ν B 2 2 wi K i (µ i ν i ) = w1ν w2ν 2 2 C ν:a 2,1 ν 1 +a 2,2 ν 2 b 2 and hence w i a 1,i continues to be optimal for (27). We first find the point (κ 1, κ 2 ) B 2 that achieves the minimum in the above optimization problem. We know that (κ 1, κ 2 ) satisfies a 2,1 κ 1 + a 2,2 κ 2 = b 2, 16

17 and the slope matching condition It follows from easy calculations that w 1 κ 1 a 2,1 = w 2 κ 2 a 2,2. w1ν w2ν 2 2 = ν:a 2,1 ν 1 +a 2,2 ν 2 b 2 a 2 2,1 w 1 b a2 2,2 w 2 = ( a 2 2,1 a 1,1 + a2 2,2 a 1,2 b 2 2 ) ( a 1,1 + a 1,2 ). b The above expression is greater than 2 1 ( a 1,1 + a 1,2 when (35) is true, which gives us the required ) 2 result. Case 2: Case 2 follows similarly as Case 1. Case 3: It is easy to see that (37) implies (31). Let (w1, w 2 ) denote the optimal solution to (27). It is clear that the corresponding ellipse must be tangential to both the half lines a 1,1 ν 1 + a 1,2 ν 2 = b 1 and a 2,1 ν 1 + a 2,2 ν 2 = b 2, since if it does not touch one of these half lines, then the associated constraint can be ignored in solving (27). However, that violates (37). Therefore, the solution is provided by Proposition A c is a union of many half spaces We now provide an algorithm to solve the lower bound problem (27) for general m when the number of arms K = 2. Again, both arms are assumed to have a Gaussian distribution, the variance of each arm is assumed to be 1/2, and mean of each arm is set to zero. The algorithm is outlined somewhat ormally emphasising its graphical interpretation. Observe that in the discussion in Section ellipse 2 w iνi 2 touches the line a 1,1 ν 1 + a 1,2 ν 2 = b 1 (i.e., they have a non-empty intersection) if and only if it touches lines a 1,1 ν 1 + a 1,2 ν 2 = b 1, a 1,1 ν 1 a 1,2 ν 2 = b 1 and a 1,1 ν 1 a 1,2 ν 2 = b 1. Further, it is tangential to them at points symmetric around the axes. Thus, without loss of generality we could have restricted the analysis to a 1,1, a 1,2, a 2,1, a 2,2 > 0. With this in mind, consider m half-spaces (B j : j = 1,..., m), where B j = {ν R 2 : a j,1 ν 1 + a j,2 ν 2 b j }, (39) where we take each a j,k and b j to be strictly positive. Let L j = {ν R 2 : a j,1 ν 1 + a j,2 ν 2 = b j } (40) for each j denote the associated line. Further, without loss of generality, suppose that a 1,1 a 1,2 > a 2,1 a 2,2 >... a 2,m a 2,m. To ensure that any two of the lines intersect in the positive quadrant (else one of the two lines is above the other line in the positive quadrant and can be ignored), we further assume that a j,1 a j+1,1 > for j = 1,..., m 1. Recall that our optimization problem is b j b j+1 > a j,2 a j+1,2 max (w 1,w 2 ) P 2 ν m B j 17 2 w i νi 2. (41)

18 For any w [0, 1], let C(w) = wν ν m B (1 w)ν2. 2 j Our aim is to maximize C(w) for w [0, 1]. Again, this is a concave programming problem so a local optimal is global optimal. The algorithm starts with value of w = 0 and proceeds with increasing w so that C(w) increases. It stops when it reaches a point where further increase in w leads to reduction in C(w). At each value of w considered in the algorithm, C(w) is known. The ellipse wν (1 w)ν 2 2 = C(w) has the property that it is tangential to one or more lines (L j : j m) and always lies in Ā. This ensures that wν (1 w)ν2 2 C(w) for all ν m B j. Algorithm: The algorithm proceeds in stages. Set i 1 = 1. Stage j = 1 starts with the ellipse tangential with L 1. Specifically, set w = 0, C(w) = 0 and the ellipse wν (1 w)ν2 2 = C(w) is tangential to L 1 at point (ν 1 (0), ν 2 (0)) = (b 1 /a 1,1, 0). Let C(w), ν 1 (w), and ν 2 (w) be solution to equations wν (1 w)ν 2 2 = C, (42) a 1,1 ν 1 + a 1,2 ν 2 = b 1, (43) wν 1 = (1 w)ν 2. (44) a 1,1 a 1,2 From Lemma 4 below, it follows thats as w increases from 0, C(w) and ν 2 (w) increase while ν 1 (w) reduces. With increasing w either the ellipse touches another line, call it a L i2, before w equals 1,1 a 1,1 +a 1,2 or w equals this value before ellipse touches any other line. In the latter case, the points (w, C(w), ν 1 (w), ν 2 (w)) are a solution to the problem where A c = B 1 and hence also for A c = m B j since the ellipse does not intersect with the half-spaces B j for 2 j m. Else, let (ν 1 (w), ν 2 (w)) denote the point where the ellipse meets L i2. The algorithm moves to stage j = 2 The algorithm at the start of any stage j 2 proceeds by increasing w so that the ellipse remains tangential to L ij. Lemma 5 below ensures that at any stage j 2, with increasing w, the ellipse wν 1 (w) 2 + (1 w)ν 2 (w) 2 = C(w) does not intersect with lines L k, k i j 1. Again, if (ν 1 (w), ν 2 (w)) denote the point of intersection of the ellipse with L ij. Then, C(w) and ν 2 (w) increase with w and ν 1 (w) reduces with increase in w. Stage j ends when either ellipse satisfying the optimality condition for the single half space problem where a ij,1 A c = B ij, w = a ij,1 + a ij,2 or before it reaches that value of w, the ellipse becomes tangential to another line, L ij+1, for some i j+1 > i j and i j+1 m. At that point, 18

19 either the current ellipse is optimal for the problem A c = B ij B ij+1, so that a ij+1,1 a ij+1,1 + a ij+1,2 w < a ij,1 a ij,1 + a ij,2 and no further improvement by increasing w is possible, or if w < a ij+1,1 a ij+1,1+a ij+1,2 The algorithm clearly terminates for j m. Some notation is needed to state Lemma 4., then j is incremented by 1 and the next stage is initiated. For a 1, a 2, b > 0, let C(w), x(w), y(w) denote the solutions to the system of equations below for each w (0, 1). wx 2 + (1 w)y 2 = C, (45) a 1 x + a 2 y = b, (46) wx (1 w)y =. (47) a 1 a 2 Let C (w), x (w) and y (w) denote the respective derivatives of C(w), x(w) and y(w) with respect to w. Lemma 4. For w (0, 1), C (w) = b 2 ( a2 1 w + a2 2 1 w )2 ( a 2 1 w 2 a 2 ) 2 (1 w) 2. (48) and it is positive for w < a 1 a 1 +a 2, equals zero at w = a 1 a 1 +a 2 and is negative for w > a 1 a 1 +a 2. Further, x(w) = so that x (w) < 0 and y (w) = a 1 a 2 x (w) > 0 for w (0, 1). a 1 b ( ), (49) a w a2 2 1 w Proof of Lemma 4: Observe first that by substituting for wx and (1 w)y from (47) in (45), we see that (47) equals C(w)/b. Then, using expressions for for x and y in (47) and plugging them in (46), we get C(w) = b 2 a 2 1 w + a2 2 1 w (48) follows from differentiation. (49) follows by substituting for y in (47) using (46). The remaining statements are obvious. Some notation is needed for Lemma 5. For c 1, c 2, d > 0, for each w (0, 1), consider a solution C(w), x(w), y(w) to. wx 2 + (1 w)y 2 = C, (50) c 1 x + c 2 y = d, (51) wx (1 w)y =. (52) c 1 c 2 Thus, (x(w), y(w)) correspond to the point determined by intersection of the ellipse (50) (with C(w) in place of C) with line (51) so that the ellipse is tangential to the line at the point of intersection. 19

20 Further, for a 1, a 2 > 0, assume that a 1 the solution x(w), ỹ(w), s(w) > 0 to a 2 > c 1 c 2. For each w (0, 1) and C(w) as above, consider w x 2 + (1 w)ỹ 2 = C(w), (53) a 1 x + a 2 ỹ = s, (54) w x (1 w)ỹ =. (55) a 1 a 2 Thus ellipse wx 2 + (1 w)y 2 = C(w) intersects with the line a 1 x + a 2 y = s(w) at point ( x(w), ỹ(w)) and s(w) is chosen so that the ellipse is tangential to the line. Lemma 5. For w (0, 1), s(w) is a decreasing function of w. Lemma 5 implies that with increase in w the distance between the ellipse (50) and any line of the form a 1 x + a 2 y = b for b > s(w) increases. Proof of Lemma 5: Observe as in Proof of Lemma 4 that C(w) = d 2 c 2 1 w + c2 2 1 w = s(w)2 a 2 1 w + a2 2 1 w. Thus, ( d s(w) 2 2 a 2 2 = c 2 2 ) a a 2 w w 2 c c 2 w w 2. Since a2 1 a 2 2 > c2 1, it follows that s(w) is a decreasing function of w. c An asymptotically optimal algorithm In this section, we present a δ-pac algorithm (Algorithm 1) for the PI problem which, under some conditions, achieves asymptotically optimal mean termination time as δ 0. Both the algorithm and its analysis closely follow the best arm identification in Garivier and Kaufmann (2016). The sampling and stopping rules used in the algorithm (described below) are inspired by the lower bound Problem LB. Some notation: In Problem LB (with SPEF distributions replaced by the corresponding means), let W (µ) and C (µ) respectively denote the optimal solution set and optimal value. Also, let V (µ, w) and g(µ, w) respectively denote the optimal solution set and optimal value of the inner sub-problem. Sampling Rule The basic idea is to draw samples according to estimated optimal sampling ratios obtained by solving the lower bound problem with empirical means of the parameters. In other words, if ˆµ(t) is the vector of empirical means of the arms at time t, an arm is chosen to bring the ratio of total number of samples for all the arms closer to an optimal ratio ŵ(t) W(ˆµ(t)). But this simple strategy may result in erroneously giving too few samples to an arm due to initial bad estimates preventing convergence to the correct value in subsequent time-slots. This difficulty can be dealt through forced exploration for each arm to ensure sufficiently fast convergence. This idea was used in Garivier and Kaufmann (2016) for the best arm problem, where they propose two rules C-Tracking and D-Tracking which ensure convergence to the correct sampling ratio. We use the D-Tracking rule they propose as the sampling rule in our algorithm. The rule can be described as follows. Let N i (t) denote the number of samples of arm i at 20

21 time t for all i and let ŵ(t) W(ˆµ(t)). If there exists an arm i such that N i (t) < t K/2, choose that arm. Otherwise, choose an arm that has the maximum difference between the estimated optimal ratio and the actual fraction of samples, i.e., an arm is chosen from arg max i ŵ i (t) N i (t)/t. This sampling rule has the following properties: (i) each arm gets Ω( t), (ii) if the estimated sampling ratios converge to an optimal ratio, then the actual fraction of samples also converges to the same optimal ratio. We state below the result from Garivier and Kaufmann (2016) which shows these properties. Lemma 6. The D-tracking rule ensures that min i N i (t) ( t K/2 ) + 1 and that for all ɛ > 0, for all t 0, there exists t ɛ t 0 such that if sup t t0 max i ŵi (t) w i ɛ for some w PK, then Stopping Rule sup max t t ɛ i N i (t) t w i 3(K 1)ɛ. The stopping rule uses a threshold rule that imitates the lower bound (1). It first finds the partition in which the empirical mean vector ˆµ(t) lies. Denote this partition by A(t). If ν A c (t) N i (t)k i (ˆµ i (t) ν i ) β(t, δ), i then it stops and declares A(t) as the partition in which µ lies. Otherwise it continues to sample arms according to the D-Tracking rule. We will set the threshold β(t, δ) = log ( ) ct δ, where c is some constant. Algorithm 1 Algorithm for one parameter exponential families At time t, Compute weights w(ˆµ(t 1)) and sample according to D-Tracking rule Let ˆµ(t) A(t). if ν A c (t) i N i(t)k i (ˆµ i (t) ν i ) β(t, δ) then Declare µ A(t). end if Sampling Rule Termination Rule Sample Complexity Analysis Let T U (δ) be the time at which Algorithm 1 terminates. Then we have the following guarantee. Theorem 5. Suppose that Ω U and Assumptions 1 and 2 hold. If Problem LB has a unique optimal solution, i.e., if W(µ) = 1, then Algorithm 1 is a δ-pac algorithm with lim sup δ 0 E[T U (δ)] log ( ) 1 T (µ). δ As seen in Section 3, for threshold crossing, and when A c is a half-space problem or union of half-spaces, Problem LB has a unique optimal solution. We also observed that when A c is a closed convex set and the associated ν A c is a smooth point (with a unique supporting hyperplane), then Problem LB again has a unique optimal solution. Before, we prove Theorem 5, we will first prove the following continuity result. Lemma 7. Under conditions of Theorem 5, the function g is continuous at (µ, w) for any w P K. Further, if Problem LB has a unique optimal solution, then this solution is continuous at µ. 21

22 Proof. First suppose that Ā c is compact. The fact that g is continuous at (µ, w) follows from the continuity results for non-linear programs. Specifically, since the objective function is continuous in ν and Āc is compact, Theorem 2.1 in Fiacco and Ishizuka (1990) implies that g is continuous at (µ, w). Now consider non-compact Āc and define g n (µ, w) = ν Āc B n w i K i (µ i ν i ) for each n where B n is an Euclidean closed ball of radius n centred at µ. n is taken to be sufficiently large so that Āc B n is non-empty. Then, g n (µ, w) is continuous in (µ, w) and decreases with n to g(µ, w). Since this convergence is uniform, it follows that g(µ, w) is continuous in (µ, w). To see that the optimal solution to Problem LB is continuous at µ if W(µ) is a singleton, note that the problem is equivalent to max w PK g(µ, w). Since g(µ, ) is continuous on P K and W(µ) is a singleton, from Theorem 2.2 in Fiacco and Ishizuka (1990), we conclude that the optimal solution is continuous at µ. Proof of Theorem 5. Without loss of generality, let µ A. We first prove that the probability of error is at most δ. ] P µ [error] P µ [ t 1 : N i (t)k i (ˆµ i (t) ν i ) β(t, δ) ν A i ] N i (t)k i (ˆµ i (t) µ i ) β(t, δ) [ P µ t=1 i ( β(t, δ) e K+1 2 ) K log t e β(t,δ) K δ if c is chosen large enough s.t. t=1 t=1 e K+1 ct ( log 2 ) K (ct) log t 1. The third inequality above follows from Magureanu et al. (2014) extended from Bernoulli family to SPEF. Next, we prove the upper bound on the mean termination time. Fix an ɛ > 0. From the continuity of w at µ, there exists ξ > 0 such that for any µ B (µ) ξ we have w(µ ) B (w(µ)). ɛ For any T N, define the event E T := T t=h(t ) ˆµ(t) Bξ (µ). It easy to show that (see Lemma 19 of Garivier and Kaufmann (2016)) there exist constants B, C depending on ɛ and µ s.t. K ( P µ [ET c ] B exp CT 1/8). Note that ξ, E T, B, C are all functions of ɛ and µ. Now, for every ɛ > 0, define C ɛ (µ) = µ B ξ(ɛ) w B 3(K 1)ɛ (µ), (w(µ)) g(µ, w ). 22

23 By the continuity of w and g, we have lim ɛ 0 C ɛ (µ) = C (µ) = (T (µ)) 1. From Lemma 6, for any ɛ > 0, we have for every T T ɛ that on E T (ɛ), which in turn implies that N(t) t w(µ) 3(K 1)ɛ t > T, ( g ˆµ(t), N(t) ) Cɛ (µ) t > T. t Since the termination rule in the algorithm is given by for T T ɛ, on E T (ɛ), we have Now, let ( g ˆµ(t), N(t) ) t min(t U (δ), T ) T + T 0 (δ) := T + T t= T β(t, δ) C ɛ (µ). β(t, δ), t { 1 Cɛ (µ) < } β(t, δ) t { T N : } β(t, δ) T + Cɛ (µ) < T. Therefore, for any T max{t ɛ, T 0 (δ)}, on E T (ɛ), we have T U (δ) < T, which gives us E[T U (δ)] P[T U (δ) > T ] max{t ɛ, T 0 (δ)} + T =1 As shown in Garivier and Kaufmann (2016), we have P[E T (ɛ) c ]. T =1 T 0 (δ) = 1 Cɛ (O (log (1/δ)) + o (log log (1/δ))). (µ) This gives us Now, letting ɛ go to zero, we get lim sup δ 0 E[T U (δ)] log ( ) 1 1 C δ ɛ (µ). lim sup δ 0 E[T U (δ)] log ( 1 ) 1 lim ɛ 0 C δ ɛ (µ) = T (µ). 23

The information complexity of best-arm identification

The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model