Supplementary: Battle of Bandits

Size: px

Start display at page:

Download "Supplementary: Battle of Bandits"

Noah Norris
5 years ago
Views:

1 Supplementary: Battle of Bandits A Proof of Lemma Proof. We sta by noting that PX i PX i, i {U, V } Pi {U, V } PX i i {U, V } PX i i {U, V } j,j i j,j i j,j i PX i {U, V } {i, j} Q Q aia a ia j j, j,j i where the penultimate equality is by appealing to sampling without replacement and the definition of the win probability for i in the pairwise preference model. B Regret analysis of Battling-Doubler We will be using the following regret guarantee of UCB algorithm for classical MAB problem in order to proof the regret bounds of Battling-Doubler: Theorem. UCB Regret 5. Assume n denotes the set of arms. µ i denotes the expected reward associated with arm i n, such that µ > µ µ n and H n i i, where i µ µ i, then the expected regret of the UCB algorithm with confidence parameter ζ > 0 is of OζH ln T. Proof of Theorem 3 Proof. For any time horizon T Z +, let BT denotes the supremum of the expected regret of the SBM S after T steps, over all possible reward distributions on the arm set n. Clearly, length of epoch l in the algorithm is exactly l, l. We denote this by T l l. Note that for all time steps t inside this epoch, the first arms a t, a t,, a t are chosen uniformly from L, and thus the observed reward of the th arm also comes from a fixed distribution, as argued below: t within epoch l. Clearly, Eθ a t Eθ a t Eθ a t θ, since each of the first arms were drawn uniformly from L. Now, the SBM S is playing a standard MAB game over the set of arms n with binary rewards. Let b t denote the binary reward of the th arm a t in the tth step. Clearly for all t within epoch l, Eb t a t, a t,, a t + j θ a t θ a t j PX i, {U, V } {i, j} i {U, V } Thus at round t, if a t x, for any x n, P{U, V } {i, j} i {U, V } Eb t a t x E + j θ a x θ a t j + θ a x θ + θ a x θ Now for SBM S, the best arm with highest expected reward is still arm-, where Eb t a t + θ θ 5 By the definition of the bound function BT, the total expected regret in the traditional MAB sense of the SBM S, in the epoch l is at most BT l B l, which implies that E x + θ θ The interesting thing to note is that, at any round t of epoch l, the expected regret incurred by any of the first arms, i.e. a t j such that j, is exactly same as the average expected regret contributed by the set of arms drawn as the th arm in epoch l since Let θ E a UnifL θ a denote the expected utility Line 7 selects a t, a t,, a t uniformly from L at any score value for each of the first arms in all steps epoch l of length l, and thus it is at most B l / l. Thus,. + θ a x θ B l E x θ θ ax B l This in other words says that the expected contribution of the th arm to the regret in phase i is at most B l. It only remains to bound the expected contribution of the remaining arms to the regret, which are drawn from a distribution that assigns to all arms a n a probability propoional to the frequency in which a is played as the th arm in the previous epoch l.

2 Hence the total expected regret incurred by any of the first arms, a t j, j in epoch l is bounded by l B l / l B l. Since any finite time horizon of T can be uniquely decomposed as s + Z, for some integer s and 0 Z s+. Thus the total expected regret of Battling- Doubler is given by the following function of T : + 3 B + B + B s + BZ 6 Now by our assumption, Bt Oln t β, for any t Z +, the theorem claim follows by simplifying and bounding the above expression. Proof of corollary 4 Note that if the SBM S in Battling- Doubler is UCB that operates on the set of n arms, with expected reward of arm i n being + θ θ from equation 5. Thus for SBM S, the complexity of the total gap of the suboptimal arms become n i i H. Now from Theorem, we have that Bt OζH ln t for any t Z +. Thus we get T j S ER T E t θ θ j t + 3 B + B + B s + BZ Oζ H ln T OζH ln T. Proof of Corollary 5 The claim simply follows from the above theorem and the fact that the worst possible gap n ln T for UCB algorithm can be at most O T, as argued in 8 or 4 see Corollary.. The claim now follows by substituting H n O nt ln T in Corollary 4. C Regret analysis of Battling-MultiSBM Proof of Theorem 6. Proof. We sta by noting that in Battling-MultiSBM, only one SBM advances at each round. Denote by ρ x t the total number of times S x was advanced after t iterations of the algorithm, for any arm x n. + j θ a θ a tj t + jt + θ a θ j a from and Line 5 of Battling-MultiSBM. Note that for all SBMs S y, y n, the best arm the one with highest expected reward to play is arm. Thus at round t, the reward corresponding to the best arm is + t jt + Then at any round t where x is played as the th arm, words, this is the contribution of the th bandit S x internally sees a world in which the rewards are binary, choices to the regret at all times t for which the and the expected reward of any arm a n is exactly th arm is chosen to be x, and S x s internal θ θ j a. Now it is easy to see that for all SBMs Sy, y n, the suboptimalities of the arms are the same: the suboptimality regret associated to a is a n. θ θ a, for all arm Following is the ey observation in this proof, the total regret incurred by Battling-MultiSBM can be written as: T j S R T t θ θ j t T j θ θ a t j t θ θ a 0 + θ θ a 0 + θ θ a T + θ θ a t + t + θ θ a T + + θ θ a T θ θ a T T θ θ a t + t 7 as, max θ a a n This essentially conveys that bounding the above regret is equivalent to bounding regret of each SBM S y, y. In other words, this equivalently says that any suboptimal SBM S y, y n \ {} can not be played to many times, and any SBM can not play a suboptimal arm a n\{} too many times. The rest of the proof justifies the above claim. Let us denote by ρ x, the total number of times SBM S x is called, clearly ρ x T t at outputs x and by ρ xy the total number of times SBM S x played arm y, i.e. ρ xy T t at x and at y. We also denote by R x T {t ρ x T, a t x} θ θ x, the regret incurred due to advancing SBM S x, till time T. In

3 counter has not surpassed T. Similarly we denote by R xy T {t ρ x T, a t x, at y} θ θ y the regret incurred due to SBM S x for playing the arm y n, upto time T. Clearly R xy T ρ xy T y. From equation 7 it is now easy to see that, in order to bound the expected regret R T, it suffices to bound the expressions ER xy ρ x T, x, y n. Note that here we assume that each SBM policy used in Battling-MultiSBM is α-robust which would be crucially used in the proof. The main insight is to exploit the fact that due to α-robustness of the SBMs used, ρ x T is order of ln T for any suboptimal x. We begin with the observation that for any fixed x, y n, x being a suboptimal arm, if we choose α >, and s 4α, from the α-robustness propey of the SBM we get, P R xy T s ln T y s/ ln T P R xy T y / s/ ln T P ρ xy y / α s/ ln T α y / α s ln T α y s ln T α, 8 where the last inequality follows due to the fact that y, y n and α. Then under the same assumption on s and x, y, using union bound we get, P p {0,..., ln ln T } : R xy e ep sp/ y s α We now bound the quantity ρ x T for any non-optimal fixed x using the fact that all z n satisfy ρ z T T, and any SBM S x is advanced in an iteration only if x was the th bandit arm in the previous round. Thus we have that for all suboptimal x n \ {}, where the rightmost inequality is by union bound and 8. Now fix some x, y n, such that is x suboptimal. The last two inequalities give rise to a random variable Z defined as the minimal scalar for which we have for all T e, e e, e e,, e e ln ln T, ρ x T Zn ln T x, and R xy T Z ln T y. By 8 and 9, we have that for all s 4α, PZ s s α + ns ln T α. Also, conditioned on the event that {Z s}, we have R xy ρ x T R s xy : se lnsn ln T / x y ln x. Combining above we get, ER xy ρ x T OR 8α xy + se y ln ln T + ln n + ln s i R 8α+i xy 4α + i α + n4α + i ln T α. ln n Now since α max{3, + ln ln T }, it can be verified that the last expression converges to ORxy, 8α hence ER xy ρ x T Oα y ln ln T + ln n ln x. 0 Combining above and using 7 we get, the total expected regret ER T is at most +ER ρ T + x,y n\{} R xyρ x T. The desired regret bound now follows from 0. Proof of Corollary 7 Similar to the case of Battling- Doubler Corollary 5, the current claim simply follows from the regret guarantee of Battling-MultiSBM and the n ln T fact that the worst case gap can be at most O T, as argued in 8 or 4 see Corollary.. The claim now follows by substituting H n O nt ln T in Theorem 6. D Regret analysis of Battling-Duel Proof of Theorem 8 Pρ x T sn ln T / x P ρ zx T sn ln T / x z n s P R zx T ln T z n x ns ln T α, 9 Proof. Without loss of generality we will assume that the Condorcet arm a throughout the proof. Our goal is to analyse the regret of Battling-Duel in terms of its underlying dueling bandit algorithm. Considering Q to be the pairwise preference matrix perceived by dueling bandit algorithm D, i.e. upon playing any pair of item x t, y t n n, it receives feedbac according to the pairwise preference P x t beats y t Q x t,y t, we 3

4 now that the regret seen by the underlying dueling bandit algorithm is given by RT DB T Q,x t +Q,y t t. We now sta by analysing the relation between Q and Q. Case. is even. This is the easy case, since note that at any round t both x t and y t are replicated exactly for times. Thus at any round t: DB D, where the second last equality follows from Equation. Thus summing over t,,... T, the cumulative regret of Battling-Duel BD over T rounds become: Q x t, y t / P i S t i Q xt,x t + Qxt,y t Q x t,y t + 4, R BB T BD T t T t BB BD r DB t D RT DB D, where the second equality follows from the definition of pairwise-subset choice model see Lemma for details and the last equality follows from the fact that Q i,i, i n. Similarly we get, Q y t, x t P l S t l + Q yt,y t + Qyt,x t Q y t,x t + 4. Note the above expressions hold true for any pair of items x t, y t n n. Also, its also woh noting that indeed these give Q i,j + Q j,i, and Q i,i, i, j n. Thus for any pair of items i, j, we have Q i,j Q i,j + 4. Now let us analyse the instantaneous regret of Battling- Duel at round t, BB BD; using our definition of regret as defined in 3, gives: BB BD Q,j Q,j Q,x t + Q,xt Q,xt + Q,yt 4 and the claim follows. Now let us consider the case when is odd. Case. is odd. Note that, similar to the case before, we again have that at any round t, Q x t, y t + + / i + Q x t,y t + 4 P i S t + +/ Q xt,x t + + / P i S t i Qxt,y t + Q xt,x t + / Qxt,y t Similarly one can show that Q y t,x t + Q y t,x t + 4. It is easy to verify that as desired Q x t,y t + Q y t,x t. Fuhermore above relation also gives that Q xt,y t Q x + t,y t. Then similar to Case, analysing the instantaneous regret of Battling-Duel at round t, BB BD; using our definition of regret as defined in 3, gives: BB BD Q,j Q,j

5 Q,xt + + Q,yt + + Q,xt + Q,yt Q,xt + Q,yt Q,xt + Q,yt + + rdb t D, where the second last equality follows from Equation. Now summing over t,,... T, the cumulative regret of Battling-Duel BD over T rounds become: R BB T BD + T t T t BB BD r DB t D + RDB T D, score, i.e. θ > max 8 i θ i and rest of the θ i s are in an arithmetic or geometric progression respectively, as their name suggests. The two score vectors are described in Table. arith geom Table : Parameters for linear-subset choice model The next two utility score vectors has n 5 items in each. Similarly as before, item is the Condorcet winner here as well, with θ > max 8 i θ i. More specifically for g, the individual score vectors are of the form: θ i { 0.8, if i 0., i 5 \ {} For g3 the individual score vectors are of the form: and the claim for Case follows. Proof of Corollary 9 Proof. The result is immediate from Theorem 8 along with the expected regret guarantee of the RUCB algorithm as derived in Theorem 5 of 4. Proof of Corollary Proof. The proof immediately follows from Theorem 0 along with the regret lower bound for any DB problem as derived in Theorem of 3. This is because any smaller regret for A BB would violate the best achievable regret for DB, which is a logical contradiction. E Datasets for experiments 0.8, if i θ i 0.7, i 8 \ {} 0.6, otherwise We also used a bigger version of the g dataset with n 50 items to run experiments with varying subset size as shown in Figure 5 Section 5.3. The individual score vectors of g3 are the form: θ i { 0.8, if i 0., i 50 \ {} E. Parameters for linear-subset choice model For synthetic experiments with linear-subset choice model, we use the following four different utility score vectors θ 0, n :. arith. geom 3. g and 4. g3. Both arith and geom utility score vector has n 8 items, with item as the best Condorcet item with highest E. Pairwise preference matrices used in synthetic experiments with pairwise-subset choice model We run experiments on two synthetic preference matrices arith-pref and arxiv-pref for pairwise-subset choice model. The datasets are shown in Table 3 and 4 respectively. 5

6 Arith preference dataset Table 3: arith-pref: Arith preference matrix Arxiv preference dataset Table 4: arxiv-pref: Arxiv preference matrix E.3 Pairwise preference matrices used in real world experiments with pairwise-subset choice model Hurdy dataset features. Similar to Car, we construct the underlying pairwise preference matrix Q such that Q ij is computed by taing the empirical average of number of times a sushi type i is preferred over j. We fuher sample 6 sushis out of these 00 such that the underlying preference matrix contains a Condorcet winner. The dataset obtained is given in Table 7. Tennis dataset The tennis preference matrix compiles the all-time winloss results of tennis matches among 8 international tennis players as recorded by the Association of Tennis Professionals ATP. For each pair of players arms, say i and j, Q ij is set to be the fraction of matches between i and j that were won by i. The dataset is adopted from the Tennis data used by 6 which has the item player as its Condorcet winner. This resulted pairwise preference matrix is given below in Table Table 8: Tennis Dataset The Hudry tournament data is a well-studied tournament on 3 items and which actually has a special propey of being the smallest tournament dataset for which the Bans and Copeland sets are different as shown in 6. Although the original dataset does not contain a Condorcet item. Hence we delete three of the 3 items so that the resulting preference matrix contains a Condorcet winner. The preference matrix is shown in Table 5. Car dataset This dataset contains pairwise preferences of 0 cars given by 60 users, where each car has 4 features. From the user preferences, we compute the underlying pairwise preference matrix Q, where Q ij is computed by taing the empirical average of number of times a car i is preferred over item j by the users. The preference matrix obtained in this way for Car is given in Table 6. Sushi Dataset This dataset contains over 00 sushis rated according to their preferences, where each sushi is represented by 7 6

7 Table 5: Hurdy Dataset Table 6: Car Dataset Table 7: Sushi Dataset 7

Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem Masrour Zoghi 1, Shimon Whiteson 1, Rémi Munos 2 and Maarten de Rijke 1 University of Amsterdam 1 ; INRIA Lille / MSR-NE 2 June 24,