A. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A

Similar documents
Online Learning to Rank in Stochastic Click Models

The Multi-Armed Bandit Problem

Informational Confidence Bounds for Self-Normalized Averages and Applications

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Reinforcement Learning

Supplementary: Battle of Bandits

Lecture 3: Lower Bounds for Bandit Algorithms

Multi-armed bandit models: a tutorial

Online Learning and Sequential Decision Making

The information complexity of sequential resource allocation

U Logo Use Guidelines

Bandit models: a tutorial

,... We would like to compare this with the sequence y n = 1 n

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stochastic bandits: Explore-First and UCB

Online Learning to Rank in Stochastic Click Models

The multi armed-bandit problem

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

PART III. Outline. Codes and Cryptography. Sources. Optimal Codes (I) Jorge L. Villar. MAMME, Fall 2015

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Two optimization problems in a stochastic bandit model

Online learning with feedback graphs and switching costs

Lecture 4: Lower Bounds (ending); Thompson Sampling

The K-armed Dueling Bandits Problem

Introduction to Statistical Learning Theory

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Combinatorial Cascading Bandits

Stat 260/CS Learning in Sequential Decision Problems.

Introduction to Multi-Armed Bandits

Bandits : optimality in exponential families

Introduction to Multi-Armed Bandits

Shortest paths with negative lengths

2x + 5 = x = x = 4

LECTURE 10. Last time: Lecture outline

Online Learning: Bandit Setting

Lecture 11 : Asymptotic Sample Complexity

CSE 20 DISCRETE MATH SPRING

Machine Learning Theory Lecture 4

On Bayesian bandit algorithms

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Grundlagen der Künstlichen Intelligenz

Advanced Machine Learning

Introduction to Machine Learning (67577) Lecture 3

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

On the Complexity of Best Arm Identification with Fixed Confidence

arxiv: v3 [cs.gt] 17 Jun 2015

On the Complexity of Best Arm Identification with Fixed Confidence

Extended dynamic programming: technical details

Fundamental Domains for Integer Programs with Symmetries

The information complexity of best-arm identification

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

1. Suppose that a, b, c and d are four different integers. Explain why. (a b)(a c)(a d)(b c)(b d)(c d) a 2 + ab b = 2018.

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CDS 270-2: Lecture 6-1 Towards a Packet-based Control Theory

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Machine Learning Foundations

Online Learning with Experts & Multiplicative Weights Algorithms

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Richard Combes 1. SDN Days, Multi-Armed Bandits: A Novel Generic Optimal Algorithm. and Applications to Networking. R. Combes.

Analysis of Thompson Sampling for the multi-armed bandit problem

A proof of the Jordan normal form theorem

Lecture 21: Minimax Theory

arxiv: v2 [stat.ml] 19 Jul 2012

Lossy Distributed Source Coding

Math 896 Coding Theory

Dynamic resource allocation: Bandit problems and extensions

STA 4273H: Statistical Machine Learning

Solve the equation for c: 8 = 9c (c + 24). Solve the equation for x: 7x (6 2x) = 12.

Detailed Proofs of Lemmas, Theorems, and Corollaries

Facets for Node-Capacitated Multicut Polytopes from Path-Block Cycles with Two Common Nodes

Stability of boundary measures

Bayesian and Frequentist Methods in Bandit Models

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

Empirical Risk Minimization

SOLUTIONS TO ADDITIONAL EXERCISES FOR II.1 AND II.2

Chapter 5: Data Compression

Notes on the Bourgain-Katz-Tao theorem

Chapter 4 Differentiation

1 Review and Overview

MORE EXERCISES FOR SECTIONS II.1 AND II.2. There are drawings on the next two pages to accompany the starred ( ) exercises.

Lecture 5: Probabilistic tools and Applications II

Compressed Sensing and Sparse Recovery

The Multi-Arm Bandit Framework

The Multi-Armed Bandit Problem

Introduction to Statistical Learning Theory

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Equations and Inequalities

Toward a Classification of Finite Partial-Monitoring Games

Kolmogorov complexity

Lecture 11: Polar codes construction

Diophantine quadruples and Fibonacci numbers

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

Robustness of Anytime Bandit Policies

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

Generalization, Overfitting, and Model Selection

Information & Correlation

Understanding the Simplex algorithm. Standard Optimization Problems.

Transcription:

A Notation Symbol Definition (d) Attraction probability of item d max Highest attraction probability, (1) A Binary attraction vector, where A(d) is the attraction indicator of item d P Distribution over binary attraction vectors A Set of active batches b max Index of the last created batch B b,` Items in stage ` of batch b c t (k) Indicator of the click on position k at time t c b,`(d) Number of observed clicks on item d in stage ` of batch b ĉ b,`(d) Estimated probability of clicking on item d in stage ` of batch b c b,`(d) Probability of clicking on item d in stage ` of batch b, E [ĉ b,`(d)] D Ground set of items [L] such that (1) (L) log T + 3 log log T T ` 2 ` I b Interval of positions in batch b K Number of positions to display items len(b) Number of positions to display items in batch b L Number of items L b,`(d) Lower confidence bound of item d, in stage ` of batch b n` Number of times that each item is observed in stage ` n b,` Number of observations of item d in stage ` of batch b K (D) Set of all K-tuples with distinct elements from D r(r, A, ) Reward of list R, for attraction and examination indicators A and r(r,, ) Expected reward of list R R =(d 1,,d K ) List of K items, where d k is the k-th item in R R =(1,,K) Optimal list of K items R(R, A, ) Regret of list R, for attraction and examination indicators A and R(T ) Expected cumulative regret in T steps T Horizon of the experiment U b,`(d) Upper confidence bound of item d, in stage ` of batch b (R,k) Examination probability of position k in list R (k) Examination probability of position k in the optimal list R Binary examination matrix, where (R,k) is the examination indicator of position k in list R P Distribution over binary examination matrices

B Proof of Theorem 1 Let R b,` be the stochastic regret associated with stage ` of batch b Then the expected T -step regret of MergeRank can be decomposed as " 2K T 1 R(T ) E R b,`# because the maximum number of batches is 2K Let b=1 `=0 b,`(d) = c b,`(d) (d) (12) be the average examination probability of item d in stage ` of batch b Let E b,` = Event 1: 8d 2 B b,` : c b,`(d) 2 [L b,`(d), U b,`(d)], Event 2: 8I b 2 [K] 2,d2B b,`, d 2 B b,` \ [K] st = (d ) (d) > 0: n` (I b (1))(1 max ) 2 log T =) ĉ b,`(d) b,`(d)[ (d)+ /4], Event 3: 8I b 2 [K] 2,d2 B b,`, d 2 B b,` \ [K] st = (d ) (d) > 0: n` (I b (1))(1 max ) 2 log T =) ĉ b,`(d ) b,`(d )[ (d ) /4] be the good event in stage ` of batch b, where c b,`(d) is the probability of clicking on item d in stage ` of batch b, which is defined in (8); ĉ b,`(d) is its estimate, which is defined in (7); and both and max are defined in Section 53 Let E b,` be the complement of event E b,` Let E be the good event that all events E b,` happen; and E be its complement, the bad event that at least one event E b,` does not happen Then the expected T -step regret can be bounded from above as " 2K # T 1 R(T ) E R b,`1{e} + TP(E) b=1 `=0 2K b=1 E " T 1 # R b,`1{e} +4KL(3e + K), where the second inequality is from Lemma 2 Now we apply Lemma 7 to each batch b and get that " 2K T 1 # 192K 3 L E R b,`1{e} (1 max ) min This concludes our proof b=1 `=0 `=0

C Upper Bound on the Probability of Bad Event E Lemma 2 Let E be defined as in the proof of Theorem 1 and T 5 Then P (E) 4KL(3e + K) T Proof By the union bound, P (E) 2K T 1 P (E b,`) b=1 `=0 Now we bound the probability of each event in E b,` and then sum them up Event 1 The probability that event 1 in E b,` does not happen is bounded as follows Fix I b and B b,` For any d 2 B b,`, P ( c b,`(d) /2 [L b,`(d), U b,`(d)]) P ( c b,`(d) < L b,`(d)) + P ( c b,`(d) > U b,`(d)) 2e log(t log 3 T ) log n` T log 3 T 2e log 2 T + log(log 3 T ) log T T log 3 T 2e 2 log 2 T T log 3 T 6e T log T, where the second inequality is by Theorem 10 of Garivier & Cappe (2011), the third inequality is from T n`, the fourth inequality is from log(log 3 T ) log T for T 5, and the last inequality is from 2 log 2 T 3 log 2 T for T 3 By the union bound, P (9d 2 B b,` st c b,`(d) /2 [L b,`(d), U b,`(d)]) 6eL T log T for any B b,` Finally, since the above inequality holds for any B b,`, the probability that event 1 in E b,` does not happen is bounded as above Event 2 The probability that event 2 in E b,` does not happen is bounded as follows Fix I b and B b,`, and let k = I b (1) If the event does not happen for items d and d, then it must be true that n` 2 log T, ĉ b,`(d) > b,`(d)[ (d)+ /4] From the definition of the average examination probability in (12) and a variant of Hoeffding s inequality in Lemma 8, we have that P (ĉ b,`(d) > b,`(d)[ (d)+ /4]) exp [ n`d KL ( b,`(d)[ (d)+ /4] k c b,`(d))] From Lemma 9, b,`(d) (k)/k (Lemma 3), and Pinsker s inequality, we have that exp [ n`d KL ( b,`(d)[ (d)+ /4] k c b,`(d))] exp [ n` b,`(d)(1 max )D KL ( (d)+ /4 k (d))] 2 exp n` 8K

From our assumption on n`, we conclude that 2 exp n` 8K exp[ 2 log T ]= 1 T 2 Finally, we chain all above inequalities and get that event 2 in E b,` does not happen for any fixed I b, B b,`, d, and d with probability of at most T 2 Since the maximum numbers of items d and d are L and K, respectively, the event does not happen for any fixed I b and B b,` with probability of at most KLT 2 In turn, the probability that event 2 in E b,` does not happen is bounded by KLT 2 Event 3 This bound is analogous to that of event 2 Total probability The maximum number of stages in any batch in BatchRank is log T and the maximum number of batches is 2K Hence, by the union bound, 6eL P (E) T log T + KL T 2 + KL 4KL(3e + K) T 2 (2K log T ) T This concludes our proof

D Upper Bound on the Regret in Individual Batches Lemma 3 For any batch b, positions I b, stage `, set B b,`, and item d 2 B b,`, where k = I b (1) is the highest position in batch b (k) K b,`(d), Proof The proof follows from two observations First, by Assumption 6, (k) is the lowest examination probability of position k Second, by the design of DisplayBatch, item d is placed at position k with probability of at least 1/K Lemma 4 Let event E happen and T 5 For any batch b, positions I b, set B b,0, item d 2 B b,0, and item d 2 B b,0 \ [K] such that = (d ) (d) > 0, let k = I b (1) be the highest position in batch b and ` be the first stage where Then U b,`(d) < L b,`(d ) ` < r K Proof From the definition of n` in BatchRank and our assumption on `, n` 16 log T> 2` (13) 2 Let µ = b,`(d) and suppose that U b,`(d) µ[ (d)+ /2] holds Then from this assumption, the definition of U b,`(d), and event 2 in E b,`, D KL (ĉ b,`(d) k U b,`(d)) D KL (ĉ b,`(d) k µ[ (d)+ /2]) 1{ĉ b,`(d) µ[ (d)+ /2]} D KL (µ[ (d)+ /4] k µ[ (d)+ /2]) From Lemma 9, µ (k)/k (Lemma 3), and Pinsker s inequality, we have that D KL (µ[ (d)+ /4] k µ[ (d)+ /2]) µ(1 max )D KL ( (d)+ /4 k (d)+ /2) From the definition of U b,`(d), T n` = 5, and above inequalities, 2 8K log T + 3 log log T D KL (ĉ b,`(d) k U 2 log T b,`(d)) D KL (ĉ b,`(d) k U b,`(d)) log T 2 This contradicts to (13), and therefore it must be true that U b,`(d) <µ[ (d)+ On the other hand, let µ = b,`(d ) and suppose that L b,`(d ) µ [ (d ) the definition of L b,`(d ), and event 3 in E b,`, /2] holds /2] holds Then from this assumption, D KL (ĉ b,`(d ) k L b,`(d )) D KL (ĉ b,`(d ) k µ [ (d ) /2]) 1{ĉ b,`(d ) µ [ (d ) /2]} D KL (µ [ (d ) /4] k µ [ (d ) /2]) From Lemma 9, µ (k)/k (Lemma 3), and Pinsker s inequality, we have that D KL (µ [ (d ) /4] k µ [ (d ) /2]) µ (1 max )D KL ( (d ) /4 k (d ) /2) From the definition of L b,`(d ), T n` = 5, and above inequalities, 2 8K log T + 3 log log T D KL (ĉ b,`(d) k L b,`(d )) 2 log T D KL (ĉ b,`(d ) k L b,`(d )) log T 2

This contradicts to (13), and therefore it must be true that L b,`(d ) >µ [ (d ) Finally, based on inequality (11), µ = c b,`(d ) (d ) c b,`(d) (d) and item d is guaranteed to be eliminated by the end of stage ` because = µ, /2] holds U b,`(d) <µ[ (d)+ /2] µ (d)+ µ (d ) µ (d) 2 = µ (d ) µ (d ) µ (d) 2 µ [ (d ) /2] < L b,`(d ) This concludes our proof Lemma 5 Let event E happen and T 5 For any batch b, positions I b where I b (2) = K, set B b,0, and item d 2 B b,0 such that d>k, let k = I b (1) be the highest position in batch b and ` be the first stage where r ` < K for = (K) (d) Then item d is eliminated by the end of stage ` Proof Let B + = {k,, K} Now note that (d ) (d) for any d 2 B + By Lemma 4, L b,`(d ) > U b,`(d) for any d 2 B + ; and therefore item d is eliminated by the end of stage ` Lemma 6 Let E happen and T 5 For any batch b, positions I b, and set B b,0, let k = I b (1) be the highest position in batch b and ` be the first stage where r ` < max K for max = (s) (s + 1) and s = arg max [ (d) (d + 1)] Then batch b is split by the end of stage ` d2{i b (1),,I b (2) 1} Proof Let B + = {k,, s} and B = B b,0 \ B + Now note that (d ) (d) max for any (d,d) 2 B + B By Lemma 4, L b,`(d ) > U b,`(d) for any (d,d) 2 B + B ; and therefore batch b is split by the end of stage ` Lemma 7 Let event E happen and T 5 Then the expected T -step regret in any batch b is bounded as E " T 1 96K R b,`# 2 L (1 max ) max `=0 Proof Let k = I b (1) be the highest position in batch b Choose any item d 2 B b,0 and let = (k) (d) First, we show that the expected per-step regret of any item d is bounded by (k) when event E happens Since event E happens, all eliminations and splits up to any stage ` of batch b are correct Therefore, items 1,,k 1 are at positions 1,,k 1; and position k is examined with probability (k) Note that this is the highest examination probability in batch b (Assumption 4) Our upper bound follows from the fact that the reward is linear in individual items (Section 31) We analyze two cases First, suppose that 2K max for max in Lemma 6 Then by Lemma 6, batch b splits when the number of steps in a stage is at most 2 max

By the design of DisplayBatch, any item in stage ` of batch b is displayed at most 2n` times Therefore, the maximum regret due to item d in the last stage before the split is 32K (k) 2 max log T 64K2 max (1 max ) 2 max log T = 64K 2 (1 max ) max Now suppose that > 2K max This implies that item d is easy to distinguish from item K In particular, where the equality is from the identity (K) (d) = ( (k) (K)) K max 2, = (k) (d) = (k) (K)+ (K) (d); the first inequality is from (k) (K) K max, which follows from the definition of max and k 2 [K]; and the last inequality is from our assumption that K max < /2 Now we apply the derived inequality and, by Lemma 5 and from the design of DisplayBatch, the maximum regret due to item d in the stage where that item is eliminated is 32K (k) ( (K) (d)) 2 log T 128K (1 max ) 64 log T (1 max ) max The last inequality is from our assumption that > 2K max Because the lengths of the stages quadruple and BatchRank resets all click estimators at the beginning of each stage, the maximum expected regret due to any item d in batch b is at most 15 times higher than that in the last stage, and hence This concludes our proof E " T 1 R b,`# `=0 96K2 B b,0 (1 max ) max

E Technical Lemmas Lemma 8 Let ( 1 ) n i=1 be n iid Bernoulli random variables, µ = P n i=1 i, and µ = E [ µ] Then for any " 2 [0, 1 for any " 2 [0,µ] µ], and P ( µ µ + ") exp[ nd KL (µ + " k µ)] P ( µ µ ") exp[ nd KL (µ " k µ)] Proof We only prove the first claim The other claim follows from symmetry From inequality (21) of Hoeffding (1963), we have that " µ+" 1 µ 1 µ P ( µ µ + ") µ + " 1 (µ + ") (µ+") # n for any " 2 [0, 1 µ] Now note that " µ+" # 1 (µ+") n µ 1 µ =exp n (µ + ") log µ + " 1 (µ + ") This concludes the proof Lemma 9 For any c, p, q 2 [0, 1], =exp µ µ + " +(1 (µ + ")) log 1 µ 1 (µ + ") n (µ + ") log µ + " 1 (µ + ") +(1 (µ + ")) log µ 1 µ =exp[ nd KL (µ + " k µ)] c(1 max {p, q})d KL (p k q) D KL (cp k cq) cd KL (p k q) (14) Proof The proof is based on differentiation The first two derivatives of D KL (cp k cq) with respect to q are q D c(q p) KL(cp k cq) = q(1 cq), 2 q 2 D KL(cp k cq) = c2 (q p) 2 + cp(1 cp) q 2 (1 cq) 2 ; and the first two derivatives of cd KL (p k q) with respect to q are q [cd c(q p) KL(p k q)] = q(1 q), 2 q 2 [cd KL(p k q)] = c(q p)2 + cp(1 p) q 2 (1 q) 2 The second derivatives show that D KL (cp k cq) and cd KL (p k q) are convex in q for any p Their minima are at q = p Now we fix p and c, and prove (14) for any q The upper bound is derived as follows Since D KL (cp k cx) =cd KL (p k x) =0 when x = p, the upper bound holds when cd KL (p k x) increases faster than D KL (cp k cx) for any p<x q, and when cd KL (p k x) decreases faster than D KL (cp k cx) for any q x<p This follows from the definitions of x D KL(cp k cx) and x [cd KL(p k x)] In particular, both derivatives have the same sign and x D KL(cp k cx) x [cd KL(p k x)] for any feasible x 2 [min {p, q}, max {p, q}] The lower bound is derived as follows The ratio of x [cd KL(p k x)] and x D KL(cp k cx) is bounded from above as x [cd KL(p k x)] x D KL(cp k cx) = 1 cx 1 x 1 1 x 1 1 max {p, q} for any x 2 [min {p, q}, max {p, q}] Therefore, we get a lower bound on D KL (cp k cx) when we multiply cd KL (p k x) by 1 max {p, q}