Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem

Size: px

Start display at page:

Download "Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem"

Morgan Shaw
5 years ago
Views:

1 Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem Masrour Zoghi 1, Shimon Whiteson 1, Rémi Munos 2 and Maarten de Rijke 1 University of Amsterdam 1 ; INRIA Lille / MSR-NE 2 June 24, / 18

2 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 2 / 18

3 Motivation for Dueling Bandits People are better at expressing preferences than absolute quality. Preference feedback is abundant: Information Retrieval Recommender Systems Can t apply K-armed bandit algorithms 3 / 18

4 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 4 / 18

5 K arms {a 1,..., a K } K-armed Dueling Bandits Preference probabilities p ij := Pr(a i beats a j ) for i, j = 1,..., K Preference matrix p ij a 1 a 2 a 3 a 4 a a a a / 18

6 K arms {a 1,..., a K } K-armed Dueling Bandits Preference probabilities p ij := Pr(a i beats a j ) for i, j = 1,..., K Goal 1: Find the best arm: arm a b that beats all others, i.e. p bj > 0.5 for all j b Goal 2: Lower the number of suboptimal comparisons Preference matrix p ij a 1 a 2 a 3 a 4 a a a a / 18

7 Evaluation Measure Given a comparison between a i and a j, define regret as r = i + j, 2 with k = p bk 1 2 and a b is the best ranker. Cumulative regret = sum of regret over time No regret only when a 1 compared against a 1 Preference matrix p ij a 1 a 2 a 3 a 4 a a a a i / 18

8 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 7 / 18

9 Three phases in each iteration: Relative Upper Confidence Bound 8 / 18

10 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender 8 / 18

11 Relative Upper Confidence Bound Three phases in each iteration: 1. Run an optimistic tournament to pick a contender Compute µ ij (t), the frequentist estimates of p ij Frequentist estimates µ ij (t) a 1 a 2 a 3 a 4 a a a a / 18

12 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender Compute µ ij (t), the frequentist estimates of p ij Add optimism bonuses to get upper bounds u ij (t) = µ ij (t) + α log t N ij (t) Optimistic estimates u ij (t) a 1 a 2 a 3 a 4 a a a a / 18

13 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender Compute µ ij (t), the frequentist estimates of p ij Add optimism bonuses to get upper bounds u ij (t) = µ ij (t) + α log t N ij (t) Choose contender, i.e. an arm that optimistically beats everyone a 1 and a 2 potential contenders u ij (t) a 1 a 2 a 3 a 4 a a a a / 18

14 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender a 2 chosen as contender u ij (t) a 1 a 2 a 3 a 4 a a a a / 18

15 Three phases in each iteration: Relative Upper Confidence Bound 1. Run an optimistic tournament to pick a contender 2. Pick a challenger using UCB relative to the contender UCB relative to a 2 u ij (t) a 1 a 2 a 3 a 4 a a a a / 18

16 Relative Upper Confidence Bound Three phases in each iteration: 1. Run an optimistic tournament to pick a contender 2. Pick a challenger using UCB relative to the contender Optimism for challenger = pessimism for contender Choose challenger most likely to show contender best arm UCB relative to a 2 u ij (t) a 1 a 2 a 3 a 4 a a a a / 18

17 Relative Upper Confidence Bound Three phases in each iteration: 1. Run an optimistic tournament to pick a contender 2. Pick a challenger using UCB relative to the contender 3. Compare the two arms and update the score sheet Play a 4 against a 2 u ij (t) a 1 a 2 a 3 a 4 a a a a / 18

18 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 9 / 18

19 Experiments 64-armed problem obtained from LETOR learning to rank dataset Beat the Mean (BTM) from (Yue & Joachims, 2011) and Condorcet SAVAGE from (Uvroy et al, 2013) LETOR NP2004 Dataset with 64 rankers cumulative regret BTM Condorcet SAVAGE RUCB α = time 10 / 18

20 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 11 / 18

21 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter (Ailon et al, ICML 2014): Doubler & MultiSBM 12 / 18

22 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter (Ailon et al, ICML 2014): Doubler & MultiSBM 12 / 18

23 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

24 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

25 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

26 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. (Yue et al, COLT 2009): Interleaved Filter unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

27 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

28 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. - O(K log T ) U.B. unds are in O(K log T ) and do no ptions. (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

29 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

30 Existing Regret Bounds (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM - O(K log T ) U.B. and L.B. unds are in O(K log T ) and do no ptions. More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

31 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM - O(K log T ) U.B. and L.B. but for dueling bandits arising from regular bandits More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

32 Existing Regret Bounds unds are in O(K log T ) and do no ptions. (Urvoy et al, ICML 2013): SAVAGE - Very general assumptions but O(K 2 log T ) U.B. (Yue & Joachims, ICML 2011): Beat the Mean (BTM) - O(K log T ) U.B. but assumes total ordering of arms and requires prior knowledge of the problem and scales poorly with the degradation of the problem. (Yue et al, COLT 2009): Interleaved Filter - O(K log T ) L.B. but on particular dueling bandit problems - O(K log T ) U.B. but w/ strong transitivity assumptions (Ailon et al, ICML 2014): Doubler & MultiSBM - O(K log T ) U.B. and L.B. but for dueling bandits arising from regular bandits Remark: Our bounds are in O(K log T ) and do not require any transitivity assumptions or prior knowledge. More Restrictive Assumptions 12 / 18 ling bandits arising from regular band

33 Regret Bounds P := [p ij ] with i, j = 1,..., K - preference matrix Only assume p 1j > 0.5 for j > 1. No transitivity assumptions R T - cumulative regret up to time T α - the parameter in UCB 13 / 18

34 Regret Bounds P := [p ij ] with i, j = 1,..., K - preference matrix Only assume p 1j > 0.5 for j > 1. No transitivity assumptions R T - cumulative regret up to time T α - the parameter in UCB High Probability Regret Bound Given α > 0.5 and δ > 0, applying RUCB with parameter α to the above problem, we get with probability 1 δ that K R T C(K, δ, α) + }{{} O(K 2 ) j=2 D j (α) ln T } {{ } O(K ln T ) 13 / 18

35 Regret Bounds P := [p ij ] with i, j = 1,..., K - preference matrix Only assume p 1j > 0.5 for j > 1. No transitivity assumptions R T - cumulative regret up to time T α - the parameter in UCB Expected Regret Bound Assuming α > 1, we have K E[R T ] C (K, α) + D j (α) ln T }{{} j=2 }{{} O(K 2 ) O(K ln T ) 13 / 18

36 Plan of Talk Motivation Definitions Algorithm Experiments Regret Bounds Open problems 14 / 18

37 Open Theoretical Questions 1 Comprehensive lower bounds: e.g. is the O(K 2 ) additive constant necessary for problems with cycles? E[R T ] C (K, α) + D j (α) ln T }{{} j }{{} O(K 2 ) O(K ln T ) Expected Regret Bounds for α 1 Extensions to the case when there is no best arm Results for a GP or X-armed extension Theoretical results for a Thompson Sampling version Contextual, adversarial, etc 15 / 18

38 Open Theoretical Questions 2 Preference matrix P YJ used in (Yue & Joachims, ICML 2011) and (Ailon et al, ICML 2014) Relative Confidence Sampling (RCS) from (Zoghi et al, WSDM 2014) and Sparring from (Ailon et al, ICML 2014) YJ Preference Matrix with 6 arms cumulative regret RUCB α = 0.51 Sparring RCS α = time 16 / 18

39 Contributions New K-armed dueling bandit algorithm Outperforming existing algorithms with theoretical results Expected and high probability regret bounds (not requiring δ to be passed to the algorithm) Main contribution: Asymptotically optimal regret bounds for a broad class of problems 17 / 18

40 Thank you Masrour Zoghi 18 / 18

A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits Pratik Gajane Tanguy Urvoy Fabrice Clérot Orange-labs, Lannion, France PRATIK.GAJANE@ORANGE.COM TANGUY.URVOY@ORANGE.COM