Supplementary: Battle of Bandits
|
|
- Noah Norris
- 5 years ago
- Views:
Transcription
1 Supplementary: Battle of Bandits A Proof of Lemma Proof. We sta by noting that PX i PX i, i {U, V } Pi {U, V } PX i i {U, V } PX i i {U, V } j,j i j,j i j,j i PX i {U, V } {i, j} Q Q aia a ia j j, j,j i where the penultimate equality is by appealing to sampling without replacement and the definition of the win probability for i in the pairwise preference model. B Regret analysis of Battling-Doubler We will be using the following regret guarantee of UCB algorithm for classical MAB problem in order to proof the regret bounds of Battling-Doubler: Theorem. UCB Regret 5. Assume n denotes the set of arms. µ i denotes the expected reward associated with arm i n, such that µ > µ µ n and H n i i, where i µ µ i, then the expected regret of the UCB algorithm with confidence parameter ζ > 0 is of OζH ln T. Proof of Theorem 3 Proof. For any time horizon T Z +, let BT denotes the supremum of the expected regret of the SBM S after T steps, over all possible reward distributions on the arm set n. Clearly, length of epoch l in the algorithm is exactly l, l. We denote this by T l l. Note that for all time steps t inside this epoch, the first arms a t, a t,, a t are chosen uniformly from L, and thus the observed reward of the th arm also comes from a fixed distribution, as argued below: t within epoch l. Clearly, Eθ a t Eθ a t Eθ a t θ, since each of the first arms were drawn uniformly from L. Now, the SBM S is playing a standard MAB game over the set of arms n with binary rewards. Let b t denote the binary reward of the th arm a t in the tth step. Clearly for all t within epoch l, Eb t a t, a t,, a t + j θ a t θ a t j PX i, {U, V } {i, j} i {U, V } Thus at round t, if a t x, for any x n, P{U, V } {i, j} i {U, V } Eb t a t x E + j θ a x θ a t j + θ a x θ + θ a x θ Now for SBM S, the best arm with highest expected reward is still arm-, where Eb t a t + θ θ 5 By the definition of the bound function BT, the total expected regret in the traditional MAB sense of the SBM S, in the epoch l is at most BT l B l, which implies that E x + θ θ The interesting thing to note is that, at any round t of epoch l, the expected regret incurred by any of the first arms, i.e. a t j such that j, is exactly same as the average expected regret contributed by the set of arms drawn as the th arm in epoch l since Let θ E a UnifL θ a denote the expected utility Line 7 selects a t, a t,, a t uniformly from L at any score value for each of the first arms in all steps epoch l of length l, and thus it is at most B l / l. Thus,. + θ a x θ B l E x θ θ ax B l This in other words says that the expected contribution of the th arm to the regret in phase i is at most B l. It only remains to bound the expected contribution of the remaining arms to the regret, which are drawn from a distribution that assigns to all arms a n a probability propoional to the frequency in which a is played as the th arm in the previous epoch l.
2 Hence the total expected regret incurred by any of the first arms, a t j, j in epoch l is bounded by l B l / l B l. Since any finite time horizon of T can be uniquely decomposed as s + Z, for some integer s and 0 Z s+. Thus the total expected regret of Battling- Doubler is given by the following function of T : + 3 B + B + B s + BZ 6 Now by our assumption, Bt Oln t β, for any t Z +, the theorem claim follows by simplifying and bounding the above expression. Proof of corollary 4 Note that if the SBM S in Battling- Doubler is UCB that operates on the set of n arms, with expected reward of arm i n being + θ θ from equation 5. Thus for SBM S, the complexity of the total gap of the suboptimal arms become n i i H. Now from Theorem, we have that Bt OζH ln t for any t Z +. Thus we get T j S ER T E t θ θ j t + 3 B + B + B s + BZ Oζ H ln T OζH ln T. Proof of Corollary 5 The claim simply follows from the above theorem and the fact that the worst possible gap n ln T for UCB algorithm can be at most O T, as argued in 8 or 4 see Corollary.. The claim now follows by substituting H n O nt ln T in Corollary 4. C Regret analysis of Battling-MultiSBM Proof of Theorem 6. Proof. We sta by noting that in Battling-MultiSBM, only one SBM advances at each round. Denote by ρ x t the total number of times S x was advanced after t iterations of the algorithm, for any arm x n. + j θ a θ a tj t + jt + θ a θ j a from and Line 5 of Battling-MultiSBM. Note that for all SBMs S y, y n, the best arm the one with highest expected reward to play is arm. Thus at round t, the reward corresponding to the best arm is + t jt + Then at any round t where x is played as the th arm, words, this is the contribution of the th bandit S x internally sees a world in which the rewards are binary, choices to the regret at all times t for which the and the expected reward of any arm a n is exactly th arm is chosen to be x, and S x s internal θ θ j a. Now it is easy to see that for all SBMs Sy, y n, the suboptimalities of the arms are the same: the suboptimality regret associated to a is a n. θ θ a, for all arm Following is the ey observation in this proof, the total regret incurred by Battling-MultiSBM can be written as: T j S R T t θ θ j t T j θ θ a t j t θ θ a 0 + θ θ a 0 + θ θ a T + θ θ a t + t + θ θ a T + + θ θ a T θ θ a T T θ θ a t + t 7 as, max θ a a n This essentially conveys that bounding the above regret is equivalent to bounding regret of each SBM S y, y. In other words, this equivalently says that any suboptimal SBM S y, y n \ {} can not be played to many times, and any SBM can not play a suboptimal arm a n\{} too many times. The rest of the proof justifies the above claim. Let us denote by ρ x, the total number of times SBM S x is called, clearly ρ x T t at outputs x and by ρ xy the total number of times SBM S x played arm y, i.e. ρ xy T t at x and at y. We also denote by R x T {t ρ x T, a t x} θ θ x, the regret incurred due to advancing SBM S x, till time T. In
3 counter has not surpassed T. Similarly we denote by R xy T {t ρ x T, a t x, at y} θ θ y the regret incurred due to SBM S x for playing the arm y n, upto time T. Clearly R xy T ρ xy T y. From equation 7 it is now easy to see that, in order to bound the expected regret R T, it suffices to bound the expressions ER xy ρ x T, x, y n. Note that here we assume that each SBM policy used in Battling-MultiSBM is α-robust which would be crucially used in the proof. The main insight is to exploit the fact that due to α-robustness of the SBMs used, ρ x T is order of ln T for any suboptimal x. We begin with the observation that for any fixed x, y n, x being a suboptimal arm, if we choose α >, and s 4α, from the α-robustness propey of the SBM we get, P R xy T s ln T y s/ ln T P R xy T y / s/ ln T P ρ xy y / α s/ ln T α y / α s ln T α y s ln T α, 8 where the last inequality follows due to the fact that y, y n and α. Then under the same assumption on s and x, y, using union bound we get, P p {0,..., ln ln T } : R xy e ep sp/ y s α We now bound the quantity ρ x T for any non-optimal fixed x using the fact that all z n satisfy ρ z T T, and any SBM S x is advanced in an iteration only if x was the th bandit arm in the previous round. Thus we have that for all suboptimal x n \ {}, where the rightmost inequality is by union bound and 8. Now fix some x, y n, such that is x suboptimal. The last two inequalities give rise to a random variable Z defined as the minimal scalar for which we have for all T e, e e, e e,, e e ln ln T, ρ x T Zn ln T x, and R xy T Z ln T y. By 8 and 9, we have that for all s 4α, PZ s s α + ns ln T α. Also, conditioned on the event that {Z s}, we have R xy ρ x T R s xy : se lnsn ln T / x y ln x. Combining above we get, ER xy ρ x T OR 8α xy + se y ln ln T + ln n + ln s i R 8α+i xy 4α + i α + n4α + i ln T α. ln n Now since α max{3, + ln ln T }, it can be verified that the last expression converges to ORxy, 8α hence ER xy ρ x T Oα y ln ln T + ln n ln x. 0 Combining above and using 7 we get, the total expected regret ER T is at most +ER ρ T + x,y n\{} R xyρ x T. The desired regret bound now follows from 0. Proof of Corollary 7 Similar to the case of Battling- Doubler Corollary 5, the current claim simply follows from the regret guarantee of Battling-MultiSBM and the n ln T fact that the worst case gap can be at most O T, as argued in 8 or 4 see Corollary.. The claim now follows by substituting H n O nt ln T in Theorem 6. D Regret analysis of Battling-Duel Proof of Theorem 8 Pρ x T sn ln T / x P ρ zx T sn ln T / x z n s P R zx T ln T z n x ns ln T α, 9 Proof. Without loss of generality we will assume that the Condorcet arm a throughout the proof. Our goal is to analyse the regret of Battling-Duel in terms of its underlying dueling bandit algorithm. Considering Q to be the pairwise preference matrix perceived by dueling bandit algorithm D, i.e. upon playing any pair of item x t, y t n n, it receives feedbac according to the pairwise preference P x t beats y t Q x t,y t, we 3
4 now that the regret seen by the underlying dueling bandit algorithm is given by RT DB T Q,x t +Q,y t t. We now sta by analysing the relation between Q and Q. Case. is even. This is the easy case, since note that at any round t both x t and y t are replicated exactly for times. Thus at any round t: DB D, where the second last equality follows from Equation. Thus summing over t,,... T, the cumulative regret of Battling-Duel BD over T rounds become: Q x t, y t / P i S t i Q xt,x t + Qxt,y t Q x t,y t + 4, R BB T BD T t T t BB BD r DB t D RT DB D, where the second equality follows from the definition of pairwise-subset choice model see Lemma for details and the last equality follows from the fact that Q i,i, i n. Similarly we get, Q y t, x t P l S t l + Q yt,y t + Qyt,x t Q y t,x t + 4. Note the above expressions hold true for any pair of items x t, y t n n. Also, its also woh noting that indeed these give Q i,j + Q j,i, and Q i,i, i, j n. Thus for any pair of items i, j, we have Q i,j Q i,j + 4. Now let us analyse the instantaneous regret of Battling- Duel at round t, BB BD; using our definition of regret as defined in 3, gives: BB BD Q,j Q,j Q,x t + Q,xt Q,xt + Q,yt 4 and the claim follows. Now let us consider the case when is odd. Case. is odd. Note that, similar to the case before, we again have that at any round t, Q x t, y t + + / i + Q x t,y t + 4 P i S t + +/ Q xt,x t + + / P i S t i Qxt,y t + Q xt,x t + / Qxt,y t Similarly one can show that Q y t,x t + Q y t,x t + 4. It is easy to verify that as desired Q x t,y t + Q y t,x t. Fuhermore above relation also gives that Q xt,y t Q x + t,y t. Then similar to Case, analysing the instantaneous regret of Battling-Duel at round t, BB BD; using our definition of regret as defined in 3, gives: BB BD Q,j Q,j
5 Q,xt + + Q,yt + + Q,xt + Q,yt Q,xt + Q,yt Q,xt + Q,yt + + rdb t D, where the second last equality follows from Equation. Now summing over t,,... T, the cumulative regret of Battling-Duel BD over T rounds become: R BB T BD + T t T t BB BD r DB t D + RDB T D, score, i.e. θ > max 8 i θ i and rest of the θ i s are in an arithmetic or geometric progression respectively, as their name suggests. The two score vectors are described in Table. arith geom Table : Parameters for linear-subset choice model The next two utility score vectors has n 5 items in each. Similarly as before, item is the Condorcet winner here as well, with θ > max 8 i θ i. More specifically for g, the individual score vectors are of the form: θ i { 0.8, if i 0., i 5 \ {} For g3 the individual score vectors are of the form: and the claim for Case follows. Proof of Corollary 9 Proof. The result is immediate from Theorem 8 along with the expected regret guarantee of the RUCB algorithm as derived in Theorem 5 of 4. Proof of Corollary Proof. The proof immediately follows from Theorem 0 along with the regret lower bound for any DB problem as derived in Theorem of 3. This is because any smaller regret for A BB would violate the best achievable regret for DB, which is a logical contradiction. E Datasets for experiments 0.8, if i θ i 0.7, i 8 \ {} 0.6, otherwise We also used a bigger version of the g dataset with n 50 items to run experiments with varying subset size as shown in Figure 5 Section 5.3. The individual score vectors of g3 are the form: θ i { 0.8, if i 0., i 50 \ {} E. Parameters for linear-subset choice model For synthetic experiments with linear-subset choice model, we use the following four different utility score vectors θ 0, n :. arith. geom 3. g and 4. g3. Both arith and geom utility score vector has n 8 items, with item as the best Condorcet item with highest E. Pairwise preference matrices used in synthetic experiments with pairwise-subset choice model We run experiments on two synthetic preference matrices arith-pref and arxiv-pref for pairwise-subset choice model. The datasets are shown in Table 3 and 4 respectively. 5
6 Arith preference dataset Table 3: arith-pref: Arith preference matrix Arxiv preference dataset Table 4: arxiv-pref: Arxiv preference matrix E.3 Pairwise preference matrices used in real world experiments with pairwise-subset choice model Hurdy dataset features. Similar to Car, we construct the underlying pairwise preference matrix Q such that Q ij is computed by taing the empirical average of number of times a sushi type i is preferred over j. We fuher sample 6 sushis out of these 00 such that the underlying preference matrix contains a Condorcet winner. The dataset obtained is given in Table 7. Tennis dataset The tennis preference matrix compiles the all-time winloss results of tennis matches among 8 international tennis players as recorded by the Association of Tennis Professionals ATP. For each pair of players arms, say i and j, Q ij is set to be the fraction of matches between i and j that were won by i. The dataset is adopted from the Tennis data used by 6 which has the item player as its Condorcet winner. This resulted pairwise preference matrix is given below in Table Table 8: Tennis Dataset The Hudry tournament data is a well-studied tournament on 3 items and which actually has a special propey of being the smallest tournament dataset for which the Bans and Copeland sets are different as shown in 6. Although the original dataset does not contain a Condorcet item. Hence we delete three of the 3 items so that the resulting preference matrix contains a Condorcet winner. The preference matrix is shown in Table 5. Car dataset This dataset contains pairwise preferences of 0 cars given by 60 users, where each car has 4 features. From the user preferences, we compute the underlying pairwise preference matrix Q, where Q ij is computed by taing the empirical average of number of times a car i is preferred over item j by the users. The preference matrix obtained in this way for Car is given in Table 6. Sushi Dataset This dataset contains over 00 sushis rated according to their preferences, where each sushi is represented by 7 6
7 Table 5: Hurdy Dataset Table 6: Car Dataset Table 7: Sushi Dataset 7
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem
Relative Upper Confidence Bound for the K-Armed Dueling Bandit Problem Masrour Zoghi 1, Shimon Whiteson 1, Rémi Munos 2 and Maarten de Rijke 1 University of Amsterdam 1 ; INRIA Lille / MSR-NE 2 June 24,
More informationA. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A
A Notation Symbol Definition (d) Attraction probability of item d max Highest attraction probability, (1) A Binary attraction vector, where A(d) is the attraction indicator of item d P Distribution over
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationInstance-dependent Regret Bounds for Dueling Bandits
JMLR: Workshop and Conference Proceedings vol 49:1 24, 2016 Instance-dependent Regret Bounds for Dueling Bandits Akshay Balsubramani * UC San Diego, CA, USA Zohar Karnin Yahoo! Research, New York, NY,
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationOnline Learning: Bandit Setting
Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac
More informationA Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits Pratik Gajane Tanguy Urvoy Fabrice Clérot Orange-labs, Lannion, France PRATIK.GAJANE@ORANGE.COM TANGUY.URVOY@ORANGE.COM
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationMathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations
Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations D. R. Wilkins Academic Year 1996-7 1 Number Systems and Matrix Algebra Integers The whole numbers 0, ±1, ±2, ±3, ±4,...
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationCourse: ESO-209 Home Work: 1 Instructor: Debasis Kundu
Home Work: 1 1. Describe the sample space when a coin is tossed (a) once, (b) three times, (c) n times, (d) an infinite number of times. 2. A coin is tossed until for the first time the same result appear
More informationChapter 2 Section 2.1: Proofs Proof Techniques. CS 130 Discrete Structures
Chapter 2 Section 2.1: Proofs Proof Techniques CS 130 Discrete Structures Some Terminologies Axioms: Statements that are always true. Example: Given two distinct points, there is exactly one line that
More informationOn Bayesian bandit algorithms
On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms
More informationarxiv: v1 [cs.lg] 15 Jan 2016
A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits extended version arxiv:6.3855v [cs.lg] 5 Jan 26 Pratik Gajane Tanguy Urvoy Fabrice Clérot Orange-labs, Lannion, France
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationThe K-armed Dueling Bandits Problem
The K-armed Dueling Bandits Problem Yisong Yue, Josef Broder, Robert Kleinberg, Thorsten Joachims Abstract We study a partial-information online-learning problem where actions are restricted to noisy comparisons
More informationAnnealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm
Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,
More informationarxiv: v1 [cs.lg] 14 May 2014
Reducing Dueling Bandits to Cardinal Bandits arxiv:145.3396v1 [cs.lg] 14 May 214 Nir Ailon CS Dept. Technion Haifa, Israel nailon@cs.technion.ac.il Thorsten Joachims Cs Dept. Cornell Ithaca, NY tj@cs.cornell.edu
More informationInternational Mathematics TOURNAMENT OF THE TOWNS
International Mathematics TOURNAMENT OF THE TOWNS Senior A-Level Paper Fall 2010 1 1 There are 100 points on the plane All 4950 pairwise distances between two points have been recorded (a) A single record
More informationStochastic Contextual Bandits with Known. Reward Functions
Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern
More informationActive Ranking from Pairwise Comparisons and when Parametric Assumptions Don t Help
Active Ranking from Pairwise Comparisons and when Parametric Assumptions Don t Help Reinhard Heckel Nihar B. Shah Kannan Ramchandran Martin J. Wainwright, Department of Statistics, and Department of Electrical
More informationA Bound for the Number of Different Basic Solutions Generated by the Simplex Method
ICOTA8, SHANGHAI, CHINA A Bound for the Number of Different Basic Solutions Generated by the Simplex Method Tomonari Kitahara and Shinji Mizuno Tokyo Institute of Technology December 12th, 2010 Contents
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationAn Experimental Evaluation of High-Dimensional Multi-Armed Bandits
An Experimental Evaluation of High-Dimensional Multi-Armed Bandits Naoki Egami Romain Ferrali Kosuke Imai Princeton University Talk at Political Data Science Conference Washington University, St. Louis
More informationSolving a linear equation in a set of integers II
ACTA ARITHMETICA LXXII.4 (1995) Solving a linear equation in a set of integers II by Imre Z. Ruzsa (Budapest) 1. Introduction. We continue the study of linear equations started in Part I of this paper.
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationLower Bounds for q-ary Codes with Large Covering Radius
Lower Bounds for q-ary Codes with Large Covering Radius Wolfgang Haas Immanuel Halupczok Jan-Christoph Schlage-Puchta Albert-Ludwigs-Universität Mathematisches Institut Eckerstr. 1 79104 Freiburg, Germany
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationarxiv: v3 [cs.gt] 17 Jun 2015
An Incentive Compatible Multi-Armed-Bandit Crowdsourcing Mechanism with Quality Assurance Shweta Jain, Sujit Gujar, Satyanath Bhat, Onno Zoeter, Y. Narahari arxiv:1406.7157v3 [cs.gt] 17 Jun 2015 June 18,
More informationSolution Set for Homework #1
CS 683 Spring 07 Learning, Games, and Electronic Markets Solution Set for Homework #1 1. Suppose x and y are real numbers and x > y. Prove that e x > ex e y x y > e y. Solution: Let f(s = e s. By the mean
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationLecture 5: Regret Bounds for Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined
More informationAppendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)
Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem
More informationComputing with voting trees
Computing with voting trees Jennifer Iglesias Nathaniel Ince Po-Shen Loh Abstract The classical paradox of social choice theory asserts that there is no fair way to deterministically select a winner in
More informationEstimation Considerations in Contextual Bandits
Estimation Considerations in Contextual Bandits Maria Dimakopoulou Zhengyuan Zhou Susan Athey Guido Imbens arxiv:1711.07077v4 [stat.ml] 16 Dec 2018 Abstract Contextual bandit algorithms are sensitive to
More informationBandits and Exploration: How do we (optimally) gather information? Sham M. Kakade
Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22
More informationOnline (and Distributed) Learning with Information Constraints. Ohad Shamir
Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information
More informationAdvanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12
Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationLearning to play K-armed bandit problems
Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,
More informationarxiv: v1 [stat.ml] 31 Jan 2015
Sparse Dueling Bandits arxiv:500033v [statml] 3 Jan 05 Kevin Jamieson, Sumeet Katariya, Atul Deshpande and Robert Nowak Department of Electrical and Computer Engineering University of Wisconsin Madison
More informationGoldbach s conjecture
Goldbach s conjecture Ralf Wüsthofen ¹ January 31, 2017 ² ABSTRACT This paper presents an elementary and short proof of the strong Goldbach conjecture. Whereas the traditional approaches focus on the control
More informationOnline Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks
1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Amir Alipour-Fanid, Kai Zeng, Qingsi Wang, Peter Auer arxiv:1709.10128v3 [cs.ni]
More informationSubsampling, Concentration and Multi-armed bandits
Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and
More informationOnline learning with feedback graphs and switching costs
Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence
More informationLecture 20: LP Relaxation and Approximation Algorithms. 1 Introduction. 2 Vertex Cover problem. CSCI-B609: A Theorist s Toolkit, Fall 2016 Nov 8
CSCI-B609: A Theorist s Toolkit, Fall 2016 Nov 8 Lecture 20: LP Relaxation and Approximation Algorithms Lecturer: Yuan Zhou Scribe: Syed Mahbub Hafiz 1 Introduction When variables of constraints of an
More informationMulti-Armed Bandit: Learning in Dynamic Systems with Unknown Models
c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University
More informationThe Ellipsoid Algorithm
The Ellipsoid Algorithm John E. Mitchell Department of Mathematical Sciences RPI, Troy, NY 12180 USA 9 February 2018 Mitchell The Ellipsoid Algorithm 1 / 28 Introduction Outline 1 Introduction 2 Assumptions
More informationMultiple Identifications in Multi-Armed Bandits
Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao
More informationThompson Sampling for the non-stationary Corrupt Multi-Armed Bandit
European Worshop on Reinforcement Learning 14 (2018 October 2018, Lille, France. Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit Réda Alami Orange Labs 2 Avenue Pierre Marzin 22300,
More informationDiscrete Mathematics: Midterm Test with Answers. Professor Callahan, section (A or B): Name: NetID: 30 multiple choice, 3 points each:
Discrete Mathematics: Midterm Test with Answers Professor Callahan, section (A or B): Name: NetID: 30 multiple choice, 3 points each: 1. If f is defined recursively by: f (0) = -2, f (1) = 1, and for n
More informationarxiv: v1 [cs.lg] 7 Sep 2018
Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationDiscrete Applied Mathematics
Discrete Applied Mathematics 194 (015) 37 59 Contents lists available at ScienceDirect Discrete Applied Mathematics journal homepage: wwwelseviercom/locate/dam Loopy, Hankel, and combinatorially skew-hankel
More informationBeat the Mean Bandit
Yisong Yue H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA yisongyue@cmu.edu tj@cs.cornell.edu
More informationIncomplete version for students of easllc2012 only. 94 First-Order Logic. Incomplete version for students of easllc2012 only. 6.5 The Semantic Game 93
65 The Semantic Game 93 In particular, for every countable X M there is a countable submodel N of M such that X N and N = T Proof Let T = {' 0, ' 1,} By Proposition 622 player II has a winning strategy
More informationScott Taylor 1. EQUIVALENCE RELATIONS. Definition 1.1. Let A be a set. An equivalence relation on A is a relation such that:
Equivalence MA Relations 274 and Partitions Scott Taylor 1. EQUIVALENCE RELATIONS Definition 1.1. Let A be a set. An equivalence relation on A is a relation such that: (1) is reflexive. That is, (2) is
More informationAn Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes
An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University
More informationAnalysis of Thompson Sampling for the multi-armed bandit problem
Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationarxiv: v1 [math.pr] 14 Dec 2016
Random Knockout Tournaments arxiv:1612.04448v1 [math.pr] 14 Dec 2016 Ilan Adler Department of Industrial Engineering and Operations Research University of California, Berkeley Yang Cao Department of Industrial
More informationDueling Bandits: Beyond Condorcet Winners to General Tournament Solutions
Dueling Bandits: Beyond Condorcet Winners to General Tournament Solutions Siddartha Ramamohan Indian Institute of Science Bangalore 56001, India siddarthayr@csaiiscernetin Arun Rajkumar Xerox Research
More informationCHOMP ON NUMERICAL SEMIGROUPS
CHOMP ON NUMERICAL SEMIGROUPS IGNACIO GARCÍA-MARCO AND KOLJA KNAUER ABSTRACT. We consider the two-player game chomp on posets associated to numerical semigroups and show that the analysis of strategies
More informationNoisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get
Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto
More informationMath 421, Homework #9 Solutions
Math 41, Homework #9 Solutions (1) (a) A set E R n is said to be path connected if for any pair of points x E and y E there exists a continuous function γ : [0, 1] R n satisfying γ(0) = x, γ(1) = y, and
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationThe Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang
The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011 Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationLearning from paired comparisons: three is enough, two is not
Learning from paired comparisons: three is enough, two is not University of Cambridge, Statistics Laboratory, UK; CNRS, Paris School of Economics, France Paris, December 2014 2 Pairwise comparisons A set
More informationThe Goldbach Conjecture An Emergence Effect of the Primes
The Goldbach Conjecture An Emergence Effect of the Primes Ralf Wüsthofen ¹ October 31, 2018 ² ABSTRACT The present paper shows that a principle known as emergence lies beneath the strong Goldbach conjecture.
More informationLecture 2: Minimax theorem, Impagliazzo Hard Core Lemma
Lecture 2: Minimax theorem, Impagliazzo Hard Core Lemma Topics in Pseudorandomness and Complexity Theory (Spring 207) Rutgers University Swastik Kopparty Scribe: Cole Franks Zero-sum games are two player
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationBURGESS INEQUALITY IN F p 2. Mei-Chu Chang
BURGESS INEQUALITY IN F p 2 Mei-Chu Chang Abstract. Let be a nontrivial multiplicative character of F p 2. We obtain the following results.. Given ε > 0, there is δ > 0 such that if ω F p 2\F p and I,
More informationGambling in a rigged casino: The adversarial multi-armed bandit problem
Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at
More informationNotes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed
CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed
More informationWE INTRODUCE and study a collaborative online learning
IEEE/ACM TRANSACTIONS ON NETWORKING Collaborative Learning of Stochastic Bandits Over a Social Network Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan Abstract We consider a collaborative online
More informationThe Multi-Armed Bandit Problem
Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which
More informationHybrid Machine Learning Algorithms
Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised
More informationSaturday, September 7, 2013 TEST BOOKLET. Test Version A. Your test version (A, B, C, or D) is above on this page.
AdvAntAge testing FoundAtion MAth The Fifth Prize Annual For girls Math Prize for Girls Saturday, September 7, 2013 TEST BOOKLET Test Version A DIRECTIONS 1. Do not open this test until your proctor instructs
More informationTHE STRUCTURE OF RAINBOW-FREE COLORINGS FOR LINEAR EQUATIONS ON THREE VARIABLES IN Z p. Mario Huicochea CINNMA, Querétaro, México
#A8 INTEGERS 15A (2015) THE STRUCTURE OF RAINBOW-FREE COLORINGS FOR LINEAR EQUATIONS ON THREE VARIABLES IN Z p Mario Huicochea CINNMA, Querétaro, México dym@cimat.mx Amanda Montejano UNAM Facultad de Ciencias
More informationCOMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017
COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SEQUENTIAL DATA So far, when thinking
More informationIntroduction to Information Entropy Adapted from Papoulis (1991)
Introduction to Information Entropy Adapted from Papoulis (1991) Federico Lombardo Papoulis, A., Probability, Random Variables and Stochastic Processes, 3rd edition, McGraw ill, 1991. 1 1. INTRODUCTION
More information1 Closest Pair of Points on the Plane
CS 31: Algorithms (Spring 2019): Lecture 5 Date: 4th April, 2019 Topic: Divide and Conquer 3: Closest Pair of Points on a Plane Disclaimer: These notes have not gone through scrutiny and in all probability
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationSequences, their sums and Induction
Sequences, their sums and Induction Example (1) Calculate the 15th term of arithmetic progression, whose initial term is 2 and common differnce is 5. And its n-th term? Find the sum of this sequence from
More informationNOTES ON HYPERBOLICITY CONES
NOTES ON HYPERBOLICITY CONES Petter Brändén (Stockholm) pbranden@math.su.se Berkeley, October 2010 1. Hyperbolic programming A hyperbolic program is an optimization problem of the form minimize c T x such
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationThe Goldbach Conjecture An Emergence Effect of the Primes
The Goldbach Conjecture An Emergence Effect of the Primes Ralf Wüsthofen ¹ April 20, 2018 ² ABSTRACT The present paper shows that a principle known as emergence lies beneath the strong Goldbach conjecture.
More informationDiscover Relevant Sources : A Multi-Armed Bandit Approach
Discover Relevant Sources : A Multi-Armed Bandit Approach 1 Onur Atan, Mihaela van der Schaar Department of Electrical Engineering, University of California Los Angeles Email: oatan@ucla.edu, mihaela@ee.ucla.edu
More informationOnline Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a).
+ + + + + + Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 206a). A. Further Related Work Learning with abstention is a useful paradigm in applications where the cost
More informationIrredundant Families of Subcubes
Irredundant Families of Subcubes David Ellis January 2010 Abstract We consider the problem of finding the maximum possible size of a family of -dimensional subcubes of the n-cube {0, 1} n, none of which
More information15.S24 Sample Exam Solutions
5.S4 Sample Exam Solutions. In each case, determine whether V is a vector space. If it is not a vector space, explain why not. If it is, find basis vectors for V. (a) V is the subset of R 3 defined by
More informationarxiv: v1 [cs.lg] 21 Sep 2018
SIC - MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits arxiv:809.085v [cs.lg] 2 Sep 208 Etienne Boursier and Vianney Perchet,2 CMLA, ENS Cachan, CNRS, Université Paris-Saclay,
More informationLecture 4: Two-point Sampling, Coupon Collector s problem
Randomized Algorithms Lecture 4: Two-point Sampling, Coupon Collector s problem Sotiris Nikoletseas Associate Professor CEID - ETY Course 2013-2014 Sotiris Nikoletseas, Associate Professor Randomized Algorithms
More information