Efficient Optimal Learning for Contextual Bandits

Similar documents
Minimum Squared Error

Minimum Squared Error

Chapter 2: Evaluative Feedback

Contraction Mapping Principle Approach to Differential Equations

e t dt e t dt = lim e t dt T (1 e T ) = 1

4.8 Improper Integrals

September 20 Homework Solutions

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

3. Renewal Limit Theorems

REAL ANALYSIS I HOMEWORK 3. Chapter 1

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

5.1-The Initial-Value Problems For Ordinary Differential Equations

0 for t < 0 1 for t > 0

A Kalman filtering simulation

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

Convergence of Singular Integral Operators in Weighted Lebesgue Spaces

A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR IAN KNOWLES

Mathematics 805 Final Examination Answers

Some Inequalities variations on a common theme Lecture I, UL 2007

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM

Probability, Estimators, and Stationarity

S Radio transmission and network access Exercise 1-2

A new model for limit order book dynamics

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

GENERALIZATION OF SOME INEQUALITIES VIA RIEMANN-LIOUVILLE FRACTIONAL CALCULUS

Transforms II - Wavelets Preliminary version please report errors, typos, and suggestions for improvements

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

One Practical Algorithm for Both Stochastic and Adversarial Bandits

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

1.0 Electrical Systems

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

Solutions to Problems from Chapter 2

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

A Time Truncated Improved Group Sampling Plans for Rayleigh and Log - Logistic Distributions

Physics 2A HW #3 Solutions

( ) ( ) ( ) ( ) ( ) ( y )

Average & instantaneous velocity and acceleration Motion with constant acceleration

Green s Functions and Comparison Theorems for Differential Equations on Measure Chains

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

Some basic notation and terminology. Deterministic Finite Automata. COMP218: Decision, Computation and Language Note 1

white strictly far ) fnf regular [ with f fcs)8( hs ) as function Preliminary question jointly speaking does not exist! Brownian : APA Lecture 1.

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1.

1. Introduction. 1 b b

ON NEW INEQUALITIES OF SIMPSON S TYPE FOR FUNCTIONS WHOSE SECOND DERIVATIVES ABSOLUTE VALUES ARE CONVEX

arxiv: v1 [math.pr] 24 Sep 2015

PART V. Wavelets & Multiresolution Analysis

Hermite-Hadamard-Fejér type inequalities for convex functions via fractional integrals

3 Motion with constant acceleration: Linear and projectile motion

Chapter 2. First Order Scalar Equations

Chapter Direct Method of Interpolation

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

EXERCISE - 01 CHECK YOUR GRASP

Version 001 test-1 swinney (57010) 1. is constant at m/s.

Application on Inner Product Space with. Fixed Point Theorem in Probabilistic

Honours Introductory Maths Course 2011 Integration, Differential and Difference Equations

How to Prove the Riemann Hypothesis Author: Fayez Fok Al Adeh.

Reinforcement Learning. Markov Decision Processes

Think of the Relationship Between Time and Space Again

MTH 146 Class 11 Notes

Neural assembly binding in linguistic representation

T-Match: Matching Techniques For Driving Yagi-Uda Antennas: T-Match. 2a s. Z in. (Sections 9.5 & 9.7 of Balanis)

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

Magnetostatics Bar Magnet. Magnetostatics Oersted s Experiment

1 Review of Zero-Sum Games

Asymptotic relationship between trajectories of nominal and uncertain nonlinear systems on time scales

ANSWERS TO EVEN NUMBERED EXERCISES IN CHAPTER 2

Procedia Computer Science

Reinforcement Learning

Notes on online convex optimization

Journal of Mathematical Analysis and Applications. Two normality criteria and the converse of the Bloch principle

Tax Audit and Vertical Externalities

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445

On the Pseudo-Spectral Method of Solving Linear Ordinary Differential Equations

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

How to prove the Riemann Hypothesis

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

1. Consider a PSA initially at rest in the beginning of the left-hand end of a long ISS corridor. Assume xo = 0 on the left end of the ISS corridor.

ON NEW INEQUALITIES OF SIMPSON S TYPE FOR FUNCTIONS WHOSE SECOND DERIVATIVES ABSOLUTE VALUES ARE CONVEX.

A new model for solving fuzzy linear fractional programming problem with ranking function

CBSE 2014 ANNUAL EXAMINATION ALL INDIA

LAPLACE TRANSFORM OVERCOMING PRINCIPLE DRAWBACKS IN APPLICATION OF THE VARIATIONAL ITERATION METHOD TO FRACTIONAL HEAT EQUATIONS

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Factorized Decision Forecasting via Combining Value-based and Reward-based Estimation

Integral Transform. Definitions. Function Space. Linear Mapping. Integral Transform

FURTHER GENERALIZATIONS. QI Feng. The value of the integral of f(x) over [a; b] can be estimated in a variety ofways. b a. 2(M m)

Online Convex Optimization Example And Follow-The-Leader

TIMELINESS, ACCURACY, AND RELEVANCE IN DYNAMIC INCENTIVE CONTRACTS

ECE Microwave Engineering. Fall Prof. David R. Jackson Dept. of ECE. Notes 10. Waveguides Part 7: Transverse Equivalent Network (TEN)

PARABOLA. moves such that PM. = e (constant > 0) (eccentricity) then locus of P is called a conic. or conic section.

Chapter 2. Motion along a straight line. 9/9/2015 Physics 218

Reinforcement learning

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

HUI-HSIUNG KUO, ANUWAT SAE-TANG, AND BENEDYKT SZOZDA

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Transcription:

fficien Opiml Lerning for Conexul Bndis Miroslv Dudik mdudik@yhoo-inccom Dniel Hsu djhsu@rcirugersedu Syen Kle skle@yhoo-inccom Nikos Krmpzikis nk@cscornelledu John Lngford jl@yhoo-inccom Lev Reyzin lreyzin@ccgechedu Tong Zhng zhng@srugersedu Absrc We ddress he problem of lerning in n online seing where he lerner repeedly observes feures, selecs mong se of cions, nd receives rewrd for he cion ken We provide he firs efficien lgorihm wih n opiml regre Our lgorihm uses cos sensiive clssificion lerner s n orcle nd hs running ime polylogn, where N is he number of clssificion rules mong which he orcle migh choose This is exponenilly fser hn ll previous lgorihms h chieve opiml regre in his seing Our formulion lso enbles us o cree n lgorihm wih regre h is ddiive rher hn muliplicive in feedbck dely s in ll previous work INTRODUCTION The conexul bndi seing consiss of he following loop repeed indefiniely: The world presens conex informion s feures x 2 The lerning lgorihm chooses n cion from K possible cions 3 The world presens rewrd r for he cion The key difference beween he conexul bndi seing nd sndrd supervised lerning is h only he rewrd of he chosen cion is reveled For exmple, fer lwys choosing he sme cion severl imes in row, he feedbck given provides lmos no bsis o prefer he chosen cion over noher cion In essence, he conexul bndi seing cpures he difficuly of explorion while voiding he difficuly of credi ssignmen s in more generl reinforcemen lerning seings The conexul bndi seing is hlf-wy poin beween sndrd supervised lerning nd full-scle reinforcemen lerning where i ppers possible o consruc lgorihms wih convergence re gurnees similr o supervised lerning Mny nurl seings sisfy his hlf-wy poin, moiving he invesigion of conexul bndi lerning For exmple, he problem of choosing ineresing news ricles or ds for users by inerne compnies cn be nurlly modeled s conexul bndi seing In he medicl domin where discree remens re esed before pprovl, he process of deciding which piens re eligible for remen kes conexs ino ccoun More generlly, we cn imgine h in fuure wih personlized medicine, new remens re essenilly equivlen o new cions in conexul bndi seing In he iid seing, he world drws pir x, r consising of conex nd rewrd vecor from some unknown disribuion D, reveling x in Sep, bu only he rewrd r of he chosen cion in Sep 3 Given se of policies Π = {π : X A}, he gol is o cree n lgorihm for Sep 2 which compees wih he se of policies We mesure our success by compring he lgorihm s cumulive rewrd o he expeced cumulive rewrd of he bes policy in he se The difference of he wo is clled regre All exising lgorihms for his seing eiher chieve subopiml regre Lngford nd Zhng, 2007 or require compuion liner in he number of policies Auer e l, 2002b; Beygelzimer e l, 20 In unsrucured policy spces, his compuionl complexiy is he bes one cn hope for On he oher hnd, in he cse where he rewrds of ll cions re reveled, he problem is equivlen o cos-sensiive clssificion, nd we know of lgorihms o efficienly serch he spce of policies clssificion rules such s cos-sensiive logisic regression nd suppor vecor mchines In hese cses, he spce of clssific-

ion rules is exponenil in he number of feures, bu hese problems cn be efficienly solved using convex opimizion Our gol here is o efficienly solve he conexul bndi problems for similrly lrge policy spces We do his by reducing he conexul bndi problem o cos-sensiive clssificion Given supervised cos-sensiive lerning lgorihm s n orcle Beygelzimer e l, 2009, our lgorihm runs in ime only polylogn while chieving regre O T K ln N, where N is he number of possible policies clssificion rules, K is he number of cions clsses, nd T is he number of ime seps This efficiency is chieved in modulr wy, so ny fuure improvemen in cossensiive lerning immediely pplies here PRVIOUS WORK AND MOTIVATION All previous regre-opiml pproches re mesure bsed hey work by upding mesure over policies, n operion which is liner in he number of policies In conrs, regre gurnees scle only logrihmiclly in he number of policies If no for he compuionl boleneck, hese regre gurnees imply h we could drmiclly increse performnce in conexul bndi seings using more expressive policies We overcome he compuionl boleneck using n lgorihm which works by creing cos-sensiive clssificion insnces nd clling n orcle o choose opiml policies Acions re chosen bsed on he policies reurned by he orcle rher hn ccording o mesure over ll policies This is reminiscen of AdBoos Freund nd Schpire, 997, which crees weighed binry clssificion insnces nd clls wek lerner orcle o obin clssificion rules These clssificion rules re hen combined ino finl clssifier wih boosed ccurcy Similrly s AdBoos convers wek lerner ino srong lerner, our pproch convers cos-sensiive clssificion lerner ino n lgorihm h solves he conexul bndi problem In more difficul version of conexul bndis, n dversry chooses x, r given knowledge of he lerning lgorihm bu no ny rndom numbers All known regre-opiml soluions in he dversril seing re vrins of he XP4 lgorihm Auer e l, 2002b XP4 chieves he sme regre re s our lgorihm: KT O ln N, where T is he number of ime seps, K is he number of cions vilble in ech ime sep, nd N is he number of policies Why no use XP4 in he iid seing? For exmple, i is known h he lgorihm cn be modified o succeed wih high probbiliy Beygelzimer e l, 20, nd lso for VC clsses when he dversry is consrined o iid smpling There re wo cenrl benefis h we hope o relize by direcly ssuming iid conexs nd rewrd vecors Compuionl Trcbiliy ven when he rewrd vecor is fully known, dversril regres ln scle s O N while compuion scles s ON in generl One emp o ge round his is he follow-he-perurbed-leder lgorihm Kli nd Vempl, 2005 which provides compuionlly rcble soluion in cerin specil-cse srucures This lgorihm hs no mechnism for efficien pplicion o rbirry policy spces, even given n efficien cos-sensiive clssificion orcle An efficien cos-sensiive clssificion orcle hs been shown effecive in rnsducive seings Kkde nd Kli, 2005 Aside from he drwbck of requiring rnsducive seing, he regre chieved here is subsnilly worse hn for XP4 2 Improved Res When he world is no compleely dversril, i is possible o chieve subsnilly lower regres hn re possible wih lgorihms opimized for he dversril seing For exmple, in supervised lerning, i is possible o obin regres scling s OlogT wih problem dependen consn Brle e l, 2007 When he feedbck is delyed by τ rounds, lower bounds imply h he regre in he dversril seing increses by muliplicive τ while in he iid seing, i is possible o chieve n ddiive regre of τ Lngford e l, 2009 In direc iid seing, he previous-bes pproch using cos-sensiive clssificion orcle ws given by ɛ-greedy nd epoch greedy lgorihms Lngford nd Zhng, 2007 which hve regre scling s OT 2/3 in he wors cse There hve lso been mny specil-cse nlyses For exmple, heory of conex-free seing is well undersood Li nd Robbins, 985; Auer e l, 2002; ven-dr e l, 2006 Similrly, good lgorihms exis when rewrds re liner funcions of feures Auer, 2002 or cions lie in coninuous spce wih he rewrd funcion smpled ccording o Gussin process Srinivs e l, 200 2 WHAT W PROV In Secion 3 we se he Policyliminion lgorihm, nd prove he following regre bound for i Theorem 4 For ll disribuions D over x, r wih K cions, for ll ses of N policies Π, wih probbil-

iy les, he regre of Policyliminion Algorihm over T rounds is mos 6 2T K ln 4T 2 N This resul cn be exended o del wih VC clsses, s well s oher specil cses I forms he simples mehod we hve of exhibiing he new nlysis The new key elemen of his lgorihm is idenificion of disribuion over cions which simulneously chieves smll expeced regre nd llows esiming vlue of every policy wih smll vrince The exisence of such disribuion is shown nonconsrucively by minimx rgumen Policyliminion is compuionlly inrcble nd lso requires exc knowledge of he conex disribuion bu no he rewrd disribuion! We show how o ddress hese issues in Secion 4 using n lgorihm we cll RndomizedUCB Nmely, we prove he following heorem Theorem 5 For ll disribuions D over x, r wih K cions, for ll ses of N policies Π, wih probbiliy les, he regre of RndomizedUCB Algorihm 2 over T rounds is mos O T K log T N/ + K lognk/ RndomizedUCB s nlysis is subsnilly more complex, wih key subrouine being n pplicion of he ellipsoid lgorihm wih cossensiive clssificion orcle described in Secion 5 RndomizedUCB does no ssume knowledge of he conex disribuion, nd insed works wih he hisory of conexs i hs observed Modifying he proof for his empiricl disribuion requires covering rgumen over he disribuions over policies which uses he probbilisic mehod The ne resul is n lgorihm wih similr op-level nlysis s Policyliminion, bu wih he running ime only poly-logrihmic in he number of policies given cossensiive clssificion orcle Theorem In ech ime sep, RndomizedUCB mkes mos Opoly, K, log/, log N clls o cos-sensiive clssificion orcle, nd requires ddiionl Opoly, K, log N processing ime Apr from rcble lgorihm, our nlysis cn be used o derive igher regres hn would be possible in dversril seing For exmple, in Secion 6, we consider common seing where rewrd feedbck is delyed by τ rounds A srighforwrd modificion of Policyliminion yields regre wih n ddiive erm proporionl o τ compred wih he dely-free seing Nmely, we prove he following Theorem 2 For ll disribuions D over x, r wih K cions, for ll ses of N policies Π, nd ll dely inervls τ, wih probbiliy les, he regre of DelyedP Algorihm 3 is mos 6 2K ln 4T 2 N τ + T We sr nex wih precise seings nd definiions 2 STTING AND DFINITIONS 2 TH STTING Le A be he se of K cions, le X be he domin of conexs x, nd le D be n rbirry join disribuion on x, r We denoe he mrginl disribuion of D over X by D X We denoe Π o be finie se of policies {π : X A}, where ech policy π, given conex x in round, chooses he cion πx The crdinliy of Π is denoed by N Le r 0, K be he vecor of rewrds, where r is he rewrd of cion on round In he iid seing, on ech round = T, he world chooses x, r iid ccording o D nd revels x o he lerner The lerner, hving ccess o Π, chooses cion {,, K} Then he world revels rewrd r which we cll r for shor o he lerner, nd he inercion proceeds o he nex round We consider wo modes of ccessing he se of policies Π The firs opion is hrough he enumerion of ll policies This is imprcicl in generl, bu suffices for he illusrive purpose of our firs lgorihm The second opion is n orcle ccess, hrough n rgmx orcle, corresponding o cos-sensiive lerner: Definiion For se of policies Π, n rgmx orcle AMO for shor, is n lgorihm, which for ny sequence {x, r } =, x X, r R K, compues rg mx r πx π Π = The reson why he bove cn be viewed s cossensiive clssificion orcle is h vecors of rewrds r cn be inerpreed s negive coss nd hence he policy reurned by AMO is he opiml cos-sensiive clssifier on he given d 22 XPCTD AND MPIRICAL RWARDS Le he expeced insnneous rewrd of policy π Π be denoed by η D π = rπx x, r D

The bes policy π mx Π is h which mximizes η D π More formlly, π mx = rgmx η D π π Π We define h o be he hisory ime h he lerner hs seen Specificlly h = x,, r, p, = where p is he probbiliy of he lgorihm choosing cion ime Noe h nd p re produced by he lerner while x, r re produced by nure We wrie x h o denoe choosing x uniformly rndom from he x s in hisory h Using he hisory of ps cions nd probbiliies wih which hey were ken, we cn form n unbised esime of he policy vlue for ny π Π: η π = riπx = p x,,r,p h The unbisedness follows, becuse p riπx= p = p riπx= p = rπx The empiriclly bes policy ime is denoed 23 RGRT π = rgmx η π π Π The gol of his work is o obin lerner h hs smll regre relive o he expeced performnce of π mx over T rounds, which is η D π mx r 2 =T We sy h he regre of he lerner over T rounds is bounded by ɛ wih probbiliy les, if Pr η D π mx r ɛ =T where he probbiliy is ken wih respec o he rndom pirs x, r D for = T, s well s ny inernl rndomness used by he lerner We cn lso define noions of regre nd empiricl regre for policies π For ll π Π, le D π = η D π mx η D π, π = η π η π Our lgorihms work by choosing disribuions over policies, which in urn hen induce disribuions over cions For ny disribuion P over policies Π, le W P x, denoe he induced condiionl disribuion over cions given he conex x: W P x, = P π 22 π Π:πx= In generl, we shll use W, W nd Z s condiionl probbiliy disribuions over he cions A given conexs X, ie, W : X A 0, such h W x, is probbiliy disribuion over A nd similrly for W nd Z We shll hink of W s smoohed version of W wih minimum cion probbiliy of µ o be defined by he lgorihm, such h W x, = KµW x, + µ Condiionl disribuions such s W nd W, Z, ec correspond o rndomized policies We define noions rue nd empiricl vlue nd regre for hem s follows: η D W = x, r D r W x η W = rw x, p x,,r,p h D W = η D π mx η D W W = η π η W 3 POLICY LIMINATION The bsic ides behind our pproch re demonsred in our firs lgorihm: Policyliminion Algorihm The key sep is Sep, which finds disribuion over policies which induces low vrince in he esime of he vlue of ll policies Below we use minimx heorem o show h such disribuion lwys exiss How o find his disribuion is no specified here, bu in Secion 5 we develop mehod bsed on he ellipsoid lgorihm Sep 2 hen projecs his disribuion ono disribuion over cions nd pplies smoohing Finlly, Sep 5 elimines he policies h hve been deermined o be subopiml wih high probbiliy ALGORITHM ANALYSIS We nlyze Policyliminion in severl seps Firs, we prove he exisence of P in Sep, provided h Π is non-empy We recs he fesibiliy problem in Sep s gme beween wo plyers: Prover, who is rying o produce P, nd Flsifier, who is rying o find π violing he consrins We give more power o Flsifier nd llow him o choose disribuion over π ie, rndomized policy which would viole he consrins

Algorihm PolicyliminionΠ,,K,D X Le Π 0 = Π nd hisory h 0 = Define: = / 4N 2 2K ln/ Define: b = 2 { } Define: µ = min 2K, ln/ 2K For ech imesep = T, observe x nd do: Choose disribuion P over Π s π Π : 2K x D X Kµ W P x, πx + µ 2 Le W = Kµ W P x, +µ for ll A 3 Choose W 4 Observe rewrd r { 5 Le Π = π Π : } η π mx η π 2b π Π 6 Le h = h x,, r, W Noe h ny policy π corresponds o poin in he spce of rndomized policies viewed s funcions X A 0,, wih πx, = Iπx = For ny disribuion P over policies in Π, he induced rndomized policy W P hen corresponds o poin in he convex hull of Π Denoing he convex hull of Π by C, Prover s choice by W nd Flsifier s choice by Z, he fesibiliy of Sep follows by he following lemm: Lemm Le C be compc nd convex se of rndomized policies Le µ 0, /K nd for ny W C, W x, = KµW x, + µ Then for ll disribuions D, min mx W C Z C x D X Zx, W x, K Kµ Proof Le fw, Z = x D X Zx, /W x, denoe he inner expression of he minimx problem Noe h fw, Z is: everywhere defined: Since W x, µ, we obin h /W x, 0, /µ, hence he expecions re defined for ll W nd Z liner in Z: fw, Z s fw, Z = Lineriy follows from rewriing Zx, x D X W x, A convex in W : Noe h /W x, is convex in W x, by convexiy of /c w +c 2 in w 0, for c 0, c 2 > 0 Convexiy of fw, Z in W hen follows by king expecions over x nd Hence, by Theorem 4 in Appendix B, min nd mx cn be reversed wihou ffecing he vlue: min mx W C Z C fw, Z = mx min fw, Z Z C W C The righ-hnd side cn be furher upper-bounded by mx Z C fz, Z, which is upper-bounded by Zx, fz, Z = x D X Z x, x D X A: Zx,>0 A Zx, KµZx, = K Kµ Corollry 2 The se of disribuions sisfying consrins of Sep is non-empy Given he exisence of P, we will see below h he consrins in Sep ensure low vrince of he policy vlue esimor η π for ll π Π The smll vrince is used o ensure ccurcy of policy eliminion in Sep 5 s qunified in he following lemm: Lemm 3 Wih probbiliy les, for ll : π mx Π ie, Π is non-empy 2 η D π mx η D π 4b for ll π Π Proof We will show h for ny policy π Π, he probbiliy h η π devies from η D π by more h b is mos 2 Tking he union bound over ll policies nd ll ime seps we find h wih probbiliy les, for ll nd ll π Π Then: η π η D π b 3 By he ringle inequliy, in ech ime sep, η π η π mx + 2b for ll π Π, yielding he firs pr of he lemm 2 Also by he ringle inequliy, if η D π < η D π mx 4b for π Π, hen η π < η π mx 2b Hence he policy π is elimined in Sep 5, yielding he second pr of he lemm I remins o show q 3 We fix he policy π Π nd ime, nd show h he deviion bound is violed wih probbiliy mos 2 Our rgumen

ress on Freedmn s inequliy see Theorem 3 in Appendix A Le y = r Iπx = W ie, η π = = y / Le denoe he condiionl expecion h To use Freedmn s inequliy, we need o bound he rnge of y nd is condiionl second momen y 2 Since r 0, nd W µ, we hve he bound 0 y /µ = R Nex, y 2 = y 2 x, r D W r 2 = Iπx = x, r D W W 2 W πx x, r D W πx 2 32 = x D W πx 2K 33 where q 32 follows by boundedness of r nd q 33 follows from he consrins in Sep Hence, y 2 2K = V = Since ln / is decresing for 3, we obin h µ is non-incresing by seprely nlyzing =, = 2, 3 Le 0 be he firs such h µ < /2K Noe h b 4Kµ, so for < 0, we hve b 2 nd Π = Π Hence, he deviion bound holds for < 0 Le 0 For, by he monooniciy of µ 2K R = /µ /µ = ln/ = V ln/ Hence, he ssumpions of Theorem 3 re sisfied, nd Pr η π η D π b 2 The union bound over π nd yields q 3 This immediely implies h he cumulive regre is bounded by η D π mx r 8 2K ln 4NT 2 T =T = 6 2T K ln 4T 2 N 34 nd gives us he following heorem, Algorihm 2 RndomizedUCBΠ,,K Le h 0 = be he iniil hisory Define he following quniies: { N C = 2 log nd µ = min 2K, } C 2K For ech imesep = T, observe x nd do: Le P be disribuion over Π h pproximely solves he opimizion problem min P π π P π Π s for ll disribuions Q over Π : π Q Kµ i= W P x i, πx i + µ { W Q 2 } mx 4K, 80C 4 so h he objecive vlue P is wihin ε op, = O KC / of he opiml vlue, nd so h ech consrin is sisfied wih slck K 2 Le W be he disribuion over A given by for ll A 3 Choose W W = Kµ W P x, + µ 4 Observe rewrd r 5 Le h = h x,, r, W Theorem 4 For ll disribuions D over x, r wih K cions, for ll ses of N policies Π, wih probbiliy les, he regre of Policyliminion Algorihm over T rounds is mos 6 2T K ln 4T 2 N 4 TH RANDOMIZD UCB ALGORITHM Policyliminion is he simples exhibiion of he minimx rgumen, bu i hs some drwbcks: The lgorihm keeps explici rck of he spce of good policies like version spce, which is difficul o implemen efficienly in generl

2 If he opiml policy is miskenly elimined by chnce, he lgorihm cn never recover 3 The lgorihm requires perfec knowledge of he disribuion D X over conexs These difficulies re ddressed by RndomizedUCB or RUCB for shor, n lgorihm which we presen nd nlyze in his secion Our pproch is reminiscen of he UCB lgorihm Auer e l, 2002, developed for conex-free seing, which keeps n upperconfidence bound on he expeced rewrd for ech cion However, insed of choosing he highes upper confidence bound, we rndomize over choices ccording o he vlue of heir empiricl performnce The lgorihm hs he following properies: The opimizion sep required by he lgorihm lwys considers he full se of policies ie, explici rcking of he se of good policies is voided, nd hus i cn be efficienly implemened using n rgmx orcle We discuss his furher in Secion 5 2 Subopiml policies re implicily used wih decresing frequency by using non-uniform vrince consrin h depends on policy s esimed regre A consequence of his is bound on he vlue of he opimizion, sed in Lemm 7 below 3 Insed of D X, he lgorihm uses he hisory of previously seen conexs The effec of his pproximion is qunified in Theorem 6 below The regre of RndomizedUCB is he following: Theorem 5 For ll disribuions D over x, r wih K cions, for ll ses of N policies Π, wih probbiliy les, he regre of RndomizedUCB Algorihm 2 over T rounds is mos O T K log T N/ + K lognk/ The proof is given in Appendix D4 Here, we presen n overview of he nlysis 4 MPIRICAL VARIANC STIMATS A key echnicl prerequisie for he regre nlysis is he ccurcy of he empiricl vrince esimes For disribuion P over policies Π nd priculr policy π Π, define V P,π, = x D X Kµ W P x, πx + µ V P,π, = Kµ i= W P x i, πx i + µ The firs quniy V P,π, is bound on he vrince incurred by n impornce-weighed esime of rewrd in round using he cion disribuion induced by P, nd he second quniy V P,π, is n empiricl esime of V P,π, using he finie smple {x,, x } X drwn from D X We show h for ll disribuions P nd ll π Π, V P,π, is close o V P,π, wih high probbiliy Theorem 6 For ny ɛ 0,, wih probbiliy les, V P,π, + ɛ V P,π, + 7500 ɛ 3 K for ll disribuions P over Π, ll π Π, nd ll 6K log8kn/ The proof ppers in Appendix C 42 RGRT ANALYSIS Cenrl o he nlysis is he following lemm h bounds he vlue of he opimizion in ech round I is direc corollry of Lemm 24 in Appendix D4 Lemm 7 If OPT is he vlue of he opimizion problem 4 in round, hen KC K logn/ OPT O = O This lemm implies h he lgorihm is lwys ble o selec disribuion over he policies h focuses mosly on he policies wih low esimed regre Moreover, he vrince consrins ensure h good policies never pper oo bd, nd h only bd policies re llowed o incur high vrince in heir rewrd esimes Hence, minimizing he objecive in 4 is n effecive surroge for minimizing regre The bulk of he nlysis consiss of nlyzing he vrince of he impornce-weighed rewrd esimes η π, nd showing how hey rele o heir cul expeced rewrds η D π The deils re deferred o Appendix D 5 USING AN ARGMAX ORACL In his secion, we show how o solve he opimizion problem 4 using he rgmx orcle AMO for our se of policies Nmely, we describe n lgorihm running in polynomil ime independen of he number of policies, which mkes queries o AMO o compue disribuion over policies suible for he opimizion sep of Algorihm 2 Or rher dependen only on log N, he represenion size of policy

This lgorihm relies on he ellipsoid mehod The ellipsoid mehod is generl echnique for solving convex progrms equipped wih seprion orcle A seprion orcle is defined s follows: Definiion 2 Le S be convex se in R n A seprion orcle for S is n lgorihm h, given poin x R n, eiher declres correcly h x S, or produces hyperplne H such h x nd S re on opposie sides of H We do no describe he ellipsoid lgorihm here since i is sndrd, bu only spell ou is key properies in he following lemm For poin x R n nd r 0, we use he noion Bx, r o denoe he l 2 bll of rdius r cenered x Lemm 8 Suppose we re required o decide wheher convex se S R n is empy or no We re given seprion orcle for S nd wo numbers R nd r, such h S B0, R nd if S is non-empy, hen here is poin x such h S Bx, r The ellipsoid lgorihm decides correcly if S is empy or no, by execuing mos On 2 log R r ierions, ech involving one cll o he seprion orcle nd ddiionl On 2 processing ime We now wrie convex progrm whose soluion is he required disribuion, nd show how o solve i using he ellipsoid mehod by giving seprion orcle for is fesible se using AMO Fix ime period Le X be he se of ll conexs seen so fr, ie X = {x, x 2,, x } We embed ll policies π Π in R K, wih coordines idenified wih x, X A Wih buse of noion, policy π is represened by he vecor π wih coordine πx, = if πx = nd 0 oherwise Le C be he convex hull of ll policy vecors π Recll h disribuion P over policies corresponds o poin inside C, ie, W P x, = π:πx= P π, nd h W x, = µ KW x, + µ, where µ is s defined in Algorihm 2 Also define β = 80C In he following, we use he noion x h o denoe conex drwn uniformly rndom from X Consider he following convex progrm: min s s W s 5 W C 52 Z C : x h Zx, W mx{4k, β Z 2 } 53 x, We clim h his progrm is equivlen o he RUCB opimizion problem 4, up o finding n explici disribuion over policies which corresponds o he opiml soluion This cn be seen s follows Since we require W C, i cn be inerpreed s being equl o W P for some disribuion over policies P The consrins 53 re equivlen o 4 by subsiuion Z = W Q The bove convex progrm cn be solved by performing binry serch over s nd esing fesibiliy of he consrins For fixed vlue of s, he fesibiliy problem defined by 5 53 is denoed by A We now give skech of how we consruc seprion orcle for he fesible region of A The deils of he lgorihm re bi compliced due o he fc h we need o ensure h he fesible region, when non-empy, hs non-negligible volume recll he requiremens of Lemm 8 This necessies hving smll error in sisfying he consrins of he progrm We leve he deils o Appendix Modulo hese deils, he consrucion of he seprion orcle essenilly implies h we cn solve A Before giving he consrucion of he seprion orcle, we firs show h AMO llows us o do liner opimizion over C efficienly: Lemm 9 Given vecor w R K, we cn compue rg mx Z C w Z using one invocion of AMO Proof The sequence for AMO consiss of x X nd r = wx, The lemm now follows since w π = x X wx, πx We need noher simple echnicl lemm which explins how o ge sepring hyperplne for violions of convex consrins: Lemm 0 For x R n, le fx be convex funcion of x, nd consider he convex se K defined by K = {x : fx 0} Suppose we hve poin y such h fy > 0 Le fy be subgrdien of f y Then he hyperplne fy + fy x y = 0 sepres y from K Proof Le gx = fy + fy x y By he convexiy of f, we hve fx gx for ll x Thus, for ny x K, we hve gx fx 0 Since gy = fy > 0, we conclude h gx = 0 sepres y from K Now given cndide poin W, seprion orcle cn be consruced s follows We check wheher W sisfies he consrins of A If ny consrin is violed, hen we find hyperplne sepring W from ll poins sisfying he consrin

Firs, for consrin 5, noe h η W is liner in W, nd so we cn compue mx π η π vi AMO s in Lemm 9 We cn hen compue η W nd check if he consrin is sisfied If no, hen he consrin, being liner, uomiclly yields sepring hyperplne 2 Nex, we consider consrin 52 To check if W C, we use he percepron lgorihm We shif he origin o W, nd run he percepron lgorihm wih ll poins π Π being posiive exmples The percepron lgorihm ims o find hyperplne puing ll policies π Π on one side In ech ierion of he percepron lgorihm, we hve cndide hyperplne specified by is norml vecor, nd hen if here is policy π h is on he wrong side of he hyperplne, we cn find i by running liner opimizion over C in he negive norml vecor direcion s in Lemm 9 If W / C, hen in bounded number of ierions depending on he disnce of W from C, nd he mximum mgniude π 2 we obin sepring hyperplne In pssing we lso noe h if W C, he sme echnique llows us o explicily compue n pproxime convex combinion of policies in Π h yields W This is done by running he percepron lgorihm s before nd sopping fer he bound on he number of ierions hs been reched Then we collec ll he policies we hve found in he run of he percepron lgorihm, nd we re gurneed h W is close in disnce o heir convex hull We cn hen find he closes poin in he convex hull of hese policies by solving simple qudric progrm 3 Finlly, we consider consrin 53 We rewrie η W s η W = w W, where wx, = r I = /W Thus, Z = v w Z, where v = mx π η π = mx π w π, which cn be compued by using AMO once Nex, using he cndide poin W, compue he vecor u defined s ux, = nx/ W x,, where n x is he number of imes x ppers in h, so h Zx, x h W x, = u Z Now, he problem reduces o finding policy Z C which violes he consrin u Z mx{4k, β w Z v 2 } Define fz = mx{4k, β w Z v 2 } u Z Noe h f is convex funcion of Z Finding poin Z h violes he bove consrin is equivlen o solving he following convex progrm: fz 0 54 Z C 55 To do his, we gin pply he ellipsoid mehod For his, we need seprion orcle for he progrm A seprion orcle for he consrins 55 cn be consruced s in Sep 2 bove For he consrins 54, if he cndide soluion Z hs fz > 0, hen we cn consruc sepring hyperplne s in Lemm 0 Suppose h fer solving he progrm, we ge poin Z C such h fz 0, ie W violes he consrin 53 for Z Then since consrin 53 is convex in W, we cn consruc sepring hyperplne s in Lemm 0 This complees he descripion of he seprion orcle Working ou he deils crefully yields he following heorem, proved in Appendix : Theorem There is n ierive lgorihm wih O 5 K 4 log 2 K ierions, ech involving one cll o AMO nd O 2 K 2 processing ime, h eiher declres correcly h A is infesible or oupus disribuion P over policies in Π such h W P sisfies x h where ɛ = 8 µ 2 Z C : Zx, mx{4k, β Z 2 } + 5ɛ x, W P W s + 2γ, nd γ = µ 6 DLAYD FDBACK In delyed feedbck seing, we observe rewrds wih τ sep dely ccording o: The world presens feures x 2 The lerning lgorihm chooses n cion {,, K} 3 The world presens rewrd r τ for he cion τ given he feures x τ We del wih dely by suibly modifying Algorihm o incorpore he dely τ, giving Algorihm 3 Now we cn prove he following heorem, which shows he dely hs n ddiive effec on regre Theorem 2 For ll disribuions D over x, r wih K cions, for ll ses of N policies Π, nd ll dely inervls τ, wih probbiliy les, he regre of DelyedP Algorihm 3 is mos 6 2K ln 4T 2 N τ + T

Algorihm 3 DelyedPΠ,,K,D X,τ Le Π 0 = Π nd hisory h 0 = Define: = / 4N 2 2K ln/ nd b = 2 { } Define: µ = min 2K, ln/ 2K For ech imesep = T, observe x nd do: Le = mx τ, 2 Choose disribuion P over Π s π Π : 2K Kµ W P x, πx + µ x D X 3 A, Le W = Kµ W P x, + µ 4 Choose W 5 Observe rewrd r { 6 Le Π = π Π : } η h π mx η h π 2b π Π 7 Le h = h x,, r, W Proof ssenilly s Theorem 4 The vrince bound is unchnged becuse i depends only on he conex disribuion Thus, i suffices o replce T wih τ + T +τ =τ+ τ = τ + T = in q 34 Acknowledgemens We hnk Alin Beygelzimer, who helped in severl formive discussions References Peer Auer Using confidence bounds for exploiionexplorion rde-offs Journl of Mchine Lerning Reserch, 3:397 422, 2002 Peer Auer, Nicolò Ces-Binchi, nd Pul Fischer Finieime nlysis of he mulirmed bndi problem Mchine Lerning, 472 3:235 256, 2002 Peer Auer, Nicolò Ces-Binchi, Yov Freund, nd Rober Schpire The nonsochsic mulirmed bndi problem SIAM Journl of Compuing, 32:48 77, 2002b P L Brle, Hzn, nd A Rkhlin Adpive online grdien descen In NIPS, 2007 Alin Beygelzimer, John Lngford, nd Prdeep Rvikumr rror correcing ournmens In ALT, 2009 Alin Beygelzimer, John Lngford, Lihong Li, Lev Reyzin, nd Rober Schpire Conexul bndi lgorihms wih supervised lerning gurnees In AISTATS, 20 yl ven-dr, Shie Mnnor, nd Yishy Mnsour Acion eliminion nd sopping condiions for he muli-rmed bndi nd reinforcemen lerning problems Journl of Mchine Lerning Reserch, 7:079 05, 2006 Dvid A Freedmn On il probbiliies for mringles Annls of Probbiliy, 3:00 8, 975 Y Freund nd R Schpire A decision-heoreic generlizion of on-line lerning nd n pplicion o boosing Journl of Compuer nd Sysem Sciences, 55: 9 39, 997 Shm M Kkde nd Adm Kli From bch o rnsducive online lerning In NIPS, 2005 Adm Tumn Kli nd Snosh Vempl fficien lgorihms for online decision problems J Compu Sys Sci, 73:29 307, 2005 Tze Leung Li nd Herber Robbins Asympoiclly efficien dpive llocion rules Advnces in Applied Mhemics, 6:4 22, 985 J Lngford, A Smol, nd M Zinkevich Slow lerners re fs In NIPS, 2009 John Lngford nd Tong Zhng The epoch-greedy lgorihm for conexul muli-rmed bndis In NIPS, 2007 Murice Sion On generl minimx heorems Pcific J Mh, 8:7 76, 958 Nirnjn Srinivs, Andres Kruse, Shm Kkde, nd Mhis Seeger Gussin process opimizion in he bndi seing: No regre nd experimenl design In ICML, 200 A Concenrion Inequliy The following is n immedie corollry of Theorem of Beygelzimer e l, 20 I cn be viewed s version of Freedmn s Inequliy Freedmn, 975 Le y,, y T be sequence of rel-vlued rndom vribles Le denoe he condiionl expecion y,, y nd V condiionl vrince Theorem 3 Freedmn-syle Inequliy Le V, R R such h T = V y V, nd for ll, y y R Then for ny > 0 such h R V/ ln2/, wih probbiliy les, B T y = T y 2 V ln2/ = Minimx Theorem The following is coninuous version of Sion s Minimx Theorem Sion, 958, Theorem 34 Theorem 4 Le W nd Z be compc nd convex ses, nd f : W Z R funcion which for ll Z Z is convex nd coninuous in W nd for ll W W is concve nd coninuous in Z Then min mx fw, Z = mx W W Z Z Z Z min fw, Z W W

C mpiricl Vrince Bounds nd hen pplying he AM/GM inequliy In his secion we prove Theorem 6 We firs show uniform convergence for cerin clss of policy disribuions Lemm 5, nd rgue h ech disribuion P is close o some disribuion P from his clss, in he sense h V P,π, is close o V P,π, nd V P,π, is close o V P,π, Lemm 6 Togeher, hey imply he min uniform convergence resul in Theorem 6 For ech posiive ineger m, le Sprsem be he se of disribuions P over Π h cn be wrien s P π = m Iπ = π i m i= ie, he verge of m del funcions for some π,, π m Π In our nlysis, we pproxime n rbirry disribuion P over Π by disribuion P Sprsem chosen rndomly by independenly drwing π,, π m P ; we denoe his process by P P m Lemm 5 Fix posiive inegers m, m 2, Wih probbiliy les over he rndom smples x, x 2, from D X, Lemm 6 Fix ny γ 0,, nd ny x X For ny disribuion P over Π nd ny π Π, if hen P P m m = 6 γ 2, µ Kµ W P x, πx + µ Kµ W P x, πx + µ γ Kµ W P x, πx + µ V P,π, + λ V P,π, + 5 + 2λ m + log N + log 2 2 µ for ll λ > 0, ll, ll π Π, nd ll disribuions P Sprsem Proof Le Z P,π, x = Kµ W P x, πx + µ so V P,π, = x D X Z P,π, x nd V P,π, = i= Z P,π, x i Also le ε = log Sprsem N2 2 / µ = m + log N + log 2 2 µ We pply Bernsein s inequliy nd union bounds over P Sprsem, π Π, nd so h wih probbiliy les, V P,π, V P,π, + 2V P,π, ε + 2/3ε ll, ll π Π, nd ll disribuions P Sprsem The conclusion follows by solving he qudric inequliy for V P,π, o ge V P,π, V P,π, + 2 V P,π, ε + 5ε This implies h for ll disribuions P over Π nd ny π Π, here exiss P Sprsem such h for ny λ > 0, V P,π, V P,π, + + λ V P,π, V P,π, γv P,π, + + λ V P,π, Proof We rndomly drw P P m, wih P π m m i= Iπ = π i, nd hen define z = π Π P π Iπ x = πx nd ẑ = π Π P π Iπ x = πx We hve z = π P Iπ x = πx nd ẑ = m m i= Iπ ix = πx In oher words, ẑ is he verge of m independen Bernoulli rndom vribles, ech wih men z Thus, P P mẑ z 2 = z z/m nd Pr P P mẑ z/2 exp mz/8 by Chernoff =

bound We hve P P Kµ m ẑ + µ Kµ z + µ Kµ ẑ z P P Kµ m ẑ + µ Kµ z + µ Kµ ẑ z Iẑ 05z P P 05 Kµ m z + µ 2 Kµ ẑ z Iẑ 05z + P P µ m Kµ z + µ Kµ P P m ẑ z 2 05 Kµ z + µ 2 + Kµ z Pr P P mẑ 05z µ Kµ z + µ Kµ z/m 052 Kµ zµ Kµ z + µ + Kµ z exp mz/8 µ Kµ z + µ γ Kµ z/m z6/m Kµ z + µ + Kµ γ 2 mz exp mz/8, 6 Kµ z + µ where he hird inequliy follows from Jensen s inequliy, nd he fourh inequliy uses he AM/GM inequliy in he denominor of he firs erm nd he previous observions in he numerors The finl expression simplifies o he firs desired displyed inequliy by observing h mz exp mz/8 3 for ll mz 0 he mximum is chieved mz = 8 The second displyed inequliy follows from he following fcs: V P,π, V P,π, γv P,π,, P P m + λ V P,π, V P,π, γ + λ V P,π, P P m Boh inequliies follow from he firs displyed bound of he lemm, by king expecion wih respec o he rue nd empiricl disribuions over x The desired bound follows by dding he bove wo inequliies, which implies h he bound holds in expecion, nd hence he exisence of P for which he bound holds Now, we cn prove Theorem 6 Proof of Theorem 6 Le m = 6 λ 2 µ for some λ 0, /5 o be deermined nd condiion on he probbiliy even from Lemm 5 h V P,π, + λ V P,π, K 5 + 2λ K 5 + λ m + logn + log2 2 / Kµ m + logn + log2 2 / Kµ for ll 2, ll P Sprsem, nd ll π Π Using he definiions of m nd µ, he second erm is mos 40/λ 2 + /λ K for ll 6K log8kn/: he key here is h for 6K log8kn/, we hve µ = logn//k /2K nd herefore m logn Kµ 6 λ 2 nd logn + log2 2 / Kµ 2 Now fix 6K log8kn/, π Π, nd disribuion P over Π Le P Sprsem be he disribuion gurneed by Lemm 6 wih γ = λ sisfying V P,π, V P,π, + λ V P,π, + + λ 2 VP,π, λ Subsiuing he previous bound for V P,π, + λ V P,π, gives V P,π, 40 λ λ 2 + /λk + + λ2 VP,π, This cn be bounded s + ɛ V P,π, + 7500/ɛ 3 K by seing λ = ɛ/5 D Anlysis of RndomizedUCB D Preliminries Firs, we define he following consns ɛ 0, is fixed consn, nd ρ = 7500 ɛ is he fcor h ppers in he bound 3 from Theorem 6 θ = ρ + / + ɛ/2 = 2 ɛ + 7500 ɛ 5 3 is consn cenrl o Lemm 2, which bounds he vrince of he opiml policy s esimed rewrds Recll he lgorihm-specific quniies N C = 2 log { } µ = min 2K, C 2K

I cn be checked h µ is non-incresing We define he following ime indices: 0 is he firs round in which µ = C /2K Noe h 8K 0 8K lognk/ := 6K log8kn/ is he round given by Theorem 6 such h, wih probbiliy les, x D X W πx + ɛ x h W P,µ x, πx + ρk D for ll π Π nd ll, where W P,µ x, is he disribuion over A given by W P,µ x, = KµW P x, + µ, nd he noion x h denoes expecion wih respec o he empiricl uniform disribuion over x,, x The following lemm shows he effec of llowing slck in he opimizion consrins Lemm 7 If P sisfies he consrins of he opimizion problem 4 wih slck K for ech disribuion Q over Π, ie, π Q x h Kµ W P x, πx + µ mx {4K, W Q 2 80C for ll Q, hen P sisfies π Q x h for ll Q } + K Kµ W P x, πx + µ mx {5K, W Q 2 } 44C Proof Le b = mx {4K, π2 80C } Noe h b 5b 4 K Hence b + K 4 which gives he sed bound Noe h he llownce of slck K is somewh rbirry; ny OK slck is olerble provided h oher consns re djused ppropriely D2 Deviion Bound for η π For ny policy π Π, define, for 0, nd for > 0, V π = K + V π = K, x D X W πx The V π bounds he vrinces of he erms in η π Lemm 8 Assume he bound in D holds for ll π Π nd For ll π Π: If, hen 2 If >, hen V π + ɛ x h + ρ + K K V π 4K Kµ W P x, πx + µ Proof For he firs clim, noe h if < 0, hen V π = K, nd if 0 <, hen logn/ logn 0 / µ = K 6K 2 log8kn/ 4K ; so W µ /4K For he second clim, pick ny >, nd noe h by definiion of, for ny π Π we hve W πx + ɛ + ρk x h Kµ W P x, πx + µ x D X The sed bound on V π now follows from is definiion Le V mx, π = mx{ V τ π, τ =, 2,, } The following lemm gives deviion bound for η π in erms of hese quniies Lemm 9 Pick ny 0, Wih probbiliy les, for ll pirs π, π Π nd 0, we hve η π η π η D π η D π Vmx, π + 2 V mx, π C D2

Proof Fix ny 0 nd π, π Π Le := exp C Pick ny τ Le Z τ π = r τ τ Iπx τ = τ W τ τ so η π = τ= Z τ π I is esy o see h nd x τ, r τ D, τ W τ τ= Z τ π Z τ π = η D π η D π x τ, rτ D, τ W τ x τ= τ D X Zτ π Z τ π 2 W τ πx τ + W τ π x τ V mx, π + V mx, π Moreover, wih probbiliy, Z τ π Z τ π µ τ Now, noe h since 0, µ = C 2K, so h = C Furher, boh V 2Kµ 2 mx, π nd V mx, π re les K Using hese bounds we ge log/ V mx, π + V mx, π C C 2Kµ 2 2K = µ µ τ, for ll τ, since he µ τ s re non-incresing Therefore, by Freedmn s inequliy Theorem 3, we hve η Pr π η π η D π η D π Vmx, π + > 2 V mx, π log/ 2 The conclusion follows by king union bound over 0 < T nd ll pirs π, π Π D3 Vrince Anlysis We define he following condiion, which will be ssumed by mos of he subsequen lemms in his secion Condiion The deviion bound D holds for ll π Π nd, nd he deviion bound D2 holds for ll pirs π, π Π nd 0 The nex wo lemms rele he V π o he π Lemm 20 Assume Condiion For ny nd π Π, if V π > θk, hen π 72 V πc Proof By Lemm 8, he fc V π > θk implies h x h Kµ W P x, πx + µ > ρ + V π + ɛ θ 2 V π Since V π > θk 5K, Lemm 7 implies h in order for P o sisfy he opimizion consrin in 4 corresponding o π wih slck K, i mus be he cse h π 44C x h Kµ W P x, πx + µ Combining wih he bove, we obin π 72 V πc Lemm 2 Assume Condiion For ll, V mx, π mx θk nd V mx, π θk Proof By inducion on The clim for ll follows from Lemm 8 So ke >, nd ssume s he srong inducive hypohesis h V mx,τ π mx θk nd V mx,τ π τ θk for τ {,, } Suppose for ske of conrdicion h V π mx > θk By Lemm 20, π mx 72 V π mx C However, by he deviion bounds, we hve π mx + D π 2 V mx, π + V mx, π mx C 2 2 V π mx C 72 < V π mx C

The second inequliy follows from our ssumpion nd he inducion hypohesis: V π mx > θk V mx, π, V mx, π mx Since D π 0, we hve conrdicion, so i mus be h V π mx θk This proves h V mx, π mx θk I remins o show h V mx, π θk So suppose for ske of conrdicion h he inequliy fils, nd le < τ be ny round for which V τ π = V mx, π > θk By Lemm 20, τ π On he oher hnd, 72 V τ π C τ D3 τ τ π D π τ + τ π + π mx = D π τ + τ π mx + η τ π mx η τ π D π + D π + π mx The prenhesized erms cn be bounded using he deviion bounds, so we hve τ π 2 V mx,τ π τ + V mx,τ π mx C τ τ + 2 V mx,τ π + V mx,τ π mx C τ τ Vmx, π + + 2 V mx, π mx C 2 2 V τ π C τ 2 + 2 V τ π C τ τ τ 2 Vτ π C < + 2 72 V τ π C τ τ where he second inequliy follows from he following fcs: By inducion hypohesis, we hve V mx,τ π τ, V mx,τ π mx, V mx, π mx θk, nd V τ π > θk, 2 Vτ π V mx, π, nd 3 since τ is round h chieves V mx, π, we hve V τ π V τ π This conrdics he inequliy in D3, so i mus be h V mx, π θk Corollry 22 Under he ssumpions of Lemm 2, for ll 0 D π + π mx 2 2θKC Proof Immedie from Lemm 2 nd he deviion bounds from D2 The following lemm shows h if policy π hs lrge τ π in some round τ, hen π remins lrge in ler rounds > τ Lemm 23 Assume Condiion Pick ny π Π nd If V mx, π > θk, hen 2 Vmx, πc π > 2 Proof Le τ be ny round in which V τ π = V mx, π > θk We hve π π π mx D π τ = τ π + η π mx η π D π + η D π τ η D π τ π 72 V τ πc τ τ Vmx, π + 2 V mx, π mx C 2 V mx,τ π + V mx,τ π τ C τ τ 72 > V mx, πc τ 2 Vmx, πc 2 τ 2 2 V mx, πc τ τ 2 2 V mx, πc τ 2 Vmx, πc 2 τ where he second inequliy follows from Lemm 20 nd he deviion bounds, nd he hird inequliy follows from Lemm 2 nd he fcs h V τ π = V mx, π > θk V mx, π mx, V mx,τ π τ, nd V mx, π V mx,τ π

D4 Regre Anlysis We now bound he vlue of he opimizion problem 4, which hen leds o our regre bound The nex lemm shows he exisence of fesible soluion wih cerin srucure bsed on he non-uniform consrins Recll from Secion 5, h solving he opimizion problem A, ie consrins 5, 52, 53, for he smlles fesible vlue of s is equivlen o solving he RUCB opimizion problem 4 β = 80C Recll h Lemm 24 There is poin W R K such h K W 4 β W C Zx, Z C : x h W mx{4k, β Z 2 } x, In priculr, he vlue of he opimizion problem 4, OPT, is bounded by 8 K β 0 KC Proof Define he ses {C i : i =, 2, } such h C i := {Z C : 2 i+ κ Z 2 i+2 κ}, K where κ = β Noe h since Z is liner funcion of Z, ech C i is closed, convex, compc se Also, define C 0 = {Z C : Z 4κ} This is lso closed, convex, compc se Noe h C = i=0 C i Le I = {i : C i }For i I \ {0}, define w i = 4 i, nd le w 0 = i I\{0} w i Noe h w 0 2/3 By Lemm, for ech i I, here is poin W i C i such h for ll Z C i, we hve Zx, 2K x, x h W i Here we use he fc h Kµ /2 o upper K bound Kµ by 2K Now consider he poin W = i I w iw i Since C is convex, W C Now fix ny i I For ny x,, we hve W x, w i W i x,, so h for ll Z C i, we hve Zx, W 2K x, w i x h 4 i+ K so he consrin for Z is sisfied mx{4k, β Z 2 }, Finlly, since for ll i I, we hve w i 4 i nd W i 2 i+2 κ, we ge W = w i W i 4 i 2 i+2 κ 8κ i I i=0 The vlue of he opimizion problem 4 cn be reled o he expeced insnneous regre of policy drwn rndomly from he disribuion P Lemm 25 Assume Condiion Then P π D π 220 + 4 KC 2θ + 2ε op, π Π for ll > Proof Fix ny π Π nd > By he deviion bounds, we hve η D π η D π π + 2 V mx, π + V mx, π C Vmx, π + θk C π + 2, by Lemm 2 By Corollry 22 we hve 2θKC D π 2 Thus, we ge D π η D π η D π + D π Vmx, π + θk C π + 2 2θKC + 2 If V mx, π θk, hen we hve 2θKC D π π + 4 Oherwise, Lemm 23 implies h so V mx, π π 2 8C, π D π π + 2 2 + θkc 8 2θKC + 2 2θKC 2 π + 4

Therefore π Π P π D π 2 π Π P π π + 4 2 OPT +ε op, + 4 2θKC 2θKC where OPT is he vlue of he opimizion problem 4 The conclusion follows from Lemm 24 We cn now finlly prove he min regre bound for RUCB Proof of Theorem 5 The regre hrough he firs rounds is rivilly bounded by In he even h Condiion holds, we hve for ll, A W r A nd herefore x, r D W = x, r D Kµ W P x, r A W P x, r Kµ = π Π P πr πx Kµ, r W r A π Π P πη D π Kµ η D π mx O KC + ε op, where he ls inequliy follows from Lemm 25 Summing he bound from = +,, T gives T = x, r D W η D π mx r + O T K log NT/ By Azum s inequliy, he probbiliy h T = r devies from is men by more hn O T log/ is mos Finlly, he probbiliy h Condiion does no hold is mos 2 by Lemm 9, Theorem 6, nd union bound The conclusion follows by finl union bound Deils of Orcle-bsed Algorihm We show how o pproximely solve A using he ellipsoid lgorihm wih AMO Fix ime period To void cluer, only in his secion we drop he subscrip from η,, nd h so h hey becomes η,, nd h respecively In order o use he ellipsoid lgorihm, we need o relx he progrm lile bi in order o ensure h he fesible region hs non-negligible volume To do his, we need o obin some perurbion bounds for he consrins of A The following lemm gives such bounds For ny > 0, we define C o be he se of ll poins wihin disnce of from C Lemm 26 Le b/4 be prmeer Le U, W C 2 be poins such h U W Then we hve U W γ Z C : x h where ɛ = 8 µ 2 Zx, U x, x h nd γ = µ Proof Firs, we hve ηu ηw which implies µ = γ, x,,r,q h Nex, for ny Z C, we hve Zx, U x, Zx, W x, 8 µ 2 Zx, W x, ɛ 2 r Ux, W x, p Zx, U x, W x, U x, W x, = ɛ In he ls inequliy, we use he Cuchy-Schwrz inequliy, nd use he following fcs here, Zx, denoes he vecor Zx,, ec: Zx, 2 since Z C, 2 U x, W x, Ux, W x,, nd 3 U x, bk 2 + b b/2, for b/4, nd similrly W x, b/2 This implies 2

We now consider he following relxed form of A Here, 0, b/4 is prmeer We wn o find poin W R K such h W s + γ 3 W C 4 Z C 2 : x h Zx, W x, mx{4k, β Z 2 } + ɛ, 5 where ɛ nd γ re s defined in Lemm 26 Cll his relxed progrm A We pply he ellipsoid mehod o A rher hn A Recll he requiremens of Lemm 8: we need n enclosing bll of bounded rdius for he fesible region, nd he rdius of n enclosed bll in he fesible region The following lemm gives his Lemm 27 The fesible region for A is conined in B0, +, nd if A is fesible, hen i conins bll of rdius Proof Noe h for ny W C, we hve W +, so he fesible region lies in B0, + Nex, if A is fesible, le W C be ny fesible soluion o A Consider he bll BW, Le U be ny poin in BW, Clerly U C By Lemm 26, ssuming /2, we hve for ll Z C 2, Also x h Zx, U x, x h Zx, U x, + ɛ mx{4k, β Z 2 } + ɛ U W + γ s + γ Thus, U is fesible for A, nd hence he enire bll BW, is fesible for A We now give he consrucion of seprion orcle for he fesible region of A by checking for violions of he consrins In he following, we use he word ierion o indice one sep of eiher he ellipsoid lgorihm or he percepron lgorihm ch such ierion involves one cll o AMO, nd ddiionl O 2 K 2 processing ime Le W R K be cndide poin h we wn o check for fesibiliy for A We cn check for violion of he consrin 3 esily, nd since i is liner consrin in W, i uomiclly yields sepring hyperplne if i is violed The hrder consrins re 4 nd 5 Recll h Lemm 9 shows h h AMO llows us o do liner opimizion over C efficienly This immediely gives us he following useful corollry: Corollry 28 Given vecor w R K nd > 0, we cn compue rg mx Z C w Z using one invocion of AMO Proof This follows direcly from he following fc: rg mx Z C w Z = w + rg mx w w Z Z C Now we show how o use AMO o check for consrin 4: Lemm 29 Suppose we re given poin W Then in O ierions, if W / C 2 2, we cn consruc hyperplne sepring W from C Oherwise, we declre correcly h W C 2 In he ler cse, we cn find n explici disribuion P over policies in Π such h W P sisfies W P W 2 Proof We run he percepron lgorihm wih he origin W nd ll poins in C being posiive exmples The gol of he percepron lgorihm hen is o find hyperplne going hrough W h pus ll of C sricly on one side In ech ierion of he percepron lgorihm, we hve weigh vecor w h is he norml o cndide hyperplne, nd we need o find poin Z C such h w Z W 0 noe h we hve shifed he origin o W To do his, we use AMO s in Lemm 9 o find Z = rg mx Z C w Z If w Z W 0, we use Z o upde w using he percepron upde rule, w w + Z W Oherwise, we hve w Z W > 0 for ll W C, nd hence we hve found our sepring hyperplne Now suppose h W / C 2, ie he disnce of W from C is more hn Since Z W 2 + 3 = O for ll W C ssuming = O, he percepron convergence gurnee implies h in O ierions we find sepring hyperplne 2 If in k = O 2 ierions we hven found sepring hyperplne, hen W C 2 In fc he percepron lgorihm gives sronger gurnee: if he k policies found in he run of he percepron lgorihm re π, π 2,, π k Π, hen W is wihin disnce of 2 from heir convex hull, C = convπ, π 2,, π k This is becuse run of he percepron lgorihm on C 2 would be idenicl o h on C 2 for k seps We cn hen compue he explici disribuion over policies P by compuing he ucliden projecion of W on C in

polyk ime using convex qudric progrm: min W k i= P iπ i 2 P i = i i : P i 0 Solving his qudric progrm, we ge disribuion P over he policies {π, π 2,, π k } such h W P W 2 Finlly, we show how o check consrin 5: Lemm 30 Suppose we re given poin W In O 3 K 2 log 2 ierions, we cn eiher find poin Z C 2 such h Zx, W mx{4k, β Z 2 } + 2ɛ, x, x h x h or else we conclude correcly h for ll Z C, we hve Zx, W mx{4k, β Z 2 } + 3ɛ x, Proof We firs rewrie ηw s ηw = w π, where w is vecor defined s wx, = x,,r,p h: x =x, = r p Thus, Z = v w Z, where v = mx π ηπ = mx π w π which cn be compued by using AMO once Nex, using he cndide poin W, compue he vecor u defined s ux, = nx/ W x,, where n x is he number of imes x ppers in h, so h Zx, x h W x, = u Z Now, he problem reduces o finding poin R C which violes he consrin Define u Z mx{4k, β w Z v 2 } + 3ɛ fz = mx{4k, β w Z v 2 } + 3ɛ u Z Noe h f is convex funcion of Z Checking for violion of he bove consrin is equivlen o solving he following convex progrm: fz 0 Z C 6 7 To do his, we gin pply he ellipsoid mehod, bu on he relxed progrm fz ɛ Z C 8 9 To run he ellipsoid lgorihm, we need seprion orcle for he progrm Given cndide soluion Z, we run he lgorihm of Lemm 29, nd if Z / C 2, we consruc hyperplne sepring Z from C Now suppose we conclude h Z C 2 Then we consruc seprion orcle for 6 s follows If fz > ɛ, hen since f is convex funcion of Z, we cn consruc sepring hyperplne s in Lemm 0 Now we cn run he ellipsoid lgorihm wih he sring ellipsoid being B0, If here is poin Z C such h fz 0, hen consider he bll BZ 4, 5 For ny Y BZ 4, Kβ 5, we hve Kβ u Z u Y u Z Y ɛ 2 since u K µ Also, β w Z v 2 w Y v 2 = β w Z w Y w Z + w Y 2v β w Z Y w Z + Y + 2 v ɛ 2, since w µ, Z, Y + 2, nd v w µ Thus, fy fz + ɛ ɛ, so he enire bll BZ 4, 5 is fesible for he relxed progrm Kβ By Lemm 8, in O 2 K 2 log K ierions of he ellipsoid lgorihm, we obin one of he following: we eiher find poin Z C 2 such h fz ɛ, ie Zx, W mx{4k, β Z 2 } + 2ɛ, x, x h 2 or else we conclude h he originl convex progrm 6,7 is infesible, ie for ll Z C, we hve x h Zx, W x, mx{4k, β Z 2 } + 3ɛ The ol number of invocions of ierions is bounded by O 2 K 2 log K O = O 3 K 2 2 2 log K

Lemm 3 Suppose we re given poin Z C 2 such h Zx, W mx{4k, β Z 2 } + 2ɛ x, x h Then we cn consruc hyperplne sepring W from ll fesible poins for A Proof For noionl convenience, define he funcion Zx, f Z W := x h W mx{4k, β Z 2 } 2ɛ x, Noe h i is convex funcion of W Noe h for ny poin U h is fesible for A, we hve f Z U ɛ, wheres f Z W 0 Thus, by Lemm 0, we cn consruc he desired sepring hyperplne We cn finlly prove Theorem : Proof Theorem We run he ellipsoid lgorihm sring wih he bll B0, + A ech poin, we re given cndide soluion W for progrm A We check for violion of consrin 3 firs If i is violed, he consrin, being liner, gives us sepring hyperplne lse, we use Lemm 29 o check for violion of consrin 4 If W / C 2, hen we cn consruc sepring hyperplne lse, we use Lemms 30 nd 3 o check for violion of consrin 5 If here is Z C such h Zx, x h W x, mx{4k, β Z 2 } + 3ɛ, hen we cn find sepring hyperplne lse, we conclude h he curren poin W sisfies he following consrins: W s + γ Zx, Z C : x h W mx{4k, β Z 2 } + 3ɛ x, W C 2 We cn hen use he percepron-bsed lgorihm of Lemm 29 o round W o n explici disribuion P over policies in Π such h W P sisfies W P W 2 Then Lemm 26 implies he sed bounds for W P By Lemm 8, in O 2 K 2 log ierions of he ellipsoid lgorihm, we find he poin W sisfying he consrins given bove, or declre correcly h A is infesible In he wors cse, we migh hve o run he lgorihm of Lemm 30 in every ierion, leding o n upper bound of O 2 K 2 log O 3 K 2 log K 2 = O 5 K 4 log 2 K on he number of ierions