Online Learning: Bandit Setting

Size: px

Start display at page:

Download "Online Learning: Bandit Setting"

Jade Harrell
5 years ago
Views:

1 Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac iteration is cosen as l i,t ν i. Terefore te mean for eac of tese distributions can be represented as µ k E[l k,t. Denote te least expected loss wit µ min k {,...,} µ k. Define τ i (t to be te number of times arm i as been pulled. More formally: τ i (t t {I s i} s Lemma. Te (pseudo regret can be written as R(T k E [τ i (T were k is te difference between te mean loss of te action cosen and te action wit minimum loss: k µ k µ. Te stationarity assumption is implicit ere; te distributions are not canged across time orizon.

2 Proof. R(T E i l It,t min E l k,t k {,...,} i E [E [l It,t I t T µ E [µ It µ (µ k µ P(I t k k {,...,} T k k {,...,} k {,...,} k E [τ k (T P(I t k Terefore te only ting we need to worry about is τ k (T, te number of times eac armed is pulled. Exploration first: Consider a simple strategy: sample eac arm for C many times (in any order, ten start decision making decisions. Denote te empirical estimate mean for eac action wit ˆµ k. Here is te suggested algoritm:. For a fixed probability δ (0,, sample eac arm for C times, and computer teir empirical mean ˆµ k. For te rest of te T C remaining iterations, do te action wic as te minimum loss: ˆk arg min k {,...,} ˆµ k,t Using te Hoeffding bound we know tat: log(/δ ˆµ k µ k <, k {,..., } C wic means tat, te bigger te size of samples C are, te better our estimate of te means are, as expected. Define min{ i : i > 0} wic is te minimum difference between te true action means. In order to guarantee tat our we always coose te correct action wit mean µ, we need to make sure tat our estimates satisfy ˆµ k µ k < /: log(/δ < / C > ln(/δ C Te regret incurred for te first C iterations of sampling actions is See ttp://web.engr.illinois.edu/~kasab/learn/concentration.pdf

3 Optimisim in te face of uncertainty (α-ucb Define our estimate of eac mean until time t to be ˆµ k,t t τ k (t s l k,s{i s k}. Define te upper-confidence bound on action (arm k at time t to be U k,t ˆµ k,t + α log(t τ k (t, were α > 0 is a parameter wic controls te upper-bound estimates and we will set it later. At eac iteration coose te actions to be I t arg max k {,...,} U k,t. Wit tis coice we know tat E[ It I t µ µ It w..p. U k,t µ It max k {,...,} U k,t µ It Lemma. If I t i i (incorrect action ten U i,t U i,t. Te mistake migt be at least due to one of te following events.. A (t {U i,t µ }: te upper-bound estimate on te true action is too small.. A (t {U i,t µ ( µ i + i }: te upper-bound estimate on action i is too big. α ln T 3. A 3 (t {τ i (t }: number of samples from action i is too small. i Proof. To prove it, we can use proof by contradiction. Suppose all of te above are false; we will sow tat essentially I t i : U i,t > µ A is false µ i + i α ln T > µ i + τ i (t α ln t µ i + τ i (t α ln t ˆµ i,t + τ i (t A 3 is false A is false Lemma 3. P (A (t t α and P (A (t t α Proof. Again we use te Hoeffding bound: ( α ln t P ˆµ k,t µ t α s were s is te number of times te action i is sampled. Now we can sow tat ( ( t { } P (A (t P ˆµ k,t µ α ln t α ln t P ˆµ k τ k (t,t µ s s ( t α ln t P ˆµ k,t µ s s t t α t α s 3

4 Te claim P (A (t t α can be proved in a similar way. Lemma 4. Proof. E[τ k (T E[τ k (T E {I t k} α log T i + α, [ T E {I t k, τ k (t t 0 } + t 0 + tt 0 + k k {I t k, τ k (t > t 0 } P {I t k, τ k (t > t 0 } α ln T Now define t 0. Terefore te event A 3 of lemma as been satisfied. We can furter i simplify te previous equation: E[τ k (T t 0 + t 0 + t 0 + t 0 + tt 0 + tt 0 + tt 0 + tt 0 + Te last inequality comes from te fact tat P {I t k, τ k (t > t 0 } P {A (t A (t} [P {A (t} + P {A (t} t α t 0 + t α dt α t α t 0 + α Using te results of Lemma 4 into Lemma we would get: R(T. Adversarial setting k ( α log T k + α ( α k log T + α k 4

5 Input: decay parameters { } T. Initialize: Uniform distribution p [p,,..., p, over te set {,..., }. For t,..., T : Draw an arm I t based on probability distribution p t. Create te loss value ˆl i,t, based on l i,t and p t. Update te commulative loss ˆL i,t t s ˆl i,s. Update te probability distribution over actions: exp ( ˆLi,t p i,t+, for eac i ( η exp t ˆLk,t Lemma 5. For any sequence of actions in Algoritm??, wit non-increasing positive sequence η, η,..., we ave: p k,tˆlk,t min {,...,} ( T p,tˆl,t p k,t (ˆlk,t + ln Proof. We prove te inequality for any decisions and ignore te minimum. Terefore te left side is: T ( p T k,tˆl k,t p,tˆl,t. Te log-moment of te p k,tˆlk,t E k ptˆlk,t E k ptˆlk,t + ( ˆl,t ( ˆl,t ( redundant p k,tˆlk,t E k ptˆlk,t E k ptˆlk,t + ( ˆl,t ( ˆl,t ln exp E k pt ˆlk,t + ( ˆl,t ( ˆl,t ( ln exp E k pt ˆlk,t E pt exp ( ˆl,t ( ˆl,t ln E pt (exp E k pt ˆlk,t exp ( ˆl,t ( ˆl,t ( (ˆl,t E k ptˆlk,t ( ˆl,t 5

6 Now we simplify left two terms in Equation. In te following, we use te two inequalities ln x x and exp( x + x x / eac once: ( ˆl,t + E k ptˆlk,t ( E pt exp ( ˆl,t + E k ptˆlk,t E pt (exp ( ˆl,t + ˆl,t ηt E p tˆl,t E p tˆl,t Define te sortand notation Φ t (η η ln ( η exp ˆL,t. ( ˆl,t ln ln p,t exp ( ˆl,t exp ( ˆL,t exp ( (ˆL,t ( η ˆL,t exp t ˆLk,t exp ( ˆL,t ln η Φ t ( Φ t ( t ( η exp t ˆLk,t Summing te time index we ave: p k,tˆlk,t p It,tˆl It,t E p tˆl,t + Φ t ( Φ t ( Φ t ( Φ t ( (Φ t (+ Φ t ( Φ T ( Note tat Φ 0 (η 0. Also, Φ T ( ln Ten te summation becomes: p k,tˆlk,t p It,tˆl It,t ln ln exp ( ˆL,T exp ( ˆL,T ln ln exp ( ˆL,T ln ˆl,t E It p tˆlk,t E p tˆl,t + ln + (Φ t (+ Φ t ( 6

7 We can sow tat Φ t(η 0. Since we assumed tat for any t +. Terefore Φ t (+ Φ t ( 0. wic is te desired result. p k,tˆlk,t p It,tˆl It,t E p tˆl,t + ln Corollary. For any sequence of actions in Algoritm??, wit non-increasing positive sequence η, η,..., we ave: E p k,tˆlk,t min {,...,} ( [ T E p,tˆl,t E p k,t (ˆlk,t Proof. Take expectation from bot sides and use te fact tat E [min [. min [E [.... Te EXP3 algoritm If te second step in te Algoritm?? is ˆl i,t l i,t {I t i} /p t,i. Lemma 6. For te EXP3 algoritm te expected regret is bounded by + ln + ln For proper coice of { } T te overall bound is n ln Proof. p k,t (ˆlk,t T p k,t (l k,t /p k,t {I t k} p I t,t (l It,t/p It,t l I t,t/p It,t 7

8 Since te decisions I t are made in stocastic fasion we need to find te expectation wit respect to I t. [ T E p k,t (ˆlk,t E l I t,t/p It,t [ ηt E l I t,t/p It,t [ ηt E./p I t,t Wic would give te general form of te bound for EXP3. If we set te result. ln T we would get + ln ln T T + ln T ln T ln + T ln T ln 8

9 .3 Lower bounds.3. Preliminaries Te L divergence as??? property. L (p(x, y q(x, y L (p(x q(x + L (p(y x q(y x L (p(x q(x + p(xl (p(y x q(y x x Te Pinsker s inequality creates connection between te L divergence and te total variations divergence: sup p(x q(x L (p q x.3. Lower bounding... Teorem. Suppose Y i,, Y i,,... te i.i.d. sequence of costs. We want to find a lower bound on te regret. Te lower bound needs to old for any distribution of rewards (specifically te worst case of te distributions, tus inf wit respect to te reward distributions. It also needs to old to te best forecaster one can design (tus sup wit respect to forecasters. inf sup ( E T Y i,t min E Y i,t n i {0,...,} 0 Proof. Te idea of te proof is to analyze te beavior of any forecaster against two distributions tat differ sligtly: ( in one all of te distributions are /. ( in te oter all of te arm distributions are / except one wic is / + ɛ. Lemma 7. inf sup ( E T Y i,t min E Y i,t T ɛ ( i {0,...,} ɛ ln + ɛ ɛ T Proof. Define te loss l,t representing te loss value at time t for action. We coose action {,..., }. Define + different games. In eac of te games te distribution of losses are different. For te i-t game i, all of te loss values are iid random variables distributed wit Bernoulli of bias ɛ, except te -t arm, wic is distributed wit Bernoulli distribution of bias ɛ. Also define an additional game in wic all of te losses ave Bernoulli distribution wit bias ɛ. Suppose I t is te arm played by te algoritm at time t. Denote te empirical distribution over actions up to time t wit q t (q,t,..., q,t : q k,t t t {I t k} t 9

10 Let J be a random variable distributed according to q t. Define P to be te law of J, wen te forecaster plays te -t game, and we know: [ t P (J E {I t k} t were E [. means expectation wit respect to te distribution of te -t game. Te regret for te -t game is: R(T E (l,t l It,t Te regret can be simplified in te following form: E (l,t l It,t ɛt P (J Note tat P(l,t and l It,t 0( 0 + P(l,t 0 and l It,t (0 +ɛ part needs modification and more clear explanation. wic can be written as t ɛt ( P (J Averaging over all of te games we ave: [ T ( E (l,t l It,t ɛt P (J ɛ ɛ. Tis Note tat we want a lower bound on max (not average. But since average is less tan max, a good lower bound on average would also work for us. By te Pinsker s inequality we ave: P (J P + L(P P and ence P (J + L(P N P Note tat in te last step we used te fact tat te squared root function is concave and P (J. Te next step is to establis te distance measure between te probability distributions of losses for different games. T L(P T PT L(P0 P0 + t T L(P 0 P0 + t ( ɛ L + ɛ P t (y t L ( P t (. yt P t (. yt y t y t ;I ti ( ɛ P t (y t L + ɛ [ T E {I t } + y t ;I t i ( + ɛ P t (y t L + ɛ 0

11 L(P P L(P P [ L( ɛ + ɛ T E {I t } T L( ɛ + ɛ Tis last step I am confused wy? We know L( ɛ + ɛ ɛ ln + ɛ ɛ ten T L(P P ɛ ln + ɛ ɛ So far te lower bound is te following: sup R(T ɛt ( T +ɛ ɛ ln ɛ Te final step is te ɛ-tuning of te bound. Since te lower bound olds for any ɛ / we coose it in a way tat it N attains its biggest value. If we set ɛ α MT, were α is a real number to be tuned, tis would give us te desired result. ttp://cseweb.ucsd.edu/~kamalika/teacing/cse9w/lecture5.pdf ttp://courses. cs.wasington.edu/courses/cse599s/sp/scribes.tml Lower bounds: ttp:// berkeley.edu/~bartlett/courses/04fall-cs94stat60/lectures/bandit-lower-bound-notes. pdf ttp:// 4_Scribe_Notes.pdf ttp:// So good! Tis contains a very nice comparison between UCB and EXP3: ttp:// presentations/cmu_bandits.pdf 3 Contextual Bandits In contextual bandit unlike te standard bandits, te importance of actions are dependent on te context on wic tey are being done. In oter words weter a single action is optimal or not depends on its context. A simple example is to consider two contexts weekday and weekend. An action wic migt be optimal during weekend is not necessarily te best action for te weekday. Just like te standard bandits, in te contextual bandit problem, on eac of T rounds a learner is presented wit te coice of taking one of actions. Before making te coice of action, te learner observes a feature vector (context associated wit eac of its possible coices. In tis setting te learner as access to a ypotesis class, in wic te ypoteses receives action features (context and predict wic action will give te best reward. If te learner can guarantee to do nearly as well as te prediction of te best ypotesis in indsigt (to ave low regret, te learner is said to successfully compete wit tat class.

12 Algoritm Regret Hig probability bound Contextual Efficient Exp3.P O(T / Y N Y UCB O(T / Y N Y Exp4 O( T ln N N Y N Epoc-greedy O(T /3 Y Y Y LinUCB [4 O( Exp4.P O(T / Y Y N Table : Properties of popular bandit algoritms; N experts, T number of rounds, number of possible actions. If we ignore te contextual information we can just use te existing vanilla bandit algoritms. Terefore aving te contextual information one sould be able to get better guarantees. One way of looking at te contextual bandits is to tink of it as means to connect to te supervised learned, wic requires input features supplied by users for making predictions. An important point ere is tat te bandit problems are not supervised learning problems; for example in a click-or-not on one ad does not generally tell you if a different ad would ave been clicked on. Instead tis problem is inerently exploration-exploitation problem. Tat said, te solution to te contextual bandit sould be intuitive and reasonable from supervised learning setting; in fact some of te well-establised supervised learning tecniques will come andy in analysis of contextual bandits. Here is an approac wic tries to adapt te existing bandit algoritms as blackbox. Suppose te size of te context space is bounded and is small. Run a different k-armed Bandit for every value of context vector. Te regret and amount of information required to do well scales linearly in te number of contexts. Tis approac is a little counter-intuitive; good supervised learning algoritms often require information wic is (essentially independent of te number of contexts (instead tey depend on te complexity of te concept class define on top of te features/contexts. One can get inspiration from supervised learning. Define a policy space H from wic policies are cosen, and treat every policy (x H as a different arm. Tis removes an explicit dependence on te number of contexts, but it creates a linear dependence on te number of policies. Via Occams razor/vc dimension/margin bounds, we already know tat supervised learning requires experience muc smaller tan te number of policies. Te name contextual bandit is borrowed from Langford and Zang [3 but it as been known under oter names as well; e.g. bandit problems wit expert advice [, associative reinforcement learning [. Bibliograpical notes References [ Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. J. Mac. Learn. Researc, 3:397 4, 003. [ Andrew G Barto and P Anandan. Pattern-recognizing stocastic learning automata. Systems, Man and Cybernetics, IEEE Transactions on, (3: , 985.

13 [3 Jon Langford and Tong Zang. Te epoc-greedy algoritm for multi-armed bandits wit side information. In Adv. Neural Info. Proc. Sys. (NIPS, pages 87 84, 008. [4 Liong Li, Wei Cu, Jon Langford, and Robert E Scapire. A contextual-bandit approac to personalized news article recommendation. pages ACM, 00. 3

Continuity and Differentiability Worksheet

Continuity and Differentiability Workseet (Be sure tat you can also do te grapical eercises from te tet- Tese were not included below! Typical problems are like problems -3, p. 6; -3, p. 7; 33-34, p. 7;