Conservative Contextual Linear Bandits

Size: px

Start display at page:

Download "Conservative Contextual Linear Bandits"

Cassandra Wilson
5 years ago
Views:

1 Conservatve Contextual Lnear Bandts Abbas Kazeroun 1, Mohammad Ghavamzadeh 2, Yasn Abbas-Yadkor 3, and Benjamn Van Roy 4 arxv: v2 [stat.ml] 4 Mar Stanford Unversty, abbask@stanford.edu 2 Adobe Research, ghavamza@adobe.com 3 Adobe Research, abbasya@adobe.com 4 Stanford Unversty, bvr@stanford.edu Abstract Safety s a desrable property that can mmensely ncrease the applcablty of learnng algorthms n real-world decson-makng problems. It s much easer for a company to deploy an algorthm that s safe,.e., guaranteed to perform at least as well as a baselne. In ths paper, we study the ssue of safety n contextual lnear bandts that have applcaton n many dfferent felds ncludng personalzed ad recommendaton n onlne marketng. We formulate a noton of safety for ths class of algorthms. We develop a safe contextual lnear bandt algorthm, called conservatve lnear UCB (CLUCB), that smultaneously mnmzes ts regret and satsfes the safety constrant,.e., mantans ts performance above a fxed percentage of the performance of a baselne strategy, unformly over tme. We prove an upper-bound on the regret of CLUCB and show that t can be decomposed nto two terms: 1) an upper-bound for the regret of the standard lnear UCB algorthm that grows wth the tme horzon and 2) a constant (does not grow wth the tme horzon) term that accounts for the loss of beng conservatve n order to satsfy the safety constrant. We emprcally show that our algorthm s safe and valdate our theoretcal analyss. 1 Introducton Many problems n scence and engneerng can be formulated as decson-makng problems under uncertanty. Although many learnng algorthms have been developed to fnd a good polcy/strategy for these problems, most of them do not provde any guarantee that ther resultng polcy wll perform well, when t s deployed. Ths s a major obstacle n usng learnng algorthms n many dfferent felds, such as onlne marketng, health scences, fnance, and robotcs. Therefore, developng learnng algorthms wth safety guarantees can 1

2 mmensely ncrease the applcablty of learnng n solvng decson problems. A polcy generated by a learnng algorthm s consdered to be safe, f t s guaranteed to perform at least as well as a baselne. The baselne can be ether a baselne value or the performance of a baselne strategy. It s mportant to note that snce the polcy s learned from data, and data s often random, the generated polcy s a random varable, and thus, the safety guarantees are n hgh probablty. Safety can be studed n both offlne and onlne scenaros. In the offlne case, the algorthm learns the polcy from a batch of data, usually generated by the current strategy or recent strateges of the company, and the queston s whether the learned polcy wll perform as well as the current strategy or no worse than a baselne value, when t s deployed. Ths scenaro has been recently studed heavly n both model-based (e.g., [7]) and model-free (e.g., [3, 13, 14, 11, 10, 6]) settngs. In the model-based approach, we frst use the batch of data and buld a smulator that mmcs the behavor of the dynamcal system under study (hosptal s ER, fnancal market, robot), and then use ths smulator to generate data and learn the polcy. The man challenge here s to have guarantees on the performance of the learned polcy, gven the error n the smulator. Ths lne of research s closely related to the area of robust learnng and control. In the model-free approach, we learn the polcy drectly from the batch of data, wthout buldng a smulator. Ths lne of research s related to off-polcy evaluaton and control. Whle the model-free approach s more sutable for problems n whch we have access to a large batch of data, such as n onlne marketng, the model-based approach works better n problems n whch data s harder to collect, but nstead, we have good knowledge about the underlyng dynamcal system that allows us to buld an accurate smulator. In the onlne scenaro, the algorthm learns a polcy whle nteractng wth the real system. Although (reasonable) onlne algorthms wll eventually learn a good or an optmal polcy, there s no guarantee for ther performance along the way (the performance of ther ntermedate polces), especally at the very begnnng, when they perform a large amount of exploraton. Thus, n order to guarantee safety n onlne algorthms, t s mportant to control ther exploraton and make t more conservatve. Consder a manager that allows our learnng algorthm runs together wth her company s current strategy (baselne polcy), as long as t s safe,.e., the loss ncurred by lettng a porton of the traffc handled by our algorthm (nstead of by the baselne polcy) does not exceed a certan threshold. Although we are confdent that our algorthm wll eventually perform at least as well as the baselne strategy, t should be able to reman alve (not termnated by the manager) long enough for ths to happen. Therefore, we should make t more conservatve (less exploratory) n a way not to volate the manager s safety constrant. Ths settng has been studed n the mult-armed bandt (MAB) [15]. [15] consdered the baselne polcy as a fxed arm n MAB, formulated safety usng a constrant defned based on the performance of the baselne polcy (mean of the baselne arm), and modfed the UCB algorthm [2] to satsfy ths constrant. In ths paper, we study the noton of safety n contextual lnear bandts, a settng that has applcaton n many dfferent felds ncludng onlne personalzed ad recommendaton. 1 1 Other defntons of safety have been studed n contextual lnear bandts (e.g., Example 2 n [9]). 2

3 We frst formulate safety n ths settng, as a constrant that must hold unformly n tme, n Secton 2. Our goal s to desgn learnng algorthms that mnmze regret under the constrant that at any gven tme, ther expected sum of rewards should be above a fxed percentage of the expected sum of rewards of the baselne polcy. Ths fxed percentage depends on the amount of rsk that the manager s wllng to take. Then n Secton 3, we propose an algorthm, called conservatve lnear UCB (CLUCB), that satsfes the safety constrant. At each round, CLUCB plays the acton suggested by the standard lnear UCB (LUCB) algorthm (e.g., [5, 8, 1, 4, 9]), only f t satsfes the safety constrant for the worst choce of the parameter n the confdence set, and plays the acton suggested by the baselne polcy, otherwse. We also prove an upper-bound for the regret of CLUCB, whch can be decomposed nto two terms. The frst term s an upper-bound on the regret of LUCB that grows at the rate T log(t ). The second term s constant (does not grow wth the horzon T ) and accounts for the loss of beng conservatve n order to satsfy the safety constrant. Ths mproves over the regret bound derved n [15] for the MAB settng, where the regret of beng conservatve grows wth tme. In Secton 4, we show how CLUCB can be extended to the case that the reward of the baselne polcy s unknown wthout a change n ts rate of regret. Fnally n Secton 5, we report expermental results that show CLUCB behaves as expected n practce and valdate our theoretcal analyss. 2 Problem Formulaton In ths secton, we frst revew the standard lnear bandt settng and then ntroduce the conservatve lnear bandt formulaton consdered n ths paper. 2.1 Lnear Bandt In the lnear bandt settng, at any tme t, the agent s gven a set of (possbly) nfntely many actons/optons A t, where each acton a A t s assocated wth a feature vector φ t a R d. At each round t, the agent should select an acton a t A t. Upon selectng a t, the agent observes a random reward y t generated as y t = θ, φ t a t + η t, (1) where θ R d s the unknown reward parameter, θ, φ t a t = ra t t s the expected reward of acton a t at tme t,.e., ra t t = E[y t ], and η t s a random nose such that Assumpton 1. Each element η t of the nose sequence {η t } t=1 s condtonally σ-sub-gaussan,.e., ζ R, E [ ( ) ] ζ 2 σ 2 e ζηt a 1:t, η 1:t 1 exp. 2 The sub-gaussan assumpton automatcally mples that E[η t a 1:t, η 1:t 1 ] = 0 and Var[η t a 1:t, η 1:t 1 ] σ 2. 3

4 Note that the above formulaton contans tme-varyng acton sets and tme-dependent feature vectors for each acton, and thus, ncludes the lnear contextual bandt settng. In lnear contextual bandt, f we denote by x t, the state of the system at tme t, the tmedependent feature vector φ t a for acton a wll be equal to φ(x t, a), the feature vector of state-acton par (x t, a). We also make the followng standard assumpton on the unknown parameter θ and feature vectors: Assumpton 2. There exst constants B, D 0 such that θ 2 B, φ t a 2 D, and θ, φ t a [0, 1], for all t and all a A t. We defne B = { θ R d : θ 2 B } and F = { φ R d : φ 2 D, θ, φ [0, 1] } to be the parameter space and feature space, respectvely. Obvously, f the agent knows θ, she wll choose the optmal acton a t = arg max a At θ, φ t a at each round t. Snce θ s unknown, the agent s goal s to maxmze her cumulatve expected rewards after T rounds,.e., T t=1 θ, φ t a t, or equvalently, to mnmze ts (pseudo)-regret,.e., T T R T = θ, φ t a θ, φ t t a t, (2) t=1 whch s the dfference between the cumulatve expected rewards of the optmal and agent s strateges. 2.2 Conservatve Lnear Bandt The conservatve lnear bandt settng s exactly the same as the lnear bandt, except that there exsts a baselne polcy π b (the company s strategy) that at each round t, selects acton b t A t and ncurs the expected reward rb t t = θ, φ t b t. We assume that the expected rewards of the actons taken by the baselne polcy, rb t t, are known. Ths s often a reasonable assumpton, snce we usually have access to a large amount of data generated by the baselne polcy, as t s our company s strategy, and thus, have a good estmate of ts performance (see Remark 1 at the end of ths Secton). We relax ths assumpton n Secton 4 and extend our proposed algorthm to the case that the reward functon of the baselne polcy s not known n advance. Another dfference between the conservatve and standard lnear bandt settngs s the performance constrant, whch s defned as follows: Defnton 1 (Performance Constrant). At each round t, the dfference between the performances of the baselne and the agent s polces should reman below a pre-defned fracton α (0, 1) of the baselne performance. Ths constrant may be wrtten formally as t=1 t rb =1 t t ra α rb, t {1,..., T }, =1 =1 4

5 or equvalently as t ra (1 α) =1 t rb, t {1,..., T }. (3) =1 The parameter α (0, 1) controls how conservatve the agent should be. Small values of α show that only small losses are tolerated, and thus, the agent should be overly conservatve, whereas large values of α ndcate that the manager s wllng to take rsk, and thus, the agent can explore more and be less conservatve. Here gven the value of α, the goal of the agent s to select her actons n a way to both mnmze her regret (2) and satsfy the performance constrant (3). In the next secton, we propose a lnear bandt algorthm to acheve ths goal. Remark 1. As mentoned above, t s often reasonable to assume that we have access to a good estmate of the baselne reward functon. If n addton to ths estmate, we have access to the data generated by the baselne polcy that are used to compute ths estmate, we can use them n our algorthm. The reason we do not use the data generated by the actons suggested by the baselne polcy n constructng the confdence sets n our algorthm s manly to keep the analyss smple. However, when we deal wth the more general case of unknown baselne reward n Secton 4, we construct the confdence sets usng all avalable data, ncludng those generated by the baselne polcy. It s also mportant to note that havng a good estmate of the baselne reward functon does not necessarly mean that we know the unknown parameter θ, snce the data used for ths estmate has only been generated by the baselne polcy, and thus, may only provde a good estmate of θ n a lmted subspace. 3 A Conservatve Lnear Bandt Algorthm In ths secton, we propose a lnear bandt algorthm, called conservatve lnear upper confdence bound (CLUCB), that s based on the optmsm n the face of uncertanty prncple, and gven the value of α, both mnmzes the regret (2) and satsfes the performance constrant (3). Algorthm 1 contans the pseudocode of CLUCB. At each round t, CLUCB uses the prevous observatons and bulds a confdence set C t that wth hgh probablty contans the unknown parameter θ. It then selects the optmstc acton a t argmax a A t max θ C t θ, φ t a, whch has the best performance among all the actons avalable n A t, wthn the confdence set C t. In order to make sure that constrant (3) s satsfed, the algorthm plays the optmstc acton a t, only f t satsfes the constrant for the worst choce of the parameter θ C t. To make ths more precse, let S t 1 be the set of rounds before round t at whch CLUCB has played the optmstc acton,.e., a = a. In other words, S c t 1 = {1, 2,, t 1} S t 1 s the set of rounds j before round t at whch CLUCB has followed the baselne polcy,.e., a j = b j. 5

6 Algorthm 1 Pseudocode of CLUCB Input: α, B, F Intalze: S 0 =, z 0 = 0 R d, and C 1 = B for t = 1, 2, 3, do Fnd (a t, θ t ) arg max (a,θ) At Ct θ, φ t a Compute L t = mn θ Ct θ, z t 1 + φ t a t f L t + S r t 1 c b (1 α) t =1 r b then Play a t = a t and observe reward y t defned by (1) Set z t = z t 1 + φ t a t, S t = S t 1 t, St c = St 1 c Gven a t and y t, construct the confdence set C t+1 accordng to (5) else Play a t = b t and observe reward y t defned by (1) Set z t = z t 1, S t = S t 1, St c = St 1 c t, C t+1 = C t end f end for In order to guarantee that t does not volate constrant (3), at each round t, CLUCB plays the optmstc acton,.e., a t = a t, only f mn θ C t z t 1 [ { }}{ ] rb + θ, + θ, φ ta (1 α) t φ a St 1 c S t 1 =1 t rb, and plays the conservatve acton,.e., a t = b t, otherwse. In the next secton, we wll descrbe how CLUCB constructs and updates the confdence sets C t. 3.1 Constructon of Confdence Sets CLUCB starts by the most general confdence set C 1 = B and updates ts confdence set only when t plays an optmstc acton. Ths s manly for smplfcaton and s based on the dea that snce the reward functon of the baselne polcy s known ahead of tme, playng a baselne acton does not provde any new nformaton about the unknown parameter θ. However, ths can be easly changed to update the confdence set after each acton. Ths s n fact what we do n the algorthm proposed n Secton 4. We follow the approach of [1] to buld confdence sets for the unknown parameter θ. Let S t = { 1,..., mt } be the set of rounds up to and ncludng round t at whch CLUCB has played the optmstc acton. Note that we have defned m t = S t. For a fxed value of λ > 0, let θ t = ( Φ t Φ t + λi ) 1 Φt Y t, (4) be the regularzed least square estmate of θ at round t, where Φ t = [φ 1 a1,..., φ m t a mt ] and Y t = [y 1,..., y mt ]. For a fxed confdence parameter δ (0, 1), we construct the confdence 6

7 set for the next round t + 1 as where β t+1 = σ d log C t+1 = ( 1+(mt+1)D 2 /λ δ {θ R d : θ θ t Vt β t+1 }, (5) ) + λb, V t = λi + Φ t Φ t, and the weghted norm s defned as x V = x V x for any x R d and any postve defnte V R d d. Note that smlar to the lnear UCB algorthm (LUCB) n [1], the sub-gaussan parameter σ and the regularzaton parameter λ that appear n the defntons of β t+1 and V t should also be gven to the CLUCB algorthm as nput. The followng proposton (Theorem 2 n [1]) shows that the confdence sets constructed as n (5) contan the true parameter θ wth hgh probablty. Proposton 1. For any δ > 0 and the confdence set C t defned by (5), we have P [ θ C t, t N ] 1 δ. Proposton 1 ndcates that at each round t, the CLUCB algorthm satsfes the performance constrant (3) wth probablty at least 1 δ. Ths s because at each round t, CLUCB ensures that (3) holds for all θ C t and P[θ C t ] 1 δ. 3.2 Regret Analyss of CLUCB In ths secton, we prove a regret bound for the proposed CLUCB algorthm. Let t b t = r t a t rt b t be the baselne gap at round t,.e., the dfference between the expected rewards of the optmal and baselne actons at round t. Ths quantty shows how sub-optmal the acton suggested by the baselne polcy s at round t. We make the followng assumpton on the performance of the baselne polcy π b. Assumpton 3. There exst 0 l h and 0 < r l < r h such that, at each round t, l t b t h and r l r t b t r h. (6) An obvous canddate for both h and r h s 1, as all the mean rewards are confned n [0, 1]. The reward lower-bound r l ensures that the baselne polcy mantans a mnmum level of performance at each round. Fnally, l = 0 s a reasonable canddate for the lower-bound of the baselne gap. The followng proposton shows that the regret of CLUCB can be decomposed nto the regret of a lnear UCB (LUCB) algorthm (e.g., [1]) and a regret caused by beng conservatve n order to satsfy the performance constrant (3). Proposton 2. The regret of CLUCB can be decomposed nto two terms as follows: R T (CLUCB) R ST (LUCB) + n T h, (7) where R ST (LUCB) s the cumulatve (pseudo)-regret of LUCB at rounds t S T and n T = S c T = T S T = T m T s the number of rounds (n T rounds) at whch CLUCB has played the conservatve acton. 7

8 Proof. From the defnton of regret (2), we have R T (CLUCB) = T ra t T t t=1 t=1 r t a t = ) (ra t t rt a t + ) (ra t t rt b t t S T t S c T = ) (ra t t rt a t + t S T t S T t S c T t b t (r t a t rt a t ) + n T h. (8) The result follows from the fact that for t S T, CLUCB plays the exact same actons as LUCB, and thus, the frst term n (8) represents LUCB s regret for these rounds. The regret bound of LUCB for the confdence set defned by (5) can be derved from the results of [1]. Let E be the event that θ C t, t N, whch accordng to Proposton 1 holds wth probablty at least 1 δ. The followng proposton provdes a bound for R ST (LUCB). Snce ths proposton s a drect applcaton of Theorem 3 n [1], we omt ts proof here. Proposton 3. On event E, for any T N, we have ( R ST (LUCB) 4 m T d log λ + m ) T D d [ B λ + σ 2 log( 1δ ( ) + d log 1 + m T D λd ( ( ) ) D T = O d log λδ T. (9) Now n order to bound the regret of CLUCB, we only need to fnd an upper bound on n T,.e., the number of tmes CLUCB devates from LUCB and selects the acton suggested by the baselne polcy. We start ths part of the proof wth the followng lemma. Lemma 4. For gven k N, λ > 0 and any sequence X 1, X 2,, X k n R d such that : X 2 D, let V 0 = λi and V = λi + j=1 X jxj for 1 k. Then, we have k =1 ( ) mn 1, X 2 V 1 1 2d log )] ) (1 + kd2. (10) λd Lemma 4 s a drect applcaton of Lemma 11 n [1] and we omt ts proof here. The followng theorem provdes a bound on the number of rounds at whch CLUCB acts conservatvely and follows the baselne polcy π b. 8

9 Theorem 5. Assume that λ D 2. On event E, for any horzon T N, we have n T d 2 (B λ + σ) 2 l + αr l [ log ( 64d(B )] 2 λ + σ)d. λδ( l + αr l ) Proof. Let be the last round at whch CLUCB plays conservatvely (acton suggested by the baselne polcy),.e., = max { } 1 t T a t = b t. From Algorthm 1, at round, we may wrte mn θ, φ a θ C + φ t a t + rb t t < (1 α) rb t t. t S 1 t S 1 c t=1 or equvalently, α We may rewrte (11) as α rb t t < t=1 t=1 rb t t < rb t t mn θ, φ a θ C + φ t a t. (11) t S t S 1 t S 1 Note that for each t S 1, we have [ r t bt θ, φ t a t ] + [ r b θ, φ a ] + θ, φ a + φ t a t mn θ, φ a θ C + φ t a t t S 1 t S 1 = [ ] rb t t max θ, φ t a θ C t + max θ, φ t a t θ C t θ, φ t a t t t S 1 [ ] + rb max θ, φ a θ C + max θ, φ a θ C θ, φ a + max θ θ, φ a θ C +. (12) t S 1 φ t a t r t b t max θ C t θ, φ t a t r t b t θ, φ t a t = t b t, (13) and smlarly, for the round, we have r b max θ C θ, φ a r b θ, φ a = b. (14) Usng nequaltes (12) to (14), we may rewrte (11) as 9

10 α rb t t < [ ] t b t + max θ θ, φ t a θ C t t t=1 t S 1 b + max θ θ, φ a θ C + max θ θ, φ a θ C + φ t a t t S 1 (m 1 + 1) l + max θ θ, φ a θ C + θ θ, φ t a θ C t t t S 1 max + max θ θ, φ a θ C + t S 1 φ t a t (15) φ V (m 1 + 1) l + 2β a φ t V a t 1 + 2β t φ a + t S 1 β t t S 1 φ t a t V 1 φ V (m 1 + 1) l + 4β a 1 + 4β φ t V a t 1. (16) t S 1 t On the other hand, t follows from (15) and the fact that all rewards are n [0, 1] that α rb t t (m 1 + 1) l + 4(m 1 + 1). (17) t=1 Combnng (16) and (17), we may wrte α [ ( rb t φ t (m 1 + 1) l + 4β mn a t=1 + t S 1 mn 2 V 1 ( φ t a t V 1 t t S 1 mn V 1 ), 1 )], 1. (18) In order to wrte the next equaton more compactly, let us defne Γ as [ ( ) φ Γ = mn a, 1 + ( φ t )] a 2 t Vt 1, 1. From Cauchy-Schwarz nequalty and Lemma 4, we have 10

11 α rb t t (m 1 + 1) l + 4β (m 1 + 1) Γ t=1 ( (m 1 + 1) l + 4β 2(m 1 + 1)d log 1 + m D 2 ) λd = (m 1 + 1) l + 8 (m 1 + 1)d log [ λb + σ d log ( 1 + (m 1 + 1)D 2 ) λd ( λ + (m 1 + 1)D 2 ) ] (m 1 + 1) l + 8d(B ( 2(m 1 + 1)D 2 ) λ + σ) log (m 1 + 1), λδ where the last nequalty follows from the fact that λ D 2. On the other hand, snce t : r t b t r l and = n 1 + m 1 + 1, we may wrte λδ αr l n 1 (m 1 + 1)( l + αr l ) (19) + 8d(B ( 2(m 1 + 1)D 2 ) λ + σ) log (m 1 + 1). λδ The RHS of (19) s only postve for a fnte range of m, and thus, has a fnte upperbound. For m = (m 1 + 1), c 1 = 8d(B λ + σ), c 2 = 2D2 and c λδ 3 = ( l + αr l ), Lemma 8 reported n Appendx A provdes the followng upper-bound on the RHS (and thus for the LHS) of (19): [ ( αr l n 1 114d 2 (B λ + σ) 2 64d(B )] 2 λ + σ)d log. l + αr l λδ( l + αr l ) The result follows from n T = n = n We now have all the necessary ngredents to derve a regret bound on the performance of the CLUCB algorthm. We report the regret bound of CLUCB n Theorem 6, whose proof s a drect consequence of the results of Propostons 2 and 3, and Theorem 5. Theorem 6. Wth probablty at least 1 δ, the CLUCB algorthm satsfes the performance constrant (3) for all t N, and has the ( followng regret bound: R T (CLUCB) = O d log ( ) DT ) K h T +, (20) λδ αr l (αr l + l ) where K s a constant that depends only on the parameters of the problem as K = d 2 (B λ + σ) 2 l + αr l [ log ( 64d(B λ + σ)d λδ( l + αr l ) )] 2. 11

12 Remark 2. The frst term n the regret bound (20) s the regret of LUCB, whch grows at the rate T log(t ). The second term accounts for the loss ncurred by beng conservatve n order to satsfy the performance constrant (3). Our results ndcate that ths loss does not grow wth tme (snce CLUCB wll be conservatve only n a fnte number of tmes). Ths mproves over the regret bound derved n [15] for the MAB settng, where the regret of beng conservatve grows wth tme. Furthermore, the regret bound of Theorem 6 clearly ndcates that CLUCB s regret s larger for smaller values of α. Ths perfectly matches the ntuton that the agent must be more conservatve, and thus, suffers hgher regret for smaller values of α. Theorem 6 also ndcates that CLUCB s regret s smaller for smaller values of h, because when the baselne polcy π b s close to optmal, the algorthm does not lose much by beng conservatve. 4 Unknown Baselne Reward In ths secton, we consder the case where the expected rewards of the actons taken by the baselne polcy, rb t t, are unknown at the begnnng. We show how the CLUCB algorthm presented n Secton 3 should be changed to handle ths case, and present a new algorthm, called CLUCB2. We prove a regret bound for CLUCB2, whch s at the same rate as that for CLUCB. Ths shows that the lack of knowledge about the reward functon of the baselne polcy does not hurt our algorthm n terms of the rate of the regret. Algorthm 2 contans the pseudocode of CLUCB2. The man dfference wth CLUCB s n the condton that should be checked at each round t to see whether we should play the optmstc acton a t or the conservatve acton b t. Ths condton should be selected n a way that CLUCB2 satsfes the constrant (3). We may rewrte (3) as ra + r t a + α r t b (1 α) ( φ t b t + ) rb. (21) S t 1 S t 1 S c t 1 If we lower-bound the LHS and upper-bound the RHS of (21), we obtan mn θ, φ a θ C + φ t a + α mn θ, φ t b t θ C (22) t S t 1 (1 α) max θ C t θ, S c t 1 S t 1 φ b + φ t b t. Snce each confdence set C t s bult n a way to contan the true parameter θ wth hgh probablty, t s easy to see that (21) s satsfed whenever (22) s true. CLUCB2 uses both optmstc and conservatve actons, and ther correspondng rewards n buldng ts confdence sets. Specfcally for any t, we let Φ t = [φ 1 a 1, φ 2 a 2,, φ t a t ], Y t = [y 1, y 2,, y t ], V t = λi + Φ t Φ t, and defne the least-square estmate after round t as θ t = (Φ t Φ t + λi) 1 Φ t Y t. (23) 12

13 Algorthm 2 CLUCB2 Input: α, r l, B, F Intalze: n 0, z 0, w 0, v 0 and C 1 B for t = 1, 2, 3, do Let b t be the acton suggested by π b at round t Fnd (a t, θ) = arg max (a,θ) At Ct θ, φ t a Fnd R t = max θ Ct θ, v + φ t b t and L t = mnθ, z + φ t a + α max { } mn θ, w, nr θ C t l t θ C t f L t (1 α)r t then Play a t = a t and observe y t defned by (1) Set z z + φ t a and t else v v + φ t b t Play a t = b t and observe y t defned by (1) Set w = w + φ t b t and n n + 1 end f Gven a t and y t, construct the confdence set C t+1 accordng to (24) end for Gven V t and θ t, the confdence set for round t + 1 s constructed as C t+1 = {θ C t : θ θ } t Vt β t+1, (24) where C 1 = B and β t = σ d log ( 1+tD 2 /λ δ ) + B λ. Smlar to Proposton 1, we can easly prove that the confdence sets bult by (24) contan the true parameter θ wth hgh probablty,.e., P [ θ C t, t N ] 1 δ. Remark 3. Note that unlke the CLUCB algorthm, here we buld nested confdence sets,.e., C t+1 C t C t 1, whch s necessary for the proof of the algorthm. Potentally, ths can ncrease the computatonal complexty of CLUCB2, but from a practcal pont of vew, the confdence sets become nested automatcally after suffcent data has been observed. Therefore, the nested constrant n buldng the confdence sets can be relaxed at suffcently large rounds. The followng theorem guarantees that CLUCB2 satsfes the safety constrant (3) wth hgh probablty, whle ts regret has the same rate as that of CLUCB and s worse than that of LUCB only up to an addtve constant. Theorem 7. Wth probablty at least 1 δ, the CLUCB2 algorthm satsfes the performance constrant (3) for all t N, and has the followng regret bound ( ( ) ) DT T K h R T (CLUCB2) = O d log +, (25) λδ α 2 rl 2 13

14 Fgure 1: Average (over 1, 000 runs) per-step regret of LUCB and CLUCB for dfferent values of α. where K s a constant that depends only on the parameters of the problem as K = 256d 2 (B λ + σ) 2 [log ( 10d(B λ + σ) )] 2 D + 1. αr l (λδ) 1/4 We report the proof of Theorem 7 n Appendx B. The proof follows the same steps as that of Theorem 6, wth addtonal non-trval techncaltes that have been hghlghted there. 5 Smulaton Results In ths secton, we provde smulaton results to llustrate the performance of the proposed CLUCB algorthm. We consdered a tme ndependent acton set of 100 arms each havng a tme ndependent feature vector lvng n the R 4 space. These feature vectors and the parameter θ are randomly drawn from N ( ) 0, I 4 such that the mean reward assocated to each arm s postve. The observaton nose at each tme step s also generated ndependently from N (0, 1), and the mean reward of the baselne polcy at any tme s taken to be the reward assocated to the 10 th best acton. We have taken λ = 1, δ = and the results are averaged over 1,000 realzatons. In Fgure 1, we plot the per-step regret (.e., Rt ) of LUCB and CLUCB for dfferent values t of α over a horzon T = 40, 000. Fgure 1 shows that the per-step regret of CLUCB remans 14

15 Fgure 2: Percentage of the rounds, n the frst 1, 000 rounds, at whch the safety constrant s volated by LUCB and CLUCB for dfferent values of α. constant at the begnnng (the conservatve phase), because durng ths phase, CLUCB follows the baselne polcy to make sure that the performance constrant (3) s satsfed. As expected, the length of the conservatve phase decreases as α s ncreased, snce the performance constrant s relaxed for larger values of α, and hence, CLUCB starts playng optmstc actons more quckly. After ths ntal conservatve phase, CLUCB has learned enough about the optmal acton and ts performance starts convergng to that of LUCB. On the other hand, Fgure 1 shows that the per-step regret of CLUCB at the frst few perods remans much lower than that of LUCB. Ths s because LUCB plays agnostc to the safety constrant, and thus, may select very poor actons n ts ntal exploraton phase. In regard to ths, Fgure 2 plots the percentage of the rounds, n the frst 1, 000 rounds, at whch the safety constrant (3) s volated by LUCB and CLUCB for dfferent values of α. Accordng to ths fgure, CLUCB always satsfes the performance constrant for all the values of α, whle LUCB fals n a sgnfcant number of rounds, specally for small values of α (.e., tght constrant). To better see the effect of the safety constrant on the regret of the algorthms, Fgure 3 plots the per-step regret acheved by CLUCB at round t = 40, 000 for dfferent values of α, as well as that for LCUB. As expected from our analyss and s shown n Fgure 1, the performance of CLUCB converges to that of LUCB after an ntal conservatve phase. Fgure 3 confrms that such convergence happens more quckly for larger values of α, where the safety constrant s relaxed. 15

16 Fgure 3: Per-step regret of LUCB and CLUCB for dfferent values of α, at round t = 40, Concluson In ths paper, we studed the concept of safety n contextual lnear bandts to address the challenges that arse n mplementng such algorthms n practcal stuatons such as ad recommendaton systems. Most of the exstng lnear bandt algorthms, such as LUCB [1], suffer from a large regret at ther ntal exploratory rounds. Ths unsafe behavor s not acceptable n many practcal stuatons, where havng a reasonable performance at any tme s necessary for a learnng algorthm to be consdered relable and to reman n producton. To guarantee safe learnng, we formulated a conservatve lnear bandt problem, where the performance of the learnng algorthm (measured n terms of ts cumulatve rewards) at any tme s constraned to be at least as good as a fracton of the performance of a baselne polcy. We proposed a conservatve verson of the LUCB algorthm, called CLUCB, to solve ths constrant problem, and showed that t satsfes the safety constrant wth hgh probablty, whle achevng a regret bound equvalent to that of LUCB up to an addtve tme-ndependent constant. We desgned two versons of CLUCB that can be used dependng on whether the reward functon of the baselne polcy s known or unknown, and showed that n each case, CLUCB acts conservatvely (.e., plays the acton suggested by the baselne polcy) only at a fnte number of rounds, whch depends on how suboptmal the baselne polcy s. We reported smulaton results that support our analyss and show the performance of the proposed CLUCB algorthm. 16

17 References [1] Y. Abbas-Yadkor, D. Pál, and C. Szepesvár. Improved algorthms for lnear stochastc bandts. In Advances n Neural Informaton Processng Systems, pages , [2] P. Auer, N. Cesa-Banch, and P. Fscher. Fnte-tme analyss of the multarmed bandt problem. Machne Learnng Journal, 47: , [3] L. Bottou, J. Peters, J. Qunonero-Candela, D. Charles, D. Chckerng, E. Portugaly, D. Ray, P. Smard, and E. Snelson. Counterfactual reasonng and learnng systems: The example of computatonal advertsng. Journal of Machne Learnng Research, 14: , [4] W. Chu, L. L, L. Reyzn, and R. Schapre. Contextual bandts wth lnear payoff functons. In Proceedngs of the Fourteenth Internatonal Conference on Artfcal Intellgence and Statstcs, pages , [5] V. Dan, T. Hayes, and S. Kakade. Stochastc lnear optmzaton under bandt feedback. In COLT, pages , [6] N. Jang and L. L. Doubly robust off-polcy value evaluaton for renforcement learnng. In Proceedngs of the Thrty-Thrd Internatonal Conference on Machne Learnng, [7] M. Petrk, M. Ghavamzadeh, and Y. Chow. Safe polcy mprovement by mnmzng robust baselne regret. In Advances n Neural Informaton Processng Systems, [8] P. Rusmevchentong and J. Tstskls. Lnearly parameterzed bandts. Mathematcs of Operatons Research, 35(2): , [9] D. Russo and B. Van Roy. Learnng to optmze va posteror samplng. Mathematcs of Operatons Research, 39(4): , [10] A. Swamnathan and T. Joachms. Batch learnng from logged bandt feedback through counterfactual rsk mnmzaton. Journal of Machne Learnng Research, 16: , [11] A. Swamnathan and T. Joachms. Counterfactual rsk mnmzaton: Learnng from logged bandt feedback. In Proceedngs of The 32nd Internatonal Conference on Machne Learnng, [12] G. Theocharous, P. Thomas, and M. Ghavamzadeh. Buldng personal ad recommendaton systems for lfe-tme value optmzaton wth guarantees. In Proceedngs of the Twenty-Fourth Internatonal Jont Conference on Artfcal Intellgence, pages , [13] P. Thomas, G. Theocharous, and M. Ghavamzadeh. Hgh confdence off-polcy evaluaton. In Proceedngs of the Twenty-Nnth Conference on Artfcal Intellgence,

18 [14] P. Thomas, G. Theocharous, and M. Ghavamzadeh. Hgh confdence polcy mprovement. In Proceedngs of the Thrty-Second Internatonal Conference on Machne Learnng, pages , [15] Y. Wu, R. Sharff, T. Lattmore, and C. Szepesvár. Conservatve bandts. In Proceedngs of The 33rd Internatonal Conference on Machne Learnng, pages ,

19 A Techncal Detals of the Proof of Theorems 5 and 7 In the proof of Theorems 5 and 7, we use the followng lemma to bound the RHS of (16) and (43). Lemma 8. For any m 2 and c 1, c 2, c 3 > 0, the followng holds c 3 m + c 1 m log(c2 m) 16c2 1 9c 3 Proof. Defne the LHS of (26) as a functon g(m), m 2,.e., Frstly, note that we have g (m) = c 3 + c ( log(c2 m) ) 2 m g(m) = c 3 m + c 1 m log(c2 m). [ ( 2c1 c2 e )] 2 log. (26) c 3 and g (m) = c 1 log(c 2 m) 4m m. Ths mples that snce c 2 > 1, g s a dfferentable concave functon over ts doman [2, ), and thus, we can fnd m, the global maxmum of functon g. The frst order condton mples that g (m ) = 0, whch gves us Pluggng ths nto the defnton of g, we obtan 2 + log(c 2 m ) = 2c 3 c 1 m. (27) g = max m 2 g(m) = g(m ) = c 3 m 2c 1 m. Now, we use the change of varable x = c 3 2c 1 m, whch by (A) gves us On the other hand, (27) becomes Takng exp from both sdes gves us g = 4c2 1 c 3 (x 2 x). (28) ( 4c2 c 2 ) log + 2 log(x) = 4x. c 2 3 e 4x x = 4c2 1c 2 e 2. 2 Now, snce x 2 e x, for all x 0, t follows from (A) that c 2 3 4c 2 1c 2 e 2 c 2 3 = e4x x 2 e4x e x = e3x, 19

20 whch ndcates that Pluggng ths nto (28) gves us [ g 4c2 1 x 2 4c2 1 log c 3 9c 3 x 1 3 log ( 4c 2 1 c 2 e 2 ( 4c 2 1 c 2 e 2 ) ] 2 c 2 3 c 2 3 ). [ ( = 16c2 1 2c1 c2 e )] 2 log. 9c 3 c 3 The statement follows from the fact that g(m) g, for any m 2. B Proof of Theorem 7 Proof. Suppose the confdence sets do not fal, whch s true wth probablty at least 1 δ. Then, CLUCB2 satsfes constrants are all satsfed, snce t ensures that those constrants are satsfed by the worst parameter n the confdence set at any tme. Smlar to Proposton 2, we can decompose the regret of CLUCB2 as R T (CLUCB2) = ( ) ra t rt t a t + ) (r ta t rtbt t S T t ST c ( ) ra t rt t a t + n T h, t S T where n T = ST c s the number of tmes CLUCB2 follows the baselne polcy n T tme steps. Now note that for t S T, CLUCB2 s followng the acton suggested by LUCB and hence, ) (r ta t rtat R ST (LUCB) R T (LUCB), (29) t S T where R ST (LUCB) denotes the regret of LUCB played at tme steps t S T whch s upper bounded by the regret of LUCB played at all T tme steps. On the other hand, by Proposton 3, we have the followng regret bound for LCUB: ( ( ) ) D T R T (LUCB) = O d log λδ T. Thus, t follows that ( ( ) ) D T R T (CLUCB2) = O d log λδ T + nt h, (30) Note that accordng to (24), the confdence set C t whch CLUCB2 uses to fnd the optmstc acton at round t s bult based not only on the observatons made by prevously played optmstc actons, but also by the observatons made when the baselne polcy has been followed at rounds before t. Therefore, the confdence set C t used by CLUCB2 at round 20

21 t would be tghter than what LUCB would have had f t was appled only to rounds S t. Hence, the frst nequalty n (29) stll holds. Gven (30), we only need to show that CLUCB2 follows the baselne polcy only at a fnte number of rounds. Let be the last round that CLUCB2 follows the baselne polcy (plays conservatvely),.e., = max { 1 t T a t = b t }. At tme, let L = mn θ C θ, whch satsfes L mn θ C S 1 φ a + φ a + α max mn θ C θ, S c 1 θ, S 1 φ a + φ a + αn 1 r l, and R = max θ, φ b θ C. S 1 {} φ b, n 1 r l, From Algorthm 2 at tme, we have L < (1 α)r whch wth some smple algebra translates to αn 1 r l max θ, φ b θ C mn θ, φ a θ C + φ a α max θ, φ b θ C. (31) S 1 {} S 1 S 1 {} The rest of the proof s devoted to use (31) and prove a tme ndependent upper bound on n 1. Unlke n the proof of Theorem 5, we rely on the nested property of the confdence sets bult n (24) n ths proof. If the confdence sets do not fal (.e., θ C ), then snce : C C, we have max θ, φ b θ C S 1 {} S 1 {} Frst, t follows from (32) and the fact that θ, φ b rl that max θ, φ b max θ, φ b. (32) θ C θ C S 1 {} α max θ, θ C φ b α(m 1 + 1)r l. (33) S 1 {} On the other hand, by the defnton of optmstc acton a at round t follows that max θ C θ, φ b maxa A max θ C θ, φ a = maxθ C θ, φ a. Then, from (32) t follows that max θ, φ b θ C max θ, φ a + max θ, φ a θ C θ C. (34) S 1 {} S 1 Furthermore, snce : C C, we have θ, mn θ C S 1 φ a +φ a S 1 mn θ C θ, φ a +mn θ C 21 θ, φ a S 1 mn θ C θ, φ a +mn θ C θ, φ a. (35)

22 Combnng (33), (34), (35) wth (31) gves αn 1 r l α(m 1 +1)r l + [ ] [ ] max θ, φ a mn θ, φ a + max θ, φ a θ C θ C θ C mn θ, φ a θ C. S 1 [ (36) Now, note that for any, the reward of playng any acton s n [0, 1] and hence, max θ C θ, φ a ] θ, φ a 1. On the other hand, snce the confdence sets do not fal, we have mn θ C max θ, φ a mn θ C θ C θ, φ a = max θ θ 1, φ a θ C mn = 2 max θ C 2β φ V a 1. 1 θ C θ 1 θ, φ a θ θ 1, φ a 2 θ θ 1 V 1 φ a V 1 1 Hence snce β s are non-decreasng and all larger than 1, t follows that [ ] max θ, φ a mn θ, φ a 2β mn ( 1, ) φ V a 1. (37) θ C θ C 1 Smlarly, we can show that [ max θ, φ a θ C mn θ C θ, φ a ] 2β mn ( 1, φ a V 1 1 Substtutng (37) and (38) n (36) gves αn 1 r l α(m 1 + 1)r l + 2β mn ( ) ( 1, φ V a 1 + mn 1, φ a 1 S 1 ). (38) V 1 1 ). (39) Boundng the RHS of (39) gves rse to another key dfference between ths proof and that of Theorem 5. Note that V = λi + ( ) j S φ j a j φ j ( ) aj + j S φ j c b j φ j b j s bult not only based on the actons played at non-conservatve rounds but also based on the actons played at conservatve tmes (j S c ), and hence Lemma 4 cannot be drectly used to bound the RHS of (39). Instead, we defne for any : Ṽ = λi + ( ) j S φ j a j φ j aj whch satsfes V = Ṽ + j S c φ j b j ( φ j b j ). Therefore t follows that φ V a 1 φ Ṽ a 1, 1 1 φ a V φ 1 a Ṽ 1, 1 1 and hence, from (39) t follows that αn 1 r l α(m 1 + 1)r l + 2β mn ( ) ( 1, φ Ṽ a 1 + mn 1, φ a 1 S 1 22 Ṽ 1 1 ). (40)

23 Now, smlar to the proof of Theorem 5, we defne [ Γ = mn ( 1, φ a 2 ) ( V + mn 1, φ 1 a 1 S 1 2 V 1 1 ) ], whch by Lemma 4 satsfes Γ 2d log (1 + (m ) 1 + 1)D 2. (41) λd On the other hand, for n 1 3 2, we have ( ) 1 + (m n 1 ) β = σ d log D δ 2 /λ+b λ (σ+b ( ) 1 + (m 1 + 1)n 1 D λ) d log 2 /λ, δ (42) where we used = m 1 + n Usng (41) and (42), and an applcaton of Cauchy-Schwarz nequalty on (40) gves αn 1 r l α(m 1 + 1)r l + 2β (m 1 + 1)Γ ( α(m 1 + 1)r l + 2β 2d(m 1 + 1) log 1 + (m ) 1 + 1)D 2 λd = α(m 1 + 1)r l + 3d(B ( 2n 1 D 2 (m 1 λ + σ) log (m 1 + 1)) + 1). (43) λδ Note that n contrary to the proof of Theorem 5 where only m 1 appeared on the RHS of (19), here both n 1 and m 1 both appear on the RHS. To bound n 1, we frst provde an upper bound on the RHS of (43) n terms of n 1. For m = (m 1 + 1), c 1 = 3d(B λ + σ), c 2 = 2n 1D 2 and c λδ 3 = αr l, Lemma 8 provdes the followng upper bound on the RHS (and hence the LHS) of (43): whch s equvalent to [ αn 1 r l 16d 2 (B λ + σ) 2 log αr l n 1 4d (B λ + σ) αr l log ( 24d(B λ + σ)d n 1 λδαrl )] 2, ( ) 24d(B λ + σ)d n 1. (44) λδαrl Now, note that the LHS of (43) grows lnearly wth n 1 whle the RHS grows logarthmcally. Thus, such an nequalty holds only for a fnte number of n 1 s. Lemma 9 appled 2 If ths condton does not hold, then t results n the smple bound n

24 wth x = n 1, c 1 = 4d (B λ+σ) αr l n 1 : n d2 (B λ + σ) 2 α 2 r 2 l and c 2 = 24d(B λ+σ)d λδαrl, gves the followng upper bound on [ log ( 10d(B λ + σ) )] 2 D. αr l (λδ) 1/4 Therefore, CLUCB2 follows the baselne polcy only at n T = n = n d2 (B [ ( λ + σ) 2 10d(B λ + σ) )] 2 D log + 1 α 2 rl 2 αr l (λδ) 1/4 rounds, and hence accordng to (30), acheves a regret bound of ( ( ) D T R T (CLUCB2) = O d log λδ T h + K α 2 rl 2 where K = 256d 2 (B λ + σ) 2 [log ( ), 10d(B λ + σ) )] 2 D + 1. αr l (λδ) 1/4 C Techncal Detal Used n the Proof of Theorem 7 We used the followng Lemma n the proof of Theorem 7. Lemma 9. Let c 1 and c 2 be two postve constants such that log(c 1 c 2 ) 1. Then, any x > 0 satsfyng x c 1 log(c 2 x) also satsfes x 2c 1 log(c 1 c 2 ). Proof. Assume that x c 1 log(c 2 x) holds. Defne a = 1 c 1 c 2 and change the varble to z = c 2 x. Then, we have az log(z). Let q(z) = az and l(z) = log(z) and defne z = 2/a log(1/a). Frst, snce log(t 2 ) t for any t > 0, we have ( ) 1 1 a log log 1a ( a log 2 log 1 ) 2 log 1 ( 2 2 a a log a log 1 ). a By the defnton of z, q and l, the last nequalty s equvalent to q(z ) l(z ). (45) Furhtermore, snce log(1/a) 1, then for any z z : q a (z) = a 2 log(1/a) = l (z ) l (z). (46) Thus, t follows from (45) and (46) that q(z) l(z) for all z z. Thus, az log(z) s possble only for z z. Replacng the defnton of a, z and z, we deduce that x c 1 log(c 1 c 2 ) s possble only f x 2c 1 log(c 1 c 2 ). 24

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would