Online learning with feedback graphs and switching costs

Size: px

Start display at page:

Download "Online learning with feedback graphs and switching costs"

Elfreda Miller
5 years ago
Views:

1 Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence of feedback graphs G :T let T i be the number of times the action i I(G :T ) is selected by the player in T rounds. Let T be the total number of times the actions are selected from the set K\I(G :T ). Let E i denote expectation conditioned on X i and P i denote the probability conditioned on X i. Additionally we define P 0 as the probability conditioned on event ɛ 0. Therefore under P 0 all the actions in the independent sequence set i.e. i I(G :T ) incur an expected regret of /2 whereas the expected regret of actions i K\I(G :T ) is /2 + ɛ 2. Let E 0 be the corresponding conditional expectation. For all i K and t T l t (i) and l c t(i) denote the unclipped and clipped loss of the action i respectively. Assuming the unclipped losses are observed by the player then F is the sigma field generated by the unclipped losses and S t (i t ) is the set of actions whose losses are observed at time t following the selection of i t according to the feedback graph G t. The observed sequence of unclipped losses will be referred as l o :T. Additionally F is the sigma field generated by the clipped losses for all t T l t(i) where i S t (i t ) and the observed sequence of clipped losses will be referred as l o :T. By definition F F. Let i... i T be the sequence of actions selected by a player over the time horizon T. Then the regret R c of the player corresponding to clipped loses is R c l c t(i t ) + c M s min l c t(i) () where M s is the number of switches in the action selection sequence i... i T and c is the cost of each switch in action. Now we define the regret R which corresponds to the unclipped loss function in Algorithm as following R l t (i t ) + c M s min l t (i). (2) Using (Dekel et al. 204 Lemma 4) we have ( P For all t T 2 + W t 6 5 ) (3) Thus for all T > max{ 6} we have ɛ ɛ 2 < /6. If B {For all t T : /2 + W t /6 5/6} occurs and ɛ ɛ 2 < /6 then for all i K l c t(i) l t (i) which implies R c R (see () and (2)). Now if the event B does not occur then the losses at any time t satisfy l t (i) l c t(i) (ɛ + ɛ 2 ). Therefore we have c M s R c R c M s + (ɛ + ɛ 2 )T. Now for T > max{ 6} we have ER ER c ( P(B))ER R c B does not occur (ɛ + ɛ 2 )T 6. (4)

2 Thus (4) lower bounds the actual regret R c in terms of regret R. Now we derive the lower bound on regret R corresponding to the unclipped loses. Using the definition of R we have ER max E T l t (i t ) E i i E i i i i l t (i) M s l t (i t ) min j I(G :T )\{i} β(g:t ) E i j 2 T j + 2 T j + E i ɛ 2 T + ɛ (T T i ) β(g (b) :T ) ɛ (T ) E i Ti M s i l t (i) M s ( ) ( ) ( ) 2 ɛ T i ɛ 2 T 2 ɛ T M s ( ) ( ) 2 + ɛ 2 T ɛ T i 2 ɛ T M s M s where follows from j T j + T T and (b) follows from ɛ 2 T 0. Now we upper bound the E i Ti in (5) to obtain the lower bound on the expected regret ER. Since the player is deterministic the event {i t i} is F measurable. Therefore we have P i (i t i) P 0 (i t i) d F T V (P 0 P i ) d F T V (P 0 P i ) where d F T V (P 0 P i ) sup A F P 0 (A) P i (A) is the total variational distance between the two probability measures and follows from F F. Summing the above equation over t T and i I(G :T ) yields i ( Ei T i E 0 T i ) T i d F T V (P 0 P i ). Rearranging the above equation and using i E 0 T i E 0 i T i T we get i E i T i T Combining the above equation with (5) we get ER ɛ T ɛ T i i d F T V (P 0 P i ) + T. d F T V (P 0 P i ) ɛ T 2 ɛ T d F T V (P 0 P i ) M s i ɛ T M s where uses the fact that >. Next we upper bound the second term in the right hand side of (6). Using Pinsker s inequality we have d F T V (P 0 P i ) 2 D KL(P 0 (l o :T ) P i(l o :T )) (7) 2 (5) (6)

3 where l o :T are the losses observed by the player over the time horizon T. Using the chain rule of relative entropy to decompose D KL (P 0 (l o :T ) P 0(l o :T )) we get D KL (P 0 (l o :T ) P i (l o :T )) D KL (P 0 (l o t l o :t ) P i (l o t l o :t )) (8) D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t) )) where ρ (t) is the set of time instances 0 k t encountered when operation ρ(.) in Algorithm is applied recursively to t. Now we deal with each term D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t))) in the summation individually. For i I(G :T ) we separate this computation into four cases: i t is such that loss of action i is observed at both time instances t and ρ(t) i.e. i S t (i t ) and i S t (i ρ(t) ); i t is such that loss of action i is observed at time instance t but not at time instance ρ(t) i.e. i S t (i t ) and i / S t (i ρ(t) ); i t is such that loss of action i is not observed at time instance t but is observed at time instance ρ(t) i.e. i / S t (i t ) and i S t (i ρ(t) ); i t is such that loss of action i is not observed at both time instances t and ρ(t) i.e. i / S t (i t ) and i / S t (i ρ(t) ). Note that at a single time instance the loss of only one single action can be observed from I(G :T ) arms. Case : Since the loss of action i is observed from the independent sequence set I(G :T ) at both the time instances the loss distribution for the action i is l o t (i) l o ρ (t) N (l ρ(t)(i) σ 2 ) for both P 0 and P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(i) + ɛ + ɛ 2 σ 2 ) under both P 0 and P i. Case 2: Since the loss of action i is observed from the independent sequence set I(G :T ) at time instance t but not at ρ(t) therefore there exists an action k I(G :T )\{i} from the independent sequence set which was observed at time instance ρ(t). Then the loss distribution for the action i is l o t (i) l o ρ (t) N (l ρ(t)(k ) σ 2 ) under P 0 and l o t (i) l o ρ (t) N (l ρ(t)(k ) ɛ σ 2 ) under P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(k ) + ɛ 2 σ 2 ) under both P 0 and P i. Case 3:Since the action i is observed from the independent sequence set I(G :T ) at time instance ρ(t) but not at t therefore there exists an action k I(G :T )\{i} from the independent sequence set which was observed at time instance t. Then the loss distribution for the arm k is l o t (k ) l o ρ (t) N (l ρ(t)(i) σ 2 ) under P 0 and l o t (k ) l o ρ (t) N (l ρ(t)(i) + ɛ σ 2 ) under P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(i) + ɛ + ɛ 2 σ 2 ) under both P 0 and P i. Case 4: Let k be the arm from the independent sequence set observed at time instance ρ(t). Since the arm i is not observed from the independent sequence set I(G :T ) at the time instances t and ρ(t) therefore the loss distribution for all arms k I(G :T )\{i} is l o t (k ) l o ρ (t) N (l ρ(t)(k ) σ 2 ) for both P 0 and P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(k )+ɛ 2 σ 2 ) under both P 0 and P i. Therefore we have D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t) )) P 0(i S t (i t ) i / S ρ(t) (i ρ(t) )) D KL (N (0 σ 2 ) N ( ɛ σ 2 )) + P 0 (i / S t (i t ) i S ρ(t) (i ρ(t) )) D KL (N (0 σ 2 ) N (ɛ σ 2 )) ɛ2 2σ 2 P 0(B t ) where B t {i S t (i t ) i / S ρ(t) (i ρ(t) ) i / S t (i t ) i S ρ(t) (i ρ(t) )}. The event B t implies that the player has switched at least once between the feedback systems S t (k ) and S ρ(t) (k 2 ) such that i S t (k ) but i / S ρ(t) (k 2 ) or vice-versa. Let N i be the number of times a player switches from the feedback system which includes i to the feedback system which does not include i and vice-versa. Then using (8) and (9) we have D KL (P 0 (l o :T ) P i (l o :T )) ɛ2 ω(ρ) 2σ 2 E 0N i (0) 3 (9)

4 where ω(ρ) is the width of process ρ(.) (see Definition 2 in Dekel et al. (204)) and is bounded above by 2 log 2 (T ). Combining (7) and (0) we have (P 0 (A) P i (A)) ɛ log2 (T )E 0 N i. () σ sup A F If M s ɛ T then ER > ɛ T. Thus the claimed lower bound follows. Now let us assume M s ɛ T. For all i I(G :T ) we have E 0 M s E i M s Using the above equation we have E 0 M s EM s ɛ T m Now combining (4) (6) ()and (3) we obtain P 0 (M s m) P i (M s m)) ɛ T d F T V (P 0 P i ). (E 0 M s E i M s ) i ɛ T d F T V (P 0 P i ). ER ɛ T 6 ɛ T ɛ log2 (T )E 0 N i + c E 0 M s σ i ɛ T 6 ɛ 2 T σ 2 log2 (T )E 0 M s + c E 0 M s i (b) c/3 /3 T 2/3 c/3 /3 T 2/3 54 log 2 (T ) 62 log 2 (T ) c/3 /3 T 2/3 8 log 2 (T ) where follows from the concavity of x and i N i 2M s (b) follows from the fact that the right hand side is minimized for E 0 M s ɛ 2 T log 2 (T )/2cσ. The claim of the theorem now follows. B Proof of Lemma 2 is the cardinality of I(G :T ). Let 2... actions belong to the set I(G :T ). Then the adversary selects an action uniformly at random from the set I(G :T ) say j and assigns the loss sequence to action j using independent Bernoulli random variable with parameter 0.5 ɛ where ɛ /T ). For all i I(G :T )/{j} losses are assigned using independent Bernoulli random variable with parameter 0.5. For all i / I(G :T ) the losses are assigned using independent Bernoulli random variable with parameter. The proof of the lemma follows along the same lines as Theorem 5 in (Alon et al. (207)). C Proof of Theorem 3 Proof of this theorem uses the result from Theorem. Since the loss sequence is assigned independently to each sub-sequence U m where m M. Using Theorem there exists a constant b m such that E T (l t (i t )(G t U m ) + cw m min i U m b m c /3 β(u m ) /3 N(U m ) 2/3 / log(t ) 4 (l t (i)(g t U m ) (2) (3) (4) (5)

5 where W m is number of switches performed within the sequence U m. Since m M W m (i t i t ) there exist a constant b such that the expected regret of any algorithm A is at least b c /3 β(u m ) /3 N(U m ) 2/3 / log T. D Proof of Lemma 4 m M Proof. The proof follows from contradiction and is along the same lines as the proof of Theorem 4 in Dekel et al. (204). Let A performs at most Õ(( /2 T ) α ) switches for any sequence of loss function over T rounds with β + α/2 <. Then there exists a real number γ such that β < γ < α/2. Then assign c ( /2 T ) 3γ 2. Thus the expected regret including the switching cost of the algorithm is Õ(( /2 T ) β + ( /2 T ) 3γ 2 (T ) α ) õ( /2 T ) γ over a sequence of losses assigned by the adversary because β < γ and α < 2 2γ. However according to Theorem the expected regret is at least Ω( /3 ( /2 T ) (3γ 2)/3 T 2/3 ) Ω((T ) γ ). Hence by contradiction the proof of the lemma follows. E Proof of Theorem 5 Proof. Let t t 2... t be the sequence of time instances at which the event E t occurs during the duration T of the game. We define {r j t j+ t j } j T as the sequence of inter-event times between the events E t. Let mas(g () )... mas(g (T ) ) denote the sequence in the decreasing order of size of maximal acyclic graphs i.e. mas(g () ) (or mas(g (T ) )) is the maximum (or minimum) size of maximal acyclic graph observed in sequence G :T {G... G T }. Using the definition of E t note that r j is a random variable bounded by T /3 c 2/3 /mas(g (T ) ) /3. For all j the ratio of total weights of actions at round t j and t j+ is W tj+ W tj w itj+ W tj w itj exp( ηl t j+r j (i)) W tj p itj exp( ηl t j+r j (i)) η ( p itj ηl t j+r j (i) + ) 2 η2 l 2 t j+r j (i) p itj l t j+r j (i) + η2 2 p itj l 2 t j+r j (i) (6) where follows from the fact that for all x 0 e x x x 2 /2. Now taking logs on both sides of (6) summing over t t 2... t and using log( + x) x for all x > we get log W t + W η j p itj l t j+r j (i) + η2 2 j p itj l 2 t j+r j (i). (7) 5

6 For all actions k K we also have log W t + W log w k t + W Combining (7) and (8) for all k K we obtain j p itj l t j+r j (i) j η l t j+r j (k ) log(k). (8) j l t j+r j (k ) η + η 2 j p itj l 2 t j+r j (i). (9) Now for all i K the conditional expectation of l t j+r j (i) is E l t j+r j (i) p tj r j t j+r j tt j t j+r j tt j t j+r j tt j k :i S t(k ) l t (i) q it l t (i). p k t j lt(i) q it k :i S t(k ) p k t j (20) Therefore we have that for all i K the conditional expectation E l t j+r j (i) {p tj r j } j j j t j+r j tt j l t (i) l t (i). (2) Now the expectation of second term in right hand side of (9) is E p itj l 2 t j+r j (i) E E p itj l 2 t j+r j (i) {p tj r j } j j j j E mas(g tj:t j+r j )rj 2 where mas(g tj:t j+r j ) max n tjt j+r j mas(g n ) and follows from the fact that for all i K and t T l t (i) and p it/q it mas(g t )(Alon et al. 207 Lemma 0). Now we bound j mas(g t j:t j+r j )rj 2. We write the following optimization problem: max {r j} j T (22) mas(g tj:t j+r j )rj 2 subject to (23) j r j T j 0 r j T /3 c 2/3 mas /3 (G (T ) ). Since the objective function is submodular and the constraints are linear the ratio of the solution of the greedy algorithm and the optimal solution is at most ( /e) (Nemhauser and Wolsey (978)). Therefore the optimal solution o of the above optimization problem is o t T 2/3 mas(g (t) )c 4/3 ( /e)mas 2/3 (G (T ) ) (24) 6

7 where t T 2/3 c 2/3 mas /3 (G (T ) ). Using (9) (20) (2) (22) and (24) we have k j+r j E p ikj l t (i) l t (k ) + η t T 2/3 c 4/3 mas(g (t) ) η 2 ( /e)mas 2/3 (G (T ) ). (25) j tk j j Additionally the player switches its action only if E t is true. Thus using (25) and c(i j) c for all i j K we have R A (l :T C) η + η 2 t T 2/3 c 4/3 mas(g (t) ) T ( /e)mas 2/3 (G (T ) ) + c E (i t i t ). (26) Now we bound E T t2 (i t i t ). E t occurs with probability and does not contribute to any SC. E2 t can lead to at most T 2/3 c 2/3 mas /3 (G (T ) ) switches. Now let E3 t causes N T switches. Then we have EN T E E E j j j E (i tj+ i tj E tj 3 is true) E (i tj+ i tj E tj 3 is true) {p tj r j } j E k K\{i} j k K\{i} P(i tj i t E j 3 is true)p(i t j+ k itj i) {p t j r j } j p itj p k t j+ c 2/3 mas /3 (G (T ) )t /3 c 2/3 mas /3 (G (T ) )T 2/3 where follows from Lemma in this section. Thus the number of switches are 2c 2/3 mas /3 (G (T ) )T 2/3 and the SC is 2c /3 mas /3 (G (T ) )T 2/3. Part (iii) of the theorem follows by combining the results from (i) and (ii). Part (iv) follows from the fact that if G t is undirected mas(g t ) α(g t ). Lemma. Given i K is chosen at time instance t j for all k K\{i} we have p itj p k t j+ (t j+ ) /3. Proof. Given i is chosen at time instance t j for all k K\{i} we have t2 (27) p k t j+ p itj+ p k exp( ηˆl tj+ (k )) p itj exp( ηl t j+r j (i)) p k exp( η(ˆl tj (k ) + l t j+r j (k ))) p itj exp( ηl t (i)) j+r j (b) exp ( η(ˆl tj (k ) + l t j+r j (k ) l ) t j+r j (i)) p itj (c) exp ( η(ɛ tj+ /η) ) Kp itj exp ( ) ɛ tj+ p itj (28) 7

8 where follows from the fact that ˆl tj+ (k ) ˆl tj (k ) + l t j+r j (k ); (b) follows from p k /K; (c) follows from the fact that for all k K\{i} ˆl kt l it > ɛ t/η as the increment in l it is bounded by /q it. Now replacing ɛ t log(tc 2 /mas(g (T ) ))/3 in (28) we have p itj p k t j+ c 2/3 mas /3 (G (T ) )t /3 j+. (29) F Proof of Theorem 6 Proof. We borrow the notations from the proof of Theorem 5. Using the fact that η t is decreasing in t and (9) we have j p itj l t j+r j (i) min k K j l t j+r j (k ) η tj + 2 j p itj l 2 t j+r j (i). (30) Now taking expectation on both the sides and using the fact that expectation of the min(.) is smaller than the min(.) of the expectation we have E p itj l t j+r j (i) j j j (b) log(k) j j η tj 2 ɛ t j E min k K E j l t j+r j (k ) p itj l 2 t j+r j (i) p tj r j (i t is selected using p t ) η tj 2 ɛ t j Emas(G tj:t j+r j )r 2 j (i t is selected using p t ) η tj 2 ɛ t j 2 mas(g tj:t j+r j ) η tj 2 ɛ 2 t j 2 mas(g tj:t j+r j ) ɛ tj (c) 2 log(k) mas 2/3 (G j (T ) ) mas(g (j)) (d) E + j 2 log(k) mas 2/3 (G (T ) ) mas(g (j)) (3) where follows from (22) (b) follows from the fact that since the probability of selecting a new action is at most ɛ tj the mean and the variance of the geometric random variable r j is bounded by /ɛ 2 t j and ( ɛ tj )/ɛ 2 t j respectively (c) follows from the value of η t and ɛ t and (d) follows from the fact that mas(g (j) )/mas(g (T ) ) is a monotonic non increasing sequence in j therefore the summation is a concave function and the inequality follows from the Jensen s inequality. Now we bound the E in (3). This also gives a bound on the number of switches performed by the algorithm. We have E E(i t i t ) ɛ t 0.5mas /3 (G (T ) )T 2/3 c /3 (32) 8

9 References Alon N. Cesa-Bianchi N. Gentile C. Mannor S. Mansour Y. and Shamir O. (207). Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing 46(6): Dekel O. Ding J. Koren T. and Peres Y. (204). Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing pages ACM. Nemhauser G. L. and Wolsey L. A. (978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research 3(3):

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A