Applications of on-line prediction. in telecommunication problems

Size: px

Start display at page:

Download "Applications of on-line prediction. in telecommunication problems"

Dwayne King
5 years ago
Views:

1 Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1

2 Outline On-line prediction; Some algorithmic issues; An application to lossy data compression without delay; An application to a routing problem. 2

3 Randomized prediction A game between forecaster and environment. At each round t, the forecaster chooses an action I t {1,..., N}; the environment chooses an action y t Y; the forecaster suffers loss l(i t, y t ) [0, 1]. The goal is to minimize the cumulative excess loss ( n ) 1 n l(i t, y t ) min l(i, y t ) n i N The forecaster may randomize. At time t chooses a probability distribution p t = (p 1,t,..., p N,t ) and plays action i with probability p i,t. Actions are often called experts.. 3

4 Randomized prediction This and related models have been studied in game theory: playing repeated games; information theory: gambling and data compression; statistics: sequential decisions; statistical learning theory: on-line learning; The simplest model assumes that after each round, the losses l(i, y t ) (i = 1,..., N) are revealed (full information). In this model Hannan (1957) showed that the forecaster has a strategy such that ( n ) 1 n l(i t, y t ) min l(i, y t ) 0 a.s. n i N 4

5 Hannan consistency: basic ideas Obviously, the forecaster must randomize. l(p t, y t ) = N p i,t l(i, y t ) = E t l(i t, y t ) i=1 denotes the expected loss of the forecaster. By martingale convergence, ( 1 n l(i t, y t ) n so it suffices to study ( 1 n n ) n l(p t, y t ) = O P (n 1/2 ) l(p t, y t ) min i N ) n l(i, y t ) 5

6 Weighted average prediction Idea: assign a higher probability to better-performing actions. A popular choice is ( exp η ) t s=1 l(i, y s) p i,t = N k=1 ( η exp ) i = 1,..., N. t s=1 l(k, y s) where η > 0. Then, with η = 8 ln N/n, ( n ) 1 n l(p n t, y t ) min l(i, y t ) i N = ln N nη + η 8 ln N 2n If N is large, the algorithm is not feasible. There is hope for structured classes of experts. 6

7 The shortest path problem v PSfrag replacements u v u Given a dag (V, E) and vertices u and v, a path from u to v is identified by a binary vector i {0, 1} E. For each t = 1,..., n, the forecaster chooses a path I t. A loss l e,t [0, 1] is assigned to each e E. The loss of a path i is the sum of losses over the edges l(i, Y t ) = i l t. 7

8 Efficient implementation The number of paths (experts) is exponentially large. However, the weighted average forecaster can be computed efficiently. For each path i, we write p i,t = P t [I t = i] = K i k=1 It can be seen that, if (ŵ, w) E, P t [v It,k = v ik v It,k 1 = v i,k 1,..., v It,0 = u]. P t [v It,k = w v It,k 1 = ŵ,..., v It,0 = u] = e ηl ( bw,w),t 1 G t 1(w) G t 1 (ŵ) where, for any vertex w, G t (w) = i P w e η P e i L e,t. 8

9 The algorithm computes the weighted average forecaster over all paths such that it requires O( E ) operations at each time. The expected regret is bounded by n l(p t, Y t ) min i n n ln M l(i, Y t ) K 2 where M is the number of paths from u to v and K is the length of the longest path. 9

10 Lossy source coding problem Asymptotically optimal coding of a source sequence with zero delay (sequential encoding and decoding); no probabilistic assumptions on the source; asymptotically the same performance as the best scalar quantizer matched to the entire source sequence (for any sequence). Goals: solve the above problem with low complexity; determine the achievable rate of convergence (also for probabilistic sources) 10

11 Zero-Delay Sequential Source Coding x i encoder y i {1,..., M} decoder ˆx i Ui Source sequence: x i [0, 1], i = 1, 2,...; rate R = log M code: Encoder: f i : [0, 1] i {1, 2..., M}; Decoder: g i : {1, 2,..., M} i [0, 1]; y i = f i (x i ) = f i (x i, U i ) ˆx i = g i (y i ) Randomizing sequence: U i, i.i.d. uniform on [0, 1]; 11

12 Performance Expected normalized cumulative squared distortion: [ ] 1 n E (x i ˆx i ) 2 n i=1 Goal: Perform (essentially) as well as the best scalar quantizer matched to the entire sequence x n = (x 1,..., x n ). Expected distortion redundancy with respect to the class of scalar quantizers: ( [ ] ) 1 n ρ n = sup E (x i ˆx i ) 2 1 n min (x i Q(x i )) 2 x n n Q Q n i=1 where Q is the set of M-level scalar quantizers. i=1 12

13 A scheme based on on-line prediction n {}}{ }{{} (k + 1)st block l {}}{}{{}}{{} 1 R log ( M) K transmitting Q k (x i ) describing Q k the source sequence is divided into blocks of length l at the end of the kth block a quantizer Q k is chosen randomly from a given family of ( K M) quantizers in the first 1 R log ( K M) time instants of the (k + 1)st block the index of Q k is transmitted in the rest of the block, at time instants i, Q k (x i ) is transmitted 13

14 Choosing Q k Let Q K (K n 1/3 for minimum distortion) be the set of M-level nearest neighbor scalar quantizers with code points in the set C (K) = {1/(2K), 3/(2K),..., (2K 1)/(2K)}. Q k is chosen randomly from Q K according to the probabilities p k (Q) = P{Q k = Q} = e η P kl (x t Q(x t )) 2 b Q QK e η P kl (x t bq(x t )) 2 l log K log K Performance: ρ n C 1 + C 2 n l where C 1, C 2 > 0 are constants depending only on M. + 1 K Straightforward implementation: need to compute the cumulative distortion for all Q Q K, i.e., for about K M n M/3 quantizers. 14

15 Efficient implementation The shortest path algorithm can be used to construct a zero-delay coding scheme of rate R = log M such that for any sequence x n [0, 1] n ρ n C 1 l log K n + C 2 log K l + 3 K where C 1, C 2 > 0 are constants depending only on M, with O(MK 2 n/l) computational complexity; and O(K 2 ) memory requirement. Linear computational complexity O(M) per time instant and O(n 1/2 ) memory requirement with O(n 1/4 log n) distortion redundancy 15

16 Tracking the best expert (Herbster and Warmuth, 1998). Given any sequence i 1,..., i n of actions from {1,..., N}, define size(i 1,..., i n ) = n t=2 I i t 1 i t. The tracking regret is ( n ) n max l(p t, Y t ) l(i t, Y t ) over all sequences with size(i 1,..., i n ) m. ( n 1 k The number of meta experts is m k=0 weighted average would give a regret bounded by n 2 ( (m + 1) ln N + m ln ) N(N 1) k. Simple ) e(n 1) m 16

17 Fixed share forecaster Parameters: Real numbers η > 0 and 0 α = m/(n 1). Initialization: w 0 = (1/N,..., 1/N). For each round t = 1, 2,... (1) draw an action I t from {1,..., N} according to the distribution p i,t = w i,t 1 N j=1 w j,t 1 i = 1,..., N (2) obtain Y t and compute v i,t = w i,t 1 e η l(i,y t) for each i = 1,..., N (3) let w i,t = α W t N + (1 α)v i,t where W t = v 1,t + + v N,t. for each i = 1,..., N 17

18 Tracking the shortest path Goal: predict as well as the best path that can change m times. m ( n 1 ) k=0 k M(M 1) k meta experts. (M is the number of paths.) The fixed share forecaster requires computation proportional to M. We need efficient computation of the fixed share algorithm. 18

19 Fixed share forecaster revisited Initialization: For t = 1, choose I 1 uniformly from the set {1,..., N}. For each round t = 2,..., n (1) Draw τ t randomly according to the distribution (1 α) t 1 Z 1,t 1 NW t 1 for t = 1 P t [τ t = t ] = where we define Z t,t 1 = N. P t [I t = i τ t = t ] = α(1 α) t t W t 1 Z t,t 1 NW t 1 for t = 2,..., t (2) Given τ t = t, choose I t randomly according to the probabilities e η P t 1 s=t l(i,y s) Z for t = 1,..., t 1 t,t 1 1 N for t = t 19

20 Tracking the shortest path The alternative fixed share algorithm can be implemented such that, at time t, computing the prediction I t requires time O(tK E + t 2 ). The expected tracking regret of the algorithm satisfies K n 2 ( (m + 1) ln M + m ln e n ) m for all sequences of paths i 1,..., i n such that size(i 1,..., i n ) m where M is the number of all paths in the graph between vertices u and v and K is the length of the longest path. 20

21 Application to zero-delay coding An efficiently computable (O(M) computations per time instant) zero-delay coding scheme can be constructed whose average distortion is at most as large as the best code that can change scalar quantizers at most m times plus a redundancy of the order of mn 1/6. 21

22 Multi-armed bandit problem (Auer, Cesa-Bianchi, Freund, and Schapire, 2002.) The forecaster only observes his/her own loss. Losses of experts have to be estimated. Let g(i, Y t ) = 1 l(i, Y t ). Gains are estimated by g(i, Y t )/p i,t if I t = i g(i, Y t ) = i = 1,..., N. 0 otherwise Note that E[ g(i, Y t ) I 1,..., I t 1 ] = g(i, Y t ), and therefore g(i, Y t ) is an unbiased estimate of g(i, Y t ). 22

23 A strategy for the multi-armed bandit problem Initialization: w i,0 = 1 and p i,1 = 1/N for i = 1,..., N. (1) Select an action I t {1,..., N} according to the probability distribution p t. (2) Calculate the estimated gains g (i, Y t ) = g(i, Y t ) + β p i,t = ( g(i, Yt ) + β ) /p i,t if I t = i β/p i,t otherwise (3) Update the weights w i,t = w i,t 1 e ηg (i,y t ) (4) Calculate the updated probability distribution w i,t p i,t+1 = (1 γ) N j=1 w j,t + γ N, i = 1,..., N. 23

24 Performance bound The regret of the algorithm satisfies, with probability 1 δ, L n min i=1,...,n L i,n 11 2 nn ln(n/δ) + ln N 2 Bound depends badly on the number of actions. Cannot be improved, in general.. 24

25 Bandit version of shortest path: a routing problem A packet is sent through a network (graph), and at each edge the transmission suffers a delay that can vary arbitrarily. We want to choose the route such that the total delay is not much larger than that of the best constant route. We only receive information about the delays of the packets sent, but don t know anything about the other edges. 25

26 The general bandit algorithm needs to be modified to exploit the special structure of the class of experts of all paths. The probability of certain paths needs to be kept large. We obtain an algorithm, computable in time quadratic in E whose regret is, with probability 1 δ, n l(i t, Y t ) min i n l(i, Y t ) nk 3 E ln(m/δ). 26

The on-line shortest path problem under partial monitoring

The on-line shortest path problem under partial monitoring András György Machine Learning Research Group Computer and Automation Research Institute Hungarian Academy of Sciences Kende u. 11-13, Budapest,