Universal Loseless Compression: Context Tree Weighting(CTW)

Size: px

Start display at page:

Download "Universal Loseless Compression: Context Tree Weighting(CTW)"

Donna Roberts
5 years ago
Views:

1 Universal Loseless Compression: Context Tree Weighting(CTW) Dept. Electrical Engineering, Stanford University Dec 9, 2014

2 Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding

3 Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW

4 Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren t the LZ good enough?

5 Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution

6 Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow

7 Universal Coding with Model Classes Traditional Shannon theory assume a (probabilistic) model of data is known, and aims at compressing the data optimally w.r.t. the model. e.g. Huffman coding and Arithmetic coding In the case where the actual model is unknown, we can still compress the sequence with universal compression. e.g. Lempel Ziv, and CTW Aren t the LZ good enough? Yes. Universal w.r.t class of finite-state coders, and converging to entropy rate for stationary ergodic distribution No. convergence is slow We explore CTW.

8 Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. For a given sequence x n, P(x n ) be true distribution and P L(x n ) be estimated distribution.

9 Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. For a given sequence x n, P(x n ) be true distribution and P L(x n ) be estimated distribution. We know that P L(x n ) = 2 L(xn ) k n achieves the minimum expected code length where k n = x n 2 L(xn) 1 by Kraft Inequality

10 Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. For a given sequence x n, P(x n ) be true distribution and P L(x n ) be estimated distribution. We know that P L(x n ) = 2 L(xn ) k n achieves the minimum expected code length where k n = x n 2 L(xn) 1 by Kraft Inequality Then the expected redundancy is Red n(p L, P) = E[L(x n ) log 1 P(x n ) ] (1) = log k n + D(P P L) (2)

11 Universal Compression and Universal Estimation A good compressor implies a good estimate of the source distribution. For a given sequence x n, P(x n ) be true distribution and P L(x n ) be estimated distribution. We know that P L(x n ) = 2 L(xn ) k n achieves the minimum expected code length where k n = x n 2 L(xn) 1 by Kraft Inequality Then the expected redundancy is Red n(p L, P) = E[L(x n ) log 1 P(x n ) ] (1) = log k n + D(P P L) (2) A good compressor has a small redundacny. Therefore, a good compressor implies a good estimate of the true distribution.

12 Two Entropy Coding for known parameters Huffman code an optimal prefix code Arithmetic Code For L(x n 1 ) = log( P(x n ) ) + 1,.c = Q(x n ) 2 L(xn ) 2 L(xn ) Q c L3 0.c 3 Q 3 Q 2 Q 1 Q 0

13 Two Entropy Coding for known parameters Huffman code an optimal prefix code Arithmetic Code For L(x n 1 ) = log( P(x n ) ) + 1,.c = Q(x n ) 2 L(xn ) 2 L(xn ) Q c L3 0.c 3 Q 3 Q 2 Q 1 What if parameters are unknown? Q 0

14 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ.

15 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ. Can we we still design the compressor?

16 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all x n

17 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all x n Note that min C C L C (x n ) = min θ [0,1] log 1 P θ = nĥ(xn ) = nh 2 (ˆθ)

18 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all x n Note that min C C L C (x n ) = min θ [0,1] log 1 P θ = nĥ(xn ) = nh 2 (ˆθ) Two part code: Use log(n + 1) bits to encode n 1 and then tune code to θ = n1 n.

19 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all x n Note that min C C L C (x n ) = min θ [0,1] log 1 P θ = nĥ(xn ) = nh 2 (ˆθ) Two part code: Use log(n + 1) bits to encode n 1 and then tune code to θ = n1 Then n. Red n(l, x n ) = log(n + 1) + 2 n 0

20 Bernoulli Model: Two part code Let C = {P θ, θ [0, 1]}, X n iid Bern(θ). But we don t know θ. Can we we still design the compressor? Yes! with acceptable pointwise redundacny for all x n Note that min C C L C (x n ) = min θ [0,1] log 1 P θ = nĥ(xn ) = nh 2 (ˆθ) Two part code: Use log(n + 1) bits to encode n 1 and then tune code to θ = n1 n. Then Red n(l, x n log(n + 1) + 2 ) = 0 n Natural but dependent on n(not sequential)

21 Bernoulli Model: Mixture and Plug in code Better code:

22 Bernoulli Model: Mixture and Plug in code Better code:

23 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1

24 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3)

25 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where

26 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where where p L(0 x t ) = n 0(x t )+1 is Rule of succession t+2

27 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where where p L(0 x t ) = n 0(x t )+1 is Rule of succession t+2 Use bias estimator of the conditional probability(plug-in approach)

28 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where where p L(0 x t ) = n 0(x t )+1 is Rule of succession t+2 Use bias estimator of the conditional probability(plug-in approach) dθ Mixture code Mix over Dirichlet s density : Γ( 1 )2 = dw(θ) 2 θ(1 θ)

29 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where where p L(0 x t ) = n 0(x t )+1 is Rule of succession t+2 Use bias estimator of the conditional probability(plug-in approach) dθ Mixture code Mix over Dirichlet s density : Γ( 1 )2 = dw(θ) 2 θ(1 θ)

30 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where where p L(0 x t ) = n 0(x t )+1 is Rule of succession t+2 Use bias estimator of the conditional probability(plug-in approach) dθ Mixture code Mix over Dirichlet s density : Γ( 1 )2 = dw(θ) 2 θ(1 θ) Red n (L, x n ) log(n + 1) 2n + O( 1 n ) 0 (4) Note that P L(x n ) = 1 0 P θ(x n )dw(θ) = n 1 t=0 pl(xt t+1 x t )

31 Bernoulli Model: Mixture and Plug in code Better code: Enumerate all ( ) n n 1 sequences with n1 (n 1 is away from 0, n), L(x n ) = log ( ) n n 1 Red n (L, x n ) log(n + 1) + 2 2n + O( 1 n ) 0 (3) Uniform assignment over types is equivalent to uniform mixture, P L(x n ) = 1 0 θn 0 θn 1 1dθ = ( ) n 1 1 = n 1 n 1 n+1 t=0 pl(xt t+1 x t ) where where p L(0 x t ) = n 0(x t )+1 is Rule of succession t+2 Use bias estimator of the conditional probability(plug-in approach) dθ Mixture code Mix over Dirichlet s density : Γ( 1 )2 = dw(θ) 2 θ(1 θ) Red n (L, x n ) log(n + 1) 2n + O( 1 n ) 0 (4) Note that P L(x n ) = 1 P 0 θ(x n )dw(θ) = n 1 t=0 pl(xt t+1 x t ) where p L(0 x t ) = n 0(x t )+ 2 1 is KT(Krichevski-Trofimov) estimator t+1 only depending on n 0, n 1

32 Tree Source with Known Model If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ R S. For a fixed x n, define the number of 0 s after s as n 0 (s). Similarly define n 1 (s)

33 Tree Source with Known Model If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ R S. For a fixed x n, define the number of 0 s after s as n 0 (s). Similarly define n 1 (s) For a x n = , sequence n = 10, n 0 (10) = 2, n 1 (10) = 1

34 Tree Source with Known Model If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ R S. For a fixed x n, define the number of 0 s after s as n 0 (s). Similarly define n 1 (s) For a x n = , sequence n = 10, n 0 (10) = 2, n 1 (10) = 1 For each s, calculate KT estimator for P L,s (n 0 (s), n 1 (s)) and use P L (x n ) = s S P L,s(n 0 (s), n 1 (s)) for Arithmetic coding

35 Tree Source with Known Model If the next symbol depends on past symbols, then Tree model is a good choice. Let a binary tree have leaves S. Then θ R S. For a fixed x n, define the number of 0 s after s as n 0 (s). Similarly define n 1 (s) For a x n = , sequence n = 10, n 0 (10) = 2, n 1 (10) = 1 For each s, calculate KT estimator for P L,s (n 0 (s), n 1 (s)) and use P L (x n ) = s S P L,s(n 0 (s), n 1 (s)) for Arithmetic coding Then redundancy is Red n (L, x n ) = L(x n 1 ) log P(x n ) (5) 1 < log( P L (x n ) ) + 2 log 1 P(x n ) (6) 1 = log s S P L,s(n 0 (s), n 1 (s)) log 1 P(x n ) + 2 (7)

36 Tree Source with Known Model Continued. s S Red n (L, x n ) = log( θ s n0(s) n 1(s) θs s S P L,s(n 0 (s), n 1 (s)) ) + 2 (8) ( ) 1 2 log(n 0(s) + n 1 (s)) (9) s S ( 1 S 2 log s S (n ) 0(s) + n 1 (s)) (10) S ( 1 = S 2 log n ) S (11)

37 Tree Source with Known Model Continued. s S Red n (L, x n ) = log( θ s n0(s) n 1(s) θs s S P L,s(n 0 (s), n 1 (s)) ) + 2 (8) ( ) 1 2 log(n 0(s) + n 1 (s)) (9) s S ( 1 S 2 log s S (n ) 0(s) + n 1 (s)) (10) S ( 1 = S 2 log n ) S (11) O(log(n)) term is the cost for unknown paramters θ s, s S

38 Tree Source with Known Model Continued. s S Red n (L, x n ) = log( θ s n0(s) n 1(s) θs s S P L,s(n 0 (s), n 1 (s)) ) + 2 (8) ( ) 1 2 log(n 0(s) + n 1 (s)) (9) s S ( 1 S 2 log s S (n ) 0(s) + n 1 (s)) (10) S ( 1 = S 2 log n ) S (11) O(log(n)) term is the cost for unknown paramters θ s, s S This upperbound through KT estimator meets the lowerbound of Minimax redundancy, i.e., S 2 log n + o(1)

39 Tree Source with Unknown Model What if we do not know the tree model either?

40 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth,

41 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth, Two part code: Use n T bits for representing tree and then use KT estimator for that tree, L T (x n ) + n T

42 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth, Two part code: Use n T bits for representing tree and then use KT estimator for that tree, L T (x n ) + n T

43 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth, Two part code: Use n T bits for representing tree and then use KT estimator for that tree, L T (x n ) + n T But it should be optimized over all trees and is not sequentially implementable

44 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth, Two part code: Use n T bits for representing tree and then use KT estimator for that tree, L T (x n ) + n T But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe

45 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth, Two part code: Use n T bits for representing tree and then use KT estimator for that tree, L T (x n ) + n T But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe

46 Tree Source with Unknown Model What if we do not know the tree model either? With assumption that tree has at most D depth, Two part code: Use n T bits for representing tree and then use KT estimator for that tree, L T (x n ) + n T But it should be optimized over all trees and is not sequentially implementable CTW(Context Tree Weighting) defines a weighted coding distribution which takes into account all the possible tree sources that could lead to the sequence that we observe Question: The leaves S from this context tree might be different from actual tree model. But still good?

47 Tree Source with Unknown Model Calculate n 0 (s), n 1 (s) for all s {s s {0, 1}, s D} s = λ Ex) x n = n 0 (0) = 3, n 1 (0) = 3 s = 0 s = s = 0 s = 1 n 0 (10) = 2, n 1 (10) = 1 (x n = ) s = 00s = 01s = 10s = 11

48 Tree Source with Unknown Model Calculate n 0 (s), n 1 (s) for all s {s s {0, 1}, s D} s = λ Ex) x n = n 0 (0) = 3, n 1 (0) = 3 s = 0 s = s = 0 s = 1 n 0 (10) = 2, n 1 (10) = 1 (x n = ) s = 00s = 01s = 10s = 11 Calculate corresponding KT estimatorp e (n 0 (s), n 1 (s)) for each s.

49 Tree Source with Unknown Model Calculate n 0 (s), n 1 (s) for all s {s s {0, 1}, s D} s = λ Ex) x n = n 0 (0) = 3, n 1 (0) = 3 s = 0 s = s = 0 s = 1 n 0 (10) = 2, n 1 (10) = 1 (x n = ) s = 00s = 01s = 10s = 11 Calculate corresponding KT estimatorp e (n 0 (s), n 1 (s)) for each s. Assign weighting probability P s w = { Pe (n 0 (s), n 1 (s)) if l(s) = D P e(n 0(s),n 1(s))+P 0s w P 1s w 2 if l(s) D (12) And take P L (x n ) = P λ w(x n ) and use it for Arithmetic coding.

50 Tree Source with Unknown Model Calculate n 0 (s), n 1 (s) for all s {s s {0, 1}, s D} s = λ Ex) x n = n 0 (0) = 3, n 1 (0) = 3 s = 0 s = s = 0 s = 1 n 0 (10) = 2, n 1 (10) = 1 (x n = ) s = 00s = 01s = 10s = 11 Calculate corresponding KT estimatorp e (n 0 (s), n 1 (s)) for each s. Assign weighting probability P s w = { Pe (n 0 (s), n 1 (s)) if l(s) = D P e(n 0(s),n 1(s))+P 0s w P 1s w 2 if l(s) D (12) And take P L (x n ) = P λ w(x n ) and use it for Arithmetic coding. Corollary: Suppose x n is drawn from P 1 or P 2. The one can achieve a redundacny of 1 bit by using the weighted distribution P w = P1+P2 2

51 Tree Source with Unknown Model

52 Tree Source with Unknown Model Then P λ w(x n ) = S all T D 2 ΓD(S ) P e (n 0 (s), n 1 (s)) s S (13) S P e (n 0 (s), n 1 (s)) (14) s S S P known L (x n ) (15)

53 Tree Source with Unknown Model Then P λ w(x n ) = The redundancy is S all T D 2 ΓD(S ) P e (n 0 (s), n 1 (s)) s S (13) S P e (n 0 (s), n 1 (s)) (14) s S S P known L (x n ) (15) Red n (x n ) = Red known n (x n ) + 2 S 1 (16) = S 2 log n + S S 1 (17) S

54 Summary Two part code vs. Mixture code(plug-in code)

55 Summary Two part code vs. Mixture code(plug-in code) Look at KT estimator which is is optimal in minimax sense. And the mixture is sequentially implementable by KT estimator.

56 Summary Two part code vs. Mixture code(plug-in code) Look at KT estimator which is is optimal in minimax sense. And the mixture is sequentially implementable by KT estimator. Even though we do not know the tree model(only depth is given), can design good compressor using CTW.

57 Reference Willems; Shtarkov; Tjalkens (1995), The Context-Tree Weighting Method: Basic Properties 41, IEEE Transactions on Information Theory Begleiter; El-Yaniv; Yona (2004), On Prediction Using Variable Order Markov Models 22, Journal of Artificial Intelligence Research: Journal of Artificial Intelligence Research, pp Some tutorial and lecture note

Context tree models for source coding

Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding