Context tree models for source coding

Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 2 / 5

Data Compression : Shannon Modelization Licence de droits d usage Page 3 / 5

The Problem Finite alphabet A ; ex : A = {a, c, t, g} Set of all finite binary sequences : {, } = n {, }n If x A n, its length is x = n log = base 2 logarithm Let X be a stochastic process on the finite alphabet A, with stationary ergodic distribution P The marginal distribution of X n is denoted by Pn For n N, the coding function φ n : A n {, } Expected code length : E P [ φ n (X n ) ] Licence de droits d usage Page 5 / 5

Example : DNA Codes A = {a, c, t, g} Uniform concatenation code : φ = a c t g, φ n(x n) = φ (X )... φ (X n ) = x n An, φ n (X n ) = 2n Improvement? ψ = a c t g Problem : what is ψ () : ct or gc? Uniquely decodable codes : φ n (X n ) = φ m(y m ) = m = n and i {,..., n}, X i = Y i Instant codes = prefix codes : (a, b) A 2, j N, φ (a) :j φ (b) Licence de droits d usage Page 6 / 5

Kraft s Inequality Theorem (Kraft 49, McMillan 56) If φ n is a uniquely decodable code, then x A n 2 φn(x) Proof for prefix codes : Let D = max x A n φ n (x) and for x A n, let Z (x) = {w {, } D : w φ(x) = φ n (x)}. #Z (x) = 2 D φn(x) φ n prefix = x, y A n, x y = Z (x) Z (y) = Hence x A n N(x) = x A n 2 D φn(x) #{, } D = 2 D Licence de droits d usage Page 7 / 5

Shannon s First Theorem = if l(x) = φ n (x), we look for min l x A n l(x)p n (x) under constraint x A n 2 l(x) Theorem (Shannon 48) E P [ φ n (x) ] H n (X) = E P [ log P n (X n )] Misleading notation : H n (X) is actually a function of P n only, the marginal distribution of X n Tight bound : there exist a code φ n such that E P [ φ n (x) ] H n (X) + Licence de droits d usage Page 8 / 5

Example : binary entropy A = {, }, X i iid B(p), < p <. n H n(x) = H (X) = h(p) = p log p + ( p) log p.8.6.4.2.2.4.6.8 Licence de droits d usage Page 9 / 5

Example : a 2-block code A = {, 2, 3}, n = 2, P =.3δ +.6δ 2 +.δ 3 i.i.d. H(X) =.3 log(.3).6 log(.6). log(.).295 bits X 2 φ 2 (X 2) 2 3 2 22 23 3 32 33 22 2 2 3 3 33 23 32 2 E [ φ2 (X 2 ) ].335 Licence de droits d usage Page / 5

Entropy Rate Theorem[Shannon-Breiman-McMillan] if P is stationary and ergodic, then n H n(x) H(X) = Interpretation : H(X) = minimal number of bits necessary to code each symbol of X Kraft s inequality : for each code φ n there exists a (sub-)probability distribution Q n such that Q n (x) = 2 φn(x) log Q n (x) = φ n (x) Licence de droits d usage Page / 5

Arithmetic Coding : a constructive reciprocal φ n (x n ) = the log Qn (x n ) +2 first bits of Qn (X n x n ) + Qn (X n < x n ) 2.9.27.3.48.84.9.93.99 2 3 2 22 23 3 32 33 Ex : A = {, 2, 3}, n = 2, Q 2 = [.3,.6,.] 2 I() = [,.9[ : m() =.45 =.... As log.9 + = 5, φ 2 () = I(22) = [.48,.84[ : 22 2 2 m(22) =.66 =.... As log.36 + = 3, φ 2 (22) = 23 3 Licence de droits d usage Page 2 / 5 32 3 33

Coding distributions = log Q n (x)= code length for x with coding distribution Q n. Th Shannon : the best coding distribution is Q n = P n. Using another distribution causes an expected overhead called redundancy E P [ φ n (X n ) ] H (X n ) = E P [ log Q n (X n ) + log Pn (X n )] = KL (P n, Q n ) = Küllback-Leibler divergence of P n to Q n. Licence de droits d usage Page 3 / 5

Universal coding What if the source statistics are unknown? What if we need a versatile code? = Need a single coding distribution Q n for a class of sources Λ = {P θ, θ Θ} Ex : memoryless processes, Markov chains, HMM... = unavoidable redundancy, in the worst case : sup θ Θ E Pθ [ φ n (X n ) ] H (X n ) = sup KL (Pθ n, Qn ) θ Θ = Q n must be close to all the {P θ, θ Θ} for KL divergence Licence de droits d usage Page 4 / 5

Predictive Coding What if we want to code messages with different or large lengths? Consistent coding family : x n An, Q n (x n ) = a A Q n+ (xa) Define Q(a x n) = Q(x na) b A Q(x n b), then n ( ) Q n (x n ) = Q x j x j j= Predictive coding scheme : update Q( x n) Example : Laplace code Q(a x n) = n(a,x n)+ n+ A, where n(a, x n) = n j= x j =a. Licence de droits d usage Page 5 / 5

Density estimation with Logarithmic Loss Feature space : context x k A, Instant loss : number of bits used l(q, x k ) = log Q(x k x k ) Best regressor in( average : ) Q ( x k ) = P θ X k = x k X k = x k Excess loss : l(q, x k ) l(q, x k ) = log P θ(x k =x k X k =x k ) Q k (x k x k ) ( ) E Pθ [l(q, x k ) l(q, x k )] = KL P θ (X k = X k = x k k k ), Q ( x Total excess loss : L n (Q) = n k= l(q, x k) l(q, x k ) = log Pn θ (x n ) Q n (x n ) Total expected excess loss : E Pθ L n (Q) = KL(P n θ, Qn ) Licence de droits d usage Page 6 / 5

Bayesian predictive coding Let π be any prior on Θ, take Q n (x n ) = P n θ (x n )π(dθ) Predictive coding distribution : Q(a x n ) = P θ (x n+ = a X n ) π(dθ x n ) Conjugate priors = easy update rules for Q( x n) Idea : θ Θ, Q n (x n) Pn θ±δ (x n )π(θ ± δ) Licence de droits d usage Page 8 / 5

Example : i.i.d. classes A stationary memoryless source is parameterized by A θ S A = (θ,..., θ A ) : θ j = i= Conjugate Dirichlet prior π = D(α,..., α A ). Posterior : Π ( dθ x n ) = D(α + n,..., α A +n A ) = Q(a x n ) = n a + α a n + A j= α j Laplace code = special case α = = α A = Licence de droits d usage Page 9 / 5

The Krichevski-Trofimov Mixture Dirichlet prior with α = = α A = /2 : ( ) Γ A Q KT (x n ) = P θ (x n ) 2 θ S A A Γ( 2 ) A a A θ /2 a dθ a. Jeffrey s prior = Non-informative Peaked on extreme parameter values Example A =2 : ν(θ ) 3 2.5 2.5 Q KT () =Q KT ()Q KT ( )Q KT ( ) = /2 /2 + 2 /2 3.5..2.3.4.5.6.7.8.9 Licence de droits d usage θ Page 2 / 5

Optimality Theorem : (Krichevsky-Trofimov 8) If QKT n is the KT-mixture code and if m = A, then max θ Θ KL(Pn θ, Qn KT ) = m log n Γ(/2)m + log 2 2 Γ ( ) m + o m () 2 Theorem : (Rissanen 84) If dim Θ = k, and if there exists a n-consistent estimator of θ given X n, then lim inf min n Q n max θ Θ KL(Pn θ, Qn ) k 2 log n Theorem : Haussler 96 there exists a Bayesian code Q n reaching the minimax redundandy Licence de droits d usage Page 2 / 5

Optimality : more precisely Theorem : (Krichevsky-Trofimov 8) If QKT n is the KT-mixture code and if m = A, then max θ Θ KL(Pn θ, Qn KT ) = m log n Γ(/2)m + log 2 2 Γ ( ) m + o m () 2 Theorem : (Xie-Barron 97) min max θ Q ΘKL(Pn n θ, Qn ) = m log n Γ(/2)m + log 2 2e Γ ( ) m + o m () 2 Actually, pointwise approximation : log QKT n (x n ) inf log θ Pn θ (x n ) + A log n + C 2 Licence de droits d usage Page 22 / 5

Sources with memory iid sources on finite alphabets are very-well treated......but they provide poor models for most applications! example : Shannon considered Markov models for natural languages Pb : a Markov Chain of order 3 on an alphabet of size 26 has k = 4394 degrees of freedom! min max Q n θ Θ KL(Pn θ, Qn ) k 2 log n = need of more parcimonious classes Licence de droits d usage Page 23 / 5

Context Tree Sources Formal definition : Context Tree : T A such that x, unique s T : x s = s The unique suffix of x in T is denoted by T (x) A( string s is ) a context of a stationary ergodic process P if P X s = s > and if for all x s : ( P X = a ) ( = x s s = P X = a X X s = s ) T the context tree of P if T is a context tree, contains all contexts of P and if none of its elements can be replaced by a proper suffix without violating one of these properties Licence de droits d usage Page 25 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (2/3, /3) (3/4, /4) (/5, 4/5) (/3, 2/3) Licence de droits d usage Page 26 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (3/4, /4) (/5, 4/5) (/3, 2/3) (2/3, /3) Licence de droits d usage Page 27 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (/5, 4/5) /3 (, 2/3) (3/4, /4) (2/3, /3) Licence de droits d usage Page 28 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 4/5 (/5, ) (/3, 2/3) (3/4, /4) (2/3, /3) Licence de droits d usage Page 29 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (3/4, /4) (/5, 4/5) (/3, 2/3) ( 2/3, /3) Licence de droits d usage Page 3 / 5

Context Trees versus Markov Chains Markov chain of order r = context tree source corresponding to a complete tree of depth r Markov chain of order 3 p (. ) p (. ) M =. p (. ) p(. ) p(. ) p(. ) p(. ) p(. ) p(. ) p(. ) p(. ) Finite context tree source of depth d = Markov Chain of order d p (. ) p (. ) p (. ) p(. ) M = p (. ) p(. ) p(. ) = much more flexibility : large number of models per dimension. Licence de droits d usage Page 32 / 5

Infinite trees, unbounded memory There exist (useful and simple) non-markovian context tree sources, see Galves, Ferrari, Comets,... Ex : Renewal Processes X = } {{ } }{{ } N i N i+ iid N i µ M (N + ) ( P X = X =... j ) = µ(j) µ ([j, [) Licence de droits d usage Page 33 / 5

Bayes Codes in a Context Tree Model For a given context tree T, let Λ T = {P stationary ergodic on A : T (P) = T } P Λ T is fully charactarized by θ = (θ s ) s T Θ T = S T A R T ( A ) : ( ) P X = X = x = θ T(x ) ( ) Bayesian coding : prior π T on Θ T, Q πt (x) = P θ (x n )dπ T (θ) Licence de droits d usage Page 35 / 5

Krichevsky-Trofimov Mixtures for Context Trees Prior : the θ s are independent, D ( 2,..., 2) - distributed : π T (dθ) = s T π KT (dθ s ) For s T let T (x n, s) = substring of x n that occurs in context s Example : for T = {,, } and x 8 =, T (x n, ) = T (x n, ) = T (x n, ) = The KT-mixture on model T is : Q n T (x n ) = s T Q T (x n,s) KT (T (x n, s)) Licence de droits d usage Page 36 / 5

Optimality Theorem : (Willems, Shtar kov, Tjalkens 93) there is a constant C such that : log Q n T (x n ) inf log P θ (x n θ S A T x ) + T A 2 first-order optimal What if we don t know which model T to use? = model selection ( ) n log + C T T Licence de droits d usage Page 37 / 5

Minimum Description Length Principle Occam s razor (Jorma Rissanen 78) : Choose the model that gives the shortest description of data Information theory : objective codelength = codelength of a minimax coder. Estimator associated with minimax mixtures (π T ) T : ˆT BIC = arg min log Pθ n (x n ) π T (dθ). T θ Θ T Looking directly at the bounds : ˆT KT = arg min T inf log Pθ n (x n ) + θ Θ T T ( A ) 2 log n. =penalized maximum likelihood estimator with BIC penalty. Licence de droits d usage Page 38 / 5

Consistency Results Theorems (Csiszár& Shields 96, Csiszár& Talata 4, G. 5, Galves & Leonardi 7, Leonardi 9) : the BIC estimator is strongly consistent the Krichevski-Trofimov estimator is not consistent Many Rates of convergence, minimal penalties,... The Krichevski-Trofimov estimator fails to identify Bernoulli-/2 iid sources! ( ) One can derive from this bounds on sup T sup θ ΘT KL Pθ n, Qṋ T Licence de droits d usage Page 39 / 5

Double mixture ν = measure on the set T of all context trees defined by : 2 T + ν(t ) = 2 is a probability distribution. Context Tree Weighting coding distribution : Q n CTW (x n ) = T ν(t )Q n T (x n ) Theorem (Shtarkov and al 93)) : oracle inequality : log QCTW(x n n x ) inf inf p θ (x n x ) T T θ Θ T + A T log + n 2 T + T (2 + log m) + m 2. Licence de droits d usage Page 4 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. (4,3) (2,2) (2,) (2,) (,2) (,) (,) Licence de droits d usage Page 42 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. (4,3) (2,2) (2,) 3 8 (2,) 3 8 (,2) 2 (,) 8 (,) Licence de droits d usage Page 42 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. (4,3) ( 3 2 8 3 8 + ) 3 28 = 2 256 (2,2) ( 2 2 8 + ) 6 = 6 (2,) 3 8 (2,) 3 8 (,2) 2 (,) 8 (,) Licence de droits d usage Page 42 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. ( 2 2 256 6 + 5 258 (4,3) ) = 3 892 ( 3 2 8 3 8 + ) 3 28 = 2 256 (2,2) ( 2 2 8 + ) 6 = 6 (2,) 3 8 (2,) 3 8 (,2) 2 (,) 8 (,) Licence de droits d usage Page 42 / 5

Renewal Processes Let R be the (non-parametric) class of renewal processes on the binary alphabet {, } : X = } {{ } }{{ }, N i N i N i+ iid µ M (N + ) Theorem : (Csiszár and Shields 96) There exist two constants c and C such that c n inf sup KL(P n, Q n ) = C n Q n P R = first example of an intermediate complexity class. Problem : non-constructive Does a general purpose coder behave well on Renewal Processes? Licence de droits d usage Page 44 / 5

Redundancy of the CTW method on Renewal Processes Theorem : (G. 4) There exist constants c and C such that for all P R : c n log n KL (P n, Q n CTW ) C n log n CTW is efficient on long-memory processes Adaptivity : if the renewal distribution is bounded, CTW achieves regret O(log n) (contrary to ad-hoc coders). Requires deep contexts in the double mixture ( = the tree should not be cut off at depth log n). Kind of non-parametric estimation : need for a balance between approximation and estimation. Licence de droits d usage Page 45 / 5

Why it works ^k P n is approximated by QT n with T as above. Two losses : estimating the transition parameters : k log(n)/2 approximation due to boundedness : n log(k)/k = for k = c n, both are O ( n log(n) ) Licence de droits d usage Page 46 / 5

Extension : Markovian Renewal Processes A Markovian Renewal Processes is such that successive distances between occurences of symbol form a Markov Chain on N + : X = } {{ } }{{ }, (N i ) i Markov chain N i N i+ Theorem : (Csiszár and Shields 96) There exist two constants c and C such that cn 2/3 inf Q n sup P MR KL(P n, Q n ) Cn 2/3 Theorem : (G. 4) CTW is almost adaptive on the class MR of Markovian Renewal processes : C R, P MR, KL (P n, Q n CTW ) Cn2/3 log n Licence de droits d usage Page 47 / 5

Markovian Renewal Processes are Context Tree Sources And so are r-order Markov Renewal Processes sk s s2 s Licence de droits d usage Page 48 / 5

Perspectives : oracle inequalities on more general classes? CTW is probably more useful than expected : it obtains good results on infinite memory processes We want to obtain oracle inequalities for different non-markovian classes But the setting is more difficult than usually And approximation theory is difficult! Ex : let P H be a HMM with 3 hidden states, what is inf T k inf KL(P n θ S A T H, Pn θ )??? Licence de droits d usage Page 5 / 5

the end... Thank you for your attention! Licence de droits d usage Page 5 / 5