Coding on Countably Infinite Alphabets

Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 2 / 28

Data Compression : Shannon Modelization Licence de droits d usage Page 3 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 4 / 28

Source Entropy Shannon ( 48) : E P [ φ n (x) ] H n (P) = E P [ log P n (X n 1 )] n-block entropy and there is a code reaching the bound (within 1 bit). Moreover, if P is stationary and ergodic : 1 n H n(p) H(P) entropy rate of the source P = minimal number of bits necessary per symbol. Licence de droits d usage Page 5 / 28

Coding Distributions Kraft Inequality : 2 φn(x) 1 x n 1 and reciprocal. Code φ coding distribution Q φ n (x) = 2 φn(x). The Shannon 48 theorem expresses that in average the best coding distribution is P! Otherwise, coding distribution Q n suffers from the regret ( ) ( log Q n X n 1 log P n ( X1 n )) P = log n (X1 n ) Q n(x1 n ). Licence de droits d usage Page 6 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 7 / 28

Universal Coders What if the source statistics are unknown? What if we need versatile code? = Need a single coding distribution Q n for a class of sources Λ = {P θ, θ Θ} Ex : memoryless processes, Markov chains, HMM... = unavoidable redundancy : E Pθ [ φ (X n 1 ) ] H (X n 1 ) = E P θ [log Q n (X n 1 ) + log P θ (X n 1 )] = KL (P θ, Q n ) = Küllback-Leibler divergence between P θ and Q n. Licence de droits d usage Page 8 / 28

Coders : Measures of Universality 1. Maximal regret : R (Q n, Λ) = sup 2. Worst case redundancy :[ R + (Q n, Λ) = sup E Pθ θ Θ sup x1 n An θ Θ log Pn θ log P ( ) θ x n ( 1 ) Q n x n 1 ( ) ] X n 1 Q n ( X n 1 ) = sup KL (P θ, Q n ) θ Θ 3. Expected redundancy [ with [ respect ( to ) prior ]] π : Rπ (Q n, Λ) = E π E Pθ log Pn θ X n ( 1 ) = E Q n X n π [KL (P θ, Q n )] 1 = R (Q n, Λ) R + (Q n, Λ) R (Q n, Λ) Licence de droits d usage Page 9 / 28

Classes : Measures of Complexity 1. Minimax regret : ( ) Rn (Λ) = inf R (Q n, Λ) = min max log Pn θ x n ( 1 ) Q n Q n x1 n,θ Q n x n 1 2. Minimax redundancy : R n + (Λ) = inf R + (Q n, Λ) = min max KL (Pθ n, Q n) Q n θ 3. Maximin redundancy : Rn (Λ) = sup inf π Q n Rπ (Q n, Λ) = max Q n π min E π [KL (Pθ n, Q n)] Q n = R n (Λ) R + n (Λ) R n (Λ) Licence de droits d usage Page 10 / 28

General Standard Results Minimax result (Haussler 97, Sion) : Rn (Λ) = R n + (Λ). Shtarkov s NML : Rn (Λ) = R (QNML, n Λ) with QNML(x) n = and ˆp(x) = sup P Λ P n (x), so that and hence Rn (Λ) = log ˆp(y). y X n Channel capacity method : if dµ(θ, x) = dπ(θ)dp θ (x) then P ˆp(x) y X n ˆp(y) inf Rπ (Q n, Λ) = H µ (X1 n ) H µ(x1 n θ) = H µ(θ) H µ (θ X1 n ). Q n Licence de droits d usage Page 11 / 28

Finite Alphabets : Standard Results Theorem (Shtarkov, Barron, Spankowski,... ) If I m denotes the class of memoryless processes over alphabet {1,..., m}, then R + n (I m ) = m 1 2 R n (I m ) = m 1 2 log n Γ(1/2)m + log 2e Γ ( ) m + o m (1) 2 log n Γ(1/2)m + log 2 Γ ( ) m + o m (1) m 1 log n + 2 2 2 Theorem (Rissanen 84) If dim Θ = k, and if there exists a n-consistent estimator of θ given X n 1, then lim inf n R n (Λ) k log n. 2 Licence de droits d usage Page 12 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 13 / 28

Infinite Alphabets : Issues Motivations : integer coding, lossless image coding, text coding on words, mathematical curiosity. 1. Understand general structural properties of minimax redundancy and minimax regret. 2. Characterize those source classes that have finite minimax regret. 3. Quantitative relations between minimax redundancy or regret and integrability of the envelope function. 4. Develope effective coding techniques for source classes with known non-trivial minimax redundancy rate. 5. Develope adaptive coding schemes for collections of source classes that are too large to have redundancy rates. Licence de droits d usage Page 14 / 28

Sanity-check Properties 1. R (Λ n ) < + Q n NML is well-defined and given by Q n NML(x) = ˆp(x) y X n ˆp(y) for x X n where ˆp(x) = sup P Λ P n (x). 2. R + (Λ n ) and R (Λ n ) are non-decreasing, sub-additive (or infinite) functions of n. Thus, and R + (Λ n ) R + (Λ n ) ( lim = inf R + Λ 1), n n n N + n R (Λ n ) R (Λ n ) ( lim = inf R Λ 1). n n n N + n Licence de droits d usage Page 15 / 28

Memoryless Sources 1. Let Λ be a class of stationary memoryless sources over a countably infinite alphabet. Let ˆp be defined by ˆp(x) = sup P Λ P{x}. The minimax regret with respect to Λ n is finite if and only if the normalized maximum likelihood (Shtarkov) coding probability is well-defined and : R (Λ n ) < x N + ˆp(x) <. 2. We give an example of a memoryless class Λ such that R + (Λ n ) <, but R (Λ n ) =. Licence de droits d usage Page 16 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 17 / 28

Redundancy and Regret Equivalence Property Definition Λ f = { P : } x N, P 1 {x} f (x) and P is stationary memoryless.. Theorem Let f be a non-negative function from N + to [0, 1], let Λ f be the class of stationary memoryless sources defined by envelope f. Then R + (Λ n f ) < R (Λ n f ) < f (k) <. k N + Licence de droits d usage Page 18 / 28

Idea of the Proof The only thing to prove is that k N + f (k) = R + (Λ n f ) =. Let the sequence of integers (h i ) i N be defined recursively by h 0 = 0 and h h i+1 = min h : f (k) > 1. k=h i +1 We consider the class Λ = {P θ, θ Θ} where Θ = N. The memoryless source P i is defined by its first marginal Pi 1 which is given by Pi 1 f (m) (m) = hi+1 k=h i +1 f (k) for m {p i + 1,..., p i+1 }. Taking any infinite entropy prior π over Θ the Λ 1 = {Pi 1 ; i N + } shows that R + ( Λ 1) Rπ ( Λ 1 ) = H(θ) H(θ X 1 ) =. Licence de droits d usage Page 19 / 28

Upper-bounding the Regret Theorem If Λ is a class of memoryless sources, let the tail function F Λ 1 be defined by F Λ 1(u) = k>u ˆp(k), then : [ R (Λ n ) inf n F Λ 1(u) log e + u 1 ] log n + 2. u:u n 2 Corollary Let Λ denote a class of memoryless sources, then the following holds : R (Λ n ) < R (Λ n ) = o(n) and R + (Λ n ) = o(n). Licence de droits d usage Page 20 / 28

Lower-bounding the Redundancy Theorem Let f denote a non-increasing, summable envelope function. For any integer p, let c(p) = p k=1 f (2k). Let c( ) = k 1 f (2k). Assume furthermore that c( ) > 1. Let p N + be such that c(p) > 1. Let n N +, ɛ > 0 and λ ]0, 1[ be such that n > c(p) 10 f (2p) ɛ(1 λ). Then R + (Λ n f ) C(p, n, λ, ɛ) p where C(p, n, λ, ɛ) = 1 1+ c(p) λ 2 nf (2p) i=1 ( ) 1 (2i) (1 λ) π lognf ɛ, 2 2c(p)e ( 1 4 π 5c(p) (1 λ)ɛnf (2p) ). Licence de droits d usage Page 21 / 28

Power-law Classes Theorem Let α denote a real number larger than 1, and C be such that Cζ(α) 2 α. The source class Λ C α is the envelope class associated with the decreasing function f α,c : x 1 C x α for C > 1 and α > 1. Then : 1. 2. where n 1/α A(α) log (Cζ(α)) 1/α R + (Λ n C ) α A(α) = 1 α 1 1 ( u 1 1/α 1 e 1/(ζ(α)u)) du. ( ) 2Cn 1/α R (Λ n C ) (log n) 1 1/α + O(1). α α 1 Licence de droits d usage Page 22 / 28

Exponential Classes Theorem Let C and α denote positive real numbers satisfying C > e 2α. The class Λ Ce α is the envelope class associated with function f α : x 1 Ce αx. Then 1 8α log2 n (1 o(1)) R + (Λ n Ce α ) R (Λ n Ce α ) 1 2α log2 n + O(1) cf Dominique Bontemps : R + (Λ n Ce α ) 1 4α log2 n Licence de droits d usage Page 23 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 24 / 28

The CensoringCode Algorithm Algorithm CensoringCode K 0 counts [1/2, 1/2,...] for i from 1 to n do cutoff 4 α 1 Ci 1/α if cutoff > K then for j K + 1 to cutoff do counts[0] counts[0] counts[j] + 1/2 end for K cutoff end if if x[i] cutoff then ArithCode(x[i], counts[0 : cutoff]) else ArithCode(0, counts[0 : cutoff]) C1 C1 EliasCode(x[i]) counts[0] counts[0] + 1 end if counts[x[i]] counts[x[i]] + 1 end for C2 ArithCode() C 1 C 2 Theorem Let C and α be positive reals. Let the sequence of cutoffs (K i ) i n be given by Then K i = ( ) 4Ci 1/α. α 1 R + (CensoringCode, Λ n C ) α ( ) 1 4Cn α log n (1 + o(1)). α 1 Licence de droits d usage Page 25 / 28

Approximately Adaptive Algorithms A sequence (Q n ) n of coding probabilities is said to be approximately asymptotically adaptive with respect to a collection (Λ m ) m M of source classes if for each P m M Λ m, for each Λ m such that P Λ m : D(P n, Q n )/R + (Λ n m) O(log n). A modification of CensoringCode choosing the n + 1th cutoff K n+1 according to the number of distinct symbols in x is approximately adaptive on W α = {P : P Λ α, 0 < lim inf P 1 (k) lim sup k α P 1 (k) < k k } Pattern coding is approximately adaptive if 1 < α 5/2.. Licence de droits d usage Page 26 / 28

Patterns The information conveyed in a message x can be separated into 1. a dictionary = (x) : the sequence of distinct symbols occurring in x in order of appearance ; 2. a pattern ψ = ψ(x) where ψ i is the rank of x i in dictionary. Example : Message x = a b r a c a d a b r a Pattern ψ(x) = 1 2 3 1 4 1 5 1 2 3 1 Dictionary (x) = a b r c d = A random process (X n ) n with distribution P induces a random pattern process (Ψ n ) n on N + with distribution : P Ψ (Ψ n 1 = ψn 1 ) = P (X n 1 = x n 1 ). x n 1 :ψ(x n 1 )=ψn 1 Licence de droits d usage Page 27 / 28

Pattern Coding The pattern entropy rate exists and coincides with the process entropy rate : 1 n H (Ψn 1 ) = 1 [ ] n E P Ψ log P Ψ (Ψ n 1 ) H(Ψ) = H(X). Pattern redundancy satisfies : ( ) 1 ( ) n 3 1.84 R 2 n. log n Ψ,n (I ) RΨ,n (I ) π 3 log e The proof of the lower-bound uses fine combinatorics on integer partitions with small summands (see Garivier 09). Upper-bounds in O(n 2/5 ) have been given (see Shamir 07), but there is still a gap between lower- and upper-bounds. For memoryless coding, the pattern redundancy is neglectible wrt. the codelength of the dictionary whenever n < 5/2. Licence de droits d usage Page 28 / 28