Coding on Countably Infinite Alphabets

Similar documents
Context tree models for source coding

Coding on countably infinite alphabets

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

The Method of Types and Its Application to Information Hiding

Lecture 2: August 31

Consistency of the maximum likelihood estimator for general hidden Markov models

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Lecture 5 Channel Coding over Continuous Channels

Piecewise Constant Prediction

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

5 Mutual Information and Channel Capacity

WE start with a general discussion. Suppose we have

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

3F1 Information Theory, Lecture 3

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Expectation Maximization

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Information Theory in Intelligent Decision Making

Information Theory, Statistics, and Decision Trees

Minimax Redundancy for Large Alphabets by Analytic Methods

3F1 Information Theory, Lecture 3

Universal Loseless Compression: Context Tree Weighting(CTW)

Shannon s Noisy-Channel Coding Theorem

Homework Set #2 Data Compression, Huffman code and AEP

Lecture 4 Channel Coding

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Robustness and duality of maximum entropy and exponential family distributions

Information and Statistics

ELEMENT OF INFORMATION THEORY

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Basic math for biology

Exercises with solutions (Set D)

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Chaos, Complexity, and Inference (36-462)

Lossy Compression Coding Theorems for Arbitrary Sources

Dept. of Linguistics, Indiana University Fall 2015

Piecewise Constant Prediction

A One-to-One Code and Its Anti-Redundancy

(Classical) Information Theory II: Source coding

Introduction to information theory and coding

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

On bounded redundancy of universal codes

A lower bound on compression of unknown alphabets

ECE 587 / STA 563: Lecture 5 Lossless Compression

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 4 Noisy Channel Coding

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Second-Order Asymptotics in Information Theory

Large Deviations Performance of Knuth-Yao algorithm for Random Number Generation

Pointwise Redundancy in Lossy Data Compression and Universal Lossy Data Compression

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Bioinformatics: Biology X

CS 630 Basic Probability and Information Theory. Tim Campbell

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria

Quiz 2 Date: Monday, November 21, 2016

ECE 587 / STA 563: Lecture 5 Lossless Compression

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Bounded Expected Delay in Arithmetic Coding

Chapter 9 Fundamental Limits in Information Theory

Information Theory Primer:

Two-Part Codes with Low Worst-Case Redundancies for Distributed Compression of Bernoulli Sequences

Sequential prediction with coded side information under logarithmic loss

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Variable-to-Variable Codes with Small Redundancy Rates

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Coding of memoryless sources 1/35

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

Lecture 21: Minimax Theory

Convexity/Concavity of Renyi Entropy and α-mutual Information

Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets

ECE 4400:693 - Information Theory

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Asymptotic Minimax Regret for Data Compression, Gambling, and Prediction

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

EECS 750. Hypothesis Testing with Communication Constraints

On Competitive Prediction and Its Relation to Rate-Distortion Theory

Two Applications of the Gaussian Poincaré Inequality in the Shannon Theory

Kolmogorov complexity

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

ECE Advanced Communication Theory, Spring 2009 Homework #1 (INCOMPLETE)

Analytic Information Theory: From Shannon to Knuth and Back. Knuth80: Piteaa, Sweden, 2018 Dedicated to Don E. Knuth

IN this paper, we study the problem of universal lossless compression

Upper Bounds on the Capacity of Binary Intermittent Communication

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 35: December The fundamental statistical distances

Lecture 1: Introduction, Entropy and ML estimation

Quantitative Biology II Lecture 4: Variational Methods

Information Dimension

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Chapter 5: Data Compression

Information Theory and Model Selection

Transcription:

Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 2 / 28

Data Compression : Shannon Modelization Licence de droits d usage Page 3 / 28

Data Compression : Shannon Modelization Licence de droits d usage Page 3 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 4 / 28

Source Entropy Shannon ( 48) : E P [ φ n (x) ] H n (P) = E P [ log P n (X n 1 )] n-block entropy and there is a code reaching the bound (within 1 bit). Moreover, if P is stationary and ergodic : 1 n H n(p) H(P) entropy rate of the source P = minimal number of bits necessary per symbol. Licence de droits d usage Page 5 / 28

Coding Distributions Kraft Inequality : 2 φn(x) 1 x n 1 and reciprocal. Code φ coding distribution Q φ n (x) = 2 φn(x). The Shannon 48 theorem expresses that in average the best coding distribution is P! Otherwise, coding distribution Q n suffers from the regret ( ) ( log Q n X n 1 log P n ( X1 n )) P = log n (X1 n ) Q n(x1 n ). Licence de droits d usage Page 6 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 7 / 28

Universal Coders What if the source statistics are unknown? What if we need versatile code? = Need a single coding distribution Q n for a class of sources Λ = {P θ, θ Θ} Ex : memoryless processes, Markov chains, HMM... = unavoidable redundancy : E Pθ [ φ (X n 1 ) ] H (X n 1 ) = E P θ [log Q n (X n 1 ) + log P θ (X n 1 )] = KL (P θ, Q n ) = Küllback-Leibler divergence between P θ and Q n. Licence de droits d usage Page 8 / 28

Coders : Measures of Universality 1. Maximal regret : R (Q n, Λ) = sup 2. Worst case redundancy :[ R + (Q n, Λ) = sup E Pθ θ Θ sup x1 n An θ Θ log Pn θ log P ( ) θ x n ( 1 ) Q n x n 1 ( ) ] X n 1 Q n ( X n 1 ) = sup KL (P θ, Q n ) θ Θ 3. Expected redundancy [ with [ respect ( to ) prior ]] π : Rπ (Q n, Λ) = E π E Pθ log Pn θ X n ( 1 ) = E Q n X n π [KL (P θ, Q n )] 1 = R (Q n, Λ) R + (Q n, Λ) R (Q n, Λ) Licence de droits d usage Page 9 / 28

Classes : Measures of Complexity 1. Minimax regret : ( ) Rn (Λ) = inf R (Q n, Λ) = min max log Pn θ x n ( 1 ) Q n Q n x1 n,θ Q n x n 1 2. Minimax redundancy : R n + (Λ) = inf R + (Q n, Λ) = min max KL (Pθ n, Q n) Q n θ 3. Maximin redundancy : Rn (Λ) = sup inf π Q n Rπ (Q n, Λ) = max Q n π min E π [KL (Pθ n, Q n)] Q n = R n (Λ) R + n (Λ) R n (Λ) Licence de droits d usage Page 10 / 28

General Standard Results Minimax result (Haussler 97, Sion) : Rn (Λ) = R n + (Λ). Shtarkov s NML : Rn (Λ) = R (QNML, n Λ) with QNML(x) n = and ˆp(x) = sup P Λ P n (x), so that and hence Rn (Λ) = log ˆp(y). y X n Channel capacity method : if dµ(θ, x) = dπ(θ)dp θ (x) then P ˆp(x) y X n ˆp(y) inf Rπ (Q n, Λ) = H µ (X1 n ) H µ(x1 n θ) = H µ(θ) H µ (θ X1 n ). Q n Licence de droits d usage Page 11 / 28

Finite Alphabets : Standard Results Theorem (Shtarkov, Barron, Spankowski,... ) If I m denotes the class of memoryless processes over alphabet {1,..., m}, then R + n (I m ) = m 1 2 R n (I m ) = m 1 2 log n Γ(1/2)m + log 2e Γ ( ) m + o m (1) 2 log n Γ(1/2)m + log 2 Γ ( ) m + o m (1) m 1 log n + 2 2 2 Theorem (Rissanen 84) If dim Θ = k, and if there exists a n-consistent estimator of θ given X n 1, then lim inf n R n (Λ) k log n. 2 Licence de droits d usage Page 12 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 13 / 28

Infinite Alphabets : Issues Motivations : integer coding, lossless image coding, text coding on words, mathematical curiosity. 1. Understand general structural properties of minimax redundancy and minimax regret. 2. Characterize those source classes that have finite minimax regret. 3. Quantitative relations between minimax redundancy or regret and integrability of the envelope function. 4. Develope effective coding techniques for source classes with known non-trivial minimax redundancy rate. 5. Develope adaptive coding schemes for collections of source classes that are too large to have redundancy rates. Licence de droits d usage Page 14 / 28

Sanity-check Properties 1. R (Λ n ) < + Q n NML is well-defined and given by Q n NML(x) = ˆp(x) y X n ˆp(y) for x X n where ˆp(x) = sup P Λ P n (x). 2. R + (Λ n ) and R (Λ n ) are non-decreasing, sub-additive (or infinite) functions of n. Thus, and R + (Λ n ) R + (Λ n ) ( lim = inf R + Λ 1), n n n N + n R (Λ n ) R (Λ n ) ( lim = inf R Λ 1). n n n N + n Licence de droits d usage Page 15 / 28

Memoryless Sources 1. Let Λ be a class of stationary memoryless sources over a countably infinite alphabet. Let ˆp be defined by ˆp(x) = sup P Λ P{x}. The minimax regret with respect to Λ n is finite if and only if the normalized maximum likelihood (Shtarkov) coding probability is well-defined and : R (Λ n ) < x N + ˆp(x) <. 2. We give an example of a memoryless class Λ such that R + (Λ n ) <, but R (Λ n ) =. Licence de droits d usage Page 16 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 17 / 28

Redundancy and Regret Equivalence Property Definition Λ f = { P : } x N, P 1 {x} f (x) and P is stationary memoryless.. Theorem Let f be a non-negative function from N + to [0, 1], let Λ f be the class of stationary memoryless sources defined by envelope f. Then R + (Λ n f ) < R (Λ n f ) < f (k) <. k N + Licence de droits d usage Page 18 / 28

Idea of the Proof The only thing to prove is that k N + f (k) = R + (Λ n f ) =. Let the sequence of integers (h i ) i N be defined recursively by h 0 = 0 and h h i+1 = min h : f (k) > 1. k=h i +1 We consider the class Λ = {P θ, θ Θ} where Θ = N. The memoryless source P i is defined by its first marginal Pi 1 which is given by Pi 1 f (m) (m) = hi+1 k=h i +1 f (k) for m {p i + 1,..., p i+1 }. Taking any infinite entropy prior π over Θ the Λ 1 = {Pi 1 ; i N + } shows that R + ( Λ 1) Rπ ( Λ 1 ) = H(θ) H(θ X 1 ) =. Licence de droits d usage Page 19 / 28

Upper-bounding the Regret Theorem If Λ is a class of memoryless sources, let the tail function F Λ 1 be defined by F Λ 1(u) = k>u ˆp(k), then : [ R (Λ n ) inf n F Λ 1(u) log e + u 1 ] log n + 2. u:u n 2 Corollary Let Λ denote a class of memoryless sources, then the following holds : R (Λ n ) < R (Λ n ) = o(n) and R + (Λ n ) = o(n). Licence de droits d usage Page 20 / 28

Lower-bounding the Redundancy Theorem Let f denote a non-increasing, summable envelope function. For any integer p, let c(p) = p k=1 f (2k). Let c( ) = k 1 f (2k). Assume furthermore that c( ) > 1. Let p N + be such that c(p) > 1. Let n N +, ɛ > 0 and λ ]0, 1[ be such that n > c(p) 10 f (2p) ɛ(1 λ). Then R + (Λ n f ) C(p, n, λ, ɛ) p where C(p, n, λ, ɛ) = 1 1+ c(p) λ 2 nf (2p) i=1 ( ) 1 (2i) (1 λ) π lognf ɛ, 2 2c(p)e ( 1 4 π 5c(p) (1 λ)ɛnf (2p) ). Licence de droits d usage Page 21 / 28

Power-law Classes Theorem Let α denote a real number larger than 1, and C be such that Cζ(α) 2 α. The source class Λ C α is the envelope class associated with the decreasing function f α,c : x 1 C x α for C > 1 and α > 1. Then : 1. 2. where n 1/α A(α) log (Cζ(α)) 1/α R + (Λ n C ) α A(α) = 1 α 1 1 ( u 1 1/α 1 e 1/(ζ(α)u)) du. ( ) 2Cn 1/α R (Λ n C ) (log n) 1 1/α + O(1). α α 1 Licence de droits d usage Page 22 / 28

Exponential Classes Theorem Let C and α denote positive real numbers satisfying C > e 2α. The class Λ Ce α is the envelope class associated with function f α : x 1 Ce αx. Then 1 8α log2 n (1 o(1)) R + (Λ n Ce α ) R (Λ n Ce α ) 1 2α log2 n + O(1) cf Dominique Bontemps : R + (Λ n Ce α ) 1 4α log2 n Licence de droits d usage Page 23 / 28

Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe Classes Theoretical Properties Algorithms Licence de droits d usage Page 24 / 28

The CensoringCode Algorithm Algorithm CensoringCode K 0 counts [1/2, 1/2,...] for i from 1 to n do cutoff 4 α 1 Ci 1/α if cutoff > K then for j K + 1 to cutoff do counts[0] counts[0] counts[j] + 1/2 end for K cutoff end if if x[i] cutoff then ArithCode(x[i], counts[0 : cutoff]) else ArithCode(0, counts[0 : cutoff]) C1 C1 EliasCode(x[i]) counts[0] counts[0] + 1 end if counts[x[i]] counts[x[i]] + 1 end for C2 ArithCode() C 1 C 2 Theorem Let C and α be positive reals. Let the sequence of cutoffs (K i ) i n be given by Then K i = ( ) 4Ci 1/α. α 1 R + (CensoringCode, Λ n C ) α ( ) 1 4Cn α log n (1 + o(1)). α 1 Licence de droits d usage Page 25 / 28

Approximately Adaptive Algorithms A sequence (Q n ) n of coding probabilities is said to be approximately asymptotically adaptive with respect to a collection (Λ m ) m M of source classes if for each P m M Λ m, for each Λ m such that P Λ m : D(P n, Q n )/R + (Λ n m) O(log n). A modification of CensoringCode choosing the n + 1th cutoff K n+1 according to the number of distinct symbols in x is approximately adaptive on W α = {P : P Λ α, 0 < lim inf P 1 (k) lim sup k α P 1 (k) < k k } Pattern coding is approximately adaptive if 1 < α 5/2.. Licence de droits d usage Page 26 / 28

Patterns The information conveyed in a message x can be separated into 1. a dictionary = (x) : the sequence of distinct symbols occurring in x in order of appearance ; 2. a pattern ψ = ψ(x) where ψ i is the rank of x i in dictionary. Example : Message x = a b r a c a d a b r a Pattern ψ(x) = 1 2 3 1 4 1 5 1 2 3 1 Dictionary (x) = a b r c d = A random process (X n ) n with distribution P induces a random pattern process (Ψ n ) n on N + with distribution : P Ψ (Ψ n 1 = ψn 1 ) = P (X n 1 = x n 1 ). x n 1 :ψ(x n 1 )=ψn 1 Licence de droits d usage Page 27 / 28

Pattern Coding The pattern entropy rate exists and coincides with the process entropy rate : 1 n H (Ψn 1 ) = 1 [ ] n E P Ψ log P Ψ (Ψ n 1 ) H(Ψ) = H(X). Pattern redundancy satisfies : ( ) 1 ( ) n 3 1.84 R 2 n. log n Ψ,n (I ) RΨ,n (I ) π 3 log e The proof of the lower-bound uses fine combinatorics on integer partitions with small summands (see Garivier 09). Upper-bounds in O(n 2/5 ) have been given (see Shamir 07), but there is still a gap between lower- and upper-bounds. For memoryless coding, the pattern redundancy is neglectible wrt. the codelength of the dictionary whenever n < 5/2. Licence de droits d usage Page 28 / 28