Context tree models for source coding

Similar documents
Coding on Countably Infinite Alphabets

Universal Loseless Compression: Context Tree Weighting(CTW)

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

3F1 Information Theory, Lecture 3

Bioinformatics: Biology X

3F1 Information Theory, Lecture 3

WE start with a general discussion. Suppose we have

Chapter 5: Data Compression

Markov approximation and consistent estimation of unbounded probabilistic suffix trees

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

An O(N) Semi-Predictive Universal Encoder via the BWT

Homework Set #2 Data Compression, Huffman code and AEP

Lecture 2: August 31

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

IN this paper, we study the problem of universal lossless compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Information Theory and Statistics Lecture 2: Source coding

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

(Classical) Information Theory II: Source coding

Minimax Redundancy for Large Alphabets by Analytic Methods

Asymptotic Minimax Regret for Data Compression, Gambling, and Prediction

U Logo Use Guidelines

lossless, optimal compressor

Information Theory and Model Selection

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria

CSCI 2570 Introduction to Nanocomputing

Chapter 2: Source coding

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

COMM901 Source Coding and Compression. Quiz 1

Information Theory in Intelligent Decision Making

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Coding of memoryless sources 1/35

Achievability of Asymptotic Minimax Regret in Online and Batch Prediction

Quantifying Computational Security Subject to Source Constraints, Guesswork and Inscrutability

Bayesian Regularization

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

EECS 229A Spring 2007 * * (a) By stationarity and the chain rule for entropy, we have

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Bayesian Inference Course, WTCN, UCL, March 2013

13: Variational inference II

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Expectation Maximization

Lossy Compression Coding Theorems for Arbitrary Sources

Chapter 9 Fundamental Limits in Information Theory

Lecture 21: Minimax Theory

Two-Part Codes with Low Worst-Case Redundancies for Distributed Compression of Bernoulli Sequences

Consistency of the maximum likelihood estimator for general hidden Markov models

Introduction to Statistical Learning Theory

arxiv: v1 [cs.it] 29 Feb 2008

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Inferring Markov Chains: Bayesian Estimation, Model Comparison, Entropy Rate, and Out-of-class Modeling

Information Theory and Hypothesis Testing

13 : Variational Inference: Loopy Belief Propagation and Mean Field

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

1 Introduction to information theory

Exercises with solutions (Set B)

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Information Theory, Statistics, and Decision Trees

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Unsupervised Learning

arxiv: v3 [stat.me] 11 Feb 2018

ECE 587 / STA 563: Lecture 5 Lossless Compression

Introduction to information theory and coding

Information & Correlation

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Lecture 13 : Variational Inference: Mean Field Approximation

Information and Statistics

ELEMENT OF INFORMATION THEORY

Lecture 5 Channel Coding over Continuous Channels

Basic math for biology

Piecewise Constant Prediction

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Lecture 8: Information Theory and Statistics

Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies

Predictive MDL & Bayes

Priors for the frequentist, consistency beyond Schwartz

Asymptotically Minimax Regret for Models with Hidden Variables

Information and Entropy

Minimax lower bounds I

CAN the future of a sequence be predicted based on its

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

ECE 587 / STA 563: Lecture 5 Lossless Compression

Kolmogorov complexity ; induction, prediction and compression

On Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004

Informational Confidence Bounds for Self-Normalized Averages and Applications

6.1 Variational representation of f-divergences

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Information Theory. Week 4 Compressing streams. Iain Murray,

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Lec 03 Entropy and Coding II Hoffman and Golomb Coding

CS Lecture 19. Exponential Families & Expectation Propagation

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

Transcription:

Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 2 / 5

Data Compression : Shannon Modelization Licence de droits d usage Page 3 / 5

Data Compression : Shannon Modelization Licence de droits d usage Page 3 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 4 / 5

The Problem Finite alphabet A ; ex : A = {a, c, t, g} Set of all finite binary sequences : {, } = n {, }n If x A n, its length is x = n log = base 2 logarithm Let X be a stochastic process on the finite alphabet A, with stationary ergodic distribution P The marginal distribution of X n is denoted by Pn For n N, the coding function φ n : A n {, } Expected code length : E P [ φ n (X n ) ] Licence de droits d usage Page 5 / 5

Example : DNA Codes A = {a, c, t, g} Uniform concatenation code : φ = a c t g, φ n(x n) = φ (X )... φ (X n ) = x n An, φ n (X n ) = 2n Improvement? ψ = a c t g Problem : what is ψ () : ct or gc? Uniquely decodable codes : φ n (X n ) = φ m(y m ) = m = n and i {,..., n}, X i = Y i Instant codes = prefix codes : (a, b) A 2, j N, φ (a) :j φ (b) Licence de droits d usage Page 6 / 5

Kraft s Inequality Theorem (Kraft 49, McMillan 56) If φ n is a uniquely decodable code, then x A n 2 φn(x) Proof for prefix codes : Let D = max x A n φ n (x) and for x A n, let Z (x) = {w {, } D : w φ(x) = φ n (x)}. #Z (x) = 2 D φn(x) φ n prefix = x, y A n, x y = Z (x) Z (y) = Hence x A n N(x) = x A n 2 D φn(x) #{, } D = 2 D Licence de droits d usage Page 7 / 5

Shannon s First Theorem = if l(x) = φ n (x), we look for min l x A n l(x)p n (x) under constraint x A n 2 l(x) Theorem (Shannon 48) E P [ φ n (x) ] H n (X) = E P [ log P n (X n )] Misleading notation : H n (X) is actually a function of P n only, the marginal distribution of X n Tight bound : there exist a code φ n such that E P [ φ n (x) ] H n (X) + Licence de droits d usage Page 8 / 5

Example : binary entropy A = {, }, X i iid B(p), < p <. n H n(x) = H (X) = h(p) = p log p + ( p) log p.8.6.4.2.2.4.6.8 Licence de droits d usage Page 9 / 5

Example : a 2-block code A = {, 2, 3}, n = 2, P =.3δ +.6δ 2 +.δ 3 i.i.d. H(X) =.3 log(.3).6 log(.6). log(.).295 bits X 2 φ 2 (X 2) 2 3 2 22 23 3 32 33 22 2 2 3 3 33 23 32 2 E [ φ2 (X 2 ) ].335 Licence de droits d usage Page / 5

Entropy Rate Theorem[Shannon-Breiman-McMillan] if P is stationary and ergodic, then n H n(x) H(X) = Interpretation : H(X) = minimal number of bits necessary to code each symbol of X Kraft s inequality : for each code φ n there exists a (sub-)probability distribution Q n such that Q n (x) = 2 φn(x) log Q n (x) = φ n (x) Licence de droits d usage Page / 5

Arithmetic Coding : a constructive reciprocal φ n (x n ) = the log Qn (x n ) +2 first bits of Qn (X n x n ) + Qn (X n < x n ) 2.9.27.3.48.84.9.93.99 2 3 2 22 23 3 32 33 Ex : A = {, 2, 3}, n = 2, Q 2 = [.3,.6,.] 2 I() = [,.9[ : m() =.45 =.... As log.9 + = 5, φ 2 () = I(22) = [.48,.84[ : 22 2 2 m(22) =.66 =.... As log.36 + = 3, φ 2 (22) = 23 3 Licence de droits d usage Page 2 / 5 32 3 33

Coding distributions = log Q n (x)= code length for x with coding distribution Q n. Th Shannon : the best coding distribution is Q n = P n. Using another distribution causes an expected overhead called redundancy E P [ φ n (X n ) ] H (X n ) = E P [ log Q n (X n ) + log Pn (X n )] = KL (P n, Q n ) = Küllback-Leibler divergence of P n to Q n. Licence de droits d usage Page 3 / 5

Universal coding What if the source statistics are unknown? What if we need a versatile code? = Need a single coding distribution Q n for a class of sources Λ = {P θ, θ Θ} Ex : memoryless processes, Markov chains, HMM... = unavoidable redundancy, in the worst case : sup θ Θ E Pθ [ φ n (X n ) ] H (X n ) = sup KL (Pθ n, Qn ) θ Θ = Q n must be close to all the {P θ, θ Θ} for KL divergence Licence de droits d usage Page 4 / 5

Predictive Coding What if we want to code messages with different or large lengths? Consistent coding family : x n An, Q n (x n ) = a A Q n+ (xa) Define Q(a x n) = Q(x na) b A Q(x n b), then n ( ) Q n (x n ) = Q x j x j j= Predictive coding scheme : update Q( x n) Example : Laplace code Q(a x n) = n(a,x n)+ n+ A, where n(a, x n) = n j= x j =a. Licence de droits d usage Page 5 / 5

Density estimation with Logarithmic Loss Feature space : context x k A, Instant loss : number of bits used l(q, x k ) = log Q(x k x k ) Best regressor in( average : ) Q ( x k ) = P θ X k = x k X k = x k Excess loss : l(q, x k ) l(q, x k ) = log P θ(x k =x k X k =x k ) Q k (x k x k ) ( ) E Pθ [l(q, x k ) l(q, x k )] = KL P θ (X k = X k = x k k k ), Q ( x Total excess loss : L n (Q) = n k= l(q, x k) l(q, x k ) = log Pn θ (x n ) Q n (x n ) Total expected excess loss : E Pθ L n (Q) = KL(P n θ, Qn ) Licence de droits d usage Page 6 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 7 / 5

Bayesian predictive coding Let π be any prior on Θ, take Q n (x n ) = P n θ (x n )π(dθ) Predictive coding distribution : Q(a x n ) = P θ (x n+ = a X n ) π(dθ x n ) Conjugate priors = easy update rules for Q( x n) Idea : θ Θ, Q n (x n) Pn θ±δ (x n )π(θ ± δ) Licence de droits d usage Page 8 / 5

Example : i.i.d. classes A stationary memoryless source is parameterized by A θ S A = (θ,..., θ A ) : θ j = i= Conjugate Dirichlet prior π = D(α,..., α A ). Posterior : Π ( dθ x n ) = D(α + n,..., α A +n A ) = Q(a x n ) = n a + α a n + A j= α j Laplace code = special case α = = α A = Licence de droits d usage Page 9 / 5

The Krichevski-Trofimov Mixture Dirichlet prior with α = = α A = /2 : ( ) Γ A Q KT (x n ) = P θ (x n ) 2 θ S A A Γ( 2 ) A a A θ /2 a dθ a. Jeffrey s prior = Non-informative Peaked on extreme parameter values Example A =2 : ν(θ ) 3 2.5 2.5 Q KT () =Q KT ()Q KT ( )Q KT ( ) = /2 /2 + 2 /2 3.5..2.3.4.5.6.7.8.9 Licence de droits d usage θ Page 2 / 5

Optimality Theorem : (Krichevsky-Trofimov 8) If QKT n is the KT-mixture code and if m = A, then max θ Θ KL(Pn θ, Qn KT ) = m log n Γ(/2)m + log 2 2 Γ ( ) m + o m () 2 Theorem : (Rissanen 84) If dim Θ = k, and if there exists a n-consistent estimator of θ given X n, then lim inf min n Q n max θ Θ KL(Pn θ, Qn ) k 2 log n Theorem : Haussler 96 there exists a Bayesian code Q n reaching the minimax redundandy Licence de droits d usage Page 2 / 5

Optimality : more precisely Theorem : (Krichevsky-Trofimov 8) If QKT n is the KT-mixture code and if m = A, then max θ Θ KL(Pn θ, Qn KT ) = m log n Γ(/2)m + log 2 2 Γ ( ) m + o m () 2 Theorem : (Xie-Barron 97) min max θ Q ΘKL(Pn n θ, Qn ) = m log n Γ(/2)m + log 2 2e Γ ( ) m + o m () 2 Actually, pointwise approximation : log QKT n (x n ) inf log θ Pn θ (x n ) + A log n + C 2 Licence de droits d usage Page 22 / 5

Sources with memory iid sources on finite alphabets are very-well treated......but they provide poor models for most applications! example : Shannon considered Markov models for natural languages Pb : a Markov Chain of order 3 on an alphabet of size 26 has k = 4394 degrees of freedom! min max Q n θ Θ KL(Pn θ, Qn ) k 2 log n = need of more parcimonious classes Licence de droits d usage Page 23 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 24 / 5

Context Tree Sources Formal definition : Context Tree : T A such that x, unique s T : x s = s The unique suffix of x in T is denoted by T (x) A( string s is ) a context of a stationary ergodic process P if P X s = s > and if for all x s : ( P X = a ) ( = x s s = P X = a X X s = s ) T the context tree of P if T is a context tree, contains all contexts of P and if none of its elements can be replaced by a proper suffix without violating one of these properties Licence de droits d usage Page 25 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (2/3, /3) (3/4, /4) (/5, 4/5) (/3, 2/3) Licence de droits d usage Page 26 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (3/4, /4) (/5, 4/5) (/3, 2/3) (2/3, /3) Licence de droits d usage Page 27 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (/5, 4/5) /3 (, 2/3) (3/4, /4) (2/3, /3) Licence de droits d usage Page 28 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 4/5 (/5, ) (/3, 2/3) (3/4, /4) (2/3, /3) Licence de droits d usage Page 29 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (2/3, /3) (3/4, /4) (/5, 4/5) (/3, 2/3) Licence de droits d usage Page 3 / 5

Context Tree Sources Informal Definition A Context tree Source is a Markov Chain whose order is allowed to depend on the past data. T = {,,, } P(X 4 = X = ) = P(X = X = ) 3/4 P(X 2 = X = ) /3 P(X 3 = X 2 = ) 4/5 P(X 4 = X 3 = ) /3 P(X 5 = X 4 = ) 2/3 (3/4, /4) (/5, 4/5) (/3, 2/3) ( 2/3, /3) Licence de droits d usage Page 3 / 5

Context Trees versus Markov Chains Markov chain of order r = context tree source corresponding to a complete tree of depth r Markov chain of order 3 p (. ) p (. ) M =. p (. ) p(. ) p(. ) p(. ) p(. ) p(. ) p(. ) p(. ) p(. ) Finite context tree source of depth d = Markov Chain of order d p (. ) p (. ) p (. ) p(. ) M = p (. ) p(. ) p(. ) = much more flexibility : large number of models per dimension. Licence de droits d usage Page 32 / 5

Infinite trees, unbounded memory There exist (useful and simple) non-markovian context tree sources, see Galves, Ferrari, Comets,... Ex : Renewal Processes X = } {{ } }{{ } N i N i+ iid N i µ M (N + ) ( P X = X =... j ) = µ(j) µ ([j, [) Licence de droits d usage Page 33 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 34 / 5

Bayes Codes in a Context Tree Model For a given context tree T, let Λ T = {P stationary ergodic on A : T (P) = T } P Λ T is fully charactarized by θ = (θ s ) s T Θ T = S T A R T ( A ) : ( ) P X = X = x = θ T(x ) ( ) Bayesian coding : prior π T on Θ T, Q πt (x) = P θ (x n )dπ T (θ) Licence de droits d usage Page 35 / 5

Krichevsky-Trofimov Mixtures for Context Trees Prior : the θ s are independent, D ( 2,..., 2) - distributed : π T (dθ) = s T π KT (dθ s ) For s T let T (x n, s) = substring of x n that occurs in context s Example : for T = {,, } and x 8 =, T (x n, ) = T (x n, ) = T (x n, ) = The KT-mixture on model T is : Q n T (x n ) = s T Q T (x n,s) KT (T (x n, s)) Licence de droits d usage Page 36 / 5

Optimality Theorem : (Willems, Shtar kov, Tjalkens 93) there is a constant C such that : log Q n T (x n ) inf log P θ (x n θ S A T x ) + T A 2 first-order optimal What if we don t know which model T to use? = model selection ( ) n log + C T T Licence de droits d usage Page 37 / 5

Minimum Description Length Principle Occam s razor (Jorma Rissanen 78) : Choose the model that gives the shortest description of data Information theory : objective codelength = codelength of a minimax coder. Estimator associated with minimax mixtures (π T ) T : ˆT BIC = arg min log Pθ n (x n ) π T (dθ). T θ Θ T Looking directly at the bounds : ˆT KT = arg min T inf log Pθ n (x n ) + θ Θ T T ( A ) 2 log n. =penalized maximum likelihood estimator with BIC penalty. Licence de droits d usage Page 38 / 5

Consistency Results Theorems (Csiszár& Shields 96, Csiszár& Talata 4, G. 5, Galves & Leonardi 7, Leonardi 9) : the BIC estimator is strongly consistent the Krichevski-Trofimov estimator is not consistent Many Rates of convergence, minimal penalties,... The Krichevski-Trofimov estimator fails to identify Bernoulli-/2 iid sources! ( ) One can derive from this bounds on sup T sup θ ΘT KL Pθ n, Qṋ T Licence de droits d usage Page 39 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 4 / 5

Double mixture ν = measure on the set T of all context trees defined by : 2 T + ν(t ) = 2 is a probability distribution. Context Tree Weighting coding distribution : Q n CTW (x n ) = T ν(t )Q n T (x n ) Theorem (Shtarkov and al 93)) : oracle inequality : log QCTW(x n n x ) inf inf p θ (x n x ) T T θ Θ T + A T log + n 2 T + T (2 + log m) + m 2. Licence de droits d usage Page 4 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. (4,3) (2,2) (2,) (2,) (,2) (,) (,) Licence de droits d usage Page 42 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. (4,3) (2,2) (2,) 3 8 (2,) 3 8 (,2) 2 (,) 8 (,) Licence de droits d usage Page 42 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. (4,3) ( 3 2 8 3 8 + ) 3 28 = 2 256 (2,2) ( 2 2 8 + ) 6 = 6 (2,) 3 8 (2,) 3 8 (,2) 2 (,) 8 (,) Licence de droits d usage Page 42 / 5

CTW : Efficient Algorithm Idea : in each node, compute the arithmetical mean of a self-prob and a sub-prob. ( 2 2 256 6 + 5 258 (4,3) ) = 3 892 ( 3 2 8 3 8 + ) 3 28 = 2 256 (2,2) ( 2 2 8 + ) 6 = 6 (2,) 3 8 (2,) 3 8 (,2) 2 (,) 8 (,) Licence de droits d usage Page 42 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 43 / 5

Renewal Processes Let R be the (non-parametric) class of renewal processes on the binary alphabet {, } : X = } {{ } }{{ }, N i N i N i+ iid µ M (N + ) Theorem : (Csiszár and Shields 96) There exist two constants c and C such that c n inf sup KL(P n, Q n ) = C n Q n P R = first example of an intermediate complexity class. Problem : non-constructive Does a general purpose coder behave well on Renewal Processes? Licence de droits d usage Page 44 / 5

Redundancy of the CTW method on Renewal Processes Theorem : (G. 4) There exist constants c and C such that for all P R : c n log n KL (P n, Q n CTW ) C n log n CTW is efficient on long-memory processes Adaptivity : if the renewal distribution is bounded, CTW achieves regret O(log n) (contrary to ad-hoc coders). Requires deep contexts in the double mixture ( = the tree should not be cut off at depth log n). Kind of non-parametric estimation : need for a balance between approximation and estimation. Licence de droits d usage Page 45 / 5

Why it works ^k P n is approximated by QT n with T as above. Two losses : estimating the transition parameters : k log(n)/2 approximation due to boundedness : n log(k)/k = for k = c n, both are O ( n log(n) ) Licence de droits d usage Page 46 / 5

Extension : Markovian Renewal Processes A Markovian Renewal Processes is such that successive distances between occurences of symbol form a Markov Chain on N + : X = } {{ } }{{ }, (N i ) i Markov chain N i N i+ Theorem : (Csiszár and Shields 96) There exist two constants c and C such that cn 2/3 inf Q n sup P MR KL(P n, Q n ) Cn 2/3 Theorem : (G. 4) CTW is almost adaptive on the class MR of Markovian Renewal processes : C R, P MR, KL (P n, Q n CTW ) Cn2/3 log n Licence de droits d usage Page 47 / 5

Markovian Renewal Processes are Context Tree Sources And so are r-order Markov Renewal Processes sk s s2 s Licence de droits d usage Page 48 / 5

Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding Bayesian coding Context-tree Models and Double Mixture Rissanen s model Coding in Context Tree models The Context-Tree Weighting Method Context Tree Weighting on Renewal Processes CTW on Renewal Processes Perpectives Licence de droits d usage Page 49 / 5

Perspectives : oracle inequalities on more general classes? CTW is probably more useful than expected : it obtains good results on infinite memory processes We want to obtain oracle inequalities for different non-markovian classes But the setting is more difficult than usually And approximation theory is difficult! Ex : let P H be a HMM with 3 hidden states, what is inf T k inf KL(P n θ S A T H, Pn θ )??? Licence de droits d usage Page 5 / 5

the end... Thank you for your attention! Licence de droits d usage Page 5 / 5