Lossy Compression Coding Theorems for Arbitrary Sources

Similar documents
Pointwise Redundancy in Lossy Data Compression and Universal Lossy Data Compression

1590 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 6, JUNE Source Coding, Large Deviations, and Approximate Pattern Matching

Lecture 5 Channel Coding over Continuous Channels

Lecture 4 Noisy Channel Coding

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Exercises with solutions (Set D)

Context tree models for source coding

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Joint Universal Lossy Coding and Identification of Stationary Mixing Sources With General Alphabets Maxim Raginsky, Member, IEEE

Coding on Countably Infinite Alphabets

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

A Tight Upper Bound on the Second-Order Coding Rate of Parallel Gaussian Channels with Feedback

Chapter 9 Fundamental Limits in Information Theory

A new converse in rate-distortion theory

COMM901 Source Coding and Compression. Quiz 1

Pattern Matching and Lossy Data Compression on Random Fields

EE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15

Lecture 5: Asymptotic Equipartition Property

Lecture 22: Final Review

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

10-704: Information Processing and Learning Spring Lecture 8: Feb 5

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

Homework Set #2 Data Compression, Huffman code and AEP

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

(Classical) Information Theory III: Noisy channel coding

lossless, optimal compressor

Memory in Classical Information Theory: A Brief History

Information Dimension

CSCI 2570 Introduction to Nanocomputing

Summary of Shannon Rate-Distortion Theory

Shannon s Noisy-Channel Coding Theorem

Midterm Exam Information Theory Fall Midterm Exam. Time: 09:10 12:10 11/23, 2016

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Lossless Source Coding

On Optimum Conventional Quantization for Source Coding with Side Information at the Decoder

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 7, JULY

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Subset Source Coding

(Classical) Information Theory II: Source coding

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Coding into a source: an inverse rate-distortion theorem

On MMSE Estimation from Quantized Observations in the Nonasymptotic Regime

Lecture 3: Channel Capacity

Chapter 2: Source coding

Lecture 14 February 28

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

A Formula for the Capacity of the General Gel fand-pinsker Channel

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Electrical and Information Technology. Information Theory. Problems and Solutions. Contents. Problems... 1 Solutions...7

( 1 k "information" I(X;Y) given by Y about X)

Lecture 4 Channel Coding

Solutions to Set #2 Data Compression, Huffman code and AEP

Sequential prediction with coded side information under logarithmic loss

On Third-Order Asymptotics for DMCs

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

arxiv: v1 [cs.it] 5 Sep 2008

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

On Scalable Coding in the Presence of Decoder Side Information

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Lecture Notes 3 Convergence (Chapter 5)

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Two Applications of the Gaussian Poincaré Inequality in the Shannon Theory

Capacity of a channel Shannon s second theorem. Information Theory 1/33

Chapter 5: Data Compression

ELEMENT OF INFORMATION THEORY

Quiz 2 Date: Monday, November 21, 2016

Denoising via MCMC-based Lossy. Compression

On Common Information and the Encoding of Sources that are Not Successively Refinable

The Gallager Converse

EE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE Information theory Final

ECE Information theory Final (Fall 2008)

Appendix B Information theory from first principles

ECE 587 / STA 563: Lecture 5 Lossless Compression

INFORMATION THEORY AND STATISTICS

Lecture 6 Channel Coding over Continuous Channels

Soft-Output Trellis Waveform Coding

Second-Order Asymptotics in Information Theory

ECE 4400:693 - Information Theory

Arimoto Channel Coding Converse and Rényi Divergence

Transmitting k samples over the Gaussian channel: energy-distortion tradeoff

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

The Empirical Distribution of Rate-Constrained Source Codes

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

AN INFORMATION THEORY APPROACH TO WIRELESS SENSOR NETWORK DESIGN

Vector Quantization. Institut Mines-Telecom. Marco Cagnazzo, MN910 Advanced Compression

ELEMENTS O F INFORMATION THEORY

Channel Dispersion and Moderate Deviations Limits for Memoryless Channels

Digital communication system. Shannon s separation principle

LECTURE 15. Last time: Feedback channel: setting up the problem. Lecture outline. Joint source and channel coding theorem

Transcription:

Lossy Compression Coding Theorems for Arbitrary Sources Ioannis Kontoyiannis U of Cambridge joint work with M. Madiman, M. Harrison, J. Zhang Beyond IID Workshop, Cambridge, UK July 23, 2018

Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View

Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity

Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics

Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics 4. Code Performance: Generalized AEP 5. Solidarity with Shannon Theory: Stationary Ergodic Sources

Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics 4. Code Performance: Generalized AEP 5. Solidarity with Shannon Theory: Stationary Ergodic Sources 6. Aside: Lossy Compression & Infrence: What IS the best code? 7. Choosing a Code: Lossy MLE & Lossy MDL 8. Application: Pre-processing in VQ Design

Emphasis Concentrate On ; General sources, general distortion measures ; Nonasymptotic, pointwise results ; Precise performance bounds ; Systematic develompent of MDL point of view, parallel to lossless case ; Connections with Applications

Emphasis Concentrate On ; General sources, general distortion measures ; Nonasymptotic, pointwise results ; Precise performance bounds ; Systematic develompent of MDL point of view, parallel to lossless case ; Connections with Applications Background Questions Why is lossy compression so much harder? What s so di erent (mathematically) between them? How much do the right models matter for real data?

Lossy Compression: The Basic Problem Consider Data string x n 1 =(x 1,x 2,...,x n ) to be compressed Each x i taking values in the source alphabet A e.g., A = {0, 1}, A= R, A= R k,...

Lossy Compression: The Basic Problem Consider Data string x n 1 =(x 1,x 2,...,x n ) to be compressed Each x i taking values in the source alphabet A e.g., A = {0, 1}, A= R, A= R k,... Problem Find e cient approximate representation y n 1 =(y 1,y 2,...,y n ) for x n 1 with y i taking values in the reproduction alphabet  E cient means simple or compressible Approximate means that the distortion d n (x n 1,y n 1 ) is apple some level D where d n : A n Ân is an arbitrary distortion measure

Step 1. Ideal Compression: Kolmogorov Distortion Complexity For computable data and distortion: Define (Muramatsu-Kanaya 1994) The (Kolmogorov) distortion-complexity at distortion level D: K D (x n 1)=min{`(p) : p s.t. U(p) 2 B(x n 1,D)} bits where U( ) =universal Turing machine B(x n 1,D)=distortion-ball of radius D around x n 1: B(x n 1,D)={y n 1 2 Ân : d n (x n 1,y n 1 ) apple D}

Step 1. Ideal Compression: Kolmogorov Distortion Complexity For computable data and distortion: Define (Muramatsu-Kanaya 1994) The (Kolmogorov) distortion-complexity at distortion level D: K D (x n 1)=min{`(p) : p s.t. U(p) 2 B(x n 1,D)} bits where U( ) =universal Turing machine B(x n 1,D)=distortion-ball of radius D around x n 1: B(x n 1,D)={y n 1 2 Ân : d n (x n 1,y n 1 ) apple D} Properties of K D (x n 1): (a) machine-independent (b) nr(d) for stationary ergodic data ; THE fundamental limit of compression But (c) not computable...

A Hint From Lossless Compression Data: X1 n = X 1,X 2,...,X n with distribution P n on A n Lossless Code: (e n,l n ) Encoder: e n : A n!{0, 1} Code-lengths: L n (X1 n )=length of e n (X1 n ) bits

A Hint From Lossless Compression Data: X1 n = X 1,X 2,...,X n with distribution P n on A n Lossless Code: (e n,l n ) Encoder: e n : A n!{0, 1} Code-lengths: L n (X1 n )=length of e n (X1 n ) bits Kraft Inequality There exists a uniquley decodable code (e n,l n ) i P 1 ) apple 1 all x n 1 2An 2 L n(xn ) To any code (e n,l n ) we can associate a probability distribution Q n (x n 1) / 2 L n(x n 1 ) ( From any prob distribution Q n we can get a code e n with L n (x n 1) = log Q n (x n 1) bits

Lossy Compression in More Detail Data: X1 n = X 1,X 2,...,X n with distribution P n on A n Quantizer: q n : A n! codebook B n Ân Encoder: e n : B n!{0, 1} (uniquely decodable) Code-length: L n (X1 n )=L n (q n (X1 n )) = length of e n (q n (X1 n )) bits q n!! 0010111 101101000... e n e 1 n

Step 2. Codes as Probability Distributions The code (C n,l n )=(B n,q n,e n,l n ) operates at distortion level D, if d n (x n 1,q n (x n 1)) apple D for every x n 1 2 A n

Step 2. Codes as Probability Distributions The code (C n,l n )=(B n,q n,e n,l n ) operates at distortion level D, if d n (x n 1,q n (x n 1)) apple D for every x n 1 2 A n Kraft Inequality Theorem: Lossy Kraft Inequality (() For every lossless code (C n,l n ) (() Foreverycode(C n,l n ) there is a prob measure Q n on A n s.t. operating at distortion level D there is a prob meas. Q n on Ân s.t. L n (x n 1) log Q n (x n 1) bits L n (x n 1) log Q n (B(x n 1,D)) bits for all x n 1 for all x n 1 ()) For any prob measure Q n on A n ()) For any admissible sequence there is a code (C n,l n ) s.t. of measures {Q n } on Ân there are codes {C n,l n } at dist n level D s.t. L n (x n 1) apple log Q n (x n 1)+1bits L n (X1 n ) apple log Q n (B(X1 n,d)) + log n for all x n 1 bits, eventually, w.p.1

Remarks on the Codes-Measures Correspondence The converse part is a finite-n result as in the lossless case The direct part is asymptotic (random coding) but with (near) optimal convergence rate Both results are valid without any assumptions on the source or the distortion measure Similar results hold in expectation with a 1 2 log n rate Admissibility, the {Q n } yield codes with finite rate: lim sup n!1 1 n log Q n(b(x n 1,D)) apple some finite R bits/symbol, w.p.1

Remarks on the Codes-Measures Correspondence The converse part is a finite-n result as in the lossless case The direct part is asymptotic (random coding) but with (near) optimal convergence rate Both results are valid without any assumptions on the source or the distortion measure Similar results hold in expectation with a 1 2 log n rate Admissibility, the {Q n } yield codes with finite rate: lim sup n!1 1 n log Q n(b(x n 1,D)) apple some finite R bits/symbol, w.p.1 This suggests a natural lossy analog for the Shannon code-lengths: L n (X1 n ) = log Q n B(X1 n,d) bits All codes are random codes

Proof Outline (() Given a code (C n,l n ), let Q n (y n 1 ) / Then for all x n 1 : ( 2 L n(y n 1 ) if y n 1 2 B n 0 otherwise L n (x n 1) = L n (q n (x n 1)) log Q n (q n (x n 1)) log Q n (B(x n 1,D)) bits ()) Given Q n, generate IID codewords Y n 1 (i) Q n : Encode X n 1 Y1 n (1) Y1 n (2) Y1 n (3) as the position of the first D-close match: W n =min{i : d n (X n 1,Y n 1 (i)) apple D} This takes L n (X n 1 ) log W n bits log[waiting time for a match] log[1/prob of a match] log Q n (B(X n 1,D)) bits 2

Step 3. Coding Theorems: Best Achievable Performance Let Q n achieve: K n (D) 4 =inf Q n E[ log Q n (B(X n 1,D))]

Step 3. Coding Theorems: Best Achievable Performance Let Q n achieve: K n (D) 4 =inf Q n E[ log Q n (B(X n 1,D))] Theorem: One-Shot Converses i. For any code (C n,l n ) operating at distortion level D : E[L n (X n 1 )] K n (D) R n (D) bits ii. For any (other) prob measure Q n on A n and any K: n o Pr log Q n (B(X1 n,d)) apple log Q n(b(x1 n,d)) K bits apple 2 K

Step 3. Coding Theorems: Best Achievable Performance Let Q n achieve: K n (D) 4 =inf Q n E[ log Q n (B(X n 1,D))] Theorem: One-Shot Converses i. For any code (C n,l n ) operating at distortion level D : E[L n (X n 1 )] K n (D) R n (D) bits ii. For any (other) prob measure Q n on A n and any K: n o Pr log Q n (B(X1 n,d)) apple log Q n(b(x1 n,d)) K bits apple 2 K Proof. Selection in convex families: Bell-Cover version of the Kuhn-Tucker conditions 2

Coding Theorems Continued Theorem: Asymptotic Bounds i. For any seq of codes {C n,l n } operating at distortion level D : L n (X1 n ) log Q n(b(x1 n,d)) log n bits, eventually, w.p.1 with Q n as before

Coding Theorems Continued Theorem: Asymptotic Bounds i. For any seq of codes {C n,l n } operating at distortion level D : L n (X1 n ) log Q n(b(x1 n,d)) log n bits, eventually, w.p.1 with Q n as before ii. There is a seq of codes {Cn,L n} operating at distortion level D s.t. L n(x1 n ) apple log Q n(b(x1 n,d)) + log n bits, eventually, w.p.1

Coding Theorems Continued Theorem: Asymptotic Bounds i. For any seq of codes {C n,l n } operating at distortion level D : L n (X1 n ) log Q n(b(x1 n,d)) log n bits, eventually, w.p.1 with Q n as before ii. There is a seq of codes {Cn,L n} operating at distortion level D s.t. L n(x1 n ) apple log Q n(b(x1 n,d)) + log n bits, eventually, w.p.1 Proof. Finite-n bounds + Markov inequality + Borel-Cantelli + extra care 2

Remarks 1. Gaussian Approximation This is a good starting point for developing the well-known Gaussian approximation results for the fundamental limits of lossy compression [K2000, Kostina-Verdu,...]

Remarks 1. Gaussian Approximation This is a good starting point for developing the well-known Gaussian approximation results for the fundamental limits of lossy compression [K2000, Kostina-Verdu,...] 2. More fundamentally We wish to approximate the performance of the optimal Shannon code: Find {Q n } that yield code-lengths L n (X1 n ) = log Q n B(X1 n,d) bits close to those of the optimal Shannon code : L n(x1 n ) = log Q n B(X1 n,d) bits

Remarks 1. Gaussian Approximation This is a good starting point for developing the well-known Gaussian approximation results for the fundamental limits of lossy compression [K2000, Kostina-Verdu,...] 2. More fundamentally We wish to approximate the performance of the optimal Shannon code: Find {Q n } that yield code-lengths L n (X1 n ) = log Q n B(X1 n,d) bits close to those of the optimal Shannon code : L n(x1 n ) = log Q n B(X1 n,d) bits 3. Performance?

Step 4. Code Performance: Generalized AEP Suppose The source {X n } is stationary ergodic with distribution P {Q n } are the marginals of a stationary ergodic Q d n (x n 1,y1 n ) = 1 P n n i=1 d(x i,y i ) is a single-letter distortion measure

Step 4. Code Performance: Generalized AEP Suppose The source {X n } is stationary ergodic with distribution P {Q n } are the marginals of a stationary ergodic Q d n (x n 1,y n 1 ) = 1 n P n i=1 d(x i,y i ) is a single-letter distortion measure Theorem: Generalized AEP [L. & Szpan.], [Dembo & K.], [Chi], [...] If Q is mixing enough and d(x, y) is not wild: where 1 n log Q n(b(x n 1,D))! R(P, Q,D) 1 R(P, Q,D)= lim n!1 n bits/symbol, w.p.1 inf H(P P X n=p n,e[d n (X n 1 1,Y 1 n)]appled X1 n,y 1 nkp n Q n )

Step 4. Code Performance: Generalized AEP Suppose The source {X n } is stationary ergodic with distribution P {Q n } are the marginals of a stationary ergodic Q d n (x n 1,y n 1 ) = 1 n P n i=1 d(x i,y i ) is a single-letter distortion measure Theorem: Generalized AEP [L. & Szpan.], [Dembo & K.], [Chi], [...] If Q is mixing enough and d(x, y) is not wild: where 1 n log Q n(b(x n 1,D))! R(P, Q,D) 1 R(P, Q,D)= lim n!1 n bits/symbol, w.p.1 inf H(P P X n=p n,e[d n (X n 1 1,Y 1 n)]appled X1 n,y 1 nkp n Q n ) Proof. Based on very technical large deviations 2

Step 5. Sanity Check: Stationary Ergodic Souces Suppose The source {X n } is stationary ergodic with distribution P d n (x n 1,y n 1 ) = 1 n P n i=1 d(x i,y i ) is a single-letter distortion measure As before, Q n achieves K n (D) = inf Q n E[ log Q n (B(X n 1,D))] Theorem ([Kie er], [K & Zhang]) 1 i. K(D) = lim n!1 n K 1 n(d) = lim n!1 n R n(d) =R(D) ii. 1 n log Q n(b(x n 1,D))! R(D) bits/symbol, w.p.1

Outline Recall our program: 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics 4. Code Performance: Generalized AEP 5. Solidarity with Shannon Theory: Stationary Ergodic Sources 6. Aside: Lossy Compression & Infrence: What IS the best code? 7. Choosing a Code: Lossy MLE & Lossy MDL 8. Toward Practicality: Pre-processing in VQ Design

So far Setting We identified codes with prob measures L n (X1 n ) = log Q n B(X1 n,d) bits Design WhatisQ n?! What are good codes like? Howdewefindthem? How can we empirically design/choose a good code? Given a parametric family {Q ; 2 } of codes how do we choose?

Step 6. Lossy Compression & Inference: What is Q n? An HMM Example: Noisy Data Suppose {Y n } is a finite-state Markov Chain {Z n } is IID N(0, 2 ) noise and d(x, y) is MSE: d(x n 1,y1 n )= 1 P n n i=1 (x i y i ) 2 Y 1 Y + Z + Z2 Z n IID N(0, σ 2 ) 1 2... Y n + Then For D = For D< X 1 X 2,wehaveQ n 2 2,wehaveQ n D Y n 1 X n D [Y n 1 + IID N(0, 2 D)]

Lossy Compression & Inference: Denoising Idea: Lossy Compression of Noisy Data ; Denoising Suppose X n 1 is noisy data X n 1 = Y n 1 + Z n 1 the noise is maximally random with strength 2 the distortion measure d(x, y) is matched to the noise 0 apple D apple 2 [i.e., if the Shannon lower bound is tight] Then Q n D distr of Y n 1 + noise with strength ( 2 D) Egs Binary noise with Hamming distortion Gaussian noise with MSE

Step 7. Choosing a Code: The Lossy MLE & A Lossy MDL Proposal Greedy empirical code selection Given a parametric family {Q ; 2 } of codes, define: h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2

Step 7. Choosing a Code: The Lossy MLE & A Lossy MDL Proposal Greedy empirical code selection Given a parametric family {Q ; 2 } of codes, define: h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2 Problems: As with classical MLE 1. Encourages overfitting, and 2. Does not lead to real codes

Step 7. Choosing a Code: The Lossy MLE & A Lossy MDL Proposal Greedy empirical code selection Given a parametric family {Q ; 2 } of codes, define: h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2 Problems: As with classical MLE 1. Encourages overfitting, and 2. Does not lead to real codes Solution: Follow coding intuition For the MLE to be useful, it needs to be described as well ) Consider two-part codes: L n (X n 1 )= log Q (B(X n 1,D)) + `n( ) {z} where: either (c n,`n) is a prefix-free code on or `n( ) is an appropriate penalization term description of

Two Lossy MDL Proposals Two Specific MDL-like proposals: Define: h i 4 ˆ MDL =arginf log Q (B(X1 n,d)) + `n( ) 2 with either: (a) `n( ) = dim(q ) 2 log n 8 >< `( ) for in some countable ; (b) `n( ) = >: 1 otherwise

Two Lossy MDL Proposals Two Specific MDL-like proposals: Define: h i 4 ˆ MDL =arginf log Q (B(X1 n,d)) + `n( ) 2 with either: (a) `n( ) = dim(q ) 2 log n 8 >< `( ) for in some countable ; (b) `n( ) = >: 1 otherwise Consistency? Does ˆ MLE /ˆ MDL asymptotically lead to optimal compression? What is the optimal? What if is not unique? (the typical case) As in the classical case: Proof often hard and problem-specific

Justification For The Lossy MDL Estimate 1. Leads to realistic code selection 2. An example: Theorem Let {X n } be real-valued, stationary, ergodic, E(X n )=0, Var(X n )=1 Take d n (x n 1,y n 1 ) = MSE, let D 2 (0, 1) fixed : Q IID 1 2 N(0, 1 D)+1 2N(0, ), 2 [0, 1] With `n( ) = dim(q ) 2 log n we have: (a) ˆ MLE and ˆ MDL both! =1 D w.p.1 (b) ˆ MLE 6=(1 D) i.o., w.p.1 (c) ˆ MDL =(1 D) ev., w.p.1

Example Interpretation and Details Although artificial, the above example illustrates a general phenomenon: Proof. ˆ MLE overfits whereas ˆ MDL doesn t To compare the ˆ MLE with ˆ MDL, need estimates of the log-likelihood log Q (B(X n 1,D)) with accuracy better than O(log n), uniformlyin. This involves very intricate large deviations: STEPS 1 & 2: np log Q (B(X1 n,d)) = log Pr d(x i,y i ) apple D X1 n 1 n i=1 nr( ˆP X n 1, Q,D)+ 1 2 n P i=1 g (X i )+ 1 2 log n + O(1) w.p.1 log n + O(log log n) w.p.1 STEP 3: Implicitly identify g (x) as the derivative a convex dual STEP 4: Expand g (x) in Taylor series around STEP 5: Use the LIL to compare ˆ MLE with ˆ MDL STEP 6: Justify a.s.-uniform approximation 2

Another Example of Consistency: Gaussian Mixtures Let: The source {X n } be R k -valued, stationary, ergodic finite mean and covariance, arbitrary distr P be Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) where = (p 1,...,p L ), (µ 1,...,µ L ), (K 1,...,K L ) k, L fixed, µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Motivation: Practical quantization/clustering schemes e.g., Gray s Gauss mixture VQ and MDI selection

Another Example of Consistency: Gaussian Mixtures Let: The source {X n } be R k -valued, stationary, ergodic finite mean and covariance, arbitrary distr P be Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) where = (p 1,...,p L ), (µ 1,...,µ L ), (K 1,...,K L ) k, L fixed, µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Motivation: Theorem: Practical quantization/clustering schemes e.g., Gray s Gauss mixture VQ and MDI selection There is a unique optimal characterized by a k-dimensional variational problem inf R(P k,q,k,d)=r(p k,q,k,d), 2 and ˆ MLE, ˆ MDL both! w.p.1 and in L 1

Weak Consistency: A General Theorem Suppose (a) The generalized AEP holds for all Q in (b) The rate function R(P, Q,D) of the AEP is lower semicont s in (c) The lower bound of the AEP holds uniformly on compacts in (d) ˆ stays in a compact set, eventually, w.p.1 Then dist(ˆ MLE, { })! 0 w.p.1 dist(ˆ MDL, { })! 0 w.p.1

Weak Consistency: A General Theorem Suppose (a) The generalized AEP holds for all Q in (b) The rate function R(P, Q,D) of the AEP is lower semicont s in (c) The lower bound of the AEP holds uniformly on compacts in (d) ˆ stays in a compact set, eventually, w.p.1 Then dist(ˆ MLE, { })! 0 w.p.1 dist(ˆ MDL, { })! 0 w.p.1 Proof. Based on epiconvergence (or -convergence); quite technical 2 Conditions. (a) we saw; (c) often checked by Chernov bound-like arguments; (b) and (d) need to be checked case-by-case ; These su cient conditions can be substantially weakened

Strong Consistency: A Conjecture Under conditions (a) (d) above: Conjecture For smooth enough parametric families, under regularity conditions: always : dim(ˆ MDL )=dim( ) ev., w.p.1 typically : dim(ˆ MLE ) 6= dim( ) i.o., w.p.1

Strong Consistency: A Conjecture Under conditions (a) (d) above: Conjecture For smooth enough parametric families, under regularity conditions: always : dim(ˆ MDL )=dim( ) ev., w.p.1 typically : dim(ˆ MLE ) 6= dim( ) i.o., w.p.1 Proof? Saw the many technicalities in simple Gaussian case General result is still open 2

Step 8. An Application: Preprocessing in VQ Design Remarks So far, everything based on measures $ (random) codes correspondence Practical implications? Candidate Application: Gaussian-mixture VQ stationary, ergodic, R k -valued source {X n } finite mean and covariance, arbitrary distr P are Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Main Problem Choose L

Step 8. An Application: Preprocessing in VQ Design Remarks So far, everything based on measures $ (random) codes correspondence Practical implications? Candidate Application: Gaussian-mixture VQ stationary, ergodic, R k -valued source {X n } finite mean and covariance, arbitrary distr P are Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Main Problem Choose L MDL Estimate ˆL =[# of components in ˆ MDL ]

Final Remarks: The MDL Point of View Guideline: Kolmogorov Distortion-Complexity: Not computable Codes-Measures Correspondence: All codes are random codes L n (X1 n ) = log Q n B(X1 n,d) bits Optimal code: Q n = arginf Qn E[ log(q n (B(X n 1,D))] Generalized AEP(s): 1 n log Q n(b(x1 n,d))! R(P, Q,D) Lossy MLE: consistent but overfits h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2 Lossy MDL: consistent and does NOT overfit h 4 ˆ MDL =arginf log Q (B(X1 n,d)) + `n( ) 2 VQ Design: Preprocessing with Lossy MDL reduces problem dimensionality i