Lossy Compression Coding Theorems for Arbitrary Sources Ioannis Kontoyiannis U of Cambridge joint work with M. Madiman, M. Harrison, J. Zhang Beyond IID Workshop, Cambridge, UK July 23, 2018
Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View
Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity
Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics
Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics 4. Code Performance: Generalized AEP 5. Solidarity with Shannon Theory: Stationary Ergodic Sources
Outline Motivation and Background: Lossless Data Compression Lossy Data Compression: MDL Point of View 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics 4. Code Performance: Generalized AEP 5. Solidarity with Shannon Theory: Stationary Ergodic Sources 6. Aside: Lossy Compression & Infrence: What IS the best code? 7. Choosing a Code: Lossy MLE & Lossy MDL 8. Application: Pre-processing in VQ Design
Emphasis Concentrate On ; General sources, general distortion measures ; Nonasymptotic, pointwise results ; Precise performance bounds ; Systematic develompent of MDL point of view, parallel to lossless case ; Connections with Applications
Emphasis Concentrate On ; General sources, general distortion measures ; Nonasymptotic, pointwise results ; Precise performance bounds ; Systematic develompent of MDL point of view, parallel to lossless case ; Connections with Applications Background Questions Why is lossy compression so much harder? What s so di erent (mathematically) between them? How much do the right models matter for real data?
Lossy Compression: The Basic Problem Consider Data string x n 1 =(x 1,x 2,...,x n ) to be compressed Each x i taking values in the source alphabet A e.g., A = {0, 1}, A= R, A= R k,...
Lossy Compression: The Basic Problem Consider Data string x n 1 =(x 1,x 2,...,x n ) to be compressed Each x i taking values in the source alphabet A e.g., A = {0, 1}, A= R, A= R k,... Problem Find e cient approximate representation y n 1 =(y 1,y 2,...,y n ) for x n 1 with y i taking values in the reproduction alphabet  E cient means simple or compressible Approximate means that the distortion d n (x n 1,y n 1 ) is apple some level D where d n : A n Ân is an arbitrary distortion measure
Step 1. Ideal Compression: Kolmogorov Distortion Complexity For computable data and distortion: Define (Muramatsu-Kanaya 1994) The (Kolmogorov) distortion-complexity at distortion level D: K D (x n 1)=min{`(p) : p s.t. U(p) 2 B(x n 1,D)} bits where U( ) =universal Turing machine B(x n 1,D)=distortion-ball of radius D around x n 1: B(x n 1,D)={y n 1 2 Ân : d n (x n 1,y n 1 ) apple D}
Step 1. Ideal Compression: Kolmogorov Distortion Complexity For computable data and distortion: Define (Muramatsu-Kanaya 1994) The (Kolmogorov) distortion-complexity at distortion level D: K D (x n 1)=min{`(p) : p s.t. U(p) 2 B(x n 1,D)} bits where U( ) =universal Turing machine B(x n 1,D)=distortion-ball of radius D around x n 1: B(x n 1,D)={y n 1 2 Ân : d n (x n 1,y n 1 ) apple D} Properties of K D (x n 1): (a) machine-independent (b) nr(d) for stationary ergodic data ; THE fundamental limit of compression But (c) not computable...
A Hint From Lossless Compression Data: X1 n = X 1,X 2,...,X n with distribution P n on A n Lossless Code: (e n,l n ) Encoder: e n : A n!{0, 1} Code-lengths: L n (X1 n )=length of e n (X1 n ) bits
A Hint From Lossless Compression Data: X1 n = X 1,X 2,...,X n with distribution P n on A n Lossless Code: (e n,l n ) Encoder: e n : A n!{0, 1} Code-lengths: L n (X1 n )=length of e n (X1 n ) bits Kraft Inequality There exists a uniquley decodable code (e n,l n ) i P 1 ) apple 1 all x n 1 2An 2 L n(xn ) To any code (e n,l n ) we can associate a probability distribution Q n (x n 1) / 2 L n(x n 1 ) ( From any prob distribution Q n we can get a code e n with L n (x n 1) = log Q n (x n 1) bits
Lossy Compression in More Detail Data: X1 n = X 1,X 2,...,X n with distribution P n on A n Quantizer: q n : A n! codebook B n Ân Encoder: e n : B n!{0, 1} (uniquely decodable) Code-length: L n (X1 n )=L n (q n (X1 n )) = length of e n (q n (X1 n )) bits q n!! 0010111 101101000... e n e 1 n
Step 2. Codes as Probability Distributions The code (C n,l n )=(B n,q n,e n,l n ) operates at distortion level D, if d n (x n 1,q n (x n 1)) apple D for every x n 1 2 A n
Step 2. Codes as Probability Distributions The code (C n,l n )=(B n,q n,e n,l n ) operates at distortion level D, if d n (x n 1,q n (x n 1)) apple D for every x n 1 2 A n Kraft Inequality Theorem: Lossy Kraft Inequality (() For every lossless code (C n,l n ) (() Foreverycode(C n,l n ) there is a prob measure Q n on A n s.t. operating at distortion level D there is a prob meas. Q n on Ân s.t. L n (x n 1) log Q n (x n 1) bits L n (x n 1) log Q n (B(x n 1,D)) bits for all x n 1 for all x n 1 ()) For any prob measure Q n on A n ()) For any admissible sequence there is a code (C n,l n ) s.t. of measures {Q n } on Ân there are codes {C n,l n } at dist n level D s.t. L n (x n 1) apple log Q n (x n 1)+1bits L n (X1 n ) apple log Q n (B(X1 n,d)) + log n for all x n 1 bits, eventually, w.p.1
Remarks on the Codes-Measures Correspondence The converse part is a finite-n result as in the lossless case The direct part is asymptotic (random coding) but with (near) optimal convergence rate Both results are valid without any assumptions on the source or the distortion measure Similar results hold in expectation with a 1 2 log n rate Admissibility, the {Q n } yield codes with finite rate: lim sup n!1 1 n log Q n(b(x n 1,D)) apple some finite R bits/symbol, w.p.1
Remarks on the Codes-Measures Correspondence The converse part is a finite-n result as in the lossless case The direct part is asymptotic (random coding) but with (near) optimal convergence rate Both results are valid without any assumptions on the source or the distortion measure Similar results hold in expectation with a 1 2 log n rate Admissibility, the {Q n } yield codes with finite rate: lim sup n!1 1 n log Q n(b(x n 1,D)) apple some finite R bits/symbol, w.p.1 This suggests a natural lossy analog for the Shannon code-lengths: L n (X1 n ) = log Q n B(X1 n,d) bits All codes are random codes
Proof Outline (() Given a code (C n,l n ), let Q n (y n 1 ) / Then for all x n 1 : ( 2 L n(y n 1 ) if y n 1 2 B n 0 otherwise L n (x n 1) = L n (q n (x n 1)) log Q n (q n (x n 1)) log Q n (B(x n 1,D)) bits ()) Given Q n, generate IID codewords Y n 1 (i) Q n : Encode X n 1 Y1 n (1) Y1 n (2) Y1 n (3) as the position of the first D-close match: W n =min{i : d n (X n 1,Y n 1 (i)) apple D} This takes L n (X n 1 ) log W n bits log[waiting time for a match] log[1/prob of a match] log Q n (B(X n 1,D)) bits 2
Step 3. Coding Theorems: Best Achievable Performance Let Q n achieve: K n (D) 4 =inf Q n E[ log Q n (B(X n 1,D))]
Step 3. Coding Theorems: Best Achievable Performance Let Q n achieve: K n (D) 4 =inf Q n E[ log Q n (B(X n 1,D))] Theorem: One-Shot Converses i. For any code (C n,l n ) operating at distortion level D : E[L n (X n 1 )] K n (D) R n (D) bits ii. For any (other) prob measure Q n on A n and any K: n o Pr log Q n (B(X1 n,d)) apple log Q n(b(x1 n,d)) K bits apple 2 K
Step 3. Coding Theorems: Best Achievable Performance Let Q n achieve: K n (D) 4 =inf Q n E[ log Q n (B(X n 1,D))] Theorem: One-Shot Converses i. For any code (C n,l n ) operating at distortion level D : E[L n (X n 1 )] K n (D) R n (D) bits ii. For any (other) prob measure Q n on A n and any K: n o Pr log Q n (B(X1 n,d)) apple log Q n(b(x1 n,d)) K bits apple 2 K Proof. Selection in convex families: Bell-Cover version of the Kuhn-Tucker conditions 2
Coding Theorems Continued Theorem: Asymptotic Bounds i. For any seq of codes {C n,l n } operating at distortion level D : L n (X1 n ) log Q n(b(x1 n,d)) log n bits, eventually, w.p.1 with Q n as before
Coding Theorems Continued Theorem: Asymptotic Bounds i. For any seq of codes {C n,l n } operating at distortion level D : L n (X1 n ) log Q n(b(x1 n,d)) log n bits, eventually, w.p.1 with Q n as before ii. There is a seq of codes {Cn,L n} operating at distortion level D s.t. L n(x1 n ) apple log Q n(b(x1 n,d)) + log n bits, eventually, w.p.1
Coding Theorems Continued Theorem: Asymptotic Bounds i. For any seq of codes {C n,l n } operating at distortion level D : L n (X1 n ) log Q n(b(x1 n,d)) log n bits, eventually, w.p.1 with Q n as before ii. There is a seq of codes {Cn,L n} operating at distortion level D s.t. L n(x1 n ) apple log Q n(b(x1 n,d)) + log n bits, eventually, w.p.1 Proof. Finite-n bounds + Markov inequality + Borel-Cantelli + extra care 2
Remarks 1. Gaussian Approximation This is a good starting point for developing the well-known Gaussian approximation results for the fundamental limits of lossy compression [K2000, Kostina-Verdu,...]
Remarks 1. Gaussian Approximation This is a good starting point for developing the well-known Gaussian approximation results for the fundamental limits of lossy compression [K2000, Kostina-Verdu,...] 2. More fundamentally We wish to approximate the performance of the optimal Shannon code: Find {Q n } that yield code-lengths L n (X1 n ) = log Q n B(X1 n,d) bits close to those of the optimal Shannon code : L n(x1 n ) = log Q n B(X1 n,d) bits
Remarks 1. Gaussian Approximation This is a good starting point for developing the well-known Gaussian approximation results for the fundamental limits of lossy compression [K2000, Kostina-Verdu,...] 2. More fundamentally We wish to approximate the performance of the optimal Shannon code: Find {Q n } that yield code-lengths L n (X1 n ) = log Q n B(X1 n,d) bits close to those of the optimal Shannon code : L n(x1 n ) = log Q n B(X1 n,d) bits 3. Performance?
Step 4. Code Performance: Generalized AEP Suppose The source {X n } is stationary ergodic with distribution P {Q n } are the marginals of a stationary ergodic Q d n (x n 1,y1 n ) = 1 P n n i=1 d(x i,y i ) is a single-letter distortion measure
Step 4. Code Performance: Generalized AEP Suppose The source {X n } is stationary ergodic with distribution P {Q n } are the marginals of a stationary ergodic Q d n (x n 1,y n 1 ) = 1 n P n i=1 d(x i,y i ) is a single-letter distortion measure Theorem: Generalized AEP [L. & Szpan.], [Dembo & K.], [Chi], [...] If Q is mixing enough and d(x, y) is not wild: where 1 n log Q n(b(x n 1,D))! R(P, Q,D) 1 R(P, Q,D)= lim n!1 n bits/symbol, w.p.1 inf H(P P X n=p n,e[d n (X n 1 1,Y 1 n)]appled X1 n,y 1 nkp n Q n )
Step 4. Code Performance: Generalized AEP Suppose The source {X n } is stationary ergodic with distribution P {Q n } are the marginals of a stationary ergodic Q d n (x n 1,y n 1 ) = 1 n P n i=1 d(x i,y i ) is a single-letter distortion measure Theorem: Generalized AEP [L. & Szpan.], [Dembo & K.], [Chi], [...] If Q is mixing enough and d(x, y) is not wild: where 1 n log Q n(b(x n 1,D))! R(P, Q,D) 1 R(P, Q,D)= lim n!1 n bits/symbol, w.p.1 inf H(P P X n=p n,e[d n (X n 1 1,Y 1 n)]appled X1 n,y 1 nkp n Q n ) Proof. Based on very technical large deviations 2
Step 5. Sanity Check: Stationary Ergodic Souces Suppose The source {X n } is stationary ergodic with distribution P d n (x n 1,y n 1 ) = 1 n P n i=1 d(x i,y i ) is a single-letter distortion measure As before, Q n achieves K n (D) = inf Q n E[ log Q n (B(X n 1,D))] Theorem ([Kie er], [K & Zhang]) 1 i. K(D) = lim n!1 n K 1 n(d) = lim n!1 n R n(d) =R(D) ii. 1 n log Q n(b(x n 1,D))! R(D) bits/symbol, w.p.1
Outline Recall our program: 1. Ideal Compression: Kolmogorov Distortion-Complexity 2. Codes as Probability Distributions: Lossy Kraft Inequality 3. Coding Theorems: Asymptotics & Non-Asymptotics 4. Code Performance: Generalized AEP 5. Solidarity with Shannon Theory: Stationary Ergodic Sources 6. Aside: Lossy Compression & Infrence: What IS the best code? 7. Choosing a Code: Lossy MLE & Lossy MDL 8. Toward Practicality: Pre-processing in VQ Design
So far Setting We identified codes with prob measures L n (X1 n ) = log Q n B(X1 n,d) bits Design WhatisQ n?! What are good codes like? Howdewefindthem? How can we empirically design/choose a good code? Given a parametric family {Q ; 2 } of codes how do we choose?
Step 6. Lossy Compression & Inference: What is Q n? An HMM Example: Noisy Data Suppose {Y n } is a finite-state Markov Chain {Z n } is IID N(0, 2 ) noise and d(x, y) is MSE: d(x n 1,y1 n )= 1 P n n i=1 (x i y i ) 2 Y 1 Y + Z + Z2 Z n IID N(0, σ 2 ) 1 2... Y n + Then For D = For D< X 1 X 2,wehaveQ n 2 2,wehaveQ n D Y n 1 X n D [Y n 1 + IID N(0, 2 D)]
Lossy Compression & Inference: Denoising Idea: Lossy Compression of Noisy Data ; Denoising Suppose X n 1 is noisy data X n 1 = Y n 1 + Z n 1 the noise is maximally random with strength 2 the distortion measure d(x, y) is matched to the noise 0 apple D apple 2 [i.e., if the Shannon lower bound is tight] Then Q n D distr of Y n 1 + noise with strength ( 2 D) Egs Binary noise with Hamming distortion Gaussian noise with MSE
Step 7. Choosing a Code: The Lossy MLE & A Lossy MDL Proposal Greedy empirical code selection Given a parametric family {Q ; 2 } of codes, define: h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2
Step 7. Choosing a Code: The Lossy MLE & A Lossy MDL Proposal Greedy empirical code selection Given a parametric family {Q ; 2 } of codes, define: h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2 Problems: As with classical MLE 1. Encourages overfitting, and 2. Does not lead to real codes
Step 7. Choosing a Code: The Lossy MLE & A Lossy MDL Proposal Greedy empirical code selection Given a parametric family {Q ; 2 } of codes, define: h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2 Problems: As with classical MLE 1. Encourages overfitting, and 2. Does not lead to real codes Solution: Follow coding intuition For the MLE to be useful, it needs to be described as well ) Consider two-part codes: L n (X n 1 )= log Q (B(X n 1,D)) + `n( ) {z} where: either (c n,`n) is a prefix-free code on or `n( ) is an appropriate penalization term description of
Two Lossy MDL Proposals Two Specific MDL-like proposals: Define: h i 4 ˆ MDL =arginf log Q (B(X1 n,d)) + `n( ) 2 with either: (a) `n( ) = dim(q ) 2 log n 8 >< `( ) for in some countable ; (b) `n( ) = >: 1 otherwise
Two Lossy MDL Proposals Two Specific MDL-like proposals: Define: h i 4 ˆ MDL =arginf log Q (B(X1 n,d)) + `n( ) 2 with either: (a) `n( ) = dim(q ) 2 log n 8 >< `( ) for in some countable ; (b) `n( ) = >: 1 otherwise Consistency? Does ˆ MLE /ˆ MDL asymptotically lead to optimal compression? What is the optimal? What if is not unique? (the typical case) As in the classical case: Proof often hard and problem-specific
Justification For The Lossy MDL Estimate 1. Leads to realistic code selection 2. An example: Theorem Let {X n } be real-valued, stationary, ergodic, E(X n )=0, Var(X n )=1 Take d n (x n 1,y n 1 ) = MSE, let D 2 (0, 1) fixed : Q IID 1 2 N(0, 1 D)+1 2N(0, ), 2 [0, 1] With `n( ) = dim(q ) 2 log n we have: (a) ˆ MLE and ˆ MDL both! =1 D w.p.1 (b) ˆ MLE 6=(1 D) i.o., w.p.1 (c) ˆ MDL =(1 D) ev., w.p.1
Example Interpretation and Details Although artificial, the above example illustrates a general phenomenon: Proof. ˆ MLE overfits whereas ˆ MDL doesn t To compare the ˆ MLE with ˆ MDL, need estimates of the log-likelihood log Q (B(X n 1,D)) with accuracy better than O(log n), uniformlyin. This involves very intricate large deviations: STEPS 1 & 2: np log Q (B(X1 n,d)) = log Pr d(x i,y i ) apple D X1 n 1 n i=1 nr( ˆP X n 1, Q,D)+ 1 2 n P i=1 g (X i )+ 1 2 log n + O(1) w.p.1 log n + O(log log n) w.p.1 STEP 3: Implicitly identify g (x) as the derivative a convex dual STEP 4: Expand g (x) in Taylor series around STEP 5: Use the LIL to compare ˆ MLE with ˆ MDL STEP 6: Justify a.s.-uniform approximation 2
Another Example of Consistency: Gaussian Mixtures Let: The source {X n } be R k -valued, stationary, ergodic finite mean and covariance, arbitrary distr P be Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) where = (p 1,...,p L ), (µ 1,...,µ L ), (K 1,...,K L ) k, L fixed, µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Motivation: Practical quantization/clustering schemes e.g., Gray s Gauss mixture VQ and MDI selection
Another Example of Consistency: Gaussian Mixtures Let: The source {X n } be R k -valued, stationary, ergodic finite mean and covariance, arbitrary distr P be Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) where = (p 1,...,p L ), (µ 1,...,µ L ), (K 1,...,K L ) k, L fixed, µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Motivation: Theorem: Practical quantization/clustering schemes e.g., Gray s Gauss mixture VQ and MDI selection There is a unique optimal characterized by a k-dimensional variational problem inf R(P k,q,k,d)=r(p k,q,k,d), 2 and ˆ MLE, ˆ MDL both! w.p.1 and in L 1
Weak Consistency: A General Theorem Suppose (a) The generalized AEP holds for all Q in (b) The rate function R(P, Q,D) of the AEP is lower semicont s in (c) The lower bound of the AEP holds uniformly on compacts in (d) ˆ stays in a compact set, eventually, w.p.1 Then dist(ˆ MLE, { })! 0 w.p.1 dist(ˆ MDL, { })! 0 w.p.1
Weak Consistency: A General Theorem Suppose (a) The generalized AEP holds for all Q in (b) The rate function R(P, Q,D) of the AEP is lower semicont s in (c) The lower bound of the AEP holds uniformly on compacts in (d) ˆ stays in a compact set, eventually, w.p.1 Then dist(ˆ MLE, { })! 0 w.p.1 dist(ˆ MDL, { })! 0 w.p.1 Proof. Based on epiconvergence (or -convergence); quite technical 2 Conditions. (a) we saw; (c) often checked by Chernov bound-like arguments; (b) and (d) need to be checked case-by-case ; These su cient conditions can be substantially weakened
Strong Consistency: A Conjecture Under conditions (a) (d) above: Conjecture For smooth enough parametric families, under regularity conditions: always : dim(ˆ MDL )=dim( ) ev., w.p.1 typically : dim(ˆ MLE ) 6= dim( ) i.o., w.p.1
Strong Consistency: A Conjecture Under conditions (a) (d) above: Conjecture For smooth enough parametric families, under regularity conditions: always : dim(ˆ MDL )=dim( ) ev., w.p.1 typically : dim(ˆ MLE ) 6= dim( ) i.o., w.p.1 Proof? Saw the many technicalities in simple Gaussian case General result is still open 2
Step 8. An Application: Preprocessing in VQ Design Remarks So far, everything based on measures $ (random) codes correspondence Practical implications? Candidate Application: Gaussian-mixture VQ stationary, ergodic, R k -valued source {X n } finite mean and covariance, arbitrary distr P are Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Main Problem Choose L
Step 8. An Application: Preprocessing in VQ Design Remarks So far, everything based on measures $ (random) codes correspondence Practical implications? Candidate Application: Gaussian-mixture VQ stationary, ergodic, R k -valued source {X n } finite mean and covariance, arbitrary distr P are Gaussian mixtures: Q IID P L i=1 p in(µ i, K i ) µ i 2 [ M,M] k, K i has eigenvalues in some [, ] d n (x n 1,y n 1 ) = MSE, D>0 fixed Main Problem Choose L MDL Estimate ˆL =[# of components in ˆ MDL ]
Final Remarks: The MDL Point of View Guideline: Kolmogorov Distortion-Complexity: Not computable Codes-Measures Correspondence: All codes are random codes L n (X1 n ) = log Q n B(X1 n,d) bits Optimal code: Q n = arginf Qn E[ log(q n (B(X n 1,D))] Generalized AEP(s): 1 n log Q n(b(x1 n,d))! R(P, Q,D) Lossy MLE: consistent but overfits h i 4 ˆ MLE =arginf log Q (B(X1 n,d)) 2 Lossy MDL: consistent and does NOT overfit h 4 ˆ MDL =arginf log Q (B(X1 n,d)) + `n( ) 2 VQ Design: Preprocessing with Lossy MDL reduces problem dimensionality i