PROOF OF ZADOR-GERSHO THEOREM

Size: px

Start display at page:

Download "PROOF OF ZADOR-GERSHO THEOREM"

Theresa Hampton
5 years ago
Views:

1 ZADOR-GERSHO THEOREM FOR VARIABLE-RATE VQ For a stationary source and large R, the least distortion of k-dim'l VQ with nth-order entropy coding and rate R or less is δ(k,n,r) m k * σ 2 η kn 2-2R = Z(k,n,R) where Z(k,n,R) = Zador-Gersho funct. for k-dim'l VQ with nth-order entropy coding m * k = best inert'l profile = least NMI of any tessel'g polytope (Gersho's conj.) η k = 2 2h k σ 2 Equivalently, S(k,n,R) 6.02 R - 0 log 0 m * k η kn. (Again, 6 db per bit.) Notes: k-dim'l VQ-EC with n = is at least as good as k-dim'l VQ-FR, because the latter is a special case of the former. Later we show directly that η k β k, which implies Z(k,,R) Z(k,R). The proof shows that an approximately optimal k-dimensional VQ with nth-order variable-rate coding can be constructed with a partition that is a tesselation of the best k-dimensional polytope, scaled to volume 2 k(h kn-r). This has constant inert'l profile m(x) = m k, * constant pt density Λ(x) = Λ k * = 2 k(r-h kn), distortion D m k * (Λ k) * -2/k = Z(k,nR). VQ-EC-3 PROOF OF ZADOR-GERSHO THEOREM We begin with δ(k,n,r) min m(x),λ(x) m(x) Λ 2/k (x) f(x) dx where x = (x x k ) and the minimization is over all inertial profiles m(x) and all point densities Λ(x) such that h kn + k f(x) log Λ(x) dx R (such a Λ rate ~ R) Best inertial profile: We assume Gersho's conjecture -- In the high rate, small distortion regime, most cells of an optimal quantizer are, approximately, congruent to the tesselating polytope with least NMI. We conclude: The best inertial profile is m k(x) * = m k * = least NMI of k-dimen'l tess'ng polytopes It follows that δ(k,n,r) m * k min Λ(x) Λ 2/k (x) f(x) dx where the min is taken over functions Λ(x) such that Λ(x) 0 and h kn + f(x) log Λ(x) dx R k VQ-EC-4

2 Best point density: Suppose Λ(x) satisfies h kn + k f(x) log 2 Λ(x) dx R ( * ) Then by convexity of the logarithm and Jensen's inequality log 2 Λ 2/k (x) f(x) dx equality iff Λ(x) is constant log 2 Λ 2/k f(x) dx (x) with probability one = - 2 k f(x) log Λ(x) dx 2h kn - 2R Hence, Λ 2/k (x) f(x) dx 2 2hkn 2-2R ( ** ) with equality iff Λ(x) is a constant with probability one. ( * ) and ( ** ) δ(k,n,r) m k * 2 2hkn 2-2R = m k * σ 2 η kn 2-2R Moreover, we have shown that the optimal point density is a constant. The constant must be such that ( * ) holds with equality. Therefore, Λ(x) = 2 k(r-h kn) * = Λ k We see from the proof that an approx'ly optimal VQ can be constructed with a partition that is a tesselation of the best k-dim'l polytope, scaled to volume 2 k(h kn-r). The tesselation need only cover the region where f(x) is not small. VQ-EC-5 SUMMARY OF FIXED- AND VARIABLE-RATE VQ Let 0th-order entropy coding (n=0) denote fixed-rate coding. Given a stationary source and large R, the least distortion of VQ with dimension k, nth-order entropy coding, and rate R or less is where δ(k,n,r) m * k σ 2 α k,n 2-2R (= Z(k,n,R)) β k, n=0 α k,n = η kn = σ 2 22h kn, n Notice the "," in α k,n but not in η kn or h kn! The best k-dimen'l VQ to use with fixed-rate coding (n=0) has - has point density λ k(x) * = c f k/(k+2) (x) - congruent cells with NMI = m k * (constant inert'l profile). The best k-dimen'l VQ to use with variable-rate binary coding (n ) has - uniform point density - congruent cells with NMI equal to m k. * That is, it is simply a tesselation. VQ-EC-6

3 WHAT HAPPENS AS K AND N CHANGE? As usual, consider a stationary source. Recall: δ(k,n,r) σ 2 m * k α k,n 2-2R m * k decreases subadditively to m * = 2πe α k,n = β k, n =0 σ 2 22h kn, n β k decreases subadditively to β 2 2hk decreases monotonically to 2 2h Therefore, α k,0 decreases monotonically with k to β α k,n decreases monotonically with k to 2 2h for n α k,n decreases monotonically with n to 2 2h Key Fact: 2 2h = β (proved later) Therefore, α k,n decreases monotonically with n or k to β = 2 2h VQ-EC-7 CONCLUSIONS () The least distortion of vector quantization with rate R or less, with any dimension, and with fixed-rate coding or with variable-rate encoding of any order is δ(r) σ 2 m * β 2-2R Among other things, this says that the best possible performance with variable-rate coding is no better than the best possible performance with fixed-rate coding. (2) Increasing n with k fixed: δ(k,n,r) decreases monotonically to the limit δ(k,,r) σ 2 m * k β 2-2R m* k m * δ(r) (space filling loss) Therefore, for large n and arbitrary k, δ(k,n,r) m* k m * δ(r) Among other things, this shows that one needs large k in order to approach the best possible performance. VQ-EC-8

4 (3) Increasing k with n fixed: δ(k,n,r) "decreases", though not monotonically, to the limit δ(,n,r) σ 2 m * β 2-2R = δ(r) (no loss) Therefore, for large k and arbitrary n (even n = 0), δ(k,n,r) δ(r) Among other things this indicates that for large k, increasing n does not improve the best possible performance attainable with for that k. That is, one can attain the best possible performance, even with n = 0 or. (4) To get the best possible performance we must have (a) k large enough that m k/m * *, i.e. well shaped cells, (b) k and/or n large enough that α k,n β, i.e. good point density and good exploitation of memory. Usually (b) is more important (a). VQ-EC-9 (5) What's the point of variable-rate coding if the best possible performance with variable-rate coding is no better than the best possible performance without it? From the point of view of VQ, the reason to use variable-rate coding, instead of fixed-rate coding) is to permit a VQ with smaller dimension (less complexity) to work well. The extreme case -- k=, i.e. scalar quantization: with n = 0 (fixed-rate coding) δ(,0,r) σ 2 m * β 2-2R = m* m * β β δ(r) with large n (high-order variable-rate coding) δ(,,r) σ 2 m * β 2-2R = m* m * δ(r), where m* m * =.42 or.53 db The variable-coding causes β to be replaced by β. Moreover, the best scalar quantizer for use with variable-rate coding is a uniform scalar quantizer. Thus this shows that with variable-rate coding, a uniform scalar quantizer can have performance within only.53 db of the best VQ of any type! VQ-EC-20

5 (6) What's the point of vector quantization if uniform scalar quantization plus variable-rate coding can come within.53 db of the best VQ of any type? From the point of view of the binary code, the purpose of VQ is to permit a lower order variable-rate coder to be used, i.e. it permits a simpler lossless coder. For example, if uniform scalar quantization were used, the variable-rate coder must exploit the memory in the source, i.e. it would have to be complex. VQ also reduces the space filling loss, i.e. it improves cell shapes. (7) It's worth re-emphasizing that one can attain the best possible performance (D vs. R) by choosing k large and n = 0; i.e. with a complex quantizer and a simple fixed-rate binary encoder. the best possible performance minus only.53 db by choosing a uniform scalar quantizer and an entropy coder with large n; i.e. with a simple quantizer and a complex entroopy coder. Which is simpler? Hard to say. Both are very complex. Good systems are usually compromises; i.e. a nontrivial quantizer and nontrivial entropy coder. In some applications variable-rate coding is not an option and so fixed-rate coding must be used. VQ-EC-2 EXAMPLE: GAUSS-MARKOV SOURCE, ρ =.9 S(k,n,R) 6.02 R - 0 log 0 m * k α k,n Plot of - log 0 α k,n 5 4 n k -5 0 VQ-EC-22

6 PROPERTIES OF DIFFERENTIAL ENTROPY AND THE ZADOR FACTOR Η K Most are extensions of properties of the differential entropy of one random variable. Many proofs are similar to those of analogous properties for ordinary entropy. Definitions: h(x,,x k ) = - f(x) log 2 f(x) dx h k = k h(x,,x k ) = kth order differential entropy η k = σ 2 22hk = Zador factor for variable-rate VQ Note that h and h k can be positive or negative! VQ-EC-23 () h k 2 log 2 σ 2 β k and η k β k where β k is Zador's factor. Equality holds in each iff f(x) has the same value wherever it is not zero, e.g. if it is uniform. Derivation: h k = - k f(x) log 2 f(x) dx = k+2 2k f(x) log 2 f -2/(k+2) (x) dx = k+2 2k E log 2 Y, where Y = f -2/(k+2) (X) k+2 2k log 2 EY by Jensen's ineq. (log 2 is concave) = k+2 2k log 2 f(x) f -2/(k+2) (x) dx = 2 log 2 f k/(k+2) (x) dx (k+2)/k = 2 log 2 σ 2 β η k = σ 2 22h k β k Since log 2 is strictly concave, equality holds if and only if Y is constant w.p.; i.e. iff P(f -2/(k+2) (X) = c) = for some some c, i.e. iff f(x) has the same value wherever it is not zero. VQ-EC-24

7 (2) If Y = a X + b, a 0, then h Y,k = h X,k + log 2 a and η Y,k = η X,k Derivation: Since f Y (y) = a k f X ( y-b a ), h Y,k = - k f(y) log 2 f(y) dy = - k a k f X ( y-b a ) log 2 a k f X ( y-b a ) dy = - k a k f X (x) log 2 a k f X () a k dx, letting x = y-b a = - k f(x) log 2 f(x) dx - k f(x) log 2 a k dx = h X,mk + log 2 a σ 2 Y = a 2 σ 2 X η Y,k = σ 2 2 2hY,k = Y a 2 σ 2 X 2 2h X,k+2 log 2 a = σ 2 2 2hX,k = η X,k X VQ-EC-25 (3) (a) If Y = A X + b and A is a k k nonsingular matrix, then h Y,k = h X,k + k log 2 A and η Y,k = η X,k A 2/k σ2 X (b) If A is orthogonal (i.e. A - = A t ), then Derivation: h Y,k = h X,k and η Y,k = η X,k. (a) Since f Y (y) = A - f X (A - (y-b)), h Y,k = - k f(y) log 2 f(y) dy = - k A - f X (A - (y-b)) log 2 A - f X (A - (y-b)) dy = - k A - f X (x) log 2 A - f X (x) A dx, with x = A - (y-b) = - k f(x) log 2 f(x) dx - k f(x) log 2 A - dx = h X,k + k log 2 A σ 2 Y η Y,k = σ 2 Y 2 2h Y,k = a 2 σ 2 X 2 2h X,k+ k log 2 A = σ 2 2 2hX,k A 2/k σ2 X X σ 2 Y = η X,k A 2/k σ2 X σ 2 Y (b) This follows from (a) and the facts that when A is orthogonal, A = and σ 2 Y = σ 2 X. VQ-EC-26

8 (4) Gaussian: If X = (X,...,X k ) is Gaussian with covariance matrix K, then h k = 2 log 2 2πe K /k and η k = 2πe K /k σ 2 This formula for η k is the same as the formula for β k except that ((k+2)/k) (k+2)/2 is replaced by "e". (Note: ((k+2)/k) (k+2)/2 e as k.) Derivation: We may assume X has zero mean, since previous properties show the mean has no effect on h k or η k. Let Y = AX, where A is the Karhunen-Loeve Transform. Then Y is Gaussian with covariance matrix Λ = AKA t, which is diagonal, with diagonal elements equal to the eigenvalues λ,,λ k of K. Thus Y has independent components with variances λ,,λ k. Note that K = i=λ k i. Since A is orthogonal h X,k = h Y,k = - k f(y) log 2 f(y) dy = - k f(y) log 2 k -/2 k yi 2 2πλi exp{- } dy 2λ i = - k f(y) log 2 k -/2 k yi 2 2πλi exp{- } dy 2λ i = - k log 2 i= i= k 2πλi -/2 + k f(y) i= i= k y 2 i i= 2λ i dy log 2 e i= VQ-EC-27 = - k log 2 (2π) -k/2 K + k k EY 2 i log 2 e i= 2λ i = 2 log 2 2π K /k + 2 log 2 e = 2 log 2 2πe K /k (5) If X has covariance matrix K, then h k 2 log 2 2πe K /k and η k 2πe K /k σ 2 with equality iff X is Gaussian Derivation: We may assume X has zero mean, since the mean has no effect on h k or η k. Let f(x) be the density of X with covariance matrix K, and let g(x) be the Gaussian density with mean zero and covariance matrix K. We make the proof in two steps. (a) h k (f) - k f(x) log 2 g(x) dx: - k f(x) log 2 g(x) dx - h k (f) = - k f(x) log g(x) 2 f(x) dx - k g(x) f(x) f(x) - dx log 2 z (z-) ln 2, ln 2 = - k g(x) dx ln 2 + f(x) dx ln 2 = - + = 0 VQ-EC-28

9 (b) - f(x) log k 2 g(x) dx = - k g(x) log 2 g(x) dx = 2 log 2 2πe K /k : - k f(x) log 2 g(x) dx = - k E log 2 (2π) -k/2 K -/2 exp{- 2 X t K - X} dx = 2 log 2 2π K /k + 2k E f [X t K - X] log 2 e where E f denotes expecation with respect to f. Since the expectation is a sum of terms of the form a ij E[X i X j ], it depends only on the covariance matrix K. Therefore, it will be the same if the expectation is taken with respect to g, because g has the same covariance matrix. Therefore - k f(x) log 2 g(x) dx = 2 log 2 2π K /k + 2k E g [X t K - X] log 2 e = - k g(x) log 2 g(x) dx = 2 log 2 2πe K /k by (4) VQ-EC-29 Definition: The conditional differential entropy of random variables X,,X k given random variables Y,,Y m h(x,,x k Y,,Y m ) = - f(x,y) log 2 f(x y) dx dy Most of the following properties are derived in the same way as the corresponding propety for entropy. (6) h(x Y) h(x) with equality iff X and Y are independent. Derivation: We'll show h(x) - h(x Y) 0 with equality iff X indep of Y. f(x,y) h(x) - h(x Y) = - f(x,y) log 2 f(x) dx dy + f(x,y) log 2 f(y) dx dy = - f(x,y) ln f(x)f(y) f(x,y) dx dy ln 2 - f(x,y) f(x)f(y) f(x,y) - dx dy ln 2 since ln z z- = - f(x)f(y) dx dy ln 2 + f(x,y) dx dy ln 2 = 0. Equality holds if and only if f(x)f(y) = f(x,y) for all x,y; i.e. if and only if X and Y are independent. VQ-EC-30

10 (7) h(y,...,y n X,,X m ) h(y,,y n X,,X m' ), 0 m' < m, with equality iff Y,,Y n is conditionally independent of X m'+,,x m given X,,X m'. Derivation: Similar to that of (6). (8) Chain rule: h(x,...,x k ) = h(x )+h(x 2 X )+H(X 3 X X 2 )+... +h(x k X X k- ) Derivation: Essentially the same proof as for the chain rule for ordinary entropy, but with H's replace by h's. (9) h(x,...,x k ) h(x ) h(x k ) with equality if and only if X i 's are independent Derivation: Essentially the same proof as for the analogous property for ordinary entropy. VQ-EC-3 Definitions: h k = k h(x,...,x k ) STATIONARY SOURCES h m = h(x n X n-m,x 2,,X n- ) ( h = h(x ) = h ) h k m = k h(xn+k- n Xn-m) n- (h k = h ) h = lim h k = differential entropy-rate of X k Properties: (0) h k+ h k Derivation: Follows from (7) and stationarity. () h k = k (h +h +h 2 + +h k- ) h k- h k- Derivation: Essentially the same as the analogous property for entropy. h k = k h(x,,x k ) = k (h(x )+h(x 2 X )+h(x 3 X,X 2 ) h(x k X,X 2,,X k- )) chain rule (8) = k (h +h +h 2 + +h k- ) by stationarity k (h k- + h k- + h k- + + h k- ) = h k- h k- by (0) VQ-EC-32

( 1 k "information" I(X;Y) given by Y about X)

( 1 k information I(X;Y) given by Y about X) SUMMARY OF SHANNON DISTORTION-RATE THEORY Consider a stationary source X with f (x) as its th-order pdf. Recall the following OPTA function definitions: δ(,r) = least dist'n of -dim'l fixed-rate VQ's w.