EPSILON ENTROPY OF PROBABILITY DISTRIBUTIONS 1. Introduction EDWARD C. POSNER and EUGENE R. RODEMICH JET PROPULSION LABORATORY CALIFORNIA INSTITUTE OF TECHNOLOGY This paper summarizes recent work on the theory of epsilon entropy for probability distributions on complete separable metric spaces. The theory was conceived [3] in order to have a framework for discussing the quality of data storage and transmission systems. The concept of data source was defined in [4] as a probabilistic metric space: a complete separable metric space together with a probability distribution under which open sets are measurable, so that the Borel sets are measurable. An a partition of such a space is a partition by measurable a sets, which, depending on context, can be sets of diameter at most e or sets of radius at most ie, that is, sets contained in spheres of radius 2e. The entropy H(U) of a partition U is the Shannon entropy of the probability of the distribution consisting of the measures of the sets of the partition. The (one shot) epsilon entropy of X with distribution A, He;,,(X), is defined by (1.1) He;p(X) = inf {H (U); U an c partition} U and, except for roundoff in the entropy function, a term less than 1, H,;m(X) is the minimum expected number of bits necessary to describe X to within a when storage is not allowed. The inf in (1.1) was shown to be a min in [4]. For X a compact metric space, Kolmogorov's epsilon entropy H.(X) is defined as (1.2) He(X) = min {log card (U); U an c U partition} and, except for roundoff in the logarithm, is the minimum number of bits necessary to describe X to within e when words of fixed length are used. Suppose one does experiments from X independently and then attempts storage or transmission. That is, take a cartesian product X(') of X, with product measure A() and supremum metric. Thus, E sets in the product are the subsets This paper presents the results of one phase of research carried out under Contract NAS 7-100, sponsored by the National Aeronautics and Space Administration. 699
700 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH of products of sets. This is the notion that insures that knowledge of outcomes to within in X(') forces knowledge to within c in each of the factors. Then the following limit can be proved to exist: (1.3) I.;,p(X) = lim - H (n) (X(")), n-.rx f and is called the absolute epsilon entropy of X. It represents the minimum expected number of bits per sample needed to describe a sequence of independent outcomes of X when arbitrary storage between experiments can be used. Similarly, define the absolute epsilon entropy of the compact metric space X as 1 n with the same definition for the metric on X('). (1.4) I,(X) = lim -H(X(n)), n-cx 2. Relations with channel coding It was shown in [3] that (2.1) Ie;,(X) = inf {I(p); pe R.(X)}, p where R.(X), in the case of the radius definition, is the class of probability distributions on X x X which are supported within 2e of the diagonal, with a more complicated definition for the diameter case. Here I(p) stands for mutual information. We do not know if the inf need be attained. However, (2.1), coupled with the continuity of Ie; m (X) from above in, proved in [3], allows us to prove a strong channel coding theorem and its converse [3]: "If K is a memoryless channel with capacity F less than Il,;}(X) (of finite capacity if IE;p,(X) is infinite), then it is not possible to transmit outcomes of X over K such that, with probability approaching 1, an arbitrarily large fraction of a long sequence of outcomes are known to within an error not much more than : But if r is greater than IE;m,(X) (assuming IE;,(X) finite), then is it is possible to transmit outcomes of X over the channel such that, with probability approaching 1, all the outcomes are known to within." 3. A useful inequality An important and useful inequality relating H.;,, and I,;, (3.1) H,;m(X) _ I4;p(X) + log+i+;is;(x) + C, where C is a universal constant. In other words, it was proved in [3]: doesn't help much to store independent experiments if the entropy is large. This result, coupled with (2.1), makes it easy to obtain asymptotic bounds on H.;,(X), which we shall do in subsequent sections.
4. Relation between I,;. and I1 EPSILON ENTROPY 701 Here is a surprising relation between I,;,p(X) and I.(X) for X compact [1]: (4.1) I8(X) = max I,,;,(X) U for all but countably many E, a condition which we now know cannot be removed. What this result means is that Nature can choose a p on X so bad that nothing can be saved by using variable length coding. The proof uses von Neumann's minimax theorem to prove, as an intermediate step, that (4.2) max I;.(X) = log-, where v(a) is the value of the zero sum two person game, which, in the radius case, has as its space of pure strategies the points of X, with payoff 0 or 1 to the first player; the payoff is 1 if and only if the points chosen by the two players have distance at most 'a. 5. Finite dimensional spaces The differential entropy of a density function p an Euclidean n space is defined, when it exists, as (5.1) H(p) = p log - dm(x), where m(x) is Lebesgue measure. The relation between differential entropy and * * S on epsilon entropy was considered in [2]. The metric can be any norm 11Isn space, where S, of Lebesgue measure v,, is a compact symmetric convex set in E', and is the unit sphere under 11 [.*s. Let p be a sufficiently nice density function on E', so that E' is a probabilistic metric space with probability p given by p(a) = SA P dim. Then as e -.0, 2 1 (5.2) H;,(E') = nlog- + log- + H(p) + C(S) + o(1) for a constant C(S) called the entropic packing constant. Furthermore, C(S) is between 0 and 1, and is 0 if and only if translates of S fill E'. Moreover, C(S) as a function of S is continuous in the Hausdorff metric, in which the distance between two compact sets is the maximum over the two sets of the distance of a point in one of the two sets from the other set. A somewhat analogous result holds for H, of the unit n cube, but the analogous C'(S), the deterministic packing constant, is not bounded by 1 but rather can be as large as (1-0(1)) log n as n -. oo.
702 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH If a Borel probability p on E' with Euclidean distance has mean 0 and a second moment a2 = EI XII2, H.;,.(E`) is finite, even though (5.2) does not hold [4]; in fact, for small e, (5.3) H.;,(E") _ n log1 + 2 log n + n log a + 1 + log (2 + A7). The normal distribution comes under either of those two cases. When n = 1, the unique minimizing partition is known [6]. It is the partition by consecutive intervals of length such that the mean is in the center of one of the intervals. The proof is hard; a simpler one would be nice. For n > 1, minimizing partitions are not known, even for the independent normal of equal variances. 6. Epsilon entropy in L2 [0, 1] The bound of (5.3) depends on n, and, in fact, examples can be given that show this dependence can actually occur. It is not surprising then, that if L2 [0, 1] is made into a probabilistic metric space by the measure induced by a mean continuous separable stochastic process on the unit interval, then the epsilon entropy can be infinite, even though the expectation of 11x112 is always finite for such a process. In fact, [4] proved that a given convergent sequence {A,} of nonnegative numbers written in nonincreasing order is the set of eigenvalues of some mean continuous stochastic process on the unit interval of infinite epsilon entropy for some e > 0 if and only if (6.1) YnA =. Conversely, if 1 na,, = so, there is a process with infinite epsilon entropy for every e > 0. Thus, a slightly stronger condition than finite second moment is necessary to insure finite epsilon entropy in the infinite dimensional case. 7. Product entropy of Gaussian processes In this section, we shall consider the definition of product entropy if X = L2 [0, 1], but only for mean continuous Gaussian processes. We shall defer the case ofthe epsilon entropy of Gaussian processes until Section 9. Product entropy J1(X) is defined as the minimum entropy over all product epsilon partitions of L2 [0, 1]. A product epsilon partition is a (countable) epsilon partition of all of X except a set of probability zero by sets which are hyperrectangles (with respect to eigenfunction coordinates) of diagonal epsilon. Thus, (7.1) J.(X) > H.(X). Surprisingly, J.(X) is infinite for one E if and only if it is infinite for all a, and is finite if and only if the "entropy of the eigenvalues" E A. log 1/in is finite, where the eigenvalues are written in nonincreasing order [5]. We have no good explanation of why the entropy of the eigenvalues occurs as the condition for
EPSILON ENTROPY 703 finite epsilon entropy. Furthermore, Y A,, log 1/i,, is necessary and sufficient in order that there be a product epsilon partition, and is also the condition that there be a hyperrectangle of positive probability and finite diameter. Incidentally, J8(X) depends only on the eigenvalues of the Gaussian process, as do H8;m(X) and I,;m (X), where p is the measure on L2 [0, 1] induced by the mean continuous Gaussian process. In [6], product entropy is estimated rather precisely in terms of the eigenvalues and the optimum product partitions found by a variational argument. By remarks of Section 5, the optimal partitions are products of centered partitions on each coordinate axis. The interpretation of product entropy is as follows. One wishes to transmit sample functions of the process so that one knows outcomes to within in the L2 norm, but only wishes to consider methods which involve correlating the sample function with the eigenfunction and then sending a quantized version of these correlations. Since the diagonal of the product sets has diameter E, the method guarantees knowledge of the sample function to within. Unfortunately, as we shall see, this method is not very good compared to the optimal compression schemes which are not restricted to product partitions but can use arbitrary c partitions. In [5], conditions are given which guarantee either (7.2) JE(X) = O(H.;,p(X)) or - (7.3) ~~~~~~JE (X ) He.;,.,(X ) as a -e 0 for a mean continuous Gaussian process. The condition for (7.2) is that the sum of the eigenvalues beyond the nth (in nonincreasing order) be 0(n).). For (7.3), the condition is that the sum be o(na,,). In the first case, in fact, (7.4) Je(X) O(L4(X)) = and in the second (7.5) ~~~~J.(X) - L. (X), where L.(X) is a general lower bound for the epsilon entropy of a mean continuous Gaussian process to be discussed later. For a stationary band limited Gaussian process on the unit interval with continuous spectral density, - (7.6) A. n-(cn) C constant, as n - oo. Thus, the o(na,,) condition is satisfied and L,(X) evaluated. The final result is (7.7) J, (X) 1 2 log log - 6 can be
704 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH as a e 0. Now (7.8) JE(X) nlog1 as e 0 if the process has only finitely many nonzero eigenvalues, n in number. So in the case at hand where there are infinitely many nonzero eigenvalues, the growth of J,(X) had to be faster than any constant times log (1/E). The rate of growth given by (7.8) is not much faster, however. This is an expression of the fact that the sample functions from such a process are entire functions, hence not very random, and they should be easy to approximate. 8. Entropy in C[0, 1] The case of C[O, 1] is much more difficult than L2[0, 1], partly because it is hard to determine whether a given process has continuous paths. However, in [7] it is proved that if the mean continuous separable stochastic process on [0, 1] satisfies, (8.1) E(x(0))2 < A, E(x(s) -_x(t))2 < Als - tia, S, t E [0, 1], for some A _ 0 1 < a < 2, then the paths are continuous with probability 1, a known result, and, if p is the measure induced by the process on C [0, 1], then (8.2) H.;p- C(a)Allag 21a for E < Ki, where C(a) depends only on a. The proof was achieved by constructing an a partition of C[O, 1] using uniformly spread points on [0, 1] and facts about the modulus of continuity of the process that are forced by the given conditions on the covariance function. Conditions for finite entropy in function spaces other than L2[0, 1] and C[O, 1] have not been looked for or found, even in the case of Gaussian processes. In that case, a bound like (8.2) is found for C[O, 1], which is valid under weaker conditions on the covariance function of the process. Multidimensional time processes have not been considered at all. 9. Bounds for L2 Gaussian processes Various lower bounds have been found for L2 Gaussian processes. The first is the bound L, of [6]. This bound is defined as follows. Let {in} in nonincreasing order be the eigenvalues of the process. Define b = b (e) for 8 > 0 by "_---_ 2 2 (9.1) 1 + b(c) = 8 < A, b(e) = O. 82 >
Then EPSILON ENTROPY 705 (9.2) Le(X) = 2 E log [1 + A,,b(e)] is a lower bound for HE;p(X). It was derived from the obvious inequality 9.3) H.;,,(X) > Elog 1 and so is quite weak. (The term Se(x) is the ball of center x and radius a.) A stronger bound M.(X), never any worse, is () 1 /i\1 (9.4) M.(X) = L/2(X) - -2b -). 8 2J This bound is derived by bounding the probability density drop in a Gaussian distribution under translation, and using the fact, proved in [6], that the sphere of radius j2 about the origin in L2 under a Gaussian distribution is at least as probable as any set of diameter s. The Me (X) bound is the best asymptotic lower bound we have for arbitrary L2 Gaussian processes, but, for special ones, the next section gives an improvement. The L.(X) lower bound was introduced chiefly because it is also proved in [5] that (9.5) H1 (X) S Lme(X) for any m < 2 This difficult proof uses products of partitions of finite dimensional eigensubspaces of the process where the dimension increases without bound. In each finite dimensional subspace, "shell" partitions are used, partitions which are composed of partitions between regions of concentric spheres of properly varying radii. The deterministic epsilon entropy of the n sphere in Euclidean space needed to be estimated in the proof. This proves the finiteness of H.(X) for X a mean continuous Gaussian process on [0, 1]. From results of Section 3, we know HE(X) S IE(X). The measure p on X x X, which is defined by choosing a point of the first factor according to p and then assigning probability in SE,2 (x) according to j, has, as in [1], mutual information (9.6) I(p) < E(log[i'0]). Thus, for any probabilistic metric space, (9.7) H.;.,(X) S~E log p[i(c)) This, coupled with (9.3), gives a pair of bounds valid in general, but not readily computable.
706 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH 10. Examples of L2 entropy bounds The bounds of the previous section lead to some interesting asymptotic expressions which yield quick answers when the eigenvalues of the process are known asymptotically. For example, a Gaussian process arising as the solution of a linear differential equation with constant coefficients driven by white Gaussian noise can be handled. For such processes, [8] shows that (10.1) 2A, - An-P, p > 1,A > 0, and shows how to find A and p from the equation with simple calculations. The results of the previous section then yield (10.2) M.e(X) _ He(X) < L.12(x), (10.3) Me(X) (p - 1) (22/(p-1)-)(1)lI(P-l) (10.4) L ((X) si )P 11 For example, if X is the Wiener process (10.5) E[X(s)X(t)] = min (s, t), 8, t E [0, 1], then (10.6) 2(n1/)2' n> 1, and so A = 1/r2,p = 2, and 11 (10.7) 2 S H.e(X) 2 A better lower bound valid in general for L2 is given by (10.8) (10.8) H.4(X). M. (X) + ZE{ (x ~~~~~~~~2 {[ + Anq(X)]2} =N.(X), say, where q(x) = 0 tixil < E, and (10.9) [E 2 [e + A~%q(x)]2-lil>. 2 i Zx 1 4 > E.
EPSILON ENTROPY 707 Here x,, denotes the component of x along the nth eigenfunction. Equation (10.8) is a refined version of (9.4), and is useful only when the eigenvalues decrease slowly enough so that q(x) is almost deterministic. The condition turns out to be satisfied if (10.1) is, and allows Ne(X) to be found asymptotically as 2/(p- 1 (10.10) /(p- 1)] NE(X) (p - 1 ) (P)2 + 2pP(P2 Thus, for X the Wiener process, (10.7) can be improved to (10.11) 17 < HE A The result given by (10.11) is all we know about the entropy of the Wiener process. Our lower bounds just are not good enough to prove our conjecture (10.12) HE(X) - 2 for X the Wiener process. Notice, however, that for the Wiener process on C[0, 1], where one might expect the entropy to be much larger because of the more stringent covering requirement, Section 8 yields (10.13) H,(X in C[0, 1]) = O(Hr(X in L2 [0, 1])). In fact, (10.13) holds for any Gaussian process satisfying the eigenvalue condition (10.1). With this surprising result we close the paper. REFERENCES [1] R. J. McELIECE and E. C. POSNER, "Hide and seek, data storage, and entropy," Ann. Math. Statist., Vol. 2 (1971), pp. 1706-1716. [2] E. C. POSNER and E. R. RODEMICH, "Differential entropy and tiling," J. Statist. Phys., Vol. 1 (1969), pp. 57-69. [3] "Epsilon entropy and data compression," Ann. Math. Statist., Vol. 42 (1971), pp. 2079-2125. [4] E. C. POSNER, E. R. RODEMICH, and H. RuMSEY, JR., "Epsilon entropy of stochastic processes," Ann. Math. Statist., Vol. 38 (1967), pp. 1000-1020. [5] "Product entropy of Gaussian distributions," Ann. Math. Statist., Vol. 40 (1969), pp. 870-904. [6] "Epsilon entropy of Gaussian distributions," Ann. Math. Statist., Vol. 40 (1969), pp. 1272-1296. [7] E. R. RODEMICH and E. C. POSNER, "Epsilon entropy of stochastic processes with continuous paths," in preparation. [8] H. WIDOM, "Asymptotic behavior of eigenvalues of certain integral operators," Arch. Rational Mech. Anal., Vol. 17 (1967), pp. 215-229.