OF PROBABILITY DISTRIBUTIONS

Similar documents
Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Math 209B Homework 2

CSCI 2570 Introduction to Nanocomputing

Chapter 9 Fundamental Limits in Information Theory

Lecture 11: Quantum Information III - Source Coding

Hausdorff Measure. Jimmy Briggs and Tim Tyree. December 3, 2016

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Steiner s formula and large deviations theory

Information Without Rolling Dice

Information Theory Primer:

Some Background Material

1 Stochastic Dynamic Programming

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

Lebesgue Measure on R n

CS 630 Basic Probability and Information Theory. Tim Campbell

(Classical) Information Theory II: Source coding

Information Dimension

Lecture 22: Final Review

Contents Ordered Fields... 2 Ordered sets and fields... 2 Construction of the Reals 1: Dedekind Cuts... 2 Metric Spaces... 3

Capacity of AWGN channels

Lebesgue Measure on R n

Jensen-Shannon Divergence and Hilbert space embedding

An introduction to some aspects of functional analysis

The small ball property in Banach spaces (quantitative results)

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations

Existence and Uniqueness

Lecture 7. Union bound for reducing M-ary to binary hypothesis testing

The Lévy-Itô decomposition and the Lévy-Khintchine formula in31 themarch dual of 2014 a nuclear 1 space. / 20

Statistical signal processing

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

ON THE PRODUCT OF SEPARABLE METRIC SPACES

Lecture 14 February 28

ITERATED FUNCTION SYSTEMS WITH CONTINUOUS PLACE DEPENDENT PROBABILITIES

MATH 426, TOPOLOGY. p 1.

UC San Diego UC San Diego Electronic Theses and Dissertations

The Codimension of the Zeros of a Stable Process in Random Scenery

Introduction to Information Entropy Adapted from Papoulis (1991)

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Exercise Solutions to Functional Analysis

Invertibility of random matrices

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Derivatives. Differentiability problems in Banach spaces. Existence of derivatives. Sharpness of Lebesgue s result

An introduction to basic information theory. Hampus Wessman

David Hilbert was old and partly deaf in the nineteen thirties. Yet being a diligent

Optimal Data and Training Symbol Ratio for Communication over Uncertain Channels

Topological vectorspaces

A FIXED POINT THEOREM FOR GENERALIZED NONEXPANSIVE MULTIVALUED MAPPINGS

MATH 202B - Problem Set 5

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

Gaussian channel. Information theory 2013, lecture 6. Jens Sjölund. 8 May Jens Sjölund (IMT, LiU) Gaussian channel 1 / 26

Spheres with maximum inner diameter

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

5 Birkhoff s Ergodic Theorem

A little context This paper is concerned with finite automata from the experimental point of view. The behavior of these machines is strictly determin

Solutions to Homework Set #1 Sanov s Theorem, Rate distortion

Solutions: Problem Set 4 Math 201B, Winter 2007

Concentration inequalities: basics and some new challenges

A VERY BRIEF REVIEW OF MEASURE THEORY

On the cells in a stationary Poisson hyperplane mosaic

Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations

Zero sum games Proving the vn theorem. Zero sum games. Roberto Lucchetti. Politecnico di Milano

Entropies & Information Theory

Some topics in analysis related to Banach algebras, 2

SEPARABILITY AND COMPLETENESS FOR THE WASSERSTEIN DISTANCE

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

Three hours THE UNIVERSITY OF MANCHESTER. 24th January

Capacity of the Discrete Memoryless Energy Harvesting Channel with Side Information

AN ELEMENTARY PROOF OF THE SPECTRAL RADIUS FORMULA FOR MATRICES

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Compressibility of Infinite Sequences and its Interplay with Compressed Sensing Recovery

to mere bit flips) may affect the transmission.

Lecture 18: Gaussian Channel

Based on the Appendix to B. Hasselblatt and A. Katok, A First Course in Dynamics, Cambridge University press,

Empirical Processes: General Weak Convergence Theory

Stability Analysis and Synthesis for Scalar Linear Systems With a Quantized Feedback

Shannon meets Wiener II: On MMSE estimation in successive decoding schemes

arxiv: v1 [math.pr] 22 Dec 2018

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

Finite-Horizon Statistics for Markov chains

PERFECTLY secure key agreement has been studied recently

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

Capacity and Reliability Function for Small Peak Signal Constraints

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

Sections of Convex Bodies via the Combinatorial Dimension

Randomness Beyond Lebesgue Measure

10-704: Information Processing and Learning Fall Lecture 9: Sept 28

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Gaussian Random Fields: Geometric Properties and Extremes

Principal Components Theory Notes

Lossy Compression Coding Theorems for Arbitrary Sources

Lecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

ECE Information theory Final

9. Distance measures. 9.1 Classical information measures. Head Tail. How similar/close are two probability distributions? Trace distance.

25.1 Ergodicity and Metric Transitivity

Lecture 1: The Multiple Access Channel. Copyright G. Caire 12

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Transcription:

EPSILON ENTROPY OF PROBABILITY DISTRIBUTIONS 1. Introduction EDWARD C. POSNER and EUGENE R. RODEMICH JET PROPULSION LABORATORY CALIFORNIA INSTITUTE OF TECHNOLOGY This paper summarizes recent work on the theory of epsilon entropy for probability distributions on complete separable metric spaces. The theory was conceived [3] in order to have a framework for discussing the quality of data storage and transmission systems. The concept of data source was defined in [4] as a probabilistic metric space: a complete separable metric space together with a probability distribution under which open sets are measurable, so that the Borel sets are measurable. An a partition of such a space is a partition by measurable a sets, which, depending on context, can be sets of diameter at most e or sets of radius at most ie, that is, sets contained in spheres of radius 2e. The entropy H(U) of a partition U is the Shannon entropy of the probability of the distribution consisting of the measures of the sets of the partition. The (one shot) epsilon entropy of X with distribution A, He;,,(X), is defined by (1.1) He;p(X) = inf {H (U); U an c partition} U and, except for roundoff in the entropy function, a term less than 1, H,;m(X) is the minimum expected number of bits necessary to describe X to within a when storage is not allowed. The inf in (1.1) was shown to be a min in [4]. For X a compact metric space, Kolmogorov's epsilon entropy H.(X) is defined as (1.2) He(X) = min {log card (U); U an c U partition} and, except for roundoff in the logarithm, is the minimum number of bits necessary to describe X to within e when words of fixed length are used. Suppose one does experiments from X independently and then attempts storage or transmission. That is, take a cartesian product X(') of X, with product measure A() and supremum metric. Thus, E sets in the product are the subsets This paper presents the results of one phase of research carried out under Contract NAS 7-100, sponsored by the National Aeronautics and Space Administration. 699

700 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH of products of sets. This is the notion that insures that knowledge of outcomes to within in X(') forces knowledge to within c in each of the factors. Then the following limit can be proved to exist: (1.3) I.;,p(X) = lim - H (n) (X(")), n-.rx f and is called the absolute epsilon entropy of X. It represents the minimum expected number of bits per sample needed to describe a sequence of independent outcomes of X when arbitrary storage between experiments can be used. Similarly, define the absolute epsilon entropy of the compact metric space X as 1 n with the same definition for the metric on X('). (1.4) I,(X) = lim -H(X(n)), n-cx 2. Relations with channel coding It was shown in [3] that (2.1) Ie;,(X) = inf {I(p); pe R.(X)}, p where R.(X), in the case of the radius definition, is the class of probability distributions on X x X which are supported within 2e of the diagonal, with a more complicated definition for the diameter case. Here I(p) stands for mutual information. We do not know if the inf need be attained. However, (2.1), coupled with the continuity of Ie; m (X) from above in, proved in [3], allows us to prove a strong channel coding theorem and its converse [3]: "If K is a memoryless channel with capacity F less than Il,;}(X) (of finite capacity if IE;p,(X) is infinite), then it is not possible to transmit outcomes of X over K such that, with probability approaching 1, an arbitrarily large fraction of a long sequence of outcomes are known to within an error not much more than : But if r is greater than IE;m,(X) (assuming IE;,(X) finite), then is it is possible to transmit outcomes of X over the channel such that, with probability approaching 1, all the outcomes are known to within." 3. A useful inequality An important and useful inequality relating H.;,, and I,;, (3.1) H,;m(X) _ I4;p(X) + log+i+;is;(x) + C, where C is a universal constant. In other words, it was proved in [3]: doesn't help much to store independent experiments if the entropy is large. This result, coupled with (2.1), makes it easy to obtain asymptotic bounds on H.;,(X), which we shall do in subsequent sections.

4. Relation between I,;. and I1 EPSILON ENTROPY 701 Here is a surprising relation between I,;,p(X) and I.(X) for X compact [1]: (4.1) I8(X) = max I,,;,(X) U for all but countably many E, a condition which we now know cannot be removed. What this result means is that Nature can choose a p on X so bad that nothing can be saved by using variable length coding. The proof uses von Neumann's minimax theorem to prove, as an intermediate step, that (4.2) max I;.(X) = log-, where v(a) is the value of the zero sum two person game, which, in the radius case, has as its space of pure strategies the points of X, with payoff 0 or 1 to the first player; the payoff is 1 if and only if the points chosen by the two players have distance at most 'a. 5. Finite dimensional spaces The differential entropy of a density function p an Euclidean n space is defined, when it exists, as (5.1) H(p) = p log - dm(x), where m(x) is Lebesgue measure. The relation between differential entropy and * * S on epsilon entropy was considered in [2]. The metric can be any norm 11Isn space, where S, of Lebesgue measure v,, is a compact symmetric convex set in E', and is the unit sphere under 11 [.*s. Let p be a sufficiently nice density function on E', so that E' is a probabilistic metric space with probability p given by p(a) = SA P dim. Then as e -.0, 2 1 (5.2) H;,(E') = nlog- + log- + H(p) + C(S) + o(1) for a constant C(S) called the entropic packing constant. Furthermore, C(S) is between 0 and 1, and is 0 if and only if translates of S fill E'. Moreover, C(S) as a function of S is continuous in the Hausdorff metric, in which the distance between two compact sets is the maximum over the two sets of the distance of a point in one of the two sets from the other set. A somewhat analogous result holds for H, of the unit n cube, but the analogous C'(S), the deterministic packing constant, is not bounded by 1 but rather can be as large as (1-0(1)) log n as n -. oo.

702 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH If a Borel probability p on E' with Euclidean distance has mean 0 and a second moment a2 = EI XII2, H.;,.(E`) is finite, even though (5.2) does not hold [4]; in fact, for small e, (5.3) H.;,(E") _ n log1 + 2 log n + n log a + 1 + log (2 + A7). The normal distribution comes under either of those two cases. When n = 1, the unique minimizing partition is known [6]. It is the partition by consecutive intervals of length such that the mean is in the center of one of the intervals. The proof is hard; a simpler one would be nice. For n > 1, minimizing partitions are not known, even for the independent normal of equal variances. 6. Epsilon entropy in L2 [0, 1] The bound of (5.3) depends on n, and, in fact, examples can be given that show this dependence can actually occur. It is not surprising then, that if L2 [0, 1] is made into a probabilistic metric space by the measure induced by a mean continuous separable stochastic process on the unit interval, then the epsilon entropy can be infinite, even though the expectation of 11x112 is always finite for such a process. In fact, [4] proved that a given convergent sequence {A,} of nonnegative numbers written in nonincreasing order is the set of eigenvalues of some mean continuous stochastic process on the unit interval of infinite epsilon entropy for some e > 0 if and only if (6.1) YnA =. Conversely, if 1 na,, = so, there is a process with infinite epsilon entropy for every e > 0. Thus, a slightly stronger condition than finite second moment is necessary to insure finite epsilon entropy in the infinite dimensional case. 7. Product entropy of Gaussian processes In this section, we shall consider the definition of product entropy if X = L2 [0, 1], but only for mean continuous Gaussian processes. We shall defer the case ofthe epsilon entropy of Gaussian processes until Section 9. Product entropy J1(X) is defined as the minimum entropy over all product epsilon partitions of L2 [0, 1]. A product epsilon partition is a (countable) epsilon partition of all of X except a set of probability zero by sets which are hyperrectangles (with respect to eigenfunction coordinates) of diagonal epsilon. Thus, (7.1) J.(X) > H.(X). Surprisingly, J.(X) is infinite for one E if and only if it is infinite for all a, and is finite if and only if the "entropy of the eigenvalues" E A. log 1/in is finite, where the eigenvalues are written in nonincreasing order [5]. We have no good explanation of why the entropy of the eigenvalues occurs as the condition for

EPSILON ENTROPY 703 finite epsilon entropy. Furthermore, Y A,, log 1/i,, is necessary and sufficient in order that there be a product epsilon partition, and is also the condition that there be a hyperrectangle of positive probability and finite diameter. Incidentally, J8(X) depends only on the eigenvalues of the Gaussian process, as do H8;m(X) and I,;m (X), where p is the measure on L2 [0, 1] induced by the mean continuous Gaussian process. In [6], product entropy is estimated rather precisely in terms of the eigenvalues and the optimum product partitions found by a variational argument. By remarks of Section 5, the optimal partitions are products of centered partitions on each coordinate axis. The interpretation of product entropy is as follows. One wishes to transmit sample functions of the process so that one knows outcomes to within in the L2 norm, but only wishes to consider methods which involve correlating the sample function with the eigenfunction and then sending a quantized version of these correlations. Since the diagonal of the product sets has diameter E, the method guarantees knowledge of the sample function to within. Unfortunately, as we shall see, this method is not very good compared to the optimal compression schemes which are not restricted to product partitions but can use arbitrary c partitions. In [5], conditions are given which guarantee either (7.2) JE(X) = O(H.;,p(X)) or - (7.3) ~~~~~~JE (X ) He.;,.,(X ) as a -e 0 for a mean continuous Gaussian process. The condition for (7.2) is that the sum of the eigenvalues beyond the nth (in nonincreasing order) be 0(n).). For (7.3), the condition is that the sum be o(na,,). In the first case, in fact, (7.4) Je(X) O(L4(X)) = and in the second (7.5) ~~~~J.(X) - L. (X), where L.(X) is a general lower bound for the epsilon entropy of a mean continuous Gaussian process to be discussed later. For a stationary band limited Gaussian process on the unit interval with continuous spectral density, - (7.6) A. n-(cn) C constant, as n - oo. Thus, the o(na,,) condition is satisfied and L,(X) evaluated. The final result is (7.7) J, (X) 1 2 log log - 6 can be

704 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH as a e 0. Now (7.8) JE(X) nlog1 as e 0 if the process has only finitely many nonzero eigenvalues, n in number. So in the case at hand where there are infinitely many nonzero eigenvalues, the growth of J,(X) had to be faster than any constant times log (1/E). The rate of growth given by (7.8) is not much faster, however. This is an expression of the fact that the sample functions from such a process are entire functions, hence not very random, and they should be easy to approximate. 8. Entropy in C[0, 1] The case of C[O, 1] is much more difficult than L2[0, 1], partly because it is hard to determine whether a given process has continuous paths. However, in [7] it is proved that if the mean continuous separable stochastic process on [0, 1] satisfies, (8.1) E(x(0))2 < A, E(x(s) -_x(t))2 < Als - tia, S, t E [0, 1], for some A _ 0 1 < a < 2, then the paths are continuous with probability 1, a known result, and, if p is the measure induced by the process on C [0, 1], then (8.2) H.;p- C(a)Allag 21a for E < Ki, where C(a) depends only on a. The proof was achieved by constructing an a partition of C[O, 1] using uniformly spread points on [0, 1] and facts about the modulus of continuity of the process that are forced by the given conditions on the covariance function. Conditions for finite entropy in function spaces other than L2[0, 1] and C[O, 1] have not been looked for or found, even in the case of Gaussian processes. In that case, a bound like (8.2) is found for C[O, 1], which is valid under weaker conditions on the covariance function of the process. Multidimensional time processes have not been considered at all. 9. Bounds for L2 Gaussian processes Various lower bounds have been found for L2 Gaussian processes. The first is the bound L, of [6]. This bound is defined as follows. Let {in} in nonincreasing order be the eigenvalues of the process. Define b = b (e) for 8 > 0 by "_---_ 2 2 (9.1) 1 + b(c) = 8 < A, b(e) = O. 82 >

Then EPSILON ENTROPY 705 (9.2) Le(X) = 2 E log [1 + A,,b(e)] is a lower bound for HE;p(X). It was derived from the obvious inequality 9.3) H.;,,(X) > Elog 1 and so is quite weak. (The term Se(x) is the ball of center x and radius a.) A stronger bound M.(X), never any worse, is () 1 /i\1 (9.4) M.(X) = L/2(X) - -2b -). 8 2J This bound is derived by bounding the probability density drop in a Gaussian distribution under translation, and using the fact, proved in [6], that the sphere of radius j2 about the origin in L2 under a Gaussian distribution is at least as probable as any set of diameter s. The Me (X) bound is the best asymptotic lower bound we have for arbitrary L2 Gaussian processes, but, for special ones, the next section gives an improvement. The L.(X) lower bound was introduced chiefly because it is also proved in [5] that (9.5) H1 (X) S Lme(X) for any m < 2 This difficult proof uses products of partitions of finite dimensional eigensubspaces of the process where the dimension increases without bound. In each finite dimensional subspace, "shell" partitions are used, partitions which are composed of partitions between regions of concentric spheres of properly varying radii. The deterministic epsilon entropy of the n sphere in Euclidean space needed to be estimated in the proof. This proves the finiteness of H.(X) for X a mean continuous Gaussian process on [0, 1]. From results of Section 3, we know HE(X) S IE(X). The measure p on X x X, which is defined by choosing a point of the first factor according to p and then assigning probability in SE,2 (x) according to j, has, as in [1], mutual information (9.6) I(p) < E(log[i'0]). Thus, for any probabilistic metric space, (9.7) H.;.,(X) S~E log p[i(c)) This, coupled with (9.3), gives a pair of bounds valid in general, but not readily computable.

706 SIXTH BERKELEY SYMPOSIUM: POSNER AND RODEMICH 10. Examples of L2 entropy bounds The bounds of the previous section lead to some interesting asymptotic expressions which yield quick answers when the eigenvalues of the process are known asymptotically. For example, a Gaussian process arising as the solution of a linear differential equation with constant coefficients driven by white Gaussian noise can be handled. For such processes, [8] shows that (10.1) 2A, - An-P, p > 1,A > 0, and shows how to find A and p from the equation with simple calculations. The results of the previous section then yield (10.2) M.e(X) _ He(X) < L.12(x), (10.3) Me(X) (p - 1) (22/(p-1)-)(1)lI(P-l) (10.4) L ((X) si )P 11 For example, if X is the Wiener process (10.5) E[X(s)X(t)] = min (s, t), 8, t E [0, 1], then (10.6) 2(n1/)2' n> 1, and so A = 1/r2,p = 2, and 11 (10.7) 2 S H.e(X) 2 A better lower bound valid in general for L2 is given by (10.8) (10.8) H.4(X). M. (X) + ZE{ (x ~~~~~~~~2 {[ + Anq(X)]2} =N.(X), say, where q(x) = 0 tixil < E, and (10.9) [E 2 [e + A~%q(x)]2-lil>. 2 i Zx 1 4 > E.

EPSILON ENTROPY 707 Here x,, denotes the component of x along the nth eigenfunction. Equation (10.8) is a refined version of (9.4), and is useful only when the eigenvalues decrease slowly enough so that q(x) is almost deterministic. The condition turns out to be satisfied if (10.1) is, and allows Ne(X) to be found asymptotically as 2/(p- 1 (10.10) /(p- 1)] NE(X) (p - 1 ) (P)2 + 2pP(P2 Thus, for X the Wiener process, (10.7) can be improved to (10.11) 17 < HE A The result given by (10.11) is all we know about the entropy of the Wiener process. Our lower bounds just are not good enough to prove our conjecture (10.12) HE(X) - 2 for X the Wiener process. Notice, however, that for the Wiener process on C[0, 1], where one might expect the entropy to be much larger because of the more stringent covering requirement, Section 8 yields (10.13) H,(X in C[0, 1]) = O(Hr(X in L2 [0, 1])). In fact, (10.13) holds for any Gaussian process satisfying the eigenvalue condition (10.1). With this surprising result we close the paper. REFERENCES [1] R. J. McELIECE and E. C. POSNER, "Hide and seek, data storage, and entropy," Ann. Math. Statist., Vol. 2 (1971), pp. 1706-1716. [2] E. C. POSNER and E. R. RODEMICH, "Differential entropy and tiling," J. Statist. Phys., Vol. 1 (1969), pp. 57-69. [3] "Epsilon entropy and data compression," Ann. Math. Statist., Vol. 42 (1971), pp. 2079-2125. [4] E. C. POSNER, E. R. RODEMICH, and H. RuMSEY, JR., "Epsilon entropy of stochastic processes," Ann. Math. Statist., Vol. 38 (1967), pp. 1000-1020. [5] "Product entropy of Gaussian distributions," Ann. Math. Statist., Vol. 40 (1969), pp. 870-904. [6] "Epsilon entropy of Gaussian distributions," Ann. Math. Statist., Vol. 40 (1969), pp. 1272-1296. [7] E. R. RODEMICH and E. C. POSNER, "Epsilon entropy of stochastic processes with continuous paths," in preparation. [8] H. WIDOM, "Asymptotic behavior of eigenvalues of certain integral operators," Arch. Rational Mech. Anal., Vol. 17 (1967), pp. 215-229.