Efficient Compression of Monotone and m-modal Distributions

Size: px

Start display at page:

Download "Efficient Compression of Monotone and m-modal Distributions"

Clara Malone
5 years ago
Views:

1 Efficient Compression of Monotone and m-modal Distriutions Jayadev Acharya, Ashkan Jafarpour, Alon Orlitsky, and Ananda Theertha Suresh ECE Department, UCSD {jacharya, ashkan, alon, Astract We consider universal compression of n samples drawn independently according to a monotone or m-modal distriution over k elements. We show that for all these distriutions, the per-sample redundancy diminishes to 0 if k expon/ log n and is at least a constant if k expωn. I. INTRODUCTION Universal compression concerns encoding the output of an unknown source in a known class of distriutions. We are interested in understanding compression of i.i.d. samples from distriutions over an underlying discrete alphaet X e.g., a suset of N. Let P n def P P... P e a product distriution over X n. Shannon s source coding theorem shows that HP def X P x log P x its are necessary and sufficient to encode a known source with distriution P, where HP is the entropy of P. By Kraft s Inequality, any distriution over X n implies a code that uses log x n 1 its to encode x n 1 X n. The code implied y the underlying distriution P n uses log P n x n 1 its and the expected numer of its used is nhp. The average redundancy of with respect to P n is DP n E P n [ log P n x n 1 x n 1 the expected extra numer of its used y eyond the entropy. DP Q is called the Kullack-Lieler divergence etween P and Q. Let the underlying distriution e in a known class P, and P n def {P n : P P} ], e the class of all distriutions of the form P n. Then RP n def inf sup P n DP n is the average or min-max redundancy of P n, which corresponds to the est coding for the worst distriution. While coding over unknown distriutions, often some apriori knowledge aout the underlying distriution is known. For example, coding the last names of people from a census data is close to coding a monotone distriution, since we have a prior knowledge aout the prevalences of last names. We expect that the last name Smith is more likely than Galifianakis. In a text document we have some knowledge aout word frequencies and proailities. In such language modeling applications, Zipf distriutions are common [1]. Geometric distriutions over integers are useful in compressing residual signals in image compression []. A natural generalization is to consider distriutions with at most m modes. For example, life expectancy of a population, Poisson, and Binomial distriutions are unimodal. There has een a considerale interest over the past decade in approximating mixtures of distriutions. Many real world phenomenon can e modeled as outcomes of mixtures of simple distriutions and these have een studied since as early as the study Naples cra population y Pearson [3]. In this setup after oserving measurements say diameter over heights, the distriution that explained the data the est was a mixture of two Gaussians, predicting the presence of more than one specie. Such mixtures of m simple distriutions are typically m-modal. There has een enormous work in the past decade in learning mixtures of distriutions. Before discussing monotone distriutions, we state the results for the most well studied finite discrete distriutions, i.e., the i.i.d. distriutions. Let Ik n e the class of all i.i.d. distriutions over k elements and lock length n. Redundancy of Ik n estalished that 1 For k on has een extensively studied [4 11]. It is now well For n ok RI n k k 1 log n 1 + o1. 1 k RIk n n log k 1 + o1. n II. TERMINOLOGY We consider distriutions over the set N of positive integers. Such a distriution is non-increasing, or monotone, if P i P i + 1 for all i 1. Let M e the class of all monotone distriutions over N, and let M k e the suset of M consisting of monotone distriutions over [k] {1,..., k}. Let M n k denote the class of distriutions otained y sampling one distriution in M k independently n times, i.e., product distriutions of the form P n for P M k. Generalizing monotone distriutions, a consecutive set [l, r] {l,..., r} of integers is a mode of a distriution P if for all i, j [l, r] P i P j, and P l 1 P l P r + 1 P r > 0, representing a local minimum or maximum of P. P is m-modal if it has at most m modes. Note that a monotone distriution is 0-modal. Let M k,m e

2 the collection of all m-modal distriutions over [k], and M n k,m, the distriutions over [k] n otained y sampling the same distriution in M k,m independently n times. III. RELATED WORK Initial work on universal compression of monotone distriutions was motivated y representation of integers, hence centered on a single instance, namely n 1. [1] showed that every positive integer k can e represented using roughly log k + log log k its. [13] derived a related lower ound, and [14] constructed a method for minimizing the expected numer of its normalized y the distriution s entropy. Remarkaly, [15] found the expected redundancy exactly, k RM k log i 1. i i 1 i Since 1 1 i i e 1 and k 1 i RM k log log k + O1, log k, this shows that and in particular, RM. More recently, [16] analyzed the min-ave redundancy that replaces the redundancy of the worst distriution in the class y the average over all distriutions. They show that monotone distriutions have constant min-ave redundancy. Recent work has concerned compression of independent identical monotone distriutions, namely distriutions in M n. The results aove imply RM n, yet [17] showed that the set of all distriutions in M with finite entropy can e compressed with a diminishing per-sample relative redundancy. Specifically, there is a distriution over N n, such that for all P M with HP <, DP n nhp log lognhp. lognhp Note that the theorem holds for all distriutions with finite entropy, even those with infinite support. It therefore does not necessarily imply that the redundancy is su-linear. For example, for the uniform distriution over [exp n], the ound is n 3/, hence super-linear. [18] considered the redundancy of monotone distriutions as a function of the alphaet size k as well as the lock length n. They show tight lower and upper ounds for k On 1, RM n k Θmin{k, n 1/3 }. In particular, this implies that when n grows linearly with k, the per-sample redundancy diminishes to 0. These results can e extended using methods from [19] to prove that M n k has su-linear redundancy for k o n. Universal compression is related to the prolem of distriution estimation with Kullack-Leiler distortion. The prolem of estimating monotone and few-modes distriutions has een studied in statistics and theoretical computer science. [0] considered learning monotone and unimodal distriutions with 1 gn Θfn if fn gnpolylogn few samples. The sample complexity of learning m-modal distriutions over an alphaet of size k was considered y [1]. Testing distriutions for monotonicity has een considered in varied settings [ 6]. IV. RESULTS To egin, in Section V we consider approximating monotone distriution over k elements y step distriutions with significantly fewer steps. While such approximation results are known for l 1 distance [0], we need approximation in KL divergence, and these require more work as the KL divergence is not a metric and does not satisfy the triangle inequality. In Theorem 3 we show that for any integer 10 log k, there is a partition of [k] into fixed intervals such that for any P, the step function derived y averaging P over each of the intervals satisfies DP 10 log k. This result can e viewed as reducing the effective alphaet size from k to, and in Section VI we use it to show that RM n k 10 n log k + 1 log n. Specifically, in Theorem 4 we show that for large n, and any k, RM n k 0n log k log n. Hence in Corollary 5, we deduce that for k on/ log n, RM n k on. In Section VII we extend these results to m-modal distriutions and show in Theorem 7 that for large n and any k m k RM n k,m log + m + 1RM n m k. Consequently, in Corollary 8, we deduce that for any fixed m, and k on/ log n, RM n k,m on. Conversely, in Theorem 9, we show that for k n, RM n k Ωn. By monotonicity of redundancy, the same ound holds for all k Ωn. It follows that if k is suexponential in n/ log n then the redundancy diminishes to 0, while if it at least exponential in n, the redundancy is at least a constant. V. APPROXIMATING MONOTONE DISTRIBUTIONS We show that any monotone distriution over k elements can e approximated in KL-divergence y a step distriution with significantly fewer steps. We will use the following simple result on the average empirical variance of non-negative numers. Its proof follows from a straight-forward expansion and is omitted for revity.

3 Lemma 1: For 0 x 1 x... x n with mean x, n x i x nx n x 1 x. Let I1 I 1,..., I e a partition of [k] into consecutive intervals. P is a step distriution over this partition if for every 1 j, all i, i I j satisfy P i P i. Let I denote the numer of integers in interval I. For a distriution P, let e the step distriution over I1 whose constant value over each interval is P s average over that interval, namely for x I j x P x I j P I j. 3 I j Let P + j and P j e the highest and lowest proailities in interval I j, and let k j e the numer of non-zero proailities in interval I j. The following lemma ounds the KL Divergence etween P and. Lemma : For any P M k, DP k j P + j P j. j1 Proof: Using log x x 1 for x 0 and Lemma 1, P x log P x P x P x j j j P x j j 1j P x j k j P + j P j, and the proof follows y summing over all intervals. We now descrie a partition of [k] into intervals over which every monotone distriution P is closely approximated y the step distriution. For γ log k define { 1 for 1 j I j, 1 + γ j / for < j. Oserve that I j / + j1 j/ γ γ j γ j / 1 + γ j / 1 + γ / 1 which is larger than k for log1 + γ log k/. Hence the intervals span [k]. For j, I j 1, hence P + j P j, and for j > /, k j I j 1 + γ j /. We use natural logarithms throughout the paper. The results are a constant factor from ase., Theorem 3: Let 10 log k and the intervals I 1 as defined aove. Then for every P M, DP 10 log k. Proof: By Lemma, DP k j P + j P j j1 j/ γ j / P + j P j 1 + γp + /+1 + P + j γj+1 / P j 1 + γj / j/+1 k /+1 P + /+1 + γ j/+1 P j 1 + γj /, where the last step uses P + j+1 P j. We need to consider only non-zero k j s. If k j+1 is non-zero this implies that k j 1 + γ j /. Therefore, the summation aove can e ounded as DP k /+1 P + /+1 + γ P j k j I /+1 P + /+1 + γ. j/+1 The theorem follows y sustituting the values of γ and I /+1 and the inequality P + /+1 /. VI. UPPER BOUND ON MONOTONE REDUNDANCY Theorem 4: For large n, and any k, RM n k 0n log k log n. Proof: Recall the definition of n in 3. As efore, is the i.i.d. distriution y sampling n times. Then for any distriution over [k] n, DP n P n x n 1 log P n xn 1 x n x n 1 [k]n 1 [ P n x n 1 log P n x n 1 n x n x n 1 [k]n 1 + log n x n ] 1 x n 1 ndp + x n 1 [k]n P n x n 1 log n x n 1 x n 1, where the last step follows since the KL divergence for product distriutions is the sum of KL divergence of distriutions on each coordinate. For intervals I 1,..., I satisfying Theorem 3, y the definition of redundancy RM n k +inf sup x n 1 [k]n P n x n 1 log n x n 1 x n 1.

4 We use this inequality and Equation 1 to construct a distriution over [k] n that has a small KL divergence with respect to any distriution in M n k. We do this y showing that the second expression is upper ounded y the redundancy of n length i.i.d. sequences over an alphaet of size. Equation 1 states that for on, there is a distriution Q,n over [] n such that any distriution P over [] satisfies DP n 1 Q,n log n. 4 Using this distriution, we design a distriution over [k] n as follows. For the intervals descried earlier, and any x [k], let fx e the index of the interval that x elongs to. Therefore, f is a map from [k] to []. Then, fx n 1 def fx 1,..., x n def fx 1,..., fx n maps [k] n to [] n. Let fx n 1 j1 n def j 1,..., j n. The numer of x n 1 that map to j1 n is I j1... I jn. Define the distriution over [k] n as x n 1 def Q,nj n 1 I j i. 5 It is easy to check that it sums to 1 and the distriution is well defined. Similarly y Equation 3, n x n P I ji 1. I ji Using these two, for any P M k P n x n n x n 1 1 log Q x n n x n 1 [k]n 1 P n x n 1 log j n 1 []n Now, for j n 1 [] n, x n 1 :fxn 1 jn 1 x n 1 :fxn 1 jn 1 P n x n 1 x n 1 :xi Ij i P I j i Q,n j n 1. P x i a P I ji, where a exchanges the sum and product. Therefore, P n x n n x n 1 1 log x n x n 1 [k]n 1 P I ji log P I j i Q,n j1 n j n 1 []n A distriution P induces a distriution over I 1,..., I, an alphaet of size, and this expression is the KL divergence of the product distriution over the intervals to Q,n. By Equation 4 it is ounded y 1 log n/, hence RM n k + inf + sup Choosing sup P n x n n x n 1 1 log x n x n 1 [k]n 1 P I ji log P I j i Q,n j1 n j1 n []n 1 + log n. 0n log k log n proves the theorem. Corollary 5: For k on/ log n, RM n k on. VII. UPPER BOUND ON m-modal DISTRIBUTIONS We upper ound the redundancy of m-modal distriutions y decomposing them into m+1 monotone distriutions. First note that the redundancy of a union of distriution classes is close to the highest redundancy of a class in the union. Lemma 6 Redundancy of unions: If P 1,..., P T are T distriution classes over the same domain, then R P i max RP i + log T. 1iT 1iT Proof: Let distriution Q i achieve the redundancy of P i, and define Q T Q i T to e the average of these redundancy achieving distriutions. For any i and P i P i, DP i Q DP i Q i + log T RP i + log T. k first decompose the class of m-modal distriutions into mwe classes. Theorem 7: For large n and any k m k RM n k,m log + m + 1RM n m k. Proof: There are k m choices for the modes of the distriutions. Divide the class of all m-modal distriutions into k m classes such that the modes of all the distriutions within one class are the same. The distriutions are monotone etween the modes and there are at most m + 1 distinct regions. Each such region can e coded with redundancy at most RM n k and therefore the total extra numer of its can e ounded y m + 1RM n k. Comining with Lemma 6 with T k m proves Theorem 7. Corollary 8: For any fixed m and k on/ log n, Theorem 9: For k n, RM n k,m on. VIII. LOWER BOUND RM n k Ωn. Proof: We show the lower ound using the redundancy capacity theorem. Our ojective will e to construct a large class of distinguishale distriutions defined elow. Definition 10: A collection S P n of distriutions over X n is 1 ɛ distinguishale if there exists a function f : X n S, such that for any P n S, P rofx n P n < ɛ. In other words, a collection of distriutions is distinguishale if given a sample generated y one of the distriutions, we can identify the distriution, with error ɛ. A collection of distinguishale distriutions provides a lower ound on the redundancy y the following formulation of the redundancy-capacity theorem.

5 Lemma 11: If there is a collection S M n k ɛ distinguishale distriutions, then RM n k 1 ɛ log S 1. of 1 We now construct a class of cn distinguishale distriutions in M n k for k n, and a constant c > 0. This will give a lower ound of cn1 ɛ 1 on RM n k. Due to lack of space, we provide an overview of the construction of distriutions and a sketch the proof. We assume n is even for the ease of illustration. Divide [ n ] into n intervals as I 1 {1, }, and I j { j 1 + 1,..., j } for j,..., n. Our collection of distinguishale distriutions will e a suset of distriutions of the form P n, where P satisfies 1 For j n, and i 1, i j, j+1 ], P i 1 P i. For n/ of the intervals, P I j 3n and for the remaining n/, P I j 4 3n. Furthermore, P I 1 is always 4 3n. It is then easy to verify that any P satisfying these properties is a distriution in M k, and furthermore, the numer of such distriutions is exactly n 1 n. Consider the following ijection from the set of distriutions satisfying these conditions to the set of inary strings in {0, 1} n that have weight n/ and whose first it is 1. For j 1,..., n, if P I j 4/3n set the jth it to 1, and 0 otherwise, defining the ijection. We now show the existence of a large suset of such distriutions that map to strings with a large Hamming distance, using the Gilert-Varshamov ound [7], as follows. Lemma 1: For α < 1 def, there exists a class of M n1 hα o1 distriutions satisfying the properties and such that the Hamming distance etween the strings that they map to satisfies, SP i SP j < n1 α/. In other words for any pair of distriutions, their distriutions are different in at least a fraction 1 α/ of the intervals. Using this, we prove the following theorem. Theorem 13: The class of distriutions defined aove is 0.9 distinguishale and contains n/100 distriutions. Applying Lemma 11 to this theorem proves the lower ound. Due to space constraints, we only sketch the function f that maps [k] n to the collection of distriutions in the previous theorem and defer the complete proof to the full version of the paper. For a distriution P in the theorem, let LP [n] e the intervals where P I 4/3n. Note that LP n/, and we expect /3rd of the n symols to appear in these intervals when P is sampled n times. Given a sample in [k] n, compute the fraction of symols lying in LP for each of the n/100 such P s. Output the P that has the highest fraction of elements in its corresponding LP. The proof of distinguishaility uses Chernoff ounds and is along the lines of the proof of lower ound in [19]. REFERENCES [1] G. K. Zipf, The Psychoiology of Language. New York, NY, USA: Houghton-Mifflin, [] N. Merhav, G. Seroussi, and M. J. Weinerger, Optimal prefix codes for sources with two-sided geometric distriutions, IEEE Transactions on Information Theory, vol. 46, no. 1, pp , 000. [3] K. Pearson, Contriutions to the Mathematical Theory of Evolution, Philosophical Transactions of the Royal Society of London. A, vol. 185, pp , [4] J. Kieffer, A unified approach to weak universal source coding, IEEE Transactions on Information Theory, vol. 4, no. 6, pp , Nov [5] L. D. Davisson, Universal noiseless coding, IEEE Transactions on Information Theory, vol. 19, no. 6, pp , Nov [6] L. D. Davisson, R. J. McEliece, M. B. Pursley, and M. S. Wallace, Efficient universal noiseless source codes, IEEE Transactions on Information Theory, vol. 7, no. 3, pp , [7] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, The context-tree weighting method: asic properties. IEEE Transactions on Information Theory, vol. 41, no. 3, pp , [8] J. Rissanen, Fisher information and stochastic complexity, IEEE Transactions on Information Theory, vol. 4, no. 1, pp , January [9] W. Szpankowski, On asymptotics of certain recurrences arising in universal coding, Prolems of Information Transmission, vol. 34, no., pp , [10] Q. Xie and A. R. Barron, Asymptotic minimax regret for data compression, gamling and prediction, IEEE Transactions on Information Theory, vol. 46, no., pp , 000. [11] W. Szpankowski and M. J. Weinerger, Minimax redundancy for large alphaets, in ISIT, 010, pp [1] P. Elias, Universal codeword sets and representations of integers, IEEE Transactions on Information Theory, vol. 1, no., pp , Mar [13] B. Y. Ryako, Coding of a source with unknown ut ordered proailities, Prolemy Peredachi Informatsii, vol. 15, no., pp , [14] J. Rissanen, Minimax codes for finite alphaets, Information Theory, IEEE Transactions on, vol. 4, no. 3, pp , [15] L. D. Davisson and A. Leon-Garcia, A source matching approach to finding minimax codes. IEEE Transactions on Information Theory, vol. 6, no., pp , [16] M. Khosravifard, H. Saidi, M. Esmaeili, and T. A. Gulliver, The minimum average code for finite memoryless monotone sources. IEEE Transactions on Information Theory, vol. 53, no. 3, pp , 007. [17] D. Foster, R. Stine, and A. Wyner, Universal codes for finite sequences of integers drawn from a monotone distriution, IEEE Transactions on Information Theory, vol. 48, no. 6, pp , June 00. [18] G. I. Shamir, Universal source coding for monotonic and fast decaying monotonic distriutions, IEEE Transactions on Information Theory, vol. 59, no. 11, pp , 013. [19] J. Acharya, H. Das, and A. Orlitsky, Tight ounds on profile redundancy and distinguishaility, in Neural Information Processing Systems NIPS, 01. [0] L. Birgé, On the risk of histograms for estimating decreasing densities, Annals of Statistics, vol. 15, no. 3, pp , [1] C. Daskalakis, I. Diakonikolas, and R. A. Servedio, Learning k-modal distriutions via testing, in SODA, 01, pp [] M. Woodroofe and J. Sun, Testing uniformity versus a monotone density, The Annals of Statistics, vol. 7, no. 1, pp. pp , [3] O. Goldreich, S. Goldwasser, E. Lehman, and D. Ron, Testing monotonicity, Foundations of Computer Science, IEEE Annual Symposium on, vol. 0, p. 46, [4] T. Batu, R. Kumar, and R. Ruinfeld, in STOC. ACM, pp [5] R. Ruinfeld and R. A. Servedio, Testing monotone high-dimensional distriutions, Random Struct. Algorithms, vol. 34, no. 1, pp. 4 44, 009. [6] J. Acharya, A. Jafarpour, A. Orlitsky, and A. T. Suresh, A competitive test for uniformity of monotone distriutions, in AISTATS, 013, pp [7] R. M. Roth, Introduction to coding theory. Camridge University Press, 006.

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be