Maximum Entropy Interval Aggregations

Maxiu Entropy Interval Aggregations Ferdinando Cicalese Università di Verona, Verona, Italy Eail: cclfdn@univr.it Ugo Vaccaro Università di Salerno, Salerno, Italy Eail: uvaccaro@unisa.it arxiv:1805.05375v1 [cs.it] 14 May 018 Abstract Given a probability distribution p = (p 1,...,p n) and an integer 1 < n, we say that q = (q 1,...,q ) is a contiguous -aggregation of p if there exist indices 0 = i 0 < i 1 < < i 1 < i = n such that for each j = 1,..., it holds that = i j k=i j 1 +1 p k. In this paper, we consider the proble of efficiently finding the contiguous -aggregation of axiu entropy. We design a dynaic prograing algorith that solves the proble exactly, and two ore tie-efficient greedy algoriths that provide slightly sub-optial solutions. We also discuss a few scenarios where our proble atters. I. INTRODUCTION The proble of aggregating data in a copact and eaningful way, and such that the aggregated data retain the axiu possible inforation contained in the original data, arises in any scenarios [8]. In this paper we consider the following particular instance of the general proble. Let X = {x 1,...,x n } be a finite alphabet, and X be any rando variable (r.v.) taking values in X according to the probability distribution p = (p 1,p,...,p n ), that is, such that P{X = x i } = p i > 0, for i = 1,,...,n. Consider a partition Π = (Π 1,...,Π ), < n, of the alphabet X, where each class Π i of the partition Π consists of consecutive eleents of X. That is, there exist indices 1 i 1 < < i 1 < i = n such that Π 1 = {x 1,...,x i1 },Π = {x i1+1,...,x i },...,Π = {x i 1+1,...,x i }. Any given such a partition Π = (Π 1,...,Π ) naturally gives a r.v. Y = f Π (X), where for each x X it holds that f Π (x) = i if and only if x Π i. Let q = (q 1,...,q ) be the probability distribution of r.v. Y. The values of the probabilities can obviously be coputed as follows: for indices 0 = i 0 < i 1 < < i 1 < i = n it holds that = i j k=i p j 1+1 k. The proble we consider in this paper is to deterine the value ax Π I(X;f Π(X)), (1) where I denotes the utual inforation and the axiu is coputed over all -class partitions Π = (Π 1,...,Π ) of set X, in which each class Π i of the partition Π consists of consecutive eleents of X. Since the function f Π is deterinistic, the proble (1) can be equivalently stated as ax Π H(f Π(X)), () where H denotes Shannon entropy and the axiization takes place over the sae doain as in (1). The forulation (1) is coon in the area of clustering (e.g., [6], [10]) to ephasize that the objective is to reduce the diension of the data (i.e., the cardinality of X ) under the constraint that the reduced data gives the axiu possible inforation towards the original, not aggregated data. We reark that, in general, there is no loss of generality in considering the proble (1) for deterinistic functions only (e.g., see [9], [13]). The contributions of this paper consist in efficient algoriths to solve the optiization probles (1) and (). More precisely, we design a dynaic prograing algorith that runs in tie O(n ) to find a partition Π that achieves the axiu in (). Since the tie coplexity O(n ) can be too large in soe applications, we also provide uch ore efficient greedy algoriths that return a solution provably very close to the optial one. We reark that the optiization proble () is strongly NP-hard in case the function f is an arbitrary function such that f(x) =, i.e., the partition into classes of X induced by f is not constrained to contain only classes ade by contiguous eleents of X (see [3]). The rest of the paper is organized as follows. In Section II we discuss the relevance of our results in the context of related works. In Section III we present our O(n ) dynaic prograing algorith to solve probles (1) and (). In the final Section IV we present two sub-optial, but ore tie efficient, greedy algoriths for the sae probles. II. RELATED WORK The proble of aggregating data (or source sybols, if we think of inforation sources) in an inforative way has been widely studied in any different scenarios. One of the otivations is that data aggregation is often an useful, preliinary step to reduce the coplexity of successive data anipulation. In this section we liit ourselves to point out the work that is strictly related to ours. In the paper [1] the authors considered the following proble. Given a discrete eoryless source, eitting sybols fro the alphabet X = {x 1,...,x n } according to the probability distribution p = (p 1,p,...,p n ), the question is to find a partition Π = (Π 1,...,Π ), < n, of the source alphabet X where, as before, each Π i consists of consecutive eleents of X, and such that the su 1 i=1 q i, (3) is iniized. Each in (3) is the su of the probabilities p k s corresponding to the eleents x k X that belong to Π j, that is our = i j k=i p j 1+1 k. The otivation of the authors of [1] to study above proble is that the iniization of

expression (3) constitutes the basic step in the well known Fano algorith [7] for -ary variable length encoding finitealphabet eoryless source. In fact, solving (3) allows one to find a partition of X such that the cuulative probabilities of each class partition are as siilar as possible. Obviously, the basic step has to be iterated in each class Π i, till the partition is ade by singletons. Now, it is not hard to see that 1 i=1 q i = + 4 iq [i], (4) where (q [1],...,q [] ) is the vector that contains the sae eleents as q = (q 1,...,q ), but now ordered in nonincreasing fashion. Fro equality (4) one can see that the proble of iniizing expression (3), over all partitions as stated above, is equivalent to axiizing the quantity i=1 iq [i] over the sae doain. The quantity i=1 iq [i] is the well known guessing entropy by J. Massey [16]. Therefore, while in our proble () we seek a partition of X such that the cuulative probabilities of each class partition are as siilar as possible, and the easure we use to appraise this quality is the Shannon entropy, the authors of [1] address the sae proble using the guessing entropy, instead (this observation is not present in [1]). We should add that the criterion (3) used in [1] allows the authors to prove that the Fano algorith produces an -ary variable length encoding of the given source such that the average length of the encoding is strictly saller than H(p) log + 1 p in, for = and = 3 (and they conjecture that this is true also for any 4), where p is the source probability distribution and p in is the probability of the least likely source sybol. On the other hand, it is not clear how to efficiently solve the optiization proble (3). In fact, it is not known whether it enjoys or not the optial substructure property, a necessary condition so that the proble could be optially solved with known techniques like dynaic prograing, greedy, etc. [5]. As entioned before, our proble () can be optially solved via dynaic prograing. Nuerical siulation suggests that optial solutions to our proble () can be used to construct Fano encodings with the sae upper bound on the average length as the ones constructed in [1]. A siilar question, in which the aggregation operations of the eleents of X are again constrained by given rules, was considered in [4]. There, the authors consider the proble of constructing the suary tree of a given weighted tree, by eans of contraction operations on trees. Two types of contractions are allowed: 1) subtrees ay be contracted to single nodes that represent the corresponding subtrees, ) subtrees whose roots are siblings ay be contracted to single nodes. Nodes obtained by contracting subtrees have weight equal to the su of the node weights in the original contracted subtrees. Given a bound on the nuber of nodes in the resulting suary tree, the proble is to copute the suary tree of axiu entropy, where the entropy of a tree is the Shannon entropy of the noralized node weights. In [18] the authors consider the proble of quantizing a finite alphabet X by i=1 collapsing properly chosen contiguous sequences of sybols of X (called convex codecells in [18]) to single eleents. The objective is to iniize the expected distortion induced by the quantizer, for soe classes of distortion easures. Our siilar scenario would correspond to the iniization of H(X) H(f Π (X)), not considered in [18]. Our results could find applications also in data copression for sources with large alphabet (e.g. [17]). One could use our techniques as a pre-processing phase to reduce the source alphabet fro a large one to a saller one, in order to obtain a new source that retains ost of the entropy as the original one, just because of (). An encoding of the so constructed reduced source can be easily transfored to an encoding of the original source by exploiting the fact that the partition of the original source alphabet has been perfored with consecutive subsets of sybols. Finally, other probles siilar to ours were considered in papers [11], [14]. It sees that our findings could be useful in histogra copression, where the constraint that one can erge only adjacent class intervals is natural [19]. III. AN OPTIMAL DYNAMIC PROGRAMMING ALGORITHM We find it convenient to forulate probles (1) and () in a slightly different language. We give the following definition. Definition 1. Given a n-diensional vector of strictly positive nubers p = (p 1,...,p n ) and a positive integer < n, we say that a vector q = (q 1,...,q ) is a contiguous - aggregation of p if the following condition hold: there exist indices 0 = i 0 < i 1 < < i 1 < i = n such that for each j = 1,..., it holds that = i j k=i j 1+1 p k. Thus, our probles can be so forulated: Proble Definition. Given an n-diensional probability distributionp = (p 1,...,p n ) (where all coponents are assued to be strictly positive) and an integer 1 < n, find a contiguous -aggregation of p of axiu entropy. Our dynaic prograing algorith proceeds as follows. For j = 1,...,n, let s j = j k=1 p k. Notice that we can copute all these values in O(n) tie. For a sequence of nubers w = w 1,...,w t such that for each i = 1,...,t, w i (0,1] and t i=1 w i 1, we define the entropy-like su of w as H(w) = t w tlogw t. Clearly when w is a probability distribution we have that the entropy-like su of w coincides with the Shannon entropy of w. For each i = 1,..., and j = 1,...,n let hq[i,j] be the axiu entropy-like su of a contiguous i-aggregation of the sequence p 1,...,p j. Therefore, hq[, n] is the sought axiu entropy of a contiguous -aggregation of p. Let ˆq = (q 1,...,q i ) be a contiguousi-aggregation of(p 1,...,p j ) of axiu entropylike su. Let r be the index such that q i = j k=r p k. We have q i = s j s r 1 and H(ˆq) = (s j s r 1 )log(s j s r 1 )+ H(q ), where q = (q 1,...,q i 1 ). Now we observe that q is a contiguous (i 1)-aggregation of (p 1,...,p r 1 ). Moreover,

since H(ˆq) is axiu aong the entropy-like su of any contiguous i-aggregation of (p 1,...,p i ) it ust also hold that H(q ) is axiu aong any contiguous (i 1)- aggregation of (p 1,...,p r 1 ). Based on this observation we can copute the hq[, ] values recursively as follows: ax {hq[i 1,k 1] k=i,...,j hq[i,j] = (s j s k 1 )log(s j s k 1 )} i > 1, j i s j logs j i = 1. There are n values to be coputed and each one of the can be coputed in O(n) (due to the ax in the first case). Therefore the coputation of h[,n] requires O(n ) tie. By a standard procedure, once one has the whole table hq[, ], one can reconstruct the contiguous -aggregation of p achieving entropy hq[, n] by backtracking on the table. IV. SUB-OPTIMAL GREEDY ALGORITHMS We start by recalling a few notions of ajorization theory [15] that are relevant to our context. Definition. Given two probability distributions a = (a 1,...,a n ) and b = (b 1,...,b n ) with a 1... a n 0 and b 1... b n 0, n i=1 a i = n i=1 b i = 1, we say that a is ajorized by b, and write a b, if and only if i k=1 a k i k=1 b k, for all i = 1,...,n. We use the ajorization relationship between vectors of unequal lengths, by properly padding the shorter one with the appropriate nuber of 0 s at the end. Majorization induces n a lattice structure on P n = {(p 1,...,p n ) : i=1 p i = 1, p 1... p n 0}, see [1]. Shannon entropy function enjoys the iportant Schur-concavity property [15]: For any x,y P n, x y iplies that H(x) H(y). We also need the concept of aggregation and a result fro []. Given p = (p 1,...,p n ) P n and an integer 1 < n, we say that q = (q 1,...,q ) P is an aggregation of p if there is a partition of {1,...,n} into disjoint sets I 1,...,I such that = i I j p i, for j = 1,... Lea 1. [] Let q P be any aggregation of p P n. Then it holds that p q. We now present our first greedy approxiation algorith for the proble of finding the axiu entropy contiguous -aggregation of a given probability distribution p = (p 1,...,p n ). The pseudocode of the algorith is given below. The algorith has two phases. In the first phase, lines fro to 9, the algorith iteratively builds a new coponent of q as follows: Assue that the first i coponents of q have been produced by aggregating the first j coponents of p. If p j+1 > / then q i+1 is the aggregation of the singleton interval containing only p j+1. Otherwise, q i+1 is set to be the aggregation of the largest nuber of coponents p j+1,p j+,... such that their su is not larger than /. For each k = 1,...,i, the values start[k] and end[k] are eant to contain the first and the last coponent of p Algorith 1 A linear tie greedy approxiation algorith GREEDY-APPROXIMATION(p 1,...p n,) 1: //Assue n > and an auxiliary value p n+1 = 3/ : i 0, j 1 3: partialsu p j 4: while j n do 5: i i+1, start[i] j 6: while partialsu+p j+1 / do 7: partialsu partialsu+p j+1, j j +1 8: q i partialsu, end[i] j 9: j j +1, partialsu p j 10: // At this point i counts the nuber of coponents in q 11: // If i < we are going to split exactly i coponents 1: k i, j 1 13: while k > 0 do 14: while start[j] = end[j] do 15: j j +1 16: i i+1, k k 1 17: start[i] start[j], end[i] start[j], start[j] start[j]+1 which are aggregated into q k. By construction, we have that start[k] end[k] indicates that q k /. The first crucial observation is that, at the end of the first phase, the nuber i of coponents in the distribution q under construction is saller than. To see this, it is enough to observe that by construction + +1 > /, for any j = 1,,..., i/. Therefore, arguing by contradiction, if we had i +1 we would reach the following counterfactual inequality 1= i/ i/ i ( 1 + )> = i i 1. In the second phase, lines 1-17, the algorith splits the first i coponents of q which are obtained by aggregating at least two coponents of p. Notice that, as observed above, such coponents ofqare not larger than/. Hence, also the resulting coponents in which they are split have size at ost /. It is iportant to notice that there ust exist at least i such coposite 1 coponents, because of the assuption n >, and the fact that each coponent of p is non zero. As a result of the above considerations, the aggregation q returned by the GREEDY-APPROXIMATION algorith can be represented, after reordering its coponents in non-increasing order, as q = (q 1,...,q,q k +1,...q ), where q 1,...,q are all larger than / and coincide with the largest coponents of p and the reaining coponents of q, naely q k +1,...,q, are all not larger than /. Let us now define the quantities A = 1, and B = log 1. It holds that H(q) = log 1 + = B + log 1 (5) log 1 (6) 1 We are calling a coponent coposite if it is obtained as the su of at least two coponents of p.

B + log (7) = B +Alog() A (8) where (6) follows by definition of B; (7) follows by the fact that for any j > k ; (8) follows by definition of A and the basic properties of the logarith. Lea. Let q be the probability distribution defined as A A q = (q 1,...,q, k,..., k ). Then, it holds that: H(q) H( q) eln(). Proof. We have H( q) = log 1 + = B +Alog( ) AlogA. A k log k A Therefore, by using the above lower bound (5)-(8) on the entropy of q it follows that H( q) H(q) Alog k Alog(A)+A Alog(A)+A eln() where the second inequality follows since Alog k 0 for any 0 and the last inequality follows by the fact that A [0,1] and the axiu of the function xlogx+x in the interval [0,1] is eln(). Let q = (q 1,...q ) be a contiguous -aggregation of p of axiu entropy. We can use q to copare the entropy of our greedily constructed contiguous -aggregation q to the entropy of q. We prepare the following Lea 3. It holds that q q, therefore H( q) H(q ). Proof. Assue, w.l.o.g., that the coponents of q are sorted in non-increasing order. Let p = ( p 1,..., p n ) be the probability distribution obtained by reordering the coponents of p in non-increasing order. It is not hard to see that, by construction, we have p j = for each j = 1,...,. Since q is an aggregation of p, by Lea 1, we have that p q, which iediately iplies j j j q s = p s qj for each j = 1,...,. (9) Moreover, by the last inequality with j = it follows that s= +1 q s 1 q s = A. This, together with the assuption that q1 q k q +1 q iplies that s=t+1 s=j+1 q s t A for any t k. (10) Then, for each j =,..., we have j qj = 1 qs 1 j j k A = 1 q s = q s s=j+1 that together with (9) iplies q q. This concludes the proof of the first stateent of the Lea. The second stateent iediately follows by the Schur concavity of the entropy function. We are now ready to suarize our findings. Theore 1. Let q be the contiguous -aggregation of p returned by GREEDY-APPROXIMATION. Let q be a contiguous -aggregation of p of axiu entropy. Then, it holds that H(q) H(q ) eln() = H(q ) 1.0614756... Proof. Directly fro Leas 3 and. A. A slightly iproved greedy approach We can iprove the approxiation guarantee of Algorith 1 by a refined greedy approach of coplexityo(n+log). The new idea is to build the coponents of q in such a way that they are either not larger than 3/ or they coincide with soe large coponent of p. More precisely, when building a new coponent of q, say q i, the algorith puts together consecutive coponents of p as long as their su, denoted partialsu, is not larger than 1/. If, when trying to add the next coponent, say p j, the total su becoes larger than 1/ the following three cases are considered: Case 1. partialsu+p j [ 1, 3 ]. In this case q i is set to include also p j hence becoing a coponent of q of size not larger than 3/. Case. partialsu+p j >. In this case we produce up to two coponents of q. Precisely, if partialsu = 0 that is p j > / we set q i = p j and only one new coponent is created. Otherwise, q i is set to partialsu (i.e., it is the su of the interval up to p j 1, and it is not larger than 1/ and q i+1 is set to be equal to p j. Notice that in this case q i+1 ight be larger than 3/ but it is a non-coposite coponent. Case 3. partialsu+p j ( 3, ). In this case we produce one coponent of q, naely q i is set to partialsu+p j and we ark it. We first observe that the total nuber of coponents of q created by this procedure is not larger than. More precisely, let k 1,k,k 3 be the nuber of coponents created by the application of Case 1,, and 3 respectively. Each coponent created by Case 1 has size 1/. When we apply Case we create either one coponent of size > / or two coponents of total su > /. Altogether the k coponents created by Case have total su at least k /. Then, since each coponent created by applying Case 3 has size at least 3/ we have that k 3 1 (k1+k)/ 3/ = ( k1 k) 3, hence k 1 + k + k 3 3 + 1 3 (k 1 + k ), fro which we get 1) k 1 k 3 k 3, and ) k 1 k k 3 1 k 3. Inequalities 1) and ) ean that if k 3 > 0 then the nuber of coponents created is saller than by a quantity which equals at least half of k 3. In other words, we are allowed to split at least half of the k 3 coponents created by Case 3 and the resulting total nuber of coponents will still be not larger than. In the

second phase of the algorith, the largest coponents created fro Case 3 are split. As a result of the above considerations, the final distribution q returned by the algorith has: (i) coponents > / which are singletons, i.e., coincide with coponents of p; the reaining coponents can be divided into two sets, the coponents of size > 3/ and the ones of size 3/ with the second set having larger total probability ass. In forulas, we can represent the probability vector q, after reordering its coponents in non-increasing order, as q = (q 1,...,q,q +1,..., +1,...,q ), where: (i) q 1,...,q are all larger than / and coincide with the largest coponents of p; (ii) q k +1,..., are all in the interval (3/,/); (iii) +1,...,q, are all not larger than 3/. Let us define the quantities j A 1 = q s, A = q s, B = log 1. s= 1 s=j 1 Let A = A 1 + A. Since the algorith splits the largest coponents of size 3/ it follows that A A/. Then, by proceeding like in the previous section we have H(q)= q s log 1 + q s B + j s= +1 j s= +1 q s log 1 q s + q s log + s=j +1 q s log 1 (11) q s s=j +1 q s log 3 = B +(A 1 +A )log() A 1 A log 3 (1) (13) B +Alog() A log(3) (14) where the last inequality holds since A A/. Proceeding like in Lea above, we have the following result. Lea 4. Let q be the probability distribution defined A A as q = (q 1,...,q, k,..., k ). It holds that: 3 H(q) H( q) eln(). This result, together with Lea 3 iplies Theore. Let q be the contiguous -aggregation of p returned by the algorith GREEDY-. Let q be a contiguous -aggregation of p of axiu entropy. Then, it holds that 3 H(q) H(q ) eln() = H(q ) 0.9196... REFERENCES [1] F. Cicalese and U. Vaccaro, Superodularity and subadditivity properties of the entropy on the ajorization lattice, IEEE Transactions on Inforation Theory, Vol. 48, 933 938, 00. [] F. Cicalese, L. Gargano, and U. Vaccaro, H(X) vs. H(f(X)), in: Proc. ISIT 017, pp. 51-55. [3] F. Cicalese L. Gargano, and U. Vaccaro, Bounds on the Entropy of a Function of a Rando Variable and their Applications, IEEE Transactions on Inforation Theory, Vol. 64, 0 30, 018. [4] R. Cole and H. Karloff, Fast algoriths for constructing axiu entropy suary trees, in: Proc. of ICALP 014, pp. 33 343, 014. [5] T.H. Coren, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algoriths, MIT Press, (009). Algorith Iproved approxiation in O(n+ log ) tie GREEDY(p 1,...,p n,) // assue n > and auxiliary p n+1 = 1: i 0, j 1 : while j n do 3: i i+1, start[i] j, partialsu 0 4: while partialsu+p j 1/ and j n do 5: partialsu partialsu+p j, j j +1 6: if j > n then 7: q i partialsu 8: break while 9: 10: if partialsu+p j ( 1, 3 q i partialsu+p j, end[i] j 11: else 1: 13: if partialsu+p j > then if partialsu > 0 then 14: q i partialsu, end[i] j 1, i i+1 15: q i p j, start[i] j, end[i] j 16: else 17: // we are left with the case partialsu+p j ( 3 18: q i partialsu+p j, end[i] j 19: if partialsu > 0 then 0: Add index i to the list Marked-indices: ark[i] 1 1: j j +1 : // At this point i counts the nuber of coponents in q 3: // If i < we are going to split exactly i coponents starting with the list of Marked-indices 4: k i, j 1 5: Sort the set Q of arked coponents, in non-increasing order 6: Split i largest coponents in Q. The split is done by creating one coponent with the largest/last piece and one coponent with the reaining parts. If in Q there are less than i coponents coplete with coposite coponents [6] L. Faivishevsky and J. Faivishevsky, Nonparaetric inforation theoretic clustering algorith, in: Proceedings of the 7th International Conference on Machine Learning (ICML-10), pp. 351 358, 010. [7] R. M. Fano, The transission of inforation, Research Laboratory of Electronics, Mass. Inst. of Techn. (MIT), Tech. Report No. 65, 1949. [8] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algoriths, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 007. [9] B.C. Geiger and R.A. Ajad, Hard Clusters Maxiize Mutual Inforation, arxiv:1608.0487 [cs.it], 016. [10] M. Kearns, Y. Mansour, and A. Y. Ng, An inforation-theoretic analysis of hard and soft assignent ethods for clustering. In: Learning in graphical odels. Springer Netherlands, pp. 495 50, 1998. [11] T. Kapke and R. Kober, Discrete signal quantizations, Pattern Recognition, vol. 3, pp. 619 634, (1999). [1] S. Krajči, C.-F. Liu, L. Mikeš, and S.M. Moser, Perforance analysis of Fano coding, in: Proc. of ISIT 015, pp. 1746 1750. [13] B.M. Kurkoski, H. Yagi, Quantization of Binary-Input Discrete Meoryless Channels, IEEE Trans. on Inf. Theory, 60, 4544 455, 014. [14] R. Laarche-Perrin, Y. Deazeau, J.-M. Vincent, The best-partitions proble: How to build eaningful aggregations, 013 IEEE/WIC/ACM Inter. Conf. on Web Intell. and Intell. Agent Techhology, 309 404, 013. [15] A.W. Marshall, I. Olkin, B. C. Arnold, Inequalities: Theory of Majorization and Its Applications, Springer, New York (009). [16] J.L. Massey, Guessing and entropy, in: Proc. of 1994 IEEE International Syposiu on Inforation Theory, Trondhei, Norway, p. 04. [17] A. Moffat and A. Turpin, Efficient construction of iniuredundancy codes for large alphabets, IEEE Transactions on Inforation Theory, vol. 44, 1650 1657, 1998. [18] D. Muresan and M. Effros, Quantization as Histogra Segentation: Optial Scalar Quantizer Design in Network Systes, IEEE Transactions on Inforation Theory, Vol. 54, 344 366 (008). [19] G.M. Perillo, E. Marone, Deterination of optial nuber of class intervals using axiu entropy, Math. Geo., 18, 401 407, 1986.