Maximum Entropy Interval Aggregations

Similar documents
Fixed-to-Variable Length Distribution Matching

On Constant Power Water-filling

List Scheduling and LPT Oliver Braun (09/05/2017)

Homework 3 Solutions CSE 101 Summer 2017

CS Lecture 13. More Maximum Likelihood

Kernel Methods and Support Vector Machines

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

3.8 Three Types of Convergence

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

On Poset Merging. 1 Introduction. Peter Chen Guoli Ding Steve Seiden. Keywords: Merging, Partial Order, Lower Bounds. AMS Classification: 68W40

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Lower Bounds for Quantized Matrix Completion

Hybrid System Identification: An SDP Approach

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

A Simple Regression Problem

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

ON THE TWO-LEVEL PRECONDITIONING IN LEAST SQUARES METHOD

Constant-Space String-Matching. in Sublinear Average Time. (Extended Abstract) Wojciech Rytter z. Warsaw University. and. University of Liverpool

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

An Algorithm for Quantization of Discrete Probability Distributions

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Randomized Recovery for Boolean Compressed Sensing

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

Asynchronous Gossip Algorithms for Stochastic Optimization

Feature Extraction Techniques

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Interactive Markov Models of Evolutionary Algorithms

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Fairness via priority scheduling

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

On the Analysis of the Quantum-inspired Evolutionary Algorithm with a Single Individual

time time δ jobs jobs

In this chapter, we consider several graph-theoretic and probabilistic models

Curious Bounds for Floor Function Sums

Non-Parametric Non-Line-of-Sight Identification 1

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

On the Communication Complexity of Lipschitzian Optimization for the Coordinated Model of Computation

COS 424: Interacting with Data. Written Exercises

The Methods of Solution for Constrained Nonlinear Programming

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

The Transactional Nature of Quantum Information

Probability Distributions

Generalized Alignment Chain: Improved Converse Results for Index Coding

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

Polygonal Designs: Existence and Construction

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Introduction to Discrete Optimization

Boosting with log-loss

Iterative Decoding of LDPC Codes over the q-ary Partial Erasure Channel

2 Q 10. Likewise, in case of multiple particles, the corresponding density in 2 must be averaged over all

Distributed Subgradient Methods for Multi-agent Optimization

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Department of Electronic and Optical Engineering, Ordnance Engineering College, Shijiazhuang, , China

Lecture 21. Interior Point Methods Setup and Algorithm

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Worst-case performance of critical path type algorithms

A remark on a success rate model for DPA and CPA

Understanding Machine Learning Solution Manual

On Conditions for Linearity of Optimal Estimation

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

DESIGN OF THE DIE PROFILE FOR THE INCREMENTAL RADIAL FORGING PROCESS *

IN modern society that various systems have become more

Chapter 6 1-D Continuous Groups

A Note on Online Scheduling for Jobs with Arbitrary Release Times

Introduction to Machine Learning. Recitation 11

Math Reviews classifications (2000): Primary 54F05; Secondary 54D20, 54D65

Identical Maximum Likelihood State Estimation Based on Incremental Finite Mixture Model in PHD Filter

Lec 05 Arithmetic Coding

arxiv: v1 [cs.ds] 17 Mar 2016

When Short Runs Beat Long Runs

Weighted- 1 minimization with multiple weighting sets

Bipartite subgraphs and the smallest eigenvalue

Prediction by random-walk perturbation

arxiv: v1 [cs.ds] 3 Feb 2014

A note on the multiplication of sparse matrices

Complex Quadratic Optimization and Semidefinite Programming

Block designs and statistics

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

arxiv: v1 [math.na] 10 Oct 2016

Exact tensor completion with sum-of-squares

A note on the realignment criterion

Generalized Queries on Probabilistic Context-Free Grammars

Genetic Quantum Algorithm and its Application to Combinatorial Optimization Problem

Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Multi-Scale/Multi-Resolution: Wavelet Transform

Transcription:

Maxiu Entropy Interval Aggregations Ferdinando Cicalese Università di Verona, Verona, Italy Eail: cclfdn@univr.it Ugo Vaccaro Università di Salerno, Salerno, Italy Eail: uvaccaro@unisa.it arxiv:1805.05375v1 [cs.it] 14 May 018 Abstract Given a probability distribution p = (p 1,...,p n) and an integer 1 < n, we say that q = (q 1,...,q ) is a contiguous -aggregation of p if there exist indices 0 = i 0 < i 1 < < i 1 < i = n such that for each j = 1,..., it holds that = i j k=i j 1 +1 p k. In this paper, we consider the proble of efficiently finding the contiguous -aggregation of axiu entropy. We design a dynaic prograing algorith that solves the proble exactly, and two ore tie-efficient greedy algoriths that provide slightly sub-optial solutions. We also discuss a few scenarios where our proble atters. I. INTRODUCTION The proble of aggregating data in a copact and eaningful way, and such that the aggregated data retain the axiu possible inforation contained in the original data, arises in any scenarios [8]. In this paper we consider the following particular instance of the general proble. Let X = {x 1,...,x n } be a finite alphabet, and X be any rando variable (r.v.) taking values in X according to the probability distribution p = (p 1,p,...,p n ), that is, such that P{X = x i } = p i > 0, for i = 1,,...,n. Consider a partition Π = (Π 1,...,Π ), < n, of the alphabet X, where each class Π i of the partition Π consists of consecutive eleents of X. That is, there exist indices 1 i 1 < < i 1 < i = n such that Π 1 = {x 1,...,x i1 },Π = {x i1+1,...,x i },...,Π = {x i 1+1,...,x i }. Any given such a partition Π = (Π 1,...,Π ) naturally gives a r.v. Y = f Π (X), where for each x X it holds that f Π (x) = i if and only if x Π i. Let q = (q 1,...,q ) be the probability distribution of r.v. Y. The values of the probabilities can obviously be coputed as follows: for indices 0 = i 0 < i 1 < < i 1 < i = n it holds that = i j k=i p j 1+1 k. The proble we consider in this paper is to deterine the value ax Π I(X;f Π(X)), (1) where I denotes the utual inforation and the axiu is coputed over all -class partitions Π = (Π 1,...,Π ) of set X, in which each class Π i of the partition Π consists of consecutive eleents of X. Since the function f Π is deterinistic, the proble (1) can be equivalently stated as ax Π H(f Π(X)), () where H denotes Shannon entropy and the axiization takes place over the sae doain as in (1). The forulation (1) is coon in the area of clustering (e.g., [6], [10]) to ephasize that the objective is to reduce the diension of the data (i.e., the cardinality of X ) under the constraint that the reduced data gives the axiu possible inforation towards the original, not aggregated data. We reark that, in general, there is no loss of generality in considering the proble (1) for deterinistic functions only (e.g., see [9], [13]). The contributions of this paper consist in efficient algoriths to solve the optiization probles (1) and (). More precisely, we design a dynaic prograing algorith that runs in tie O(n ) to find a partition Π that achieves the axiu in (). Since the tie coplexity O(n ) can be too large in soe applications, we also provide uch ore efficient greedy algoriths that return a solution provably very close to the optial one. We reark that the optiization proble () is strongly NP-hard in case the function f is an arbitrary function such that f(x) =, i.e., the partition into classes of X induced by f is not constrained to contain only classes ade by contiguous eleents of X (see [3]). The rest of the paper is organized as follows. In Section II we discuss the relevance of our results in the context of related works. In Section III we present our O(n ) dynaic prograing algorith to solve probles (1) and (). In the final Section IV we present two sub-optial, but ore tie efficient, greedy algoriths for the sae probles. II. RELATED WORK The proble of aggregating data (or source sybols, if we think of inforation sources) in an inforative way has been widely studied in any different scenarios. One of the otivations is that data aggregation is often an useful, preliinary step to reduce the coplexity of successive data anipulation. In this section we liit ourselves to point out the work that is strictly related to ours. In the paper [1] the authors considered the following proble. Given a discrete eoryless source, eitting sybols fro the alphabet X = {x 1,...,x n } according to the probability distribution p = (p 1,p,...,p n ), the question is to find a partition Π = (Π 1,...,Π ), < n, of the source alphabet X where, as before, each Π i consists of consecutive eleents of X, and such that the su 1 i=1 q i, (3) is iniized. Each in (3) is the su of the probabilities p k s corresponding to the eleents x k X that belong to Π j, that is our = i j k=i p j 1+1 k. The otivation of the authors of [1] to study above proble is that the iniization of

expression (3) constitutes the basic step in the well known Fano algorith [7] for -ary variable length encoding finitealphabet eoryless source. In fact, solving (3) allows one to find a partition of X such that the cuulative probabilities of each class partition are as siilar as possible. Obviously, the basic step has to be iterated in each class Π i, till the partition is ade by singletons. Now, it is not hard to see that 1 i=1 q i = + 4 iq [i], (4) where (q [1],...,q [] ) is the vector that contains the sae eleents as q = (q 1,...,q ), but now ordered in nonincreasing fashion. Fro equality (4) one can see that the proble of iniizing expression (3), over all partitions as stated above, is equivalent to axiizing the quantity i=1 iq [i] over the sae doain. The quantity i=1 iq [i] is the well known guessing entropy by J. Massey [16]. Therefore, while in our proble () we seek a partition of X such that the cuulative probabilities of each class partition are as siilar as possible, and the easure we use to appraise this quality is the Shannon entropy, the authors of [1] address the sae proble using the guessing entropy, instead (this observation is not present in [1]). We should add that the criterion (3) used in [1] allows the authors to prove that the Fano algorith produces an -ary variable length encoding of the given source such that the average length of the encoding is strictly saller than H(p) log + 1 p in, for = and = 3 (and they conjecture that this is true also for any 4), where p is the source probability distribution and p in is the probability of the least likely source sybol. On the other hand, it is not clear how to efficiently solve the optiization proble (3). In fact, it is not known whether it enjoys or not the optial substructure property, a necessary condition so that the proble could be optially solved with known techniques like dynaic prograing, greedy, etc. [5]. As entioned before, our proble () can be optially solved via dynaic prograing. Nuerical siulation suggests that optial solutions to our proble () can be used to construct Fano encodings with the sae upper bound on the average length as the ones constructed in [1]. A siilar question, in which the aggregation operations of the eleents of X are again constrained by given rules, was considered in [4]. There, the authors consider the proble of constructing the suary tree of a given weighted tree, by eans of contraction operations on trees. Two types of contractions are allowed: 1) subtrees ay be contracted to single nodes that represent the corresponding subtrees, ) subtrees whose roots are siblings ay be contracted to single nodes. Nodes obtained by contracting subtrees have weight equal to the su of the node weights in the original contracted subtrees. Given a bound on the nuber of nodes in the resulting suary tree, the proble is to copute the suary tree of axiu entropy, where the entropy of a tree is the Shannon entropy of the noralized node weights. In [18] the authors consider the proble of quantizing a finite alphabet X by i=1 collapsing properly chosen contiguous sequences of sybols of X (called convex codecells in [18]) to single eleents. The objective is to iniize the expected distortion induced by the quantizer, for soe classes of distortion easures. Our siilar scenario would correspond to the iniization of H(X) H(f Π (X)), not considered in [18]. Our results could find applications also in data copression for sources with large alphabet (e.g. [17]). One could use our techniques as a pre-processing phase to reduce the source alphabet fro a large one to a saller one, in order to obtain a new source that retains ost of the entropy as the original one, just because of (). An encoding of the so constructed reduced source can be easily transfored to an encoding of the original source by exploiting the fact that the partition of the original source alphabet has been perfored with consecutive subsets of sybols. Finally, other probles siilar to ours were considered in papers [11], [14]. It sees that our findings could be useful in histogra copression, where the constraint that one can erge only adjacent class intervals is natural [19]. III. AN OPTIMAL DYNAMIC PROGRAMMING ALGORITHM We find it convenient to forulate probles (1) and () in a slightly different language. We give the following definition. Definition 1. Given a n-diensional vector of strictly positive nubers p = (p 1,...,p n ) and a positive integer < n, we say that a vector q = (q 1,...,q ) is a contiguous - aggregation of p if the following condition hold: there exist indices 0 = i 0 < i 1 < < i 1 < i = n such that for each j = 1,..., it holds that = i j k=i j 1+1 p k. Thus, our probles can be so forulated: Proble Definition. Given an n-diensional probability distributionp = (p 1,...,p n ) (where all coponents are assued to be strictly positive) and an integer 1 < n, find a contiguous -aggregation of p of axiu entropy. Our dynaic prograing algorith proceeds as follows. For j = 1,...,n, let s j = j k=1 p k. Notice that we can copute all these values in O(n) tie. For a sequence of nubers w = w 1,...,w t such that for each i = 1,...,t, w i (0,1] and t i=1 w i 1, we define the entropy-like su of w as H(w) = t w tlogw t. Clearly when w is a probability distribution we have that the entropy-like su of w coincides with the Shannon entropy of w. For each i = 1,..., and j = 1,...,n let hq[i,j] be the axiu entropy-like su of a contiguous i-aggregation of the sequence p 1,...,p j. Therefore, hq[, n] is the sought axiu entropy of a contiguous -aggregation of p. Let ˆq = (q 1,...,q i ) be a contiguousi-aggregation of(p 1,...,p j ) of axiu entropylike su. Let r be the index such that q i = j k=r p k. We have q i = s j s r 1 and H(ˆq) = (s j s r 1 )log(s j s r 1 )+ H(q ), where q = (q 1,...,q i 1 ). Now we observe that q is a contiguous (i 1)-aggregation of (p 1,...,p r 1 ). Moreover,

since H(ˆq) is axiu aong the entropy-like su of any contiguous i-aggregation of (p 1,...,p i ) it ust also hold that H(q ) is axiu aong any contiguous (i 1)- aggregation of (p 1,...,p r 1 ). Based on this observation we can copute the hq[, ] values recursively as follows: ax {hq[i 1,k 1] k=i,...,j hq[i,j] = (s j s k 1 )log(s j s k 1 )} i > 1, j i s j logs j i = 1. There are n values to be coputed and each one of the can be coputed in O(n) (due to the ax in the first case). Therefore the coputation of h[,n] requires O(n ) tie. By a standard procedure, once one has the whole table hq[, ], one can reconstruct the contiguous -aggregation of p achieving entropy hq[, n] by backtracking on the table. IV. SUB-OPTIMAL GREEDY ALGORITHMS We start by recalling a few notions of ajorization theory [15] that are relevant to our context. Definition. Given two probability distributions a = (a 1,...,a n ) and b = (b 1,...,b n ) with a 1... a n 0 and b 1... b n 0, n i=1 a i = n i=1 b i = 1, we say that a is ajorized by b, and write a b, if and only if i k=1 a k i k=1 b k, for all i = 1,...,n. We use the ajorization relationship between vectors of unequal lengths, by properly padding the shorter one with the appropriate nuber of 0 s at the end. Majorization induces n a lattice structure on P n = {(p 1,...,p n ) : i=1 p i = 1, p 1... p n 0}, see [1]. Shannon entropy function enjoys the iportant Schur-concavity property [15]: For any x,y P n, x y iplies that H(x) H(y). We also need the concept of aggregation and a result fro []. Given p = (p 1,...,p n ) P n and an integer 1 < n, we say that q = (q 1,...,q ) P is an aggregation of p if there is a partition of {1,...,n} into disjoint sets I 1,...,I such that = i I j p i, for j = 1,... Lea 1. [] Let q P be any aggregation of p P n. Then it holds that p q. We now present our first greedy approxiation algorith for the proble of finding the axiu entropy contiguous -aggregation of a given probability distribution p = (p 1,...,p n ). The pseudocode of the algorith is given below. The algorith has two phases. In the first phase, lines fro to 9, the algorith iteratively builds a new coponent of q as follows: Assue that the first i coponents of q have been produced by aggregating the first j coponents of p. If p j+1 > / then q i+1 is the aggregation of the singleton interval containing only p j+1. Otherwise, q i+1 is set to be the aggregation of the largest nuber of coponents p j+1,p j+,... such that their su is not larger than /. For each k = 1,...,i, the values start[k] and end[k] are eant to contain the first and the last coponent of p Algorith 1 A linear tie greedy approxiation algorith GREEDY-APPROXIMATION(p 1,...p n,) 1: //Assue n > and an auxiliary value p n+1 = 3/ : i 0, j 1 3: partialsu p j 4: while j n do 5: i i+1, start[i] j 6: while partialsu+p j+1 / do 7: partialsu partialsu+p j+1, j j +1 8: q i partialsu, end[i] j 9: j j +1, partialsu p j 10: // At this point i counts the nuber of coponents in q 11: // If i < we are going to split exactly i coponents 1: k i, j 1 13: while k > 0 do 14: while start[j] = end[j] do 15: j j +1 16: i i+1, k k 1 17: start[i] start[j], end[i] start[j], start[j] start[j]+1 which are aggregated into q k. By construction, we have that start[k] end[k] indicates that q k /. The first crucial observation is that, at the end of the first phase, the nuber i of coponents in the distribution q under construction is saller than. To see this, it is enough to observe that by construction + +1 > /, for any j = 1,,..., i/. Therefore, arguing by contradiction, if we had i +1 we would reach the following counterfactual inequality 1= i/ i/ i ( 1 + )> = i i 1. In the second phase, lines 1-17, the algorith splits the first i coponents of q which are obtained by aggregating at least two coponents of p. Notice that, as observed above, such coponents ofqare not larger than/. Hence, also the resulting coponents in which they are split have size at ost /. It is iportant to notice that there ust exist at least i such coposite 1 coponents, because of the assuption n >, and the fact that each coponent of p is non zero. As a result of the above considerations, the aggregation q returned by the GREEDY-APPROXIMATION algorith can be represented, after reordering its coponents in non-increasing order, as q = (q 1,...,q,q k +1,...q ), where q 1,...,q are all larger than / and coincide with the largest coponents of p and the reaining coponents of q, naely q k +1,...,q, are all not larger than /. Let us now define the quantities A = 1, and B = log 1. It holds that H(q) = log 1 + = B + log 1 (5) log 1 (6) 1 We are calling a coponent coposite if it is obtained as the su of at least two coponents of p.

B + log (7) = B +Alog() A (8) where (6) follows by definition of B; (7) follows by the fact that for any j > k ; (8) follows by definition of A and the basic properties of the logarith. Lea. Let q be the probability distribution defined as A A q = (q 1,...,q, k,..., k ). Then, it holds that: H(q) H( q) eln(). Proof. We have H( q) = log 1 + = B +Alog( ) AlogA. A k log k A Therefore, by using the above lower bound (5)-(8) on the entropy of q it follows that H( q) H(q) Alog k Alog(A)+A Alog(A)+A eln() where the second inequality follows since Alog k 0 for any 0 and the last inequality follows by the fact that A [0,1] and the axiu of the function xlogx+x in the interval [0,1] is eln(). Let q = (q 1,...q ) be a contiguous -aggregation of p of axiu entropy. We can use q to copare the entropy of our greedily constructed contiguous -aggregation q to the entropy of q. We prepare the following Lea 3. It holds that q q, therefore H( q) H(q ). Proof. Assue, w.l.o.g., that the coponents of q are sorted in non-increasing order. Let p = ( p 1,..., p n ) be the probability distribution obtained by reordering the coponents of p in non-increasing order. It is not hard to see that, by construction, we have p j = for each j = 1,...,. Since q is an aggregation of p, by Lea 1, we have that p q, which iediately iplies j j j q s = p s qj for each j = 1,...,. (9) Moreover, by the last inequality with j = it follows that s= +1 q s 1 q s = A. This, together with the assuption that q1 q k q +1 q iplies that s=t+1 s=j+1 q s t A for any t k. (10) Then, for each j =,..., we have j qj = 1 qs 1 j j k A = 1 q s = q s s=j+1 that together with (9) iplies q q. This concludes the proof of the first stateent of the Lea. The second stateent iediately follows by the Schur concavity of the entropy function. We are now ready to suarize our findings. Theore 1. Let q be the contiguous -aggregation of p returned by GREEDY-APPROXIMATION. Let q be a contiguous -aggregation of p of axiu entropy. Then, it holds that H(q) H(q ) eln() = H(q ) 1.0614756... Proof. Directly fro Leas 3 and. A. A slightly iproved greedy approach We can iprove the approxiation guarantee of Algorith 1 by a refined greedy approach of coplexityo(n+log). The new idea is to build the coponents of q in such a way that they are either not larger than 3/ or they coincide with soe large coponent of p. More precisely, when building a new coponent of q, say q i, the algorith puts together consecutive coponents of p as long as their su, denoted partialsu, is not larger than 1/. If, when trying to add the next coponent, say p j, the total su becoes larger than 1/ the following three cases are considered: Case 1. partialsu+p j [ 1, 3 ]. In this case q i is set to include also p j hence becoing a coponent of q of size not larger than 3/. Case. partialsu+p j >. In this case we produce up to two coponents of q. Precisely, if partialsu = 0 that is p j > / we set q i = p j and only one new coponent is created. Otherwise, q i is set to partialsu (i.e., it is the su of the interval up to p j 1, and it is not larger than 1/ and q i+1 is set to be equal to p j. Notice that in this case q i+1 ight be larger than 3/ but it is a non-coposite coponent. Case 3. partialsu+p j ( 3, ). In this case we produce one coponent of q, naely q i is set to partialsu+p j and we ark it. We first observe that the total nuber of coponents of q created by this procedure is not larger than. More precisely, let k 1,k,k 3 be the nuber of coponents created by the application of Case 1,, and 3 respectively. Each coponent created by Case 1 has size 1/. When we apply Case we create either one coponent of size > / or two coponents of total su > /. Altogether the k coponents created by Case have total su at least k /. Then, since each coponent created by applying Case 3 has size at least 3/ we have that k 3 1 (k1+k)/ 3/ = ( k1 k) 3, hence k 1 + k + k 3 3 + 1 3 (k 1 + k ), fro which we get 1) k 1 k 3 k 3, and ) k 1 k k 3 1 k 3. Inequalities 1) and ) ean that if k 3 > 0 then the nuber of coponents created is saller than by a quantity which equals at least half of k 3. In other words, we are allowed to split at least half of the k 3 coponents created by Case 3 and the resulting total nuber of coponents will still be not larger than. In the

second phase of the algorith, the largest coponents created fro Case 3 are split. As a result of the above considerations, the final distribution q returned by the algorith has: (i) coponents > / which are singletons, i.e., coincide with coponents of p; the reaining coponents can be divided into two sets, the coponents of size > 3/ and the ones of size 3/ with the second set having larger total probability ass. In forulas, we can represent the probability vector q, after reordering its coponents in non-increasing order, as q = (q 1,...,q,q +1,..., +1,...,q ), where: (i) q 1,...,q are all larger than / and coincide with the largest coponents of p; (ii) q k +1,..., are all in the interval (3/,/); (iii) +1,...,q, are all not larger than 3/. Let us define the quantities j A 1 = q s, A = q s, B = log 1. s= 1 s=j 1 Let A = A 1 + A. Since the algorith splits the largest coponents of size 3/ it follows that A A/. Then, by proceeding like in the previous section we have H(q)= q s log 1 + q s B + j s= +1 j s= +1 q s log 1 q s + q s log + s=j +1 q s log 1 (11) q s s=j +1 q s log 3 = B +(A 1 +A )log() A 1 A log 3 (1) (13) B +Alog() A log(3) (14) where the last inequality holds since A A/. Proceeding like in Lea above, we have the following result. Lea 4. Let q be the probability distribution defined A A as q = (q 1,...,q, k,..., k ). It holds that: 3 H(q) H( q) eln(). This result, together with Lea 3 iplies Theore. Let q be the contiguous -aggregation of p returned by the algorith GREEDY-. Let q be a contiguous -aggregation of p of axiu entropy. Then, it holds that 3 H(q) H(q ) eln() = H(q ) 0.9196... REFERENCES [1] F. Cicalese and U. Vaccaro, Superodularity and subadditivity properties of the entropy on the ajorization lattice, IEEE Transactions on Inforation Theory, Vol. 48, 933 938, 00. [] F. Cicalese, L. Gargano, and U. Vaccaro, H(X) vs. H(f(X)), in: Proc. ISIT 017, pp. 51-55. [3] F. Cicalese L. Gargano, and U. Vaccaro, Bounds on the Entropy of a Function of a Rando Variable and their Applications, IEEE Transactions on Inforation Theory, Vol. 64, 0 30, 018. [4] R. Cole and H. Karloff, Fast algoriths for constructing axiu entropy suary trees, in: Proc. of ICALP 014, pp. 33 343, 014. [5] T.H. Coren, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to Algoriths, MIT Press, (009). Algorith Iproved approxiation in O(n+ log ) tie GREEDY(p 1,...,p n,) // assue n > and auxiliary p n+1 = 1: i 0, j 1 : while j n do 3: i i+1, start[i] j, partialsu 0 4: while partialsu+p j 1/ and j n do 5: partialsu partialsu+p j, j j +1 6: if j > n then 7: q i partialsu 8: break while 9: 10: if partialsu+p j ( 1, 3 q i partialsu+p j, end[i] j 11: else 1: 13: if partialsu+p j > then if partialsu > 0 then 14: q i partialsu, end[i] j 1, i i+1 15: q i p j, start[i] j, end[i] j 16: else 17: // we are left with the case partialsu+p j ( 3 18: q i partialsu+p j, end[i] j 19: if partialsu > 0 then 0: Add index i to the list Marked-indices: ark[i] 1 1: j j +1 : // At this point i counts the nuber of coponents in q 3: // If i < we are going to split exactly i coponents starting with the list of Marked-indices 4: k i, j 1 5: Sort the set Q of arked coponents, in non-increasing order 6: Split i largest coponents in Q. The split is done by creating one coponent with the largest/last piece and one coponent with the reaining parts. If in Q there are less than i coponents coplete with coposite coponents [6] L. Faivishevsky and J. Faivishevsky, Nonparaetric inforation theoretic clustering algorith, in: Proceedings of the 7th International Conference on Machine Learning (ICML-10), pp. 351 358, 010. [7] R. M. Fano, The transission of inforation, Research Laboratory of Electronics, Mass. Inst. of Techn. (MIT), Tech. Report No. 65, 1949. [8] G. Gan, C. Ma, and J. Wu, Data Clustering: Theory, Algoriths, and Applications, ASA-SIAM Series on Statistics and Applied Probability, SIAM, Philadelphia, ASA, Alexandria, VA, 007. [9] B.C. Geiger and R.A. Ajad, Hard Clusters Maxiize Mutual Inforation, arxiv:1608.0487 [cs.it], 016. [10] M. Kearns, Y. Mansour, and A. Y. Ng, An inforation-theoretic analysis of hard and soft assignent ethods for clustering. In: Learning in graphical odels. Springer Netherlands, pp. 495 50, 1998. [11] T. Kapke and R. Kober, Discrete signal quantizations, Pattern Recognition, vol. 3, pp. 619 634, (1999). [1] S. Krajči, C.-F. Liu, L. Mikeš, and S.M. Moser, Perforance analysis of Fano coding, in: Proc. of ISIT 015, pp. 1746 1750. [13] B.M. Kurkoski, H. Yagi, Quantization of Binary-Input Discrete Meoryless Channels, IEEE Trans. on Inf. Theory, 60, 4544 455, 014. [14] R. Laarche-Perrin, Y. Deazeau, J.-M. Vincent, The best-partitions proble: How to build eaningful aggregations, 013 IEEE/WIC/ACM Inter. Conf. on Web Intell. and Intell. Agent Techhology, 309 404, 013. [15] A.W. Marshall, I. Olkin, B. C. Arnold, Inequalities: Theory of Majorization and Its Applications, Springer, New York (009). [16] J.L. Massey, Guessing and entropy, in: Proc. of 1994 IEEE International Syposiu on Inforation Theory, Trondhei, Norway, p. 04. [17] A. Moffat and A. Turpin, Efficient construction of iniuredundancy codes for large alphabets, IEEE Transactions on Inforation Theory, vol. 44, 1650 1657, 1998. [18] D. Muresan and M. Effros, Quantization as Histogra Segentation: Optial Scalar Quantizer Design in Network Systes, IEEE Transactions on Inforation Theory, Vol. 54, 344 366 (008). [19] G.M. Perillo, E. Marone, Deterination of optial nuber of class intervals using axiu entropy, Math. Geo., 18, 401 407, 1986.