LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

LogLog-Beta and More: A New Algorith for Cardinality Estiation Based on LogLog Counting Jason Qin, Denys Ki, Yuei Tung The AOLP Core Data Service, AOL, 22000 AOL Way Dulles, VA 20163 E-ail: jasonqin@teaaolco Abstract The inforation presented in this paper defines LogLog-Beta (LogLog-β) LogLog-β is a new algorith for estiating cardinalities based on LogLog counting The new algorith uses only one forula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is ore efficient and sipler to ipleent Our siulations show that the accuracy provided by the new algorith is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++ In addition to LogLog-β we also provide another one-forula estiator for cardinalities based on order statistics, a odification of an algorith described in Lubroso s work Index Ters Data analysis, Approxiation algoriths, Data Mining I Introduction Cardinality estiation of a ultiset has nuerous applications in data anageent, data analysis, and data services, see papers [2][3][7][11][14][15], etc Over the years, any cardinality estiation algoriths have been proposed and studied, especially the ones related to probabilistic counting [1][4][5][6][8][9][11][10][13][16][17] In this paper we present a new algorith related to a particular probabilistic counting faily called LogLog Counting LogLog Counting has been long studied: see the earlier ideas in Alon and Matias and Szegedy[1] and Flajolet[8], the basic LogLog and SuperLogLog algoriths in Duran and Flajolet[6], the popular HyperLogLog in Flajolet et al [9], and the recent iproveent HyperLogLog++ in Heule et al [11] The HyperLoglog algorith in Flajolet et al [9] serves as the foundation for our study Our new algorith, LogLog-β, can be regarded as a odification of the HyperLogLog In this introduction we will briefly present the HyperLogLog algorith The reainder of the paper is organized as follows: section II details the new algorith LogLog-β, section III gives soe accuracy tests on LogLog-β, and section IV provides another one-forula only cardinality estiator and soe further discussions on the two new algoriths HyperLogLog algorith in Flajolet et al [9] has five ajor coponents: data randoization by a hash function, stochastic averaging and register vector generation, the raw estiation forula, the Linear Counting in Whang and Vander-Zanden and Taylor [16], and bias corrections Let S be a ultiset of data, M be a vector of size with = 2 p with p = 4, 5,, h be a hash function where the hash value is a 32 -bit array, let ρ be a function that aps a bit array to its nuber of leading zeros plus one, and α be a paraeter dependent of (see [9]) HyperLogLog Algorith: M [i] = 0 for i = 0, 1,, 1 For each s S, let i be the integer fored by the left p bits of h(s) and w be the right 32 p bits of h (s), set M [i] = ax (M[i], ρ (w)) 3) Apply the raw forula: α2 1 2 M[i] (1) 4) Apply Linear Counting if : log z where z is the nuber of eleents of M with value 0 5) Apply correction if E > 1 32 302 : 2 32 log (1 E ) 2 32 E 2 5 When using a 64-bit hash function, step 5 in above algorith is no longer needed as long as the cardinality does not near 2 64 In this paper we use 64-bit hash function in algoriths and siulations The accuracy of cardinality estiation provided by forula (1) gets better when cardinality becoes larger (proved by Flajolet et al [9]), but the accuracy fro Linear Counting degenerates as cardinality increases Therefore, loss 5 of accuracy around switching point 2, defined in [9], is inevitable A ajor iproveent in HyperLogLog++[11] (in addition to using 64-bit hash and sparse representation) is provided a bias correction table for a range of cardinalities to boost the accuracy around the switching point The bias correction table and correction range are dependent and deterined epirically based on the estiates offered by forula (1) In this paper HyperLogLog++ refers to the ipleentation without sparse representation The new algorith, LogLog-β, uses only one forula and needs neither bias corrections nor Linear Counting, and therefore ipleentation is siplified Our siulations show the accuracy provided by the new algorith is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++ (without sparse representation) 1

II LogLog-β algorith In [9] Flajolet et al proved asyptotically that the raw forula (1) of HyperLogLog has a very low standard error ( 104 ) However the forula fails to provide good cardinality estiation for sall or pre-asyptotic cardinalities The ain idea of the new algorith, LogLog-β, is to odify the raw forula and to ake it applicable to the entire range of cardinalities, fro very sall to very large A LogLog-β algorith LogLog-β Algorith: Set M [i] = 0 for i = 0, 1,, 1 For each s S, let i be the integer fored by the left p bits of h(s) and w be the right 64-p bits of h (s), set M [i] = ax (M[i], ρ (w) ) 3) Apply the forula: α( z) (2) β(,z)+ 1 2 M[i] where z is the nuber of eleents of M with value 0 and β(, z) is a function of and z Clearly, forula (2) is a odification of forula (1) which replaces the nuerator α 2 with α ( z ) and adds a ter β(, z ) to the denoinator Furtherore, forula (2) and (1) are identical when z = 0 and β(, 0 ) = 0 B Function β(, z ) The function β(, z ) is a kind of bias iniizer and should go to 0 when z goes to 0 Therefore, for very large cardinalities, forula (2) is alost the sae as forula (1) The ultiate goal is to find β(, z ) such that forula (2) could yield highly accurate estiations for the entire range of cardinalities There are any ways to choose β (, z), for exaple β(, z ) = β 0 ()z, or β(, z ) = β 0 ()z + β ()z 2 1 + For the sake of convenience of discussion in this paper we liit β(, z ) to the following for: β(, z ) = β 0 ()z + β 1 ()z l + β 2 ()z l2 + + β k ()z k l where z l = l og (z + 1 ), k 0, and β 0 (), β 1 (),, β k () are dependent constants Clearly β(, z ) = 0 when z = 0 An ipleentation, based on Horner's rule, β(, z ) could be evaluated by a total of ( k + 1 ) ultiplications and k additions when is provided z l C Constants β 0 (), β 1 (),, β k () For given, k, and a data set of cardinality c, under the best circustances we expect β(, z ) = βˆ (, z ) where α ( z) βˆ (, z ) = c 1 2 M[i] If we generate data sets with given cardinalities c 1, c 2,, and c n (fro very sall to very large) and copute z and βˆ (, z ) for each c i, then by solving a least square proble in β(, z) βˆ (, z) 2 2 we shall be able to deterine β 0 (), β 1 (),, β k () In practice we pick cardinalities c i such that c 1 < c 2 < < c n, equally distanced, with n k and z = 0 for soe of the large cardinalities Furtherore for each given c i, we copute the eans of z and βˆ (, z) over any generated data sets, then use the eans to solve the least square proble In our study we used MURMUR3-64 as the hash function and javautilrando for generating data sets with given cardinalities For exaple for p = 14, = 2 14, we coputed a few β (, z) : β(, z ) = 0 309142z + 13733192z l 8636985z 328973z l 2 + 1 l 3 β(, z ) = 0 350308z + 3949176z l 6403082zl 2 + 3289908z 643495z 051568z l 3 0 l 4 + 0 l 5 The nuber of ters of β (, z) should be used is deterined by accuracy requireent: the larger k the better accuracy the algorith provides However, we cannot reach arbitrary accuracy by siply increasing k : the optial accuracy that can be achieved is dictated by, the size of vector M In our study we found 3 to 7 to be a reasonable range for k III Accuracy tests of LogLog-β As long as β (, z) is properly deterined, the LogLog-β algorith can provide estiation for different ranges of cardinalities with accuracy coparable to HyperLogLog++ Naely, forula (2) can perfor as well as Linear Counting for sall cardinalities and HyperLogLog Raw with bias correction for large and very large cardinalities For copleteness, we should deterine β (, z) and perfor accuracy tests for each precision p ( = 2 p ) However, since the processes are quite siilar, this paper only shows the tests for precision p = 14 ( = 2 14 = 16384 ) For very large cardinalities the estiations by HyperLogLog, LogLog-β, and HyperLogLog++ are literally the sae therefore cardinalities are liited to 200, 000 in accuracy tests To deterine β (, z) for p = 14, we coputed the eans of z and βˆ (, z ) over 100 (the ore the better) generated data sets per cardinality for cardinalities fro 1, 000 to 170, 000 in increents of 1, 000 With k = 7 ( 8 ters) the function β(2 14, z ) is 0 370393914z + 0070471823zl + 017393686z 16339839z l 2 + 0 l 3 009237745z 03738027z l 4 + 0 l 5 0005384159z 00042419z l 6 + 0 l 7 2

and it is used for p = 14 in all the accuracy tests in this study By using above β (, z) and any of the hash functions MURMUR3, MD5, and SHA, we found the test results of LogLog-β to be alost the sae Specifically, the function β (, z) was calculated using one hash function could be used in LogLog-β in place of any other hash function as long as their hash values are uniforly distributed and well randoized We chose MMURMUR3 for this study because of its perforance Both tests on the ean of relative errors and the ean of absolute values of relative errors for 500 generated datasets per cardinality show that LogLog-β provides slightly better estiations than HyperLogLog and HyperLogLog++, especially for the cardinalities in the range of 10, 000 to 85, 000, see Fig 1 and Fig 2 Please note that LogLog-β has better perforance in ters of accuracy and stability over Linear Counting for alost all the sall to id-range cardinalities Figure 3 shows the relative error of one generated dataset per cardinality Fig 2 The ean of abs(relative errors) of cardinality estiations for 500 generated datasets per cardinality (the x -axis) Tested cardinalities are fro 500 to 200, 000 Fig1 The ean of relative errors of cardinality estiations for 500 generated datasets per cardinality (the x -axis) Tested cardinalities are fro 500 to 200, 000 Fig 3 The relative error of the cardinality estiation of one generated dataset per cardinality (the x -axis) Tested cardinalities are fro 500 to 200, 000 Figure 4, 5, and 6 show the epirical histogras of cardinality estiations for 500 generated datasets per cardinality with cardinality being 1,000; 50,000; and 100,000 respectively Both HyperLogLog and HyperLogLog++ use the sae forulas in Fig 4 and 6: Linear Counting for cardinality = 1, 000 and forula (1) (with a inor bias correction for HyperLogLog++) for cardinality= 100, 000 Therefore, the histogras corresponding to HyperLogLog and HyperLogLog++ are alost identical in Fig 4 and 6 In both figures LogLog-β shows coparable or slightly better behaviors In Figure 5 HyperLogLog, HyperLogLog++, and LogLog-β show different behaviors since HyperLogLog uses forula (1) and HyperLogLog++ uses both forula (1) and bias correction Fig 4 The histogra of the cardinality estiations of generated datasets for cardinality 1, 000 500 3

Fig 5 The histogra of the cardinality estiations of 500 generated datasets for cardinality 50, 000 IV More discussions The forula (2) of algorith LogLog-β requires two ajor changes copared to HyperLogLog Raw (1), and these two changes ake the forula capable of covering the entire range of cardinalities with coparable or better accuracy Replacing 2 with ( z ), akes no difference for the very large cardinalities but has significant ipact on the estiates of sall and id-range cardinalities It forces the estiates to roughly align with the real cardinalities The other change, adding a ter β(, z) to the denoinator, further corrects the reaining error/bias The cobination of the two changes akes forula (2) work ore accurately for the entire range of cardinalities Both changes are essential, and the second change could have any variations and fors requires no regeneration of vectors M and it can be applied directly to the existing vectors Therefore, switching to LogLog-β requires a very sall effort and produces coparable or better results The idea of replacing 2 by ( z ) can be applied to other cardinality estiation algoriths as well For exaple, we apply this idea to an unbiased optial cardinality estiation algorith proposed and studied in Lubroso[13] The algorith, like HyperLogLog, perfors well for very large cardinalities but depends on Linear Counting and bias corrections for sall and pre-asyptotical cardinalities The core algorith in Lubroso[13], without Linear Counting and bias correction, is described as follows: Let S, M,, and p be the sae as described in HyperLogLog algorith, but h be a hash function with hash value in interval ( 0, 1 ), Lubroso's Core Algorith: Set M [i] = 1 for i = 1,, For each s S, set y = h(s), i = y + 1, and M [i] = in (M[i], y y ) 3) Apply the core forula: ( 1) i=1 M[i] It is proved by Lubroso[13] that the above algorith yields optial estiation for very large cardinalities To ake it work for sall or id-range cardinalities we substitute 1 in forula (3) with z which results in a new forula: (3) ( z) M[i] i=1 (4) Fig 6 The histogra of the cardinality estiations of generated datasets for cardinality 100, 000 500 where z is the nuber of M [i] with value 1 There is no need to add a correction ter to the denoinator in this case The estiations provided by forula (4) are strikingly accurate for all ranges of cardinalities, especially for the sall and id-range cardinalities In Figure 7 we show soe test results on this new algorith, called MMV (Mean of Miniu Values) Forula (2) of LogLog-β and forula (4) of MMV, derived fro different approaches (counting vs ordering), are deeply related (detailed study is in a separate paper) The ter 2 M[i] of (2) could be regarded as an approxiation of the ter M [i + 1 ] of (4) For a given both algoriths provide coparable accuracies, but LogLog-β requires less eory In ters of ipleentation and coputation coplexity, LogLog-β offers ore advantages: one forula for the entire range of cardinalities, a uch sipler process flow, and no bias correction and lookup tables Furtherore, for any existing ipleentations of HyperLogLog or HyperLogLog++ (without sparse representation), LogLog-β 4

Fig 7 Cardinality estiations of 500 generated datasets for each cardinality fro 500 to 200, 000 References [1] N Alon, Y Matias, and M Szegedy, The space coplexity of approxiating the frequency oents, in Journal of Coputer and Syste Sciences 58 (1999), 137-147 [2] S Acharya, P B Gibbons, V Poosala, and S Raasway The aqua approxiate query answering syste, in SIGMOD, pages 574-576, 1999 [3] P Brown, P J Haas, J Myllyaki, H Pirahesh, B Reinwald, and Y Sisanis Toward autoated large-scale inforation integration and discovery, in Data Manageent in a Connected World, pages 161-180, 2005 [4] K Beyer, P J Haas, B Reinwald, Y Sisanis, and R Geulla On synopses for distinct-value estiation under ultiset operations, in Proceedings of the 2007 ACM SIGMOD international conference on Manageent of data, pages 199-210 ACM, 2007 [5] R Cohen, L Katzir, A Yehezkel A Unified schee for generalizing cardinality estiators to su aggregation, in Inf Process Lett(2014) [6] M Durand and P Flajolet, LogLog counting of large cardinalities, in Annual European Syposiu on Algoriths (ESA03) (2003), G Di Battista and U Zwick, Eds, vol 2832 of Lecture Notes in Coputer Science, pp 605-617 [7] C Estan, G Varghese, and M Fisk Bitap algoriths for counting active flows on high-speed links, in IEEE/ACM Trans Netw, 14(5):925-937, Oct 2006 [8] P Flajolet and G Martin Probabilistic counting algoriths for data base applications, in Journal of Coputer and Syste Sciences, 31(2):182-209, 1985 [9] P Flajolet, É Fusy, O Gandouet, and F Meunier Hyperloglog: The analysis of a near-optial cardinality estiation algorith, in Analysis of Algoriths (AOFA), pages 127-146, 2007 [10] S Ganguly Counting distinct ites over update streas, in Theoretical Coputer Science, 378(3):211-222, 2007 [11] S Heule, M Nunkesser, and A Hall HyperLogLog in Practice: Algorithic Engineering of a State of The Art Cardinality Estiation Algorith, Google, Inc [12] D M Kane, J Nelson, and D P Woodruff An optial algorith for the distinct eleents proble, in PODS, pages 41-52, 2010 [13] J Lubroso An optial cardinality estiation algorith based on order statistics and its full analysis, in DMTCS Proceedings 01 (2010), 489-504 [14] C R Paler, P B Gibbons, and C Faloutsos Anf: a fast and scalable tool for data ining in assive graphs, in SIGKDD, pages 81-90, 2002 [15] P G Selinger, M M Astrahan, D D Chaberlin, R A Lorie, and T G Price Access path selection in a relational database anageent syste, in SIGMOD, pages 23-34, 1979 [16] K-Y Whang, B Vander-Zanden, and H Taylor A linear-tie probabilistic counting algorith for database applications, in ACM Transactions on Database Systes 15, 2 (1990), 208-229 [17] Z Bar-Yossef, T Jayra, R Kuar, D Sivakuar, and L Trevisan Counting distinct eleents in a data strea, in Randoization and Approxiation Techniques in Coputer Science, pages 1-10 Springer, 2002 5