LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

Similar documents
13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

A Simple Regression Problem

Estimating Entropy and Entropy Norm on Data Streams

Interactive Markov Models of Evolutionary Algorithms

Non-Parametric Non-Line-of-Sight Identification 1

arxiv: v3 [cs.ds] 22 Mar 2016

Ensemble Based on Data Envelopment Analysis

Block designs and statistics

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Analysis of an algorithm catching elephants on the Internet

On Constant Power Water-filling

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

arxiv: v1 [cs.ds] 3 Feb 2014

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

COS 424: Interacting with Data. Written Exercises

Fast Montgomery-like Square Root Computation over GF(2 m ) for All Trinomials

Bipartite subgraphs and the smallest eigenvalue

Birthday Paradox Calculations and Approximation

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

Randomized Recovery for Boolean Compressed Sensing

About the definition of parameters and regimes of active two-port networks with variable loads on the basis of projective geometry

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Hybrid System Identification: An SDP Approach

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Compression and Predictive Distributions for Large Alphabet i.i.d and Markov models

Finite fields. and we ve used it in various examples and homework problems. In these notes I will introduce more finite fields

2 Q 10. Likewise, in case of multiple particles, the corresponding density in 2 must be averaged over all

Polygonal Designs: Existence and Construction

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Comparison of Stability of Selected Numerical Methods for Solving Stiff Semi- Linear Differential Equations

Analysis of Polynomial & Rational Functions ( summary )

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Combining Classifiers

Testing Properties of Collections of Distributions

The Weierstrass Approximation Theorem

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

An improved self-adaptive harmony search algorithm for joint replenishment problems

A note on the multiplication of sparse matrices

An Adaptive UKF Algorithm for the State and Parameter Estimations of a Mobile Robot

An Approximate Model for the Theoretical Prediction of the Velocity Increase in the Intermediate Ballistics Period

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

time time δ jobs jobs

Sharp Time Data Tradeoffs for Linear Inverse Problems

a a a a a a a m a b a b

Statistical Logic Cell Delay Analysis Using a Current-based Model

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

Efficient Filter Banks And Interpolators

REDUCTION OF FINITE ELEMENT MODELS BY PARAMETER IDENTIFICATION

In this chapter, we consider several graph-theoretic and probabilistic models

Optical Properties of Plasmas of High-Z Elements

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

A method to determine relative stroke detection efficiencies from multiplicity distributions

1 Estimating Frequency Moments in Streams

EMPIRICAL COMPLEXITY ANALYSIS OF A MILP-APPROACH FOR OPTIMIZATION OF HYBRID SYSTEMS

Ph 20.3 Numerical Solution of Ordinary Differential Equations

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Generalized Permanent Estimator and its Application in Computing Multi- Homogeneous Bézout Number

Probability Distributions

Computable Shell Decomposition Bounds

arxiv: v3 [cs.lg] 7 Jan 2016

Reducing Vibration and Providing Robustness with Multi-Input Shapers

arxiv: v1 [stat.ml] 31 Jan 2018

Convex Programming for Scheduling Unrelated Parallel Machines

The accelerated expansion of the universe is explained by quantum field theory.

Ch 12: Variations on Backpropagation

Handout 7. and Pr [M(x) = χ L (x) M(x) =? ] = 1.

Feature Extraction Techniques

List Scheduling and LPT Oliver Braun (09/05/2017)

Boosting with log-loss

IN modern society that various systems have become more

ANALYSIS OF HALL-EFFECT THRUSTERS AND ION ENGINES FOR EARTH-TO-MOON TRANSFER

Pattern Recognition and Machine Learning. Artificial Neural networks

Effective joint probabilistic data association using maximum a posteriori estimates of target states

3.8 Three Types of Convergence

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Lecture 3 Sept. 4, 2014

Genetic Quantum Algorithm and its Application to Combinatorial Optimization Problem

Support recovery in compressed sensing: An estimation theoretic approach

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

CHAPTER 19: Single-Loop IMC Control

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

STOPPING SIMULATED PATHS EARLY

Generalized Alignment Chain: Improved Converse Results for Index Coding

1 Proof of learning bounds

The Transactional Nature of Quantum Information

Computable Shell Decomposition Bounds

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Optimal Resource Allocation in Multicast Device-to-Device Communications Underlaying LTE Networks

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

arxiv: v1 [math.nt] 14 Sep 2014

A nonstandard cubic equation

Impact of Imperfect Channel State Information on ARQ Schemes over Rayleigh Fading Channels

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Transcription:

LogLog-Beta and More: A New Algorith for Cardinality Estiation Based on LogLog Counting Jason Qin, Denys Ki, Yuei Tung The AOLP Core Data Service, AOL, 22000 AOL Way Dulles, VA 20163 E-ail: jasonqin@teaaolco Abstract The inforation presented in this paper defines LogLog-Beta (LogLog-β) LogLog-β is a new algorith for estiating cardinalities based on LogLog counting The new algorith uses only one forula and needs no additional bias corrections for the entire range of cardinalities, therefore, it is ore efficient and sipler to ipleent Our siulations show that the accuracy provided by the new algorith is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++ In addition to LogLog-β we also provide another one-forula estiator for cardinalities based on order statistics, a odification of an algorith described in Lubroso s work Index Ters Data analysis, Approxiation algoriths, Data Mining I Introduction Cardinality estiation of a ultiset has nuerous applications in data anageent, data analysis, and data services, see papers [2][3][7][11][14][15], etc Over the years, any cardinality estiation algoriths have been proposed and studied, especially the ones related to probabilistic counting [1][4][5][6][8][9][11][10][13][16][17] In this paper we present a new algorith related to a particular probabilistic counting faily called LogLog Counting LogLog Counting has been long studied: see the earlier ideas in Alon and Matias and Szegedy[1] and Flajolet[8], the basic LogLog and SuperLogLog algoriths in Duran and Flajolet[6], the popular HyperLogLog in Flajolet et al [9], and the recent iproveent HyperLogLog++ in Heule et al [11] The HyperLoglog algorith in Flajolet et al [9] serves as the foundation for our study Our new algorith, LogLog-β, can be regarded as a odification of the HyperLogLog In this introduction we will briefly present the HyperLogLog algorith The reainder of the paper is organized as follows: section II details the new algorith LogLog-β, section III gives soe accuracy tests on LogLog-β, and section IV provides another one-forula only cardinality estiator and soe further discussions on the two new algoriths HyperLogLog algorith in Flajolet et al [9] has five ajor coponents: data randoization by a hash function, stochastic averaging and register vector generation, the raw estiation forula, the Linear Counting in Whang and Vander-Zanden and Taylor [16], and bias corrections Let S be a ultiset of data, M be a vector of size with = 2 p with p = 4, 5,, h be a hash function where the hash value is a 32 -bit array, let ρ be a function that aps a bit array to its nuber of leading zeros plus one, and α be a paraeter dependent of (see [9]) HyperLogLog Algorith: M [i] = 0 for i = 0, 1,, 1 For each s S, let i be the integer fored by the left p bits of h(s) and w be the right 32 p bits of h (s), set M [i] = ax (M[i], ρ (w)) 3) Apply the raw forula: α2 1 2 M[i] (1) 4) Apply Linear Counting if : log z where z is the nuber of eleents of M with value 0 5) Apply correction if E > 1 32 302 : 2 32 log (1 E ) 2 32 E 2 5 When using a 64-bit hash function, step 5 in above algorith is no longer needed as long as the cardinality does not near 2 64 In this paper we use 64-bit hash function in algoriths and siulations The accuracy of cardinality estiation provided by forula (1) gets better when cardinality becoes larger (proved by Flajolet et al [9]), but the accuracy fro Linear Counting degenerates as cardinality increases Therefore, loss 5 of accuracy around switching point 2, defined in [9], is inevitable A ajor iproveent in HyperLogLog++[11] (in addition to using 64-bit hash and sparse representation) is provided a bias correction table for a range of cardinalities to boost the accuracy around the switching point The bias correction table and correction range are dependent and deterined epirically based on the estiates offered by forula (1) In this paper HyperLogLog++ refers to the ipleentation without sparse representation The new algorith, LogLog-β, uses only one forula and needs neither bias corrections nor Linear Counting, and therefore ipleentation is siplified Our siulations show the accuracy provided by the new algorith is as good as or better than the accuracy provided by either of HyperLogLog or HyperLogLog++ (without sparse representation) 1

II LogLog-β algorith In [9] Flajolet et al proved asyptotically that the raw forula (1) of HyperLogLog has a very low standard error ( 104 ) However the forula fails to provide good cardinality estiation for sall or pre-asyptotic cardinalities The ain idea of the new algorith, LogLog-β, is to odify the raw forula and to ake it applicable to the entire range of cardinalities, fro very sall to very large A LogLog-β algorith LogLog-β Algorith: Set M [i] = 0 for i = 0, 1,, 1 For each s S, let i be the integer fored by the left p bits of h(s) and w be the right 64-p bits of h (s), set M [i] = ax (M[i], ρ (w) ) 3) Apply the forula: α( z) (2) β(,z)+ 1 2 M[i] where z is the nuber of eleents of M with value 0 and β(, z) is a function of and z Clearly, forula (2) is a odification of forula (1) which replaces the nuerator α 2 with α ( z ) and adds a ter β(, z ) to the denoinator Furtherore, forula (2) and (1) are identical when z = 0 and β(, 0 ) = 0 B Function β(, z ) The function β(, z ) is a kind of bias iniizer and should go to 0 when z goes to 0 Therefore, for very large cardinalities, forula (2) is alost the sae as forula (1) The ultiate goal is to find β(, z ) such that forula (2) could yield highly accurate estiations for the entire range of cardinalities There are any ways to choose β (, z), for exaple β(, z ) = β 0 ()z, or β(, z ) = β 0 ()z + β ()z 2 1 + For the sake of convenience of discussion in this paper we liit β(, z ) to the following for: β(, z ) = β 0 ()z + β 1 ()z l + β 2 ()z l2 + + β k ()z k l where z l = l og (z + 1 ), k 0, and β 0 (), β 1 (),, β k () are dependent constants Clearly β(, z ) = 0 when z = 0 An ipleentation, based on Horner's rule, β(, z ) could be evaluated by a total of ( k + 1 ) ultiplications and k additions when is provided z l C Constants β 0 (), β 1 (),, β k () For given, k, and a data set of cardinality c, under the best circustances we expect β(, z ) = βˆ (, z ) where α ( z) βˆ (, z ) = c 1 2 M[i] If we generate data sets with given cardinalities c 1, c 2,, and c n (fro very sall to very large) and copute z and βˆ (, z ) for each c i, then by solving a least square proble in β(, z) βˆ (, z) 2 2 we shall be able to deterine β 0 (), β 1 (),, β k () In practice we pick cardinalities c i such that c 1 < c 2 < < c n, equally distanced, with n k and z = 0 for soe of the large cardinalities Furtherore for each given c i, we copute the eans of z and βˆ (, z) over any generated data sets, then use the eans to solve the least square proble In our study we used MURMUR3-64 as the hash function and javautilrando for generating data sets with given cardinalities For exaple for p = 14, = 2 14, we coputed a few β (, z) : β(, z ) = 0 309142z + 13733192z l 8636985z 328973z l 2 + 1 l 3 β(, z ) = 0 350308z + 3949176z l 6403082zl 2 + 3289908z 643495z 051568z l 3 0 l 4 + 0 l 5 The nuber of ters of β (, z) should be used is deterined by accuracy requireent: the larger k the better accuracy the algorith provides However, we cannot reach arbitrary accuracy by siply increasing k : the optial accuracy that can be achieved is dictated by, the size of vector M In our study we found 3 to 7 to be a reasonable range for k III Accuracy tests of LogLog-β As long as β (, z) is properly deterined, the LogLog-β algorith can provide estiation for different ranges of cardinalities with accuracy coparable to HyperLogLog++ Naely, forula (2) can perfor as well as Linear Counting for sall cardinalities and HyperLogLog Raw with bias correction for large and very large cardinalities For copleteness, we should deterine β (, z) and perfor accuracy tests for each precision p ( = 2 p ) However, since the processes are quite siilar, this paper only shows the tests for precision p = 14 ( = 2 14 = 16384 ) For very large cardinalities the estiations by HyperLogLog, LogLog-β, and HyperLogLog++ are literally the sae therefore cardinalities are liited to 200, 000 in accuracy tests To deterine β (, z) for p = 14, we coputed the eans of z and βˆ (, z ) over 100 (the ore the better) generated data sets per cardinality for cardinalities fro 1, 000 to 170, 000 in increents of 1, 000 With k = 7 ( 8 ters) the function β(2 14, z ) is 0 370393914z + 0070471823zl + 017393686z 16339839z l 2 + 0 l 3 009237745z 03738027z l 4 + 0 l 5 0005384159z 00042419z l 6 + 0 l 7 2

and it is used for p = 14 in all the accuracy tests in this study By using above β (, z) and any of the hash functions MURMUR3, MD5, and SHA, we found the test results of LogLog-β to be alost the sae Specifically, the function β (, z) was calculated using one hash function could be used in LogLog-β in place of any other hash function as long as their hash values are uniforly distributed and well randoized We chose MMURMUR3 for this study because of its perforance Both tests on the ean of relative errors and the ean of absolute values of relative errors for 500 generated datasets per cardinality show that LogLog-β provides slightly better estiations than HyperLogLog and HyperLogLog++, especially for the cardinalities in the range of 10, 000 to 85, 000, see Fig 1 and Fig 2 Please note that LogLog-β has better perforance in ters of accuracy and stability over Linear Counting for alost all the sall to id-range cardinalities Figure 3 shows the relative error of one generated dataset per cardinality Fig 2 The ean of abs(relative errors) of cardinality estiations for 500 generated datasets per cardinality (the x -axis) Tested cardinalities are fro 500 to 200, 000 Fig1 The ean of relative errors of cardinality estiations for 500 generated datasets per cardinality (the x -axis) Tested cardinalities are fro 500 to 200, 000 Fig 3 The relative error of the cardinality estiation of one generated dataset per cardinality (the x -axis) Tested cardinalities are fro 500 to 200, 000 Figure 4, 5, and 6 show the epirical histogras of cardinality estiations for 500 generated datasets per cardinality with cardinality being 1,000; 50,000; and 100,000 respectively Both HyperLogLog and HyperLogLog++ use the sae forulas in Fig 4 and 6: Linear Counting for cardinality = 1, 000 and forula (1) (with a inor bias correction for HyperLogLog++) for cardinality= 100, 000 Therefore, the histogras corresponding to HyperLogLog and HyperLogLog++ are alost identical in Fig 4 and 6 In both figures LogLog-β shows coparable or slightly better behaviors In Figure 5 HyperLogLog, HyperLogLog++, and LogLog-β show different behaviors since HyperLogLog uses forula (1) and HyperLogLog++ uses both forula (1) and bias correction Fig 4 The histogra of the cardinality estiations of generated datasets for cardinality 1, 000 500 3

Fig 5 The histogra of the cardinality estiations of 500 generated datasets for cardinality 50, 000 IV More discussions The forula (2) of algorith LogLog-β requires two ajor changes copared to HyperLogLog Raw (1), and these two changes ake the forula capable of covering the entire range of cardinalities with coparable or better accuracy Replacing 2 with ( z ), akes no difference for the very large cardinalities but has significant ipact on the estiates of sall and id-range cardinalities It forces the estiates to roughly align with the real cardinalities The other change, adding a ter β(, z) to the denoinator, further corrects the reaining error/bias The cobination of the two changes akes forula (2) work ore accurately for the entire range of cardinalities Both changes are essential, and the second change could have any variations and fors requires no regeneration of vectors M and it can be applied directly to the existing vectors Therefore, switching to LogLog-β requires a very sall effort and produces coparable or better results The idea of replacing 2 by ( z ) can be applied to other cardinality estiation algoriths as well For exaple, we apply this idea to an unbiased optial cardinality estiation algorith proposed and studied in Lubroso[13] The algorith, like HyperLogLog, perfors well for very large cardinalities but depends on Linear Counting and bias corrections for sall and pre-asyptotical cardinalities The core algorith in Lubroso[13], without Linear Counting and bias correction, is described as follows: Let S, M,, and p be the sae as described in HyperLogLog algorith, but h be a hash function with hash value in interval ( 0, 1 ), Lubroso's Core Algorith: Set M [i] = 1 for i = 1,, For each s S, set y = h(s), i = y + 1, and M [i] = in (M[i], y y ) 3) Apply the core forula: ( 1) i=1 M[i] It is proved by Lubroso[13] that the above algorith yields optial estiation for very large cardinalities To ake it work for sall or id-range cardinalities we substitute 1 in forula (3) with z which results in a new forula: (3) ( z) M[i] i=1 (4) Fig 6 The histogra of the cardinality estiations of generated datasets for cardinality 100, 000 500 where z is the nuber of M [i] with value 1 There is no need to add a correction ter to the denoinator in this case The estiations provided by forula (4) are strikingly accurate for all ranges of cardinalities, especially for the sall and id-range cardinalities In Figure 7 we show soe test results on this new algorith, called MMV (Mean of Miniu Values) Forula (2) of LogLog-β and forula (4) of MMV, derived fro different approaches (counting vs ordering), are deeply related (detailed study is in a separate paper) The ter 2 M[i] of (2) could be regarded as an approxiation of the ter M [i + 1 ] of (4) For a given both algoriths provide coparable accuracies, but LogLog-β requires less eory In ters of ipleentation and coputation coplexity, LogLog-β offers ore advantages: one forula for the entire range of cardinalities, a uch sipler process flow, and no bias correction and lookup tables Furtherore, for any existing ipleentations of HyperLogLog or HyperLogLog++ (without sparse representation), LogLog-β 4

Fig 7 Cardinality estiations of 500 generated datasets for each cardinality fro 500 to 200, 000 References [1] N Alon, Y Matias, and M Szegedy, The space coplexity of approxiating the frequency oents, in Journal of Coputer and Syste Sciences 58 (1999), 137-147 [2] S Acharya, P B Gibbons, V Poosala, and S Raasway The aqua approxiate query answering syste, in SIGMOD, pages 574-576, 1999 [3] P Brown, P J Haas, J Myllyaki, H Pirahesh, B Reinwald, and Y Sisanis Toward autoated large-scale inforation integration and discovery, in Data Manageent in a Connected World, pages 161-180, 2005 [4] K Beyer, P J Haas, B Reinwald, Y Sisanis, and R Geulla On synopses for distinct-value estiation under ultiset operations, in Proceedings of the 2007 ACM SIGMOD international conference on Manageent of data, pages 199-210 ACM, 2007 [5] R Cohen, L Katzir, A Yehezkel A Unified schee for generalizing cardinality estiators to su aggregation, in Inf Process Lett(2014) [6] M Durand and P Flajolet, LogLog counting of large cardinalities, in Annual European Syposiu on Algoriths (ESA03) (2003), G Di Battista and U Zwick, Eds, vol 2832 of Lecture Notes in Coputer Science, pp 605-617 [7] C Estan, G Varghese, and M Fisk Bitap algoriths for counting active flows on high-speed links, in IEEE/ACM Trans Netw, 14(5):925-937, Oct 2006 [8] P Flajolet and G Martin Probabilistic counting algoriths for data base applications, in Journal of Coputer and Syste Sciences, 31(2):182-209, 1985 [9] P Flajolet, É Fusy, O Gandouet, and F Meunier Hyperloglog: The analysis of a near-optial cardinality estiation algorith, in Analysis of Algoriths (AOFA), pages 127-146, 2007 [10] S Ganguly Counting distinct ites over update streas, in Theoretical Coputer Science, 378(3):211-222, 2007 [11] S Heule, M Nunkesser, and A Hall HyperLogLog in Practice: Algorithic Engineering of a State of The Art Cardinality Estiation Algorith, Google, Inc [12] D M Kane, J Nelson, and D P Woodruff An optial algorith for the distinct eleents proble, in PODS, pages 41-52, 2010 [13] J Lubroso An optial cardinality estiation algorith based on order statistics and its full analysis, in DMTCS Proceedings 01 (2010), 489-504 [14] C R Paler, P B Gibbons, and C Faloutsos Anf: a fast and scalable tool for data ining in assive graphs, in SIGKDD, pages 81-90, 2002 [15] P G Selinger, M M Astrahan, D D Chaberlin, R A Lorie, and T G Price Access path selection in a relational database anageent syste, in SIGMOD, pages 23-34, 1979 [16] K-Y Whang, B Vander-Zanden, and H Taylor A linear-tie probabilistic counting algorith for database applications, in ACM Transactions on Database Systes 15, 2 (1990), 208-229 [17] Z Bar-Yossef, T Jayra, R Kuar, D Sivakuar, and L Trevisan Counting distinct eleents in a data strea, in Randoization and Approxiation Techniques in Coputer Science, pages 1-10 Springer, 2002 5