Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

Size: px

Start display at page:

Download "Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics"

Norma Hardy
5 years ago
Views:

1 Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017

2 8 Un-aggregated data 3 Data element e D has key and value (e.key,e.value) Weight frequency of a key x: Aggregated View w * = e. value 7 = 7.:7;<* density W of frequencies of keys) Will also use: Max m * = max e.value :7;<* f-statistics: f W = f(w * ) *,

3 f- statistics f W = f(w * ) *, Distinct f w = 1 (w> 0) #of distinct keys Sum f w = w Frequency moments f w = w F Cap: f w = min(t, w) Complement Laplace transform: f w = 1 e LMN Other: f x = log(1 + x) f(x) = min(x R.ST,T) Applications: Search queries, online interactions, online transactions, word/term (co)-occurrences in corpus, network traffic Issue: Aggregated view Wis costly: Data movement, storage, computation Use sketches!

4 Sketches A sketch for f is a lossy summary of the data D from which we can approximate (estimate) f D = f(w) Data D Sketch (D) estimator Q: f(w)? fx(s) Sketch structure design goals: Optimize sketch-size vs. estimate quality Can we get size O(ε La + log log n)? Composable/mergeable Data A Data B Data A B Sketch(A) Sketch(B) Sketch(A B)

5 Why composable is useful? Distributed data/parallelize computation Sketch 1 Sketch 2 Sketch 3 Sketche 4 Sketch 5 S. 1 2 S. 3 4 S Streamed data Sketch Sketch Sketch Sketch Sketch Sketch

6 Approximate statistics via small sketches Data element e D has key and value (e.key,e.value) Weight of key x is the Sum of its element values: w * = e. value :7;<* f-statistics: f D = f W = f(w * ) *, Quality: Coefficient of variation p q = ε, concentration? Distinct f w = 1 (x > 0): [Flajolet Martin 85, Flajolet et al 07] O(ε La + log log n) Sum f w = w: [Morris 77] O(ε La + log log n) Frequency moments f w = w F : [Alon Matias Szegedy 99, Indyk 01] O(ε La log a n) Capping f w = min(t, w) [C 15] (via sampling) Others: universal sketches [Braverman Ostrovsky 10] O(ε La log n) Polynomial(ε Lo, log n)

7 Sum: f(w * ) *, log(n): A single register of size to keep the sum. Clearly composable O(ε La + log log n): [Morris 1977] +[Flajolet 1985] Maintain the exponent t, initialized t 0 Estimate: return 1 + ε t 1 Add Y: Increase t by maximum amount so that estimate increase by Z Y Let Δ = Y Z Increment t with probability { o} ~ Merge t a t o : same as Add 1 + ε t 1 to counter t o Composable version [C 15]:

HIP estimators [Cohen 14, Ting 15]: halve the variance to o a:! Idea: track an estimated count c with sketch structure.

8 Distinct count sketches HyperLogLog [Flajolet et al 2007] Optimal size O(ε La + log log n) for CV p q = ε ; n distinct keys Idea: store k = ε La exponents of hashes. Exponents value concentrated so store one and k offsets. HIP estimators [Cohen 14, Ting 15]: halve the variance to o a:! Idea: track an estimated count c with sketch structure. When structure is modified, add inverse modification probability to c. Sketch c Approx. count If structure S is modified c += 1/p(modified)

9 Distinct count sketches Initialize: k = ε La registers c o,, c :, ; Hash functions H x Exp[1] Process element e. key: For i k : c min (c, H e. key ) Estimate: :Lo Analysis: c minimum over active keys of independent EXP[1] c EXP Distinct D Parameter estimation problem Composability: minimum is composable *Simplified version [C 94] Reduce size: keep exponents only of c, one exponent and ε La constant-size offsets

10 MaxDistinct sketches Max value of an element with key x: m * = max e.value :7;<* MaxDistinct D = m * Initialize: k = ε La registers c o,, c :, Hash functions H x Exp[1] Process element (e. key, e. value): For i k : c min (c, 7.:7; 7. š œ7 ) Estimate: :Lo Analysis: For each key x, the minimum over 7.:7; elements of EXP[m 7. š œ7 *] c minimum over keys x of independent EXP[m * ] c EXP MaxDistinct D

11 Element processing framework 5 MAP Distinct/Max-distinct sketch Data element randomized Output elements (0 or more) Composable sketch of output elements Goal: E[ MaxDistinct( 7 8 MAP e )] = f(w), + concentration Q: For which f we can do this? How? (specify MAP)

12 W = (Soft) cap functions cap w = T 1 e L cap w = min(t, w) (aggregated D) cap ª W = «min 3, w * = 10 * cap ª W = 3 «(1 e L ª ) * 8.38

13 Warm Up: Sketching cap -statistics we work with f t w = 1 e L t!! The statistics is the Laplace c (complement-laplace) transform of W(density function of frequencies) L W t = «1 e L t * cap -statistics is T Laplace c transform at point t = ö since cap w = Tf o/ w cap W = «cap w * = T «f ö w * * = TL [W](1/T)

14 Sketching Laplace c transform of Wat point t f t w = 1 e L t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Output: (approximate) number of distinct output keys M Distinct sketch! Output sketch size is barely affected by r, only element processing

15 Claim: (correctness) 1 r E Distinct MAP e 7 8 Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i = «1 e L t * = L [W](t) We compute the probability that the output key x#i is generated: y ¼ t for at least one element e D with e. key = x EXP e. value t min ½ ½.¾½ < EXP w * t = 1 e LM ÀN!! Each input key x and i r have a unique potential output key x#i Sum over all x, i (Poisson rvs) to establish claim

16 Subtlety:? We need enough ε La distinct output keys for low error. We set r = O(ε La ), but still need to address small t o E Distinct MAP e Á 7 8 = 1 e L t * = L Ã W t density W of frequencies: 10 w * = 1 2 w * = 5 1 w * = 10 Sum W = 30 Distinct(W)=13 L W t = 13 10e Lt 2e LTt e LoRt At the regime with low distinct count we can use L W t tsum(w)

17 f(w) in the nonnegative span of g t w = 1 e L t Any function f(w) that can be expressed a t 0 as: f w = É a t We get a t = L Lo f w R Ë 1 e L t dt t = o t LLo [ ÌÍ Ì ](t) = L [a](w) Span includes all concave sublinear f without sharp corners: low frequency moments p 1 ; logarithms; soft capping functions f(w) a(t) T(1 e L ) w log(1 + w) Tδ(t 1 T ) 1 2 π tlo.t e Lt t

18 Sketching statistics for f in the nonnegative span Ë f w = É a t Ë f W = É f w W w dw = R R 1 e L t dt Ë = É a t É W w 1 e L t R Ë Ë = É a t L W t dt R Ë Can sketch L W R = L [a](w) Ë É W w É a t 1 e L t dt R R dw dt a t 0 dw = statistics f W expressed in terms of the L transform t at many points t. But we will see a better way

19 Ë Sketching f(w) = a t L W t dt R Idea: Modify the element map for point t to work with weighted combination of t values M MaxDistinct sketch point t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Distinct count sketch combination a(t) *slightly simplified Input: e = e. key, e. value For i = 1,, r y EXP[e. value] Ë output (e.key#i, a(t) dt) ; MaxDistinct sketch

20 Extension: Multi-objective sketch of the span Idea: If we can sketch all t values together, we can use the sketch for all statistics in the span log factor increase in sketch size (analysis of all-distance sketches [C 94,15]) point t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] If y t: output e.key#i Distinct count sketch All t Input: e = e. key, e. value For i = 1,, r y EXP[e. value] output (e.key#i,y ) All-threshold sketch

21 All-threshold Distinct sketches Input elements (e. key, e. value) Sketch allows us to approximate for any t : Distinct e. key e. value t} Initialize: k = ε La registers c o,, c :, ; Hash functions H x Exp[1] Process element e. key: For i k : c min (c, H e. key ) Estimate: :Lo Distinct counting sketch All-threshold Distinct sketch Remember for each register c all breakpoints where the minimum increases Logarithmic number of breakpoints [C 94] Any sample-based distinct sketch can be similarly extended [C 15]

22 Handling sharp corners: All concave sublinear f Reduces to sketching Cap o w = min 1,w Soft cap gives 1 o 7 approximation Better approximation: Use a signed inverse transform to approximate Ë Cap o w, controlling the L o a t = a t dt Separate estimate the negative and positive components Grid search on 3 points We get 12% Open question to get ε R

23 Conclusion Summary: Simple, practical design of composable double-logarithmic size sketching for concave sublinear statistics Results novel theoretically even with O(ε La log n) size Open: Handling statistics with sharp corners. Loglog in the super-linear regime (second moment?)

24 Thank you!

The Magic of Random Sampling: From Surveys to Big Data

The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.