MIT 6.854/18.415: Advaced Algorithms Sprig 16 Prof. Akur Moitra Lecture 4 February 16, 16 Scribe: Be Eysebach, Devi Neal 1 Last Time Cosistet Hashig - hash fuctios that evolve well Radom Trees - routig schemes that deal with icosistet views Today: Distict Elemets ad Cout-Mi Sketch. What ca we do if we ca t store data, oly stream it? Distict Elemets Problem: Cout the umber of distict elemets i a sequece X 1, X,..., X. For example, how may uique words did Shakespeare use? Naively this problem takes O(N) space, where N is the umber of distict elemets i the sequece. For Shakespeare s total vocabulary, N 35,. However, it turs out that you ca do much better tha the aive method if you are willig to accept some level of approximatio. There s a famous quote from a 3 paper by Mariae Durad ad Philippe Flajolet: Usig oly memory equivalet to 5 lies of prited text, you ca estimate with a typical accuracy of 5% ad i a sigle pass the total vocabulary of Shakespeare..1 Usig a Sigle Hash Fuctio Idea: Choose a radom hash fuctio h : U [, 1] ad pass oce through the data, hashig each item ad storig oly the miimum of h(x 1 ), h(x ),..., h(x i ),.... Let Y mi i {h(x i )} be the miimum, ad let N be the true umber of distict elemets. Lemma 1. E[Y ] = 1 1
Proof. E[Y ] = = 1 1 [ = = 1 P[Y z]dz (1 z) N dz ] (1 1 z)n+1 With some thought, you ca cofirm that E[Y ] is the same as the probability of choosig umbers i the iterval [, 1] ad havig the last umber be the miimum. By symmetry, this probability is 1 N+1. So, if we estimate N by 1 Y 1, we ll at least get the right aswer i expectatio. To show that we also get the right aswer with good probability, we re goig to use the Chebyshev tail boud to boud the probability that our estimate for Y is close to its expectatio. To do this, we must first compute the variace of Y. Lemma. Proof. V ar[y ] ( 1 ) V ar[y ] = E[Y ] E[Y ] 1 ( ) 1 = z N(1 z) N 1 dz = ()(N + ) 1 ()() ()() 1 ()() ( ) 1 = Ufortuately, we caot apply Chebyshev directly because the variace of Y is too large. I particular, Chebyshev oly gives error resolutio dow to size of Y s stadard deviatio (i.e. V ar[y ]). Zero is oe stadard deviatio below E[Y ], so Chebyshev would oly tell us that P[Y = ] is some costat. This is bad because we caot solve N = 1 Y 1 whe Y =. We wat P[Y = ] to be very small.
. k Hash Fuctios Fortuately, we ca apply the stadard techique of reducig the variace of our estimate via repetitio. Idea [Flajolet-Marti]: Use k hash fuctios h 1,..., h k : U [, 1]. Now, evaluate each hash fuctio o each item i the sequece, storig the miimum for each hash fuctio separately. Let Ȳ 1 k k i=1 Y i be the average miimum. The variace of the sum of idepedet radom variables is the sum of their variaces. Thus, V ar[ȳ ] = 1 k k i=1 Y 1 i, k(n+1) where we ve used Lemma for the iequality. Applyig Chebyshev: [ P Ȳ 1 ɛ ] V ar[ȳ ) ] 1 kɛ ( ɛ N+1 with proba- Thus, our estimate for the umber of distict elemets,, satisfies 1Ȳ N+1 1+ɛ 1 Ȳ N+1 1 ɛ bility at least 1 1. For small ɛ, this guaratee is equivalet to: kɛ (1 O(ɛ))N 1 Y 1 (1 + O(ɛ))N We eed to set k = O(1/ɛ ) to get a ɛ accurate estimate with probability 9/1, for example. I practice, we do t eed our hash fuctios to map to arbitrary real umbers. It is sufficiet to use legth O(log ) biary strigs. For more details, see [1]. 3 Heavy Hitters We ca compute may statistics about a stream of data beyod the umber of distict elemets. Oe particularly popular ad practically useful goal is to fid elemets that appear frequetly i the stream. These items are simply called frequet items or sometimes heavy hitters. 3.1 Misra-Gries, 198 [3] We begi with a straight forward versio of the heavy hitters problem: Give a sequece of elemets X 1, X,..., X, output a list with at most k values, esurig that every elemet which occurs at least k+1 + 1 times i the sequece is o the list. Note that there ca oly be k such items, although there could be fewer. We allow false positives i the list. Here s the algorithm: iitialize empty list for each item if item o list icremet its couter else if legth(list) < k 3
add item to list set item s couter to 1 else throw away item decremet couter of every item i list delete items i list with couter = Fact: Sice the Misra-Gries algorithm stores k couters with value at most, it uses O(k log ) space. Lemma 3. Let f x deote the frequecy of item x. Whe Misra Gries termiates, the couter for x is at least f x k+1. Note that x could have a couter equal to, i which case it will ot be o the list. Proof. The fial value of x s couter is equal to the umber of times it appears i our sequece, f x, mius the umber of times x was throw out because the list was full at the time. We argue that this umber ca t be higher tha k+1. Whe a item is throw out, each elemet cotaied i the list is decremeted. Additioally, x s virtual couter is effectively decremeted from 1 to. Sice the etire list must be filled, this correspods to k + 1 tokes beig destroyed every time x is throw out. There are a total of tokes received, so this evet ca occur at most k+1 times. We coclude that the couter for x is at least f x k+1. So, if f x > k+1, it will appear o the list at the ed of the algorithm, as desired. 3. Cout-Mi Sketch We ca also solve a more ambitious versio of the heavy hitters problem: Give a sequece X 1, X,..., X, compute f x to withi additive error ɛ. Note that a additive approximatio is more meaigful for heavier items i the list ad becomes meaigless whe f x ɛ. We will use the Cout-Mi Sketch (Cormode, Muthukrisha 5 []). Choose l radom hash fuctios h 1,..., h l : U {1,,..., b}. Iitialize a l b array CMS[l][b] with zeros. The, as elemets X i i the sequece are streamed i, for each hash fuctio h j ( ), icremet the (j, h j (X i )) etry of the table. To estimate the frequecy of item x, compute Cout(x) = mi j { CMS[j][h j (x)] } Claim 4. For ay fixed item x ad idex j, CMS[j][h j (x)] f x. Proof. We icremet CMS[j][h j (x)] each time we see x. However, this etry will be icremeted if h j (y) = h j (x) for some other item y x that also appears i the stream. Hece the iequality. 4
3..1 Aalysis Let z j CMS[j][h j (x)], ad ote that z j = f x + y x,h j (y)=h j (x) f y. We wat to examie the expected value of z j : E[z j ] = f x + y x f y P[h j (x) = h j (y)] = f x + 1 b y x f x + 1 b = f x + b y f y f y Note that z j is a biased estimator: its expected value is greater tha f x, the quatity we hope to estimate. Now, we wat to show that z j is close to f x. Usig the Markov Boud whe b = ɛ, P[z j ɛ] 1 Whe we have l hash fuctios, our estimate is the miimum z j. Our estimate is bad iff every z j is much larger tha f x : P[(mi z j ) f x + ɛ] = P[ j, z j f x + ɛ] j 1 l Settig l = O(log()), we get P[(mi j z j ) f x + ɛ] 1. Accordigly, our estimate is accurate up to ɛ error with high probability. 3.. Compariso with Misra Gries By settig ɛ = 1 k, the Cout-Mi sketch solves almost the same problem as Misra-Gries. The key differece is that the Cout-Mi sketch guaratees that for every x, if Cout(x) is large the x is a heavy hitter. I Misra-Gries, a item o the retured list might ot be a heavy hitter. We pay for this guaratee i space. Misra-Gries takes O(k log()) space, while the Cout-Mi sketch requires O(k log ()) space (settig ɛ = 1 k ad otig that each couter stores a value of at most ). Also ote that Cout-Mi oly obtais its solutio with high probability. Misra-Gries is determiistic ad so it always succeeds. 5
Refereces [1] Flajolet, Philippe ad Marti, Nigel. 1985. Probabilistic coutig algorithms for data base applicatios. I Joural of Computer ad System Scieces, Volume 31, Number. pp. 18 9. [] Cormode, Graham, ad Muthukrisha, S. 5. A improved data stream summary: the cout-mi sketch ad its applicatios. I Joural of Algorithms. pp. 58 75. [3] Misra, Jayadev ad Gries, David. 198. Fidig repeated elemets. I Sciece of Computer Programmig. pp. 143 15. 6