Lecture 4: Universal Hash Functions/Streaming Cont d

CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected to the usual scrutny reserved for formal publcatons. 4. Hash Functons Suppose we want to mantan a data structure of a set of elements x,..., x m of a unverse U, e.g., mages, that can perform nserton/deleton/search operatons. A smple strategy would be to have one bucket for every possble mage,.e., each element of U, and ndcate n each bucket whether or not the correspondng mage appeared. Unfortunately, U can be much much larger than the space avalable n our computers; for example, f U represents the set of all possble mages, U s as bg as 000000. Instead, one may use a hash functon. A hash functon h : U [B] maps elements of U to ntegers n [B]. For every element of the sequence we mark h(x ) wth x. When a query x arrves, we go to the cell h(x) f no element s stored there, x s not n our sequence. Otherwse, we go over all elements stored n h(x) and see f any of them s equal to x. Observe that the search operaton thus depends on the number of elements stored n h(x). Ideally, we would lke to have a hash functon that stores at most one element n every 0 B. Fx a functon h. Observe that h maps /B fracton of all elements of U to the same number [B]. Therefore, the search operaton n the worst case s very slow. We can mtgate ths problem by choosng a hash functon h unformly at random the famly of all functons that map U to B; let H = h : U [B], and let h H chosen unformly at random. Now, f the length of the sequence m B, then, by the brthday paradox phenomenon, wth hgh probablty, no two elements of the sequence map to the same cell. In other words, there s no collsons. However, observe that H has U B many functons, so even descrbng h requres log U B = U log B bts of memory. Recall that we assumed U 000000 so we cannot effcently represent h. Instead, we are gong to work wth smaller much famles of functons say H ; such a famly can only guarantee weaker notons of ndependence, but because H H, t s much easer to descrbe a randomly chosen functon from H. 4. -Unversal Functons In ths secton, we descrbe a famly hash functons that only preserve parwse-ndependent. Let p be a prme number, and let H = {h : [p] [p], h(x) = ax + b mod p}. Observe that any functon h a,b H can be represented n O(log p) bts of memory just by recordng the a, b [p]. Next, we show that a unformly random functon h H s parwse ndependent. Lemma 4.. For any x, y, c, d [p]x y, P [h(x) = c, h(y) = d] = p Proof. Suppose for some x y, h(x) c, and h(y) d. Equvalently, we can wrte, ax + b c mod p, and ay + b d mod p. 4-

4- Lecture 4: Unversal Hash Functons/Streamng Cont d Usng the laws of modular equatons, we can wrte, a(x y) (c b) (d b) mod p. Snce p s a prme, any number z p has a multplcatve nverse,.e., there s a number z p such that p p mod p. Snce x y, x y 0. Therefore, t has a multplcatve nverse, and we can wrte, a = (x y) (c d) mod p, whch gves, b = d ay mod p. In words, havng x, y, c, d unquely defnes a, b. Snce there are p possbltes for a, b, we get P [h(x) = c, h(y) = d] = /p. For our applcatons n estmatng F 0, we frst need to choose a prme number p > n. Then, we can use a hash functon h : [n] [B] where for any 0 x n, h(x) = ax + b mod p mod B. It s easy to see that such a functon s almost parwse ndependent whch s good enough for our applcaton n estmatng F 0. We can extend the above constructon to a famly of k-wse ndependence hash functons. We say a hash functon h : [p] [p] s k-wse ndependent f for all dstnct x 0,..., x k, P [, h(x ) = c ] = p k. Such a hash functon h can be constructed by choosng a 0, a,..., a k unformly and ndependently from [p] and lettng h(x) = a k x k + a k x k...a x + a 0. We are not provng that ths wll gve a k-wse ndependence hash functon. Instead, we just gve the hgh-level dea. Let h be a 4-wse ndependent hash functon and let x 0, x, x, x 3 [p] be dstnct and c 0, c, c, c 3 [p] we need to show that there s a unque set a 0, a, a, a 3 for whch h(x ) = c for all. To fnd a 0, a, a, a 3 t s enough to solve the followng system of lnear equautons. x 3 0 x 0 x 0 a 3 c 0 x 3 x x a x 3 x x a = c c x 3 3 x 3 x 3 a 0 c 3 It turns out that the Matrx n the LHS has a nonzero determnant of x 0, x, x, x 3 are dstnct. In such a case, t s nvertble, and we can use the nverse to unquely defne a 0, a, a, a 3. 4.3 F Moment Before desgnng a streamng algorthm that estmates F, let us revst the random walk example that we had a few lectures ago. Let X = X where for each, X = { +, w.p., w.p.

Lecture 4: Unversal Hash Functons/Streamng Cont d 4-3 Usng the Hoeffdng bound, we prevously showed that for any c >, P [X c n] e c. Is ths bound tght? Can we show that X Ω(n) wth a constant probablty? The answer yes. More generally t follows from the central lmt theorem. But nstead of usng such a heavy tool there s a more elementary argument that we can use. To show that X Ω( n) wth a constant probablty, t s enough to show that E [ X ] n. E [ [ ] X ] = E X ] = E X X j,j =,j E [X X j ] = E [ X ] = n, where n the second to last equalty we use that X, X j are ndependent, so E [X X j ] 0 only when = j, and n the last equalty we use E [ ] X s. Now back to estmatng F. We want to use a smlar dea. Let x, x,..., x m [n] be the nput sequence. For each [n] let m := #{x j = }. Recall that Let h : [n] {+, } where for any [n], F := h() = n m. = { +,,, chosen ndependently. Consder the followng algorthm: Start wth Y = 0. After readng each x, let Y = Y + h(x ). Return Y. Before, analyzng the algorthm let us study two extreme cases. Frst assume that x = x = = x m. Then, Y = m, Y = m as desred. Now, assume that x, x, dots, x m are mutually dstnct, then the dstrbuton of Y s the same as a random walk of length m; so by the prevous observaton Y n and Y n as desred. Lemma 4.. Y s an unbased estmator of F,.e., E [ Y ] = F. Proof. Frst, observe that Y = m h(). Therefore, E [ Y ] = E,j m m j h()h(j) =,j m m j E [h()h(j)] = m E [ h() ] = m where the second to last equalty uses that h() s ndependent of h(j) for all j. Now, all we need to do s to estmate the expectaton of Y wthn a ± ɛ factor. By Chebyshev s nequalty all we need to show s that Y has a small varance.

4-4 Lecture 4: Unversal Hash Functons/Streamng Cont d Lemma 4.3. Var(Y ) E [ Y ]. Proof. Frst, we calculate E [ Y 4]. The dea s smlar to before, we just use the ndependence of h() s. E [ Y 4] = E m m j m k m l h()h(j)h(k)h(l) =,j,k,l,j,k,l m m j m k m l E [h()h(j)h(k)h(l)] = m 4 E [ h() 4] + 6 <j m m je [ h() h(j) ] To see the last equalty, observe that for any 4-tuple,, j, k, l, E [h()h(j)h(k)h(l)] s nonzero only f each nteger n [m] shows up an even number. In other words, there are only two cases where E [h()h(j)h(k)h(l)] s nonzero: () when = j = k = l, () when two of these four numbers are equal and the other two are also equal. Snce for each, E [ h() ] = E [ h() 4] =, we have Now, usng Lemma 4., we can wrte, as desred. E [ Y 4] = n = m 4 + 6 <j m m j. Var(Y ) = E [ Y 4] E [ Y ] = 4 m m j E [ Y ] Now, all we need to do s to use ndependent samples of Y to reduce the varance. Suppose we take k ndependent samples of Y usng k ndependently chosen hash functons h,..., h k,.e., we run the followng algorthm: Start wth Y = Y = = Y k = 0. After readng x, let Y j = Y j + h(x ) for all j k. Then, <j Var( k (Y + + Y k )) = k Var(Y ). Therefore, by the Chebyshev s nequalty, we can wrte, [ P k E [ Y ] ɛe [ Y ]] Var( k k = Y ) ɛ E [Y ] Y = k E [ Y ] ɛ E [Y ] = ɛ k So, k = 5 ɛ many samples s enough to approxmate F wthn + ɛ factor wth probablty at least 9 0. Note that n the above constructon we assumed that h(.) assgns ndependent values to all ntegers n [n]. But, t can be seen from the proof that we only used 4-wse ndependence. The only place that we used ndependence was to show that E [h()h(j)h(k)h(l)] = 0 when, j, k, l are mutually dstnct. That s of course true even f h(.) s just a 4-wse ndependent functon. Takng that nto account we can run the above algorthm wth space O(log(n)/ɛ ). In addton, we can turn the above probablstc guarantee nto δ probablty usng log δ ɛ We refran from gvng the detals. For more detaled dscusson we refer to [AMS96]. many samples.

REFERENCES 4-5 References [AMS96] N. Alon, Y. Matas, and M. Szegedy. The space complexty of approxmatng the frequency moments. In: STOCw. ACM. 996, pp. 0 9 (ct. on p. 4-4).