18.1 Introduction and Recap

Size: px

Start display at page:

Download "18.1 Introduction and Recap"

Melvyn Nichols
5 years ago
Views:

1 CS787: Advanced Algorthms Scrbe: Pryananda Shenoy and Shjn Kong Lecturer: Shuch Chawla Topc: Streamng Algorthmscontnued) Date: 0/26/2007 We contnue talng about streamng algorthms n ths lecture, ncludng algorthms on gettng number of dstnct elements n a stream and computng second moment of element frequences. 8. Introducton and Recap Let us revew the fundamental framewor of streamng. Suppose we have a stream comng n from tme 0 to tme T. At each tme t [T ], the comng element s a t [n]. Frequency for any element s defned as m = {j a j = }. We wll see a few problems to solve under ths framewor. For dfferent problems, we expect to have a complexty of only Olog n) and Olog T ) on ether storage or update tme. We should only mae a few passes over the stream under these strong constrants. Frst problem we wll solve n ths lecture s gettng the number of dstnct elements. A smple way to do ths s to perform random samplng on all elements, mantan statstcs over samplng, and then extraploate real number of dstnct elements from statstcs. For example, f we pc of the n elements unformly at random, and count the number of dstnct elements n samples, we can roughly estmate the number of dstnct elements over all elements by multplyng the result wth n. However, n many cases where elements are unevenly dstrbuted e. g. some elements domnate), random samplng wth a small wll cause much lower estmaton than actual number of dstnct elements. In order to be accurate we need = Ωn) The second problem, computng second moment of element frequences depends on accurate estmaton of m. Agan we can apply random samplng n smlar way and estmate frequency of each element. But agan we wll have the problem of naccuracy n random samplng. Some elements may not be sampled at all. In order to better resolve those problems, we ntroduce another two dfferent algorthms whch perform better than random samplng based algorthm. Before that, let s defne what an unbased estmator s: Defnton 8.. An unbased estmator for quantty Q s a random varable X such that E[X] = Q. 8.2 Number of Dstnct Elements Let s assgn c to represent the number of dstnct elements. We are gong to prove that we can probablstcally dstngush between c < and c 2 by usng a sngle bt of memory. s related to a hash functons famly H where, h H, h : [n] [] Algorthm :

2 Suppose the bt we have s b, ntally set b = 0. for some t [T ], f ha t ) = 0, then set b =. Let us compute some event probabltes: Pr[b = 0] = ) c Pr[b = 0 c < ] = Pr[b = 0 c 2] = ) c ) 4 ) c ) 2 e Now we have separaton between the cases c < and c 2. In the next step, we can use multple bts to boost ths separaton. Algorthm 2: Mantan x bts b 0, b,..., b x and run Algorthm ndependently over each bt. Next we pc a value between 4 and, say e 2 6. If {j b j = 0} > x 6, output c <, else output c 2. Clam 8.2. The error probablty of Algorthm 2 s δ f x = Olog δ ). Proof: Suppose c <, then the expected number of bts that are 0 n all x bts s atleast x 4. By Chernoff s bound: [ Pr actual number of bts that are zero < x ] = 6 ) ] x e [ Pr actual number of bts that are 0 < Usng x = Olog δ ) gves us the answser wth probablty δ. Smlarly f c 2, we can show a smlar bound on Pr [ number of actual bts that are 0 x ] 6. We repeat log n tmes, and set δ = δ any run fals s log n. log n = δ. log n To get a + ɛ) approxmaton, we would need O δ 3 2 x 4 for each tme. Then, by unon bound, the probablty that log n ɛ 2 log log n + log δ ) ) bts 2

3 8.3 Computng Second Moment Recall that the th moment of the stream s defned as µ = [n] m. We wll now dscuss an algorthm to fnd µ 2. Ths measure s used n many applcatons as an estmate of how much the frequency vares. Algorthm 3: Step : Pc a random varable Y u.a.r {, } [n] Step 2: Let the random varable Z be defned as Z = t Y at) Step 3: Defne the random varable X as X = Z 2 Step 4: output X Clam 8.3. X s an unbased estmator for µ 2 ;.e. the expected value of X equals µ 2 Proof: Z can be redefned as: Z = [n] m Y X = Z 2 = E[X] = m 2 E [ Y 2 m 2 Y j m m j Y Y j ] + 2 m m j E[Y Y j ] j Snce Y {, }, Y 2 =. Also the second term evaluates to 0 snce Y Y j wll evaluate to - or + wth equal probablty E[X] = m 2 = µ ) To get an accurate value, the above algorthm needs to be repeated. specfes how Algorthm 3 can be repeated to obtan accurate results. Algorthm 4: Step : FOR m = to Execute Algorthm 3. Let X be the output ENDFOR Step 2: Calculate the mean of X X = X +X 2...+X ) Step 3: output X The followng algorthm 3

4 The expected value of X s taen as the value of µ 2. We wll see what the value of needs to be to get an accurate answer wth hgh probablty. To do ths, we wll apply Chebychev s bound. Consder the expected value of X 2 : E [ X 2] = E[ 2 j í m 4 Y j m 2 m j míy 2 Y j Yí + 24 m 3 m j Y 3 Y j + 6 m 2 m 2 jy 2 Yj 2 + j j í j m m j mím j Y Y j YíY j ] Assumng that the varables Y are 4-way ndepedent, we can smplfy ths to E [ X 2] = m j m 2 m 2 j The varance of X as defned n Algorthm 3) s gven by var[x] = E [ X 2] E[X]) 2 = m m 2 m 2 j) j m j m 2 m 2 j) = 4 j m 2 m 2 j 2 m 2 ) 2 2µ 2 2 var[x] = var[x] 2µ2 2 By Chebychev s nequalty: Pr [ X µ 2 ɛµ 2 ] var[x] ɛ 2 µ 2 2 2µ2 2 ɛ 2 µ ɛ 2 4

5 Hence, to compute µ 2 wthn a factor of ±ɛ) wth probablty δ), we need to run the algorthm tmes. 2 δɛ Space requrements Lets analyze the space requrements for the gven algorthm. In each run on the algorthm, we need Olog T ) space to mantan Z. If we explctly store Y, we would need On) bts, whch s too expensve. We can mprove upon ths by contructng a hash functon to generate values for Y on the fly. For the above analyss to hold, the hash functon should ensure that any group of upto four Y s are ndependent.e. the hash functon belongs to a 4-Way Independent Hash Famly). We sp the detals of how to construct such a hash famly, but ths can be done usng only Olog n) bts per hash functon Improvng the accuracy: Medan of means method In Algorthm 4, we use the mean of many trals to compute the requred value. Ths has the dsadvantage that some naccurate trals could adversely affect the soluton. So we need a large number of samples lnear n δ to get reasonable accuracy. Instead of usng the mean, the followng procedure can be used to get better results. The dea s to tae the medan of the means of subsamples. The reason ths wors better s because the medan s less senstve to the outlers n a sample, as compared to the mean. Group adjacent X s nto groups of sze 8 each. For each group calculate the mean. Then the ɛ 2 expected value of X s obtaned by tang the medan of the means. The total number of samples of X we use are 8 ɛ 2... X mean mean mean mean medan Fg : Medan of mean method To see how ths mproves the accuracy, consder a partcular group as shown n the fgure. Let X be the mean of the group. 5

6 X µ 2 ε) < > Pr > 3/4 µ 2 µ 2 +ε) Fg 2: Probablty of medan fallng nsde the range Usng Chebychev s Inequalty, Pr [ X µ 2 > ɛµ 2 ] var[x] ɛ 2 µ 2 2 2µ2 2 8 ɛ 2 ɛ 2 µ 2 2 = 4 Whch essentally means that the probablty of a value beng outsde the nterval [ ɛ)µ 2, +ɛ)µ 2 ] s atmost 4. Usng Chernoff s bound, Pr[medan s outsde the nterval] = Pr[more than half the samples are outsde the nterval] = e If the requred probablty s δ, we need to pc so that ths value s at most δ,.e. = Olog δ ) So the number of trals requred s. 8 = O log ɛ 2 ɛ 2 δ ). And the total number of bts used s O log ɛ 2 δ log T + log n)). 6

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 4: Universal Hash Functions/Streaming Cont d CSE 5: Desgn and Analyss of Algorthms I Sprng 06 Lecture 4: Unversal Hash Functons/Streamng Cont d Lecturer: Shayan Oves Gharan Aprl 6th Scrbe: Jacob Schreber Dsclamer: These notes have not been subjected