Christos Papadimitriou & Luca Trevisan November 22, PDF Free Download

U.C. Bereley CS170: Algorihms Handou LN-11-22 Chrisos Papadimiriou & Luca Trevisan November 22, 2016 Sreaming algorihms In his lecure and he nex one we sudy memory-efficien algorihms ha process a sream (i.e. a sequence) of daa iems and are able, in real ime, o compue useful feaures of he daa. Sreaming algorihms are useful in any problem domain, including graph algorihms, bu we will resric o he following imporan family of problems: we ge a sequence of daa iems (for example sales records, or web his), and each daa iem has a daa field of ineres, which we will call a label (for example, he produc id of he iem ha was sold, he IP address ha we ge a hi from) and we wan o compue various saisics abou he sequence of labels. Absracing away all he daa excep he labels, we can hin of our inpu as being a sream x 1, x 2,... x n where each x i Σ is a label, and Σ is he se of all possible labels (e.g. all produc idenifiers, or all IP addresses). For a given sream x 1,..., x n, and for a label a Σ, we will call f a he frequency of a in he sream, ha is, he number of imes a appears in he sream. We will be ineresed in he following hree problems: Finding heavy hiers, ha is, labels for which f a is large (e.g. find bes-selling producs, or IP addresses ha visi our sie mos ofen) Finding he number of disinc labels in he sream (e.g. finding how many differen producs we have sold, or how many unique visiors we have) Finding he sum-of-squares quaniy a Σ f 2 a, which is usually denoed as F 2, and also called he second frequency momen of he sream I should be self-eviden ha he firs wo problems are imporan. To undersand he quaniy F 2, consider wo exreme cases ha we can have in a sream of n iems. If all he n labels in he sream are differen, hen we have a Σ f a 2 = n, because we have n frequencies ha are 1 and all he ohers are zero. If all he n labels are idenical, hen a Σ f a 2 = n 2. In all oher cases, he value 1

of F 2 for a sream of n iems will always be beween n and n 2, and smaller values correspond o sreams wih several disinc labels each occurring no oo ofen, and large values correspond o sreams wih few disinc labels occurring ofen. So F 2 is a one-parameer measure of how diversified is he sream. All hree problems have simple soluions in which we sore he enire sream. We can sore a lis of all he disinc labels we have seen, and for each of hem also sore he number of occurrences. Tha is, he daa srucure would conain a pair (a, f a ) for every label a ha appears a leas once in he sream. Such a daa srucure can be mainained using O( (log n + log Σ )) O(n (log n + log Σ )) bis of memory. Alernaively, one could have an array of size Σ soring f a for every label a, using O( Σ log n) bis of memory. If he labels are IPv6 addresses, hen Σ = 2 128, so he second soluion is definiely no feasible; in a seup in which a sream migh conain hundreds of millions of iems or more, even a daa srucure of he firs ype is raher large. We will give a soluion o he heavy hier problem ha uses O((log n) 2 ) bis of memory (in realisic implemenaions, he daa srucure requires only an array of size abou 300-600, conaining 32-bi inegers) and soluions o he oher problems ha use O(log n) bis of memory (in realisic seings, he daa srucures are an array of size ranging from a few dozens o a few housands 32-bi inegers). Those soluions will be randomized and will only provide approximae soluions. Boh feaures are necessary: i is possible o prove ha randomized exac algorihms and deerminisic approximae algorihms mus use an amoun of memory of he same order as he rivial soluion of soring all he values f(a). We will no prove hese impossibiliy resuls, bu nex wee we will prove ha any deerminisic and exac algorihm for each of he above hree problems mus use Ω(min{ Σ, n}) bis of memory. Today we see algorihms for he heavy hiers and disinc elemens problems. Nex wee we will see he sum-of-squares algorihm. 1 Probabiliy Review Since we are going o do a probabilisic analysis of randomized algorihms, le us review he handful of noions from discree probabiliy ha we are going o use. Suppose ha A and B are wo evens, ha is, hings ha may or may no happen. Then we have he union bound PrA B] PrA] + PrB] If A and B are independen hen we also have PrA B] = PrA] PrB] 2

Informally, a random variable is an oucome of a probabilisic experimen, which has various possible values wih various probabiliies. (Formally, i is a funcion from he sample space o he reals, hough he formal definiion does no help inuiion very much.) The expecaion of a random variable X is E X = v PrX = v] v where v ranges over all possible values ha he random variable can ae. We are going o mae use of he following hree useful facs abou random variables: Lineariy of expecaion: if X and Y are wo random variables, hen EX +Y ] = E X + E y, and if v is a number, hen EvX] = v E X. Produc formula for independen random variables: If X and Y are independen, hen E XY = (E X) (E Y ) Marov s inequaliy: If X is a random variable ha is always 0, hen, for every > 0, we have PrX ] E X Lineariy of expecaion and he produc rule grealy simplify he compuaion of he expecaion of random variables ha come up in applicaions (which are ofen sums or producs of simpler random variables). Marov s inequaliy is useful because once we compue he expecaion of a random variable we are able o mae high-probabiliy saemens abou i. For example, if we now ha X is a non-negaive random variable of expecaion 100, we can say ha here is a 90% probabiliy ha X 1, 000. Finally, he variance of a random variable X is VarX := E(X E X) 2 and he sandard deviaion of a random variable X is VarX. The significance of his definiion is ha, if he variance is small, we can prove ha X has, wih high probabiliy, values close o he expecaion. Tha is, for every > 0, Pr X E X ] = Pr(X E X) 2 2 ] VarX 2 and, afer he change of variable = c VarX, Pr X E X c VarX] 1 c 2 3

so, for example, here is a leas a 99% probabiliy ha X is wihin 10 sandard deviaions of is expecaion. The above inequaliy is called Chebyshev s inequaliy. (There are a dozen alernaive spellings for Chebyshev s name; you may have encounered his inequaliy before, spelled differenly.) If one nows more abou X, hen i is possible o say a lo more abou he probabiliy ha X deviaes from he expecaion. In several ineresing cases, for example, he probabiliy of deviaing by c sandard deviaions is exponenially, raher han polynomially, small in c 2. The advanage of Chebyshev s inequaliy is ha one does no need o now anyhing abou X oher han is expecaion and is variance. Two final hings abou variance: if X 1,..., X n are pairwise independen random variables, hen VarX 1 +... + X n ] = VarX 1 +... + VarX n and if X is a random variable whose only possible values are 0 and 1, hen and E X = PrX = 1] VarX = E X 2 (E X) 2 E X 2 = E X = PrX = 1] 2 The Heavy Hiers Problem From Problem 5 in HW1, you may remember he problem of finding a label such ha f a > n in a sream of lengh n. The soluion is an algorihm ha is deerminisic 2 and uses only wo variables (and log Σ + log n bis of memory), bu is analysis relies on he fac ha we are ineresed in finding a label occurring a majoriy of he imes. Suppose, insead, ha we are ineresed in finding all labels, if any, ha occur a leas.3n imes. Nex wee, we will show ha his requires Ω(min{ Σ, n}) bis of memory. The daa srucure ha we presen in his secion, however, is able o do he following using O((log n) 2 ) memory: come up wih a lis ha includes all labels ha occur a leas.3n imes and which includes, wih high probabiliy, none of he labels ha occur less han.2n imes. Of course.2 and.3 could be replaced by any oher wo consans. Even more impressively, every ime he algorihm sees an iem in he sream, i is able o approximae he number of imes i has seen ha elemen so far, up o an addiive error of.1n, and, again,.1 could be replaced by any oher posiive consan. The algorihm is called Coun-Min, or Coun-Min-Sech, and i has been implemened in a number of libraries. 4

The algorihm relies on wo parameers l and B. We choose l = 2 log n and B = 20. In general, if we wan o be able o approximae frequencies up o an addiive error of ɛn, we will choose B = 2/ɛ. The following version of he algorihm repors an approximaion of he number of imes i has seen he label so far for each label of he sream ha i processes: Iniialize an l B array M o all zeroes Pic l random funcions h 1,..., h l, where h i : Σ {1,..., B} while no end-of-sream: read a label x from he sream for i = 1 o l: Mi, h i (x)] + + prin esimaed number of imes, x, occurred so far is, min i=1,...,l Mi, h i (x)] And he following version of he algorihm consrucs a lis ha, includes all he labels ha occur.3n imes in he sream and, wih high probabiliy, includes none of he labels ha occur <.3n 2n B imes in he sream. (If B = 20, hen.3n 2n B =.2n.) Iniialize an l B array M o all zeroes Iniialize L o an empy lis Pic l random funcions h 1,..., h l, where h i : Σ {1,..., B} while no end-of-sream: read a label x from he sream for i = 1 o l: Mi, h i (x)] + + if min i=1,...,l Mi, h i (x)] >.3n add x o L, if no already presen reurn L Le us analyze he above algorihm. The firs observaion is ha, when we see a label a, we increase all he l values M1, h 1 (a)], M2, h 2 (a)],..., Ml, h l (a)] 5

and so, a he end of he sream, having done he above f a imes, i follows ha all he above values are a leas f a, and so f a min i=1,...,l Mi, h i(a)] In paricular, his means ha if a is a label such ha f a.3n hen, by he las ime we see an occurrence of a, i mus be ha min i=1,...,l Mi, h i (a)] >.3n, and so, in he second algorihm, we see ha a is cerainly added o he lis. The problem is ha min i=1,...,l Mi, h i (a)] could be much bigger han f a, and so he firs algorihm could deliver a poor approximaion and he second algorihm could add some ligh hiers o he lis. We need o prove ha his failure mode has a low probabiliy of happening. Firs of all, we see ha, for a fixed choice of he funcions h i, Mi, h i (a)] = f a + b a:h i (b)=h i (a) Now le s compue he expecaion of Mi, h i (a)] over he random choice of he funcion h i. We see ha i is f b E Mi, h i (a)] = f a + b a Prh i (a) = h i (b)] f b = f a + 1 B f a + n B b a f b where we use he fac ha, for a random funcion h i : Σ {1,..., B}, he probabiliy ha h i (a) = h i (b) is exacly 1 B, and he fac ha b a f b = n f a n. So far, we have proved ha he conen of Mi, h i (a)] is always a leas f a and, in expecaion, is a mos f a + n. Le us now ranslae his saemen abou expecaions B o a saemen abou probabiliies. Applying Marov s inequaliy o he (non-negaive!) random variable Mi, h i (a)] f a, we have and, using he independence of he h i, Pr Mi, h i (a)] > f a + 2n ] 1 B 2 6

Pr min Mi, h i(a)] > f a + 2n ] i=1...,l B ( = Pr M1, h 1 (a) > f a + 2n ) (... Ml, h l (a) > f a + 2n )] B B = Pr M1, h 1 (a) > f a + 2n ]... Pr Ml, h l (a) > f a + 2n ] B B 1 2 l So, if we choose B = 20 and l = 2 log n, we have ha he esimae min i Mi, h i (a)] is beween f a and f a +.1n, excep wih probabiliy a mos 1 n 2. In paricular, here is a probabiliy a leas 1 1/n, ha we ge a good esimae n imes in a row. 3 Couning Disinc Elemens Here he algorihm is a lo simpler. We pic a random funcion h : Σ 0, 1] (we will see laer how o appropriaely discreize his choice so ha he funcion can be sored in memory), and, for every iem x in he sream, we evaluae h(x), eeping rac of he smalles hash value obained so far. If min is he minimum hash value seen a he end of he sream, hen 1 is our esimae of he number of disinc elemens in min he sream. The inuiion for he algorihm is he following: suppose ha he sream has disinc labels. Then when we evaluae h a every elemen of he sream, we are evaluaing h a disinc inpus, alhough some inpus are maybe repeaed several imes. If h is a random funcion, we are geing random values in he inerval 0, 1]. Inuiively, we would expec random numbers seleced in he inerval 0, 1] o be evenly disribued, and so he minimum of hese numbers should be abou 1/. Thus, he inverse of he minimum should be around. Here is a simple calculaion showing ha here is a leas 60% probabiliy ha he algorihm achieves a consan-facor approximaion. Suppose ha he sream has disinc labels, and call r 1,..., r he random numbers in 0, 1] corresponding o he evaluaion of h a he disinc labels of he sream. Then 7

Pr Algorihm s oupu ] 2 = Pr min{r 1,..., r } 2 = Pr r 1 2... r 2 ] = Pr r 1 2 ]... Pr r 2 ] ( = 1 2 ) e 2.14 ] and Pr Algorihm s oupu 4] = Pr min{r 1,..., r } 1 ] 4 = Pr r 1 1 4... r 1 ] 4 Pr r i 1 ] 4 i=1 = 1 4 so ha ] Pr 2 Algorihm s oupu 4.61 There are various ways o improve he qualiy of he approximaion and o increase he probabiliy of success. One of he simples ways is o eep rac no of he smalles hash value encounered so far, bu he disinc labels wih he smalles hash values. This is a se of labels and values ha can be ep in a daa srucure of size O( log Σ ), and processing an elemen from he sream aes ime O(log ) if he labels are pu in a prioriy queue. Then, if sh is he -h smalles hash value in he sream, our esimae for he number of disinc values is. sh The inuiion is ha, as before, if we have disinc labels, heir hashed values will be random poins in he inerval 0, 1], which we would expec o be uniformly spaced, so ha he -h smalles would have a value of abou. Thus is inverse, muliplied by, should be an esimae of. 8

Why would i help o wor wih he -h smalles hash insead of he smalles? The inuiion is ha i aes only one oulier o sew he minimum, bu one needs o have ouliers o sew he -h smalles, and he laer is a more unliely even. 3.1 * Rigorous Analysis of he -h Smalles Algorihm Le us sech a more rigorous analysis. These calculaions are a bi more complicaed han he res of he conen of his lecure, so i is o o sip hem. We will see ha we can ge an esimae ha is, wih high probabiliy, beween ɛ and + ɛ, by choosing o be of he order of 1/ɛ 2. For example, choosing = 30/ɛ 2 here is a leas a 93% probabiliy of geing an ɛ-approximaion. For concreeness, we will see ha, wih = 3, 000, we ge, wih probabiliy a leas 93%, an error ha is a mos 10%. As before, we le r 1,..., r be he hashes of he disinc labels in he sream. We le sh be he -h smalles of r 1,..., r. PrAlgorihm s oupu 1.1] = Pr sh = Pr ] 1.1 ( #i : r i 1.1 ) ] Now le s sudy he number of is such ha r i /(1.1), and le s give i a name N := #i : r i 1.1 This is a random variable whose expecaion is easy o compue. If we define S i o be 1 if r i /(1.1) and 0 oherwise, hen N = i S i and so E N = i E S i = i Pr r i The variance of N is also easy o compue. ] = 1.1 1.1 = 1.1 VarN = i VarS i 1.1 So N has an average of abou.91 and a sandard deviaion of less han, so by Chebyshev s inequaliy 9

PrAlgorihm s oupu 1.1] = PrN ] = Pr(N E N) /1.1] VarN (/11) 2 = 121 = 121 3000 4.1% Similarly, and if we call PrAlgorihm s oupu.9] = Pr sh = Pr ].9 ( #i : r i.9 ) ] We see ha N := #i : r i.9 E N =.9 VarN and PrAlgorihm s oupu.9] = PrN ] = Pr E N N ].9 10 VarN (/9) 2 = 81 = 81 3000 = 2.7%

So all ogeher we have Pr.9 Algorihm s oupu 1.1] 93% 11

Christos Papadimitriou & Luca Trevisan November 22, 2016