4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Size: px

Start display at page:

Download "4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S"

Cody Park
5 years ago
Views:

If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.

1 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University (Versão Adpatada) More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Counting distinct elements: Flajolet-Martin Number of distinct elements in the last k elements of the stream (3) Estimating moments: AMS method Estimate std. dev. of last k elements (4) Counting frequent items 2 Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Obvious solution: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream 4 1

2 Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam Given a set of keys S that we want to filter Create a bit array B of n bits, initially all 0s Choose a hash function h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == Item Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Creates false positives but no false negatives If the item is in S we surely output it, if not we may still output it 8 S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the big set to 1, so it always gets through (no false negatives) Approximately 1/8 of the bits are set to 1, so about 1/8 th of the addresses not in S get through to the output (false positives) Actually, less than 1/8 th, because more than one valid address might hash to the same bit 9 2

3 More accurate analysis for the number of false positives Consider: If we throw m darts into n equally likely targets, what is the probability that a target gets at least one dart? In our case: Targets = bits/buckets Darts = hash values of items We have m darts, n targets What is the probability that a target gets at least one dart? Equals 1/e as n Probability some target X not hit by a dart 1 - (1 1/n) n( m / n) Probability at least one dart hits target X Equivalent 1 e m/n Expected fraction of 1s in the array B = = probability of false positive = 1 e -m/n Example: 10 9 darts, targets Expected Fraction of 1s in B = 1 e -1/8 = Compare with our earlier estimate: 1/8 = Consider: S = m, B = n Use k independent hash functions h 1,, h k Initialization: Set B to all 0s Hash each element s S using each hash function h i, set B[h i (s)] = 1 (for each i = 1,.., k) Run-time: When a stream element with key x arrives If B[h i (x)] = 1 for all i = 1,..., k then declare that x is in S That is, x hashes to a bucket set to 1 for every hash function h i (x) Otherwise discard the element x (note: we have a single array B!) 13 3

4 False positive prob. 4/26/2017 What fraction of the bit vector B are 1s? Throwing k m darts at n targets So fraction of 1s is (1 e -km/n ) m = 1 billion, n = 8 billion k = 1: (1 e -1/8 ) = k = 2: (1 e -1/4 ) 2 = But we have k independent hash functions and we only let the element x through if all k hash element x to a bucket of value 1 So, false positive probability = (1 e -km/n ) k What happens as we keep increasing k? Optimal value of k: n/m ln(2) In our case: Optimal k = 8 ln(2) = Error at k = 6: (1 e -1/6 ) 2 = Number of hash functions, k Bloom filters guarantee no false negatives, and use limited memory Great for pre-processing before more expensive checks Suitable for hardware implementation Hash function computations can be parallelized 17 4

5 Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of distinct elements seen so far Obvious approach: Maintain the set of elements seen so far That is, keep a hash table of all the distinct elements seen so far How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate artificial pages (spam?) How many different Web pages does each customer request in a week? How many distinct products have we sold in the last week? Real problem: What if we do not have space to maintain the set of elements seen so far? Estimate the count in an unbiased way Accept that the count may have a little error, but limit the probability that the error is large Pick a hash function h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = position of first 1 counting from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far Estimated number of distinct elements = 2 R

6 Very very rough and heuristic intuition why Flajolet-Martin works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 -r fraction of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 distinct items so far So, it takes to hash about 2 r items before we see one with zero-suffix of length r Now we show why Flajolet-Martin works Formally, we will show that probability of finding a tail of r zeros: Goes to 1 if m 2 r Goes to 0 if m 2 r where m is the number of distinct elements seen so far in the stream Thus, 2 R will almost always be around m! What is the probability that a given h(a) ends in at least r zeros is 2 -r h(a) hashes elements uniformly at random Probability that a random number ends in at least r zeros is 2 -r Then, the probability of NOT seeing a tail of length r among m elements: r m 1 2 Prob. all end in fewer than r zeros. Prob. that given h(a) ends in fewer than r zeros r m 2 ( m2 ) m2 Note: (1 2 ) (1 2 ) e Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 m m2 (1 2 ) e 1 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 m m2 (1 2 ) e 0 as m/2 r So, the probability of finding a tail of length r tends to 1 Thus, 2 R will almost always be around m!

7 E[2 R ] is actually infinite Probability halves when R R+1, but value doubles Taking the average of different hash functions does not work well How are samples R i combined? Partition your samples into small groups Take the median of groups Then take the average of the medians 28 Suppose a stream has elements chosen from a set A of N values Let m i be the number of times value i occurs in the stream The k th moment is i A (m ) i k i A (m ) 0 th moment = number of distinct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 2 nd moment = surprise number S = a measure of how uneven the distribution is i k

8 [Alon, Matias, and Szegedy] Stream with 11 distinct values of length 100 Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Average=100/11, Surprise S = 910 Item counts: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1 Average=100/11, Surprise S = 8,110 AMS method works for all moments Gives an unbiased estimate We will just concentrate on the 2 nd moment S We pick and keep track of many variables X: For each variable X we store X.el and X.val X.el corresponds to the item i X.val corresponds to the count of item i Note this requires a count in main memory, so number of Xs is limited Our goal is to compute S = m i 2 i How to set X.val and X.el? Assume stream has length n (we relax this later) Then the estimate of the 2 nd moment ( S = f(x) = n (2 c 1) 2 i m i ) is: Pick some random time t (t<n) to start, so that any time is equally likely Let at time t the stream have item i. We set X.el = i Note, we will keep track of multiple Xs, (X 1, X 2, X k ) and our final estimate will be S = 1 k k j=1 f(x j ) Then we maintain count c (X.val = c) of the number of is in the stream starting from the chosen time t

9 c t : number of times the element that occurs at time t appears at time t m x : frequency of element x in the stream time: Stream: m a a a b b b a b a time: Stream: 1 2 a a b b b a b a E[X]= n 1 t=1 n 2c n t 1 = (2 t=1 c t ) n n Example c 1 = m a c 2 = m a -1, c 3 = m b, c 4 = m b-1 Note that n Thus, E[X] = m i 2 i t=1 c t = i m i (m i +1)/ In practice: Compute f(x) = n(2 c 1) for as many variables X as you can fit in memory Average them in groups Take median of averages Problem: Streams never end We assumed there was a number n, the number of positions in the stream But real streams go on forever, so n is a variable the number of inputs seen so far (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as time goes on: Objective: Each starting time t is selected with probability k/n Solution: (fixed-size sampling!) Choose the first k times for k variables When the n th element arrives (n > k), choose it with probability k/n If you choose it, throw one of the previously stored variables X out, with equal probability

10 For estimating k th moment we essentially use the same algorithm but change the estimate: For k=2 we used n (2 c 1) For k=3 we use: n (3 c 2 3c + 1) (where c=x.val) Generally: Estimate = n (c k c 1 k ) 42 10

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms