Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Size: px

Start display at page:

Download "Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org."

Toby Brown
6 years ago
Views:

1 Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:

2 Outline More algorithms for streams: 2

3 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream 2

4 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Coun=ng dis=nct elements: Flajolet- Mar=n Number of dis8nct elements in the last k elements of the stream 2

5 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Coun=ng dis=nct elements: Flajolet- Mar=n Number of dis8nct elements in the last k elements of the stream (3) Es=ma=ng moments: AMS method Es8mate std. dev. of last k elements 2

6 Balls into bins Consider: If we throw m balls into n equally likely bins, what is the probability that a bin does not get a ball? Consider: If we throw m balls into n bins with probabili8es 2-1, 2-2, 2-3, 2-4,.. what is the probability that the k- th bin does not get a ball?

7 Balls into bins Consider: If we throw m balls into n equally likely bins, what is the probability that a bin does not get a ball? n (1 1/n) = e 1 Consider: If we throw m balls into n bins with probabili8es 2-1, 2-2, 2-3, 2-4,.. what is the probability that the k- th bin does not get a ball?

8 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S 4

9 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Obvious solu=on: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream 4

10 Applica8ons Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam 5

11 Applica8ons Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam Publish- subscribe systems You are collec8ng lots of messages (news ar8cles) People express interest in certain sets of keywords Determine whether each message matches user s interest 5

12 First Cut Solu8on (1) 6

13 First Cut Solu8on (1) Given a set of keys S that we want to filter 6

14 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s 6

15 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) 6

16 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 6

17 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == 1 6

18 Filter First Cut Solu8on (2) Item Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. 7

19 Filter First Cut Solu8on (2) Item Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. Creates false positives but no false negatives If the item is in S we surely output it, if not we may still output it 7

20 First Cut Solu8on (3) S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the bit set to 1, so it always gets through (no false nega7ves) 8

21 First Cut Solu8on (3) S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the bit set to 1, so it always gets through (no false nega7ves) Approximately 1/8 of the bits are set to 1, so about 1/8 th of the addresses not in S get through to the output (false posi7ves) Actually, less than 1/8 th, because more than one address might hash to the same bit 8

22 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves 9

23 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves Consider: If we throw m balls into n equally likely bins, what is the probability that a bin gets at least one ball? 9

24 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves Consider: If we throw m balls into n equally likely bins, what is the probability that a bin gets at least one ball? In our case: Targets = bits/bins balls = hash values of items 9

25 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? 10

26 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? (1 1/n) Probability some target X not hit by a dart 10

27 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? 1 - (1 1/n) m Probability some target X not hit by a dart Probability at least one dart hits target X 10

28 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equivalent 1 - (1 1/n) n( m / n) Probability some target X not hit by a dart Probability at least one dart hits target X 10

29 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equals 1/e as n 1 - (1 1/n) n( m / n) Equivalent Probability some target X not hit by a dart Probability at least one dart hits target X 10

30 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equals 1/e as n 1 - (1 1/n) n( m / n) Equivalent 1 e m/n Probability some target X not hit by a dart Probability at least one dart hits target X 10

31 Analysis: Balls into Bins (3) A false positive is like choosing a random bin Frac=on of 1s in the array B = = probability of false posi=ve = 1 e - m/n 11

32 Analysis: Balls into Bins (3) A false positive is like choosing a random bin Frac=on of 1s in the array B = = probability of false posi=ve = 1 e - m/n Example: 10 9 balls, bins Frac8on of 1s in B = 1 e - 1/8 = Compare with our earlier es8mate: 1/8 =

33 Mul8ple Hash Func8ons H H H Final Array is the union of all bins 12

34 Mul8ple Hash Func8ons H2 H3 H1 Only allow if all bits are set discarded 13

35 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k (note: we have a single array B!) 14

36 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k Ini=aliza=on: Set B to all 0s Hash each element s S using each hash func8on h i, set B[h i (s)] = 1 (for each i = 1,.., k) (note: we have a single array B!) 14

37 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k Ini=aliza=on: Set B to all 0s Hash each element s S using each hash func8on h i, set B[h i (s)] = 1 (for each i = 1,.., k) Run- =me: When a stream element with key x arrives If B[h i (x)] = 1 for all i = 1,..., k then declare that x is in S That is, x hashes to a bucket set to 1 for every hash func8on h i (x) Otherwise discard the element x (note: we have a single array B!) 14

38 Bloom Filter - - Analysis 15

39 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) 15

40 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) But we have k independent hash func8ons and we only let the element x through if all k hash element x to a bin of value 1 15

41 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) But we have k independent hash func8ons and we only let the element x through if all k hash element x to a bin of value 1 So, false posi=ve probability = (1 e - km/n ) k 15

42 Bloom Filter Analysis (2) 16

43 Bloom Filter Analysis (2) m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 =

44 Bloom Filter Analysis (2) m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? 16

45 Bloom Filter Analysis (2) 0.2 m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? False positive prob Number of hash functions, k 16

46 Bloom Filter Analysis (2) 0.2 m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? False positive prob Number of hash functions, k Op8mal value of k: n/m ln(2) In our case: Op8mal k = 8 ln(2) = Error at k = 6: (1 e - 1/6 ) 2 =

47 Bloom Filter: Wrap- up Bloom filters guarantee no false nega=ves, and use limited memory Great for pre- processing before more expensive checks Suitable for hardware implementa=on Hash func8on computa8ons can be parallelized Is it beder to have 1 big B or k small Bs? It is the same: (1 e - km/n ) k vs. (1 e - m/(n/k) ) k But keeping 1 big B is simpler 17

48 Coun8ng Dis8nct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of dis8nct elements seen so far 18

49 Coun8ng Dis8nct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of dis8nct elements seen so far Obvious approach: Maintain the set of elements seen so far That is, keep a hash table of all the dis8nct elements seen so far 18

50 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) 19

51 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) How many different Web pages does each customer request in a week? 19

52 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) How many different Web pages does each customer request in a week? How many dis=nct products have we sold in the last week? 19

53 Using Small Storage Real problem: What if we do not have space to maintain the set of elements seen so far? Es=mate the count in an unbiased way Accept that the count may have a likle error, but limit the probability that the error is large 20

54 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits 21

55 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 21

56 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far 21

57 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far Es=mated number of dis=nct elements = 2 R 21

58 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values 22

59 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros 22

60 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 22

61 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 dis8nct items so far 22

62 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 dis8nct items so far So, it takes to hash about 2 r items before we see one with zero- suffix of length r 22

63 Why It Works: More formally 23

64 Why It Works: More formally 24

65 Why It Works: More formally 24

66 Why It Works: More formally Prob. that given h(a) ends in fewer than r zeros 24

67 Why It Works: More formally Prob. all end in fewer than r zeros. Prob. that given h(a) ends in fewer than r zeros 24

68 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r 25

69 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: 25

70 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m e m2 as m/2 r 0 r = 1 So, the probability of finding a tail of length r tends to 0 25

71 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m m2 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 e r m m2 as m/2 r (1 2 ) e r r = = 1 0 So, the probability of finding a tail of length r tends to 1 25

72 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m m2 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 e r m m2 as m/2 r (1 2 ) e r r = = 1 0 So, the probability of finding a tail of length r tends to 1 Thus, 2 R will almost always be around m! 25

73 Why It Doesn t Work 26

74 Why It Doesn t Work 26

75 Generaliza8on: Moments Suppose a stream has elements chosen from a set A of N values Let m i be the number of =mes value i occurs in the stream The k th moment is i A (m ) i k 27

76 Special Cases i A (m ) i k 28

77 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 28

78 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 28

79 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 2 nd moment = surprise number S = a measure of how uneven the distribu8on is 28

80 Example: Surprise Number Stream of length dis=nct values 29

81 Example: Surprise Number Stream of length dis=nct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S =

82 Example: Surprise Number Stream of length dis=nct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 910 Item counts: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1 Surprise S = 8,110 29

83 [Alon, Matias, and Szegedy] AMS Method 30

84 [Alon, Matias, and Szegedy] AMS Method 30

85 One Random Variable (X) 31

86 One Random Variable (X) 31

87 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) 32

88 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) 32

89 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen 32

90 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) 32

91 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) Time t when the penultimate i is seen (c t =2) 32

Expecta8on Analysis Count: Stream: 1 2 3 m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times

92 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) Time t when the penultimate i is seen (c t =2) 32 Time t when the first i is seen (c t =m i )

93 Expecta8on Analysis Count: Stream: m a a a b b b a b a 33

94 Expecta8on Analysis Count: Stream: m a a a b b b a b a 33

95 Higher- Order Moments 34

96 Higher- Order Moments 34

97 Combining Samples 35

98 Combining Samples 35

99 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X 36

100 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: 36

101 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: Objec=ve: Each star8ng 8me t is selected with probability k/n 36

102 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: Objec=ve: Each star8ng 8me t is selected with probability k/n Solu=on: (fixed- size sampling!) Choose the first k 8mes for k variables When the n th element arrives (n > k), choose it with probability k/n If you choose it, throw one of the previously stored variables X out, with equal probability 36

103 Thats it!!

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property