Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:
|
|
- Toby Brown
- 6 years ago
- Views:
Transcription
1 Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:
2 Outline More algorithms for streams: 2
3 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream 2
4 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Coun=ng dis=nct elements: Flajolet- Mar=n Number of dis8nct elements in the last k elements of the stream 2
5 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Coun=ng dis=nct elements: Flajolet- Mar=n Number of dis8nct elements in the last k elements of the stream (3) Es=ma=ng moments: AMS method Es8mate std. dev. of last k elements 2
6 Balls into bins Consider: If we throw m balls into n equally likely bins, what is the probability that a bin does not get a ball? Consider: If we throw m balls into n bins with probabili8es 2-1, 2-2, 2-3, 2-4,.. what is the probability that the k- th bin does not get a ball?
7 Balls into bins Consider: If we throw m balls into n equally likely bins, what is the probability that a bin does not get a ball? n (1 1/n) = e 1 Consider: If we throw m balls into n bins with probabili8es 2-1, 2-2, 2-3, 2-4,.. what is the probability that the k- th bin does not get a ball?
8 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S 4
9 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Obvious solu=on: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream 4
10 Applica8ons Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam 5
11 Applica8ons Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam Publish- subscribe systems You are collec8ng lots of messages (news ar8cles) People express interest in certain sets of keywords Determine whether each message matches user s interest 5
12 First Cut Solu8on (1) 6
13 First Cut Solu8on (1) Given a set of keys S that we want to filter 6
14 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s 6
15 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) 6
16 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 6
17 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == 1 6
18 Filter First Cut Solu8on (2) Item Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. 7
19 Filter First Cut Solu8on (2) Item Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. Creates false positives but no false negatives If the item is in S we surely output it, if not we may still output it 7
20 First Cut Solu8on (3) S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the bit set to 1, so it always gets through (no false nega7ves) 8
21 First Cut Solu8on (3) S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the bit set to 1, so it always gets through (no false nega7ves) Approximately 1/8 of the bits are set to 1, so about 1/8 th of the addresses not in S get through to the output (false posi7ves) Actually, less than 1/8 th, because more than one address might hash to the same bit 8
22 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves 9
23 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves Consider: If we throw m balls into n equally likely bins, what is the probability that a bin gets at least one ball? 9
24 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves Consider: If we throw m balls into n equally likely bins, what is the probability that a bin gets at least one ball? In our case: Targets = bits/bins balls = hash values of items 9
25 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? 10
26 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? (1 1/n) Probability some target X not hit by a dart 10
27 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? 1 - (1 1/n) m Probability some target X not hit by a dart Probability at least one dart hits target X 10
28 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equivalent 1 - (1 1/n) n( m / n) Probability some target X not hit by a dart Probability at least one dart hits target X 10
29 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equals 1/e as n 1 - (1 1/n) n( m / n) Equivalent Probability some target X not hit by a dart Probability at least one dart hits target X 10
30 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equals 1/e as n 1 - (1 1/n) n( m / n) Equivalent 1 e m/n Probability some target X not hit by a dart Probability at least one dart hits target X 10
31 Analysis: Balls into Bins (3) A false positive is like choosing a random bin Frac=on of 1s in the array B = = probability of false posi=ve = 1 e - m/n 11
32 Analysis: Balls into Bins (3) A false positive is like choosing a random bin Frac=on of 1s in the array B = = probability of false posi=ve = 1 e - m/n Example: 10 9 balls, bins Frac8on of 1s in B = 1 e - 1/8 = Compare with our earlier es8mate: 1/8 =
33 Mul8ple Hash Func8ons H H H Final Array is the union of all bins 12
34 Mul8ple Hash Func8ons H2 H3 H1 Only allow if all bits are set discarded 13
35 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k (note: we have a single array B!) 14
36 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k Ini=aliza=on: Set B to all 0s Hash each element s S using each hash func8on h i, set B[h i (s)] = 1 (for each i = 1,.., k) (note: we have a single array B!) 14
37 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k Ini=aliza=on: Set B to all 0s Hash each element s S using each hash func8on h i, set B[h i (s)] = 1 (for each i = 1,.., k) Run- =me: When a stream element with key x arrives If B[h i (x)] = 1 for all i = 1,..., k then declare that x is in S That is, x hashes to a bucket set to 1 for every hash func8on h i (x) Otherwise discard the element x (note: we have a single array B!) 14
38 Bloom Filter - - Analysis 15
39 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) 15
40 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) But we have k independent hash func8ons and we only let the element x through if all k hash element x to a bin of value 1 15
41 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) But we have k independent hash func8ons and we only let the element x through if all k hash element x to a bin of value 1 So, false posi=ve probability = (1 e - km/n ) k 15
42 Bloom Filter Analysis (2) 16
43 Bloom Filter Analysis (2) m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 =
44 Bloom Filter Analysis (2) m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? 16
45 Bloom Filter Analysis (2) 0.2 m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? False positive prob Number of hash functions, k 16
46 Bloom Filter Analysis (2) 0.2 m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? False positive prob Number of hash functions, k Op8mal value of k: n/m ln(2) In our case: Op8mal k = 8 ln(2) = Error at k = 6: (1 e - 1/6 ) 2 =
47 Bloom Filter: Wrap- up Bloom filters guarantee no false nega=ves, and use limited memory Great for pre- processing before more expensive checks Suitable for hardware implementa=on Hash func8on computa8ons can be parallelized Is it beder to have 1 big B or k small Bs? It is the same: (1 e - km/n ) k vs. (1 e - m/(n/k) ) k But keeping 1 big B is simpler 17
48 Coun8ng Dis8nct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of dis8nct elements seen so far 18
49 Coun8ng Dis8nct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of dis8nct elements seen so far Obvious approach: Maintain the set of elements seen so far That is, keep a hash table of all the dis8nct elements seen so far 18
50 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) 19
51 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) How many different Web pages does each customer request in a week? 19
52 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) How many different Web pages does each customer request in a week? How many dis=nct products have we sold in the last week? 19
53 Using Small Storage Real problem: What if we do not have space to maintain the set of elements seen so far? Es=mate the count in an unbiased way Accept that the count may have a likle error, but limit the probability that the error is large 20
54 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits 21
55 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 21
56 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far 21
57 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far Es=mated number of dis=nct elements = 2 R 21
58 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values 22
59 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros 22
60 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 22
61 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 dis8nct items so far 22
62 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 dis8nct items so far So, it takes to hash about 2 r items before we see one with zero- suffix of length r 22
63 Why It Works: More formally 23
64 Why It Works: More formally 24
65 Why It Works: More formally 24
66 Why It Works: More formally Prob. that given h(a) ends in fewer than r zeros 24
67 Why It Works: More formally Prob. all end in fewer than r zeros. Prob. that given h(a) ends in fewer than r zeros 24
68 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r 25
69 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: 25
70 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m e m2 as m/2 r 0 r = 1 So, the probability of finding a tail of length r tends to 0 25
71 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m m2 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 e r m m2 as m/2 r (1 2 ) e r r = = 1 0 So, the probability of finding a tail of length r tends to 1 25
72 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m m2 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 e r m m2 as m/2 r (1 2 ) e r r = = 1 0 So, the probability of finding a tail of length r tends to 1 Thus, 2 R will almost always be around m! 25
73 Why It Doesn t Work 26
74 Why It Doesn t Work 26
75 Generaliza8on: Moments Suppose a stream has elements chosen from a set A of N values Let m i be the number of =mes value i occurs in the stream The k th moment is i A (m ) i k 27
76 Special Cases i A (m ) i k 28
77 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 28
78 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 28
79 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 2 nd moment = surprise number S = a measure of how uneven the distribu8on is 28
80 Example: Surprise Number Stream of length dis=nct values 29
81 Example: Surprise Number Stream of length dis=nct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S =
82 Example: Surprise Number Stream of length dis=nct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 910 Item counts: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1 Surprise S = 8,110 29
83 [Alon, Matias, and Szegedy] AMS Method 30
84 [Alon, Matias, and Szegedy] AMS Method 30
85 One Random Variable (X) 31
86 One Random Variable (X) 31
87 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) 32
88 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) 32
89 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen 32
90 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) 32
91 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) Time t when the penultimate i is seen (c t =2) 32
92 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) Time t when the penultimate i is seen (c t =2) 32 Time t when the first i is seen (c t =m i )
93 Expecta8on Analysis Count: Stream: m a a a b b b a b a 33
94 Expecta8on Analysis Count: Stream: m a a a b b b a b a 33
95 Higher- Order Moments 34
96 Higher- Order Moments 34
97 Combining Samples 35
98 Combining Samples 35
99 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X 36
100 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: 36
101 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: Objec=ve: Each star8ng 8me t is selected with probability k/n 36
102 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: Objec=ve: Each star8ng 8me t is selected with probability k/n Solu=on: (fixed- size sampling!) Choose the first k 8mes for k variables When the n th element arrives (n > k), choose it with probability k/n If you choose it, throw one of the previously stored variables X out, with equal probability 36
103 Thats it!!
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationData Stream Analytics
Data Stream Analytics V. CHRISTOPHIDES vassilis.christophides@inria.fr https://who.rocq.inria.fr/vassilis.christophides/big/ Ecole CentraleSupélec Winter 2018 1 Traffic control Big Data : Velocity IP network
More informationMining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationMining Data Streams. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationCSE 473: Ar+ficial Intelligence
CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationLecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationLecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4
Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a
More informationAs mentioned, we will relax the conditions of our dictionary data structure. The relaxations we
CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data
More information:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)
BALLS, BINS, AND RANDOM GRAPHS We use the Chernoff bound for the Poisson distribution (Theorem 5.4) to bound this probability, writing the bound as Pr(X 2: x) :s ex-ill-x In(x/m). For x = m + J2m In m,
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationCSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13
CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationCOMP 562: Introduction to Machine Learning
COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu
More informationCS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment
More informationCS 6140: Machine Learning Spring 2016
CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationIntroduc)on to Ar)ficial Intelligence
Introduc)on to Ar)ficial Intelligence Lecture 13 Approximate Inference CS/CNS/EE 154 Andreas Krause Bayesian networks! Compact representa)on of distribu)ons over large number of variables! (OQen) allows
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationBias/variance tradeoff, Model assessment and selec+on
Applied induc+ve learning Bias/variance tradeoff, Model assessment and selec+on Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège October 29, 2012 1 Supervised
More informationTopic 4 Randomized algorithms
CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationCS 6140: Machine Learning Spring What We Learned Last Week 2/26/16
Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationClassification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA
Classification Chris Amato Northeastern University Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA Supervised learning Given: Training set {(xi, yi) i = 1 N}, given a labeled set
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationA General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY
A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)
More informationA Model for Learned Bloom Filters, and Optimizing by Sandwiching
A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested
More informationApplication: Bucket Sort
5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationCSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i
CSE 473: Ar+ficial Intelligence Bayes Nets Daniel Weld [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at hnp://ai.berkeley.edu.]
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationLearning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin
Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good
More informationPredicate abstrac,on and interpola,on. Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on.
Predicate abstrac,on and interpola,on Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on. Outline. Predicate abstrac,on the idea in pictures 2. Counter- example guided
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationTainted Flow Analysis on e-ssaform
Tainted Flow Analysis on e-ssaform Programs Andrei Rimsa, Marcelo d Amorim and Fernando M. Q. Pereira The Objective of this work is to detect security vulnerabilities in programs via static analysis Vulnerabilities
More informationMA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models
MA/CS 109 Lecture 7 Back To Exponen:al Growth Popula:on Models Homework this week 1. Due next Thursday (not Tuesday) 2. Do most of computa:ons in discussion next week 3. If possible, bring your laptop
More informationCounting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109
1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of
More informationRAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response
RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova Google & USC Presented By: Pat Pannuto RAPPOR, What is is good for? (Absolutely something!)
More informationPolynomials and Gröbner Bases
Alice Feldmann 16th December 2014 ETH Zürich Student Seminar in Combinatorics: Mathema:cal So
More informationComputer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13
Computer Vision Pa0ern Recogni4on Concepts Part I Luis F. Teixeira MAP- i 2012/13 What is it? Pa0ern Recogni4on Many defini4ons in the literature The assignment of a physical object or event to one of
More informationThe Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)
The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set
More informationCSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:
CSE 473: Ar+ficial Intelligence Par+cle Filters for HMMs Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All
More informationCSCI 2670 Introduction to Theory of Computing
CSCI 267 Introduction to Theory of Computing Agenda Last class Reviewed syllabus Reviewed material in Chapter of Sipser Assigned pages Chapter of Sipser Questions? This class Begin Chapter Goal for the
More informationImage Data Compression. Dirty-paper codes Alexey Pak, Lehrstuhl für Interak<ve Echtzeitsysteme, Fakultät für Informa<k, KIT
Image Data Compression Dirty-paper codes 1 Reminder: watermarking with side informa8on Watermark embedder Noise n Input message m Auxiliary informa
More informationTangent lines, cont d. Linear approxima5on and Newton s Method
Tangent lines, cont d Linear approxima5on and Newton s Method Last 5me: A challenging tangent line problem, because we had to figure out the point of tangency.?? (A) I get it! (B) I think I see how we
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015
CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what
More informationCSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.
CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationBellman s Curse of Dimensionality
Bellman s Curse of Dimensionality n- dimensional state space Number of states grows exponen
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationAttacks on hash functions. Birthday attacks and Multicollisions
Attacks on hash functions Birthday attacks and Multicollisions Birthday Attack Basics In a group of 23 people, the probability that there are at least two persons on the same day in the same month is greater
More informationLecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox
Lecture 04: Balls and Bins: Overview In today s lecture we will start our study of balls-and-bins problems We shall consider a fundamental problem known as the Recall: Inequalities I Lemma Before we begin,
More informationWavelets & Mul,resolu,on Analysis
Wavelets & Mul,resolu,on Analysis Square Wave by Steve Hanov More comics at http://gandolf.homelinux.org/~smhanov/comics/ Problem set #4 will be posted tonight 11/21/08 Comp 665 Wavelets & Mul8resolu8on
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationLogis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationPar$$oned Elias- Fano Indexes
Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3
More informationClass Notes. Examining Repeated Measures Data on Individuals
Ronald Heck Week 12: Class Notes 1 Class Notes Examining Repeated Measures Data on Individuals Generalized linear mixed models (GLMM) also provide a means of incorporang longitudinal designs with categorical
More informationCSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on
CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 13.1 and 13.2 1 Status Check Extra credits? Announcement Evalua/on process will start soon
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationCSE 373: Data Structures and Algorithms Pep Talk; Algorithm Analysis. Riley Porter Winter 2017
CSE 373: Data Structures and Algorithms Pep Talk; Algorithm Analysis Riley Porter Announcements Op4onal Java Review Sec4on: PAA A102 Tuesday, January 10 th, 3:30-4:30pm. Any materials covered will be posted
More informationECEN 689 Special Topics in Data Science for Communications Networks
ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as
More informationParallelizing Gaussian Process Calcula1ons in R
Parallelizing Gaussian Process Calcula1ons in R Christopher Paciorek UC Berkeley Sta1s1cs Joint work with: Benjamin Lipshitz Wei Zhuo Prabhat Cari Kaufman Rollin Thomas UC Berkeley EECS (formerly) IBM
More informationSta$s$cal Significance Tes$ng In Theory and In Prac$ce
Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ictir13tutorial Hypotheses and Experiments Hypothesis: Using an SVM for classifica$on will
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationAcceptably inaccurate. Probabilistic data structures
Acceptably inaccurate Probabilistic data structures Hello Today's talk Motivation Bloom filters Count-Min Sketch HyperLogLog Motivation Tape HDD SSD Memory Tape HDD SSD Memory Speed Tape HDD SSD Memory
More informationCS 124 Math Review Section January 29, 2018
CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationIntroduction to discrete probability. The rules Sample space (finite except for one example)
Algorithms lecture notes 1 Introduction to discrete probability The rules Sample space (finite except for one example) say Ω. P (Ω) = 1, P ( ) = 0. If the items in the sample space are {x 1,..., x n }
More informationDiscrete Distributions
A simplest example of random experiment is a coin-tossing, formally called Bernoulli trial. It happens to be the case that many useful distributions are built upon this simplest form of experiment, whose
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationExact data mining from in- exact data Nick Freris
Exact data mining from in- exact data Nick Freris Qualcomm, San Diego October 10, 2013 Introduc=on (1) Informa=on retrieval is a large industry.. Biology, finance, engineering, marke=ng, vision/graphics,
More informationNetworks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource
Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationWith Question/Answer Animations. Chapter 7
With Question/Answer Animations Chapter 7 Chapter Summary Introduction to Discrete Probability Probability Theory Bayes Theorem Section 7.1 Section Summary Finite Probability Probabilities of Complements
More informationPar$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini
Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a]
More informationCS 6140: Machine Learning Spring 2017
CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment
More informationRegression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons
Regression Part II One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons One- factor Analysis of variance Categorical Explanatory variable Quan?ta?ve Response
More information6.854 Advanced Algorithms
6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using
More informationA Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing
A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing Nov 12 2014 Hien To, Gabriel Ghinita, Cyrus Shahabi VLDB 2014 1 Mo/va/on Ubiquity of mobile users 6.5 billion mobile subscrip/ons,
More informationOp#mal Control of Nonlinear Systems with Temporal Logic Specifica#ons
Op#mal Control of Nonlinear Systems with Temporal Logic Specifica#ons Eric M. Wolff 1 Ufuk Topcu 2 and Richard M. Murray 1 1 Caltech and 2 UPenn University of Michigan October 1, 2013 Autonomous Systems
More informationCS174 Final Exam Solutions Spring 2001 J. Canny May 16
CS174 Final Exam Solutions Spring 2001 J. Canny May 16 1. Give a short answer for each of the following questions: (a) (4 points) Suppose two fair coins are tossed. Let X be 1 if the first coin is heads,
More information