Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Size: px
Start display at page:

Download "Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org."

Transcription

1 Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:

2 Outline More algorithms for streams: 2

3 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream 2

4 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Coun=ng dis=nct elements: Flajolet- Mar=n Number of dis8nct elements in the last k elements of the stream 2

5 Outline More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property x from stream (2) Coun=ng dis=nct elements: Flajolet- Mar=n Number of dis8nct elements in the last k elements of the stream (3) Es=ma=ng moments: AMS method Es8mate std. dev. of last k elements 2

6 Balls into bins Consider: If we throw m balls into n equally likely bins, what is the probability that a bin does not get a ball? Consider: If we throw m balls into n bins with probabili8es 2-1, 2-2, 2-3, 2-4,.. what is the probability that the k- th bin does not get a ball?

7 Balls into bins Consider: If we throw m balls into n equally likely bins, what is the probability that a bin does not get a ball? n (1 1/n) = e 1 Consider: If we throw m balls into n bins with probabili8es 2-1, 2-2, 2-3, 2-4,.. what is the probability that the k- th bin does not get a ball?

8 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S 4

9 Filtering Data Streams Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Obvious solu=on: Hash table But suppose we do not have enough memory to store all of S in a hash table E.g., we might be processing millions of filters on the same stream 4

10 Applica8ons Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam 5

11 Applica8ons Example: spam filtering We know 1 billion good addresses If an comes from one of these, it is NOT spam Publish- subscribe systems You are collec8ng lots of messages (news ar8cles) People express interest in certain sets of keywords Determine whether each message matches user s interest 5

12 First Cut Solu8on (1) 6

13 First Cut Solu8on (1) Given a set of keys S that we want to filter 6

14 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s 6

15 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) 6

16 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 6

17 First Cut Solu8on (1) Given a set of keys S that we want to filter Create a bit array B of n bits, ini8ally all 0s Choose a hash func=on h with range [0,n) Hash each member of s S to one of n buckets, and set that bit to 1, i.e., B[h(s)]=1 Hash each element a of the stream and output only those that hash to bit that was set to 1 Output a if B[h(a)] == 1 6

18 Filter First Cut Solu8on (2) Item Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. 7

19 Filter First Cut Solu8on (2) Item Output the item since it may be in S. Item hashes to a bucket that at least one of the items in S hashed to. Hash func h Bit array B Drop the item. It hashes to a bucket set to 0 so it is surely not in S. Creates false positives but no false negatives If the item is in S we surely output it, if not we may still output it 7

20 First Cut Solu8on (3) S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the bit set to 1, so it always gets through (no false nega7ves) 8

21 First Cut Solu8on (3) S = 1 billion addresses B = 1GB = 8 billion bits If the address is in S, then it surely hashes to a bucket that has the bit set to 1, so it always gets through (no false nega7ves) Approximately 1/8 of the bits are set to 1, so about 1/8 th of the addresses not in S get through to the output (false posi7ves) Actually, less than 1/8 th, because more than one address might hash to the same bit 8

22 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves 9

23 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves Consider: If we throw m balls into n equally likely bins, what is the probability that a bin gets at least one ball? 9

24 Analysis: Balls into Bins (1) More accurate analysis for the number of false posi=ves Consider: If we throw m balls into n equally likely bins, what is the probability that a bin gets at least one ball? In our case: Targets = bits/bins balls = hash values of items 9

25 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? 10

26 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? (1 1/n) Probability some target X not hit by a dart 10

27 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? 1 - (1 1/n) m Probability some target X not hit by a dart Probability at least one dart hits target X 10

28 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equivalent 1 - (1 1/n) n( m / n) Probability some target X not hit by a dart Probability at least one dart hits target X 10

29 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equals 1/e as n 1 - (1 1/n) n( m / n) Equivalent Probability some target X not hit by a dart Probability at least one dart hits target X 10

30 Analysis: Balls into Bins (2) We have m balls, n bins What is the probability that a bin gets at least one ball? Equals 1/e as n 1 - (1 1/n) n( m / n) Equivalent 1 e m/n Probability some target X not hit by a dart Probability at least one dart hits target X 10

31 Analysis: Balls into Bins (3) A false positive is like choosing a random bin Frac=on of 1s in the array B = = probability of false posi=ve = 1 e - m/n 11

32 Analysis: Balls into Bins (3) A false positive is like choosing a random bin Frac=on of 1s in the array B = = probability of false posi=ve = 1 e - m/n Example: 10 9 balls, bins Frac8on of 1s in B = 1 e - 1/8 = Compare with our earlier es8mate: 1/8 =

33 Mul8ple Hash Func8ons H H H Final Array is the union of all bins 12

34 Mul8ple Hash Func8ons H2 H3 H1 Only allow if all bits are set discarded 13

35 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k (note: we have a single array B!) 14

36 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k Ini=aliza=on: Set B to all 0s Hash each element s S using each hash func8on h i, set B[h i (s)] = 1 (for each i = 1,.., k) (note: we have a single array B!) 14

37 Bloom Filter Consider: S = m, B = n Use k independent hash func8ons h 1,, h k Ini=aliza=on: Set B to all 0s Hash each element s S using each hash func8on h i, set B[h i (s)] = 1 (for each i = 1,.., k) Run- =me: When a stream element with key x arrives If B[h i (x)] = 1 for all i = 1,..., k then declare that x is in S That is, x hashes to a bucket set to 1 for every hash func8on h i (x) Otherwise discard the element x (note: we have a single array B!) 14

38 Bloom Filter - - Analysis 15

39 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) 15

40 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) But we have k independent hash func8ons and we only let the element x through if all k hash element x to a bin of value 1 15

41 Bloom Filter - - Analysis What frac=on of the bit vector B are 1s? Throwing k m balls at n bins So frac8on of 1s is (1 e - km/n ) But we have k independent hash func8ons and we only let the element x through if all k hash element x to a bin of value 1 So, false posi=ve probability = (1 e - km/n ) k 15

42 Bloom Filter Analysis (2) 16

43 Bloom Filter Analysis (2) m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 =

44 Bloom Filter Analysis (2) m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? 16

45 Bloom Filter Analysis (2) 0.2 m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? False positive prob Number of hash functions, k 16

46 Bloom Filter Analysis (2) 0.2 m = 1 billion, n = 8 billion k = 1: (1 e - 1/8 ) = k = 2: (1 e - 1/4 ) 2 = What happens as we keep increasing k? False positive prob Number of hash functions, k Op8mal value of k: n/m ln(2) In our case: Op8mal k = 8 ln(2) = Error at k = 6: (1 e - 1/6 ) 2 =

47 Bloom Filter: Wrap- up Bloom filters guarantee no false nega=ves, and use limited memory Great for pre- processing before more expensive checks Suitable for hardware implementa=on Hash func8on computa8ons can be parallelized Is it beder to have 1 big B or k small Bs? It is the same: (1 e - km/n ) k vs. (1 e - m/(n/k) ) k But keeping 1 big B is simpler 17

48 Coun8ng Dis8nct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of dis8nct elements seen so far 18

49 Coun8ng Dis8nct Elements Problem: Data stream consists of a universe of elements chosen from a set of size N Maintain a count of the number of dis8nct elements seen so far Obvious approach: Maintain the set of elements seen so far That is, keep a hash table of all the dis8nct elements seen so far 18

50 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) 19

51 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) How many different Web pages does each customer request in a week? 19

52 Applica8ons How many different words are found among the Web pages being crawled at a site? Unusually low or high numbers could indicate ar8ficial pages (spam?) How many different Web pages does each customer request in a week? How many dis=nct products have we sold in the last week? 19

53 Using Small Storage Real problem: What if we do not have space to maintain the set of elements seen so far? Es=mate the count in an unbiased way Accept that the count may have a likle error, but limit the probability that the error is large 20

54 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits 21

55 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 21

56 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far 21

57 Flajolet- Mar8n Approach Pick a hash func8on h that maps each of the N elements to at least log 2 N bits For each stream element a, let r(a) be the number of trailing 0s in h(a) r(a) = posi8on of first 1 coun8ng from the right E.g., say h(a) = 12, then 12 is 1100 in binary, so r(a) = 2 Record R = the maximum r(a) seen R = max a r(a), over all the items a seen so far Es=mated number of dis=nct elements = 2 R 21

58 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values 22

59 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros 22

60 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 22

61 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 dis8nct items so far 22

62 Why It Works: Intui8on Very very rough and heuris=c intui=on why Flajolet- Mar=n works: h(a) hashes a with equal prob. to any of N values Then h(a) is a sequence of log 2 N bits, where 2 - r frac8on of all as have a tail of r zeros About 50% of as hash to ***0 About 25% of as hash to **00 So, if we saw the longest tail of r=2 (i.e., item hash ending *100) then we have probably seen about 4 dis8nct items so far So, it takes to hash about 2 r items before we see one with zero- suffix of length r 22

63 Why It Works: More formally 23

64 Why It Works: More formally 24

65 Why It Works: More formally 24

66 Why It Works: More formally Prob. that given h(a) ends in fewer than r zeros 24

67 Why It Works: More formally Prob. all end in fewer than r zeros. Prob. that given h(a) ends in fewer than r zeros 24

68 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r 25

69 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: 25

70 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m e m2 as m/2 r 0 r = 1 So, the probability of finding a tail of length r tends to 0 25

71 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m m2 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 e r m m2 as m/2 r (1 2 ) e r r = = 1 0 So, the probability of finding a tail of length r tends to 1 25

72 Why It Works: More formally Note: (1 2 r ) m = (1 2 r 2 ) r ( m2 r ) e m2 r Prob. of NOT finding a tail of length r is: If m << 2 r, then prob. tends to 1 (1 2 r ) m m2 as m/2 r 0 So, the probability of finding a tail of length r tends to 0 If m >> 2 r, then prob. tends to 0 e r m m2 as m/2 r (1 2 ) e r r = = 1 0 So, the probability of finding a tail of length r tends to 1 Thus, 2 R will almost always be around m! 25

73 Why It Doesn t Work 26

74 Why It Doesn t Work 26

75 Generaliza8on: Moments Suppose a stream has elements chosen from a set A of N values Let m i be the number of =mes value i occurs in the stream The k th moment is i A (m ) i k 27

76 Special Cases i A (m ) i k 28

77 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 28

78 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 28

79 Special Cases i A (m ) i k 0 th moment = number of dis8nct elements The problem just considered 1 st moment = count of the numbers of elements = length of the stream Easy to compute 2 nd moment = surprise number S = a measure of how uneven the distribu8on is 28

80 Example: Surprise Number Stream of length dis=nct values 29

81 Example: Surprise Number Stream of length dis=nct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S =

82 Example: Surprise Number Stream of length dis=nct values Item counts: 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9 Surprise S = 910 Item counts: 90, 1, 1, 1, 1, 1, 1, 1,1, 1, 1 Surprise S = 8,110 29

83 [Alon, Matias, and Szegedy] AMS Method 30

84 [Alon, Matias, and Szegedy] AMS Method 30

85 One Random Variable (X) 31

86 One Random Variable (X) 31

87 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) 32

88 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) 32

89 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen 32

90 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) 32

91 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) Time t when the penultimate i is seen (c t =2) 32

92 Expecta8on Analysis Count: Stream: m a a a b b b a b a m i total count of item i in the stream (we are assuming stream has length n) Group times by the value seen Time t when the last i is seen (c t =1) Time t when the penultimate i is seen (c t =2) 32 Time t when the first i is seen (c t =m i )

93 Expecta8on Analysis Count: Stream: m a a a b b b a b a 33

94 Expecta8on Analysis Count: Stream: m a a a b b b a b a 33

95 Higher- Order Moments 34

96 Higher- Order Moments 34

97 Combining Samples 35

98 Combining Samples 35

99 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X 36

100 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: 36

101 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: Objec=ve: Each star8ng 8me t is selected with probability k/n 36

102 Streams Never End: Fixups (1) The variables X have n as a factor keep n separately; just hold the count in X (2) Suppose we can only store k counts. We must throw some Xs out as 8me goes on: Objec=ve: Each star8ng 8me t is selected with probability k/n Solu=on: (fixed- size sampling!) Choose the first k 8mes for k variables When the n th element arrives (n > k), choose it with probability k/n If you choose it, throw one of the previously stored variables X out, with equal probability 36

103 Thats it!!

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2 CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Data Stream Analytics

Data Stream Analytics Data Stream Analytics V. CHRISTOPHIDES vassilis.christophides@inria.fr https://who.rocq.inria.fr/vassilis.christophides/big/ Ecole CentraleSupélec Winter 2018 1 Traffic control Big Data : Velocity IP network

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

CSE 473: Ar+ficial Intelligence

CSE 473: Ar+ficial Intelligence CSE 473: Ar+ficial Intelligence Hidden Markov Models Luke Ze@lemoyer - University of Washington [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4 Lecture Note 2 Material covere toay is from Chapter an chapter 4 Bonferroni Principle. Iea Get an iea the frequency of events when things are ranom billion = 0 9 Each person has a % chance to stay in a

More information

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data

More information

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m) BALLS, BINS, AND RANDOM GRAPHS We use the Chernoff bound for the Poisson distribution (Theorem 5.4) to bound this probability, writing the bound as Pr(X 2: x) :s ex-ill-x In(x/m). For x = m + J2m In m,

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13 CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

COMP 562: Introduction to Machine Learning

COMP 562: Introduction to Machine Learning COMP 562: Introduction to Machine Learning Lecture 20 : Support Vector Machines, Kernels Mahmoud Mostapha 1 Department of Computer Science University of North Carolina at Chapel Hill mahmoudm@cs.unc.edu

More information

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Assignment

More information

CS 6140: Machine Learning Spring 2016

CS 6140: Machine Learning Spring 2016 CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa?on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis?cs Assignment

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Introduc)on to Ar)ficial Intelligence

Introduc)on to Ar)ficial Intelligence Introduc)on to Ar)ficial Intelligence Lecture 13 Approximate Inference CS/CNS/EE 154 Andreas Krause Bayesian networks! Compact representa)on of distribu)ons over large number of variables! (OQen) allows

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Bias/variance tradeoff, Model assessment and selec+on

Bias/variance tradeoff, Model assessment and selec+on Applied induc+ve learning Bias/variance tradeoff, Model assessment and selec+on Pierre Geurts Department of Electrical Engineering and Computer Science University of Liège October 29, 2012 1 Supervised

More information

Topic 4 Randomized algorithms

Topic 4 Randomized algorithms CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16 Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA

Classification. Chris Amato Northeastern University. Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA Classification Chris Amato Northeastern University Some images and slides are used from: Rob Platt, CS188 UC Berkeley, AIMA Supervised learning Given: Training set {(xi, yi) i = 1 N}, given a labeled set

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

A Model for Learned Bloom Filters, and Optimizing by Sandwiching A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i

CSE 473: Ar+ficial Intelligence. Hidden Markov Models. Bayes Nets. Two random variable at each +me step Hidden state, X i Observa+on, E i CSE 473: Ar+ficial Intelligence Bayes Nets Daniel Weld [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at hnp://ai.berkeley.edu.]

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin Learning Theory Machine Learning CSE546 Carlos Guestrin University of Washington November 25, 2013 Carlos Guestrin 2005-2013 1 What now n We have explored many ways of learning from data n But How good

More information

Predicate abstrac,on and interpola,on. Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on.

Predicate abstrac,on and interpola,on. Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on. Predicate abstrac,on and interpola,on Many pictures and examples are borrowed from The So'ware Model Checker BLAST presenta,on. Outline. Predicate abstrac,on the idea in pictures 2. Counter- example guided

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Tainted Flow Analysis on e-ssaform

Tainted Flow Analysis on e-ssaform Tainted Flow Analysis on e-ssaform Programs Andrei Rimsa, Marcelo d Amorim and Fernando M. Q. Pereira The Objective of this work is to detect security vulnerabilities in programs via static analysis Vulnerabilities

More information

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models

MA/CS 109 Lecture 7. Back To Exponen:al Growth Popula:on Models MA/CS 109 Lecture 7 Back To Exponen:al Growth Popula:on Models Homework this week 1. Due next Thursday (not Tuesday) 2. Do most of computa:ons in discussion next week 3. If possible, bring your laptop

More information

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109 1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of

More information

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova Google & USC Presented By: Pat Pannuto RAPPOR, What is is good for? (Absolutely something!)

More information

Polynomials and Gröbner Bases

Polynomials and Gröbner Bases Alice Feldmann 16th December 2014 ETH Zürich Student Seminar in Combinatorics: Mathema:cal So

More information

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13

Computer Vision. Pa0ern Recogni4on Concepts Part I. Luis F. Teixeira MAP- i 2012/13 Computer Vision Pa0ern Recogni4on Concepts Part I Luis F. Teixeira MAP- i 2012/13 What is it? Pa0ern Recogni4on Many defini4ons in the literature The assignment of a physical object or event to one of

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions:

CSE 473: Ar+ficial Intelligence. Example. Par+cle Filters for HMMs. An HMM is defined by: Ini+al distribu+on: Transi+ons: Emissions: CSE 473: Ar+ficial Intelligence Par+cle Filters for HMMs Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All

More information

CSCI 2670 Introduction to Theory of Computing

CSCI 2670 Introduction to Theory of Computing CSCI 267 Introduction to Theory of Computing Agenda Last class Reviewed syllabus Reviewed material in Chapter of Sipser Assigned pages Chapter of Sipser Questions? This class Begin Chapter Goal for the

More information

Image Data Compression. Dirty-paper codes Alexey Pak, Lehrstuhl für Interak<ve Echtzeitsysteme, Fakultät für Informa<k, KIT

Image Data Compression. Dirty-paper codes Alexey Pak, Lehrstuhl für Interak<ve Echtzeitsysteme, Fakultät für Informa<k, KIT Image Data Compression Dirty-paper codes 1 Reminder: watermarking with side informa8on Watermark embedder Noise n Input message m Auxiliary informa

More information

Tangent lines, cont d. Linear approxima5on and Newton s Method

Tangent lines, cont d. Linear approxima5on and Newton s Method Tangent lines, cont d Linear approxima5on and Newton s Method Last 5me: A challenging tangent line problem, because we had to figure out the point of tangency.?? (A) I get it! (B) I think I see how we

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015

CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 CSE446: Linear Regression Regulariza5on Bias / Variance Tradeoff Winter 2015 Luke ZeElemoyer Slides adapted from Carlos Guestrin Predic5on of con5nuous variables Billionaire says: Wait, that s not what

More information

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule.

CSE 473: Ar+ficial Intelligence. Probability Recap. Markov Models - II. Condi+onal probability. Product rule. Chain rule. CSE 473: Ar+ficial Intelligence Markov Models - II Daniel S. Weld - - - University of Washington [Most slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

Bellman s Curse of Dimensionality

Bellman s Curse of Dimensionality Bellman s Curse of Dimensionality n- dimensional state space Number of states grows exponen

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Attacks on hash functions. Birthday attacks and Multicollisions

Attacks on hash functions. Birthday attacks and Multicollisions Attacks on hash functions Birthday attacks and Multicollisions Birthday Attack Basics In a group of 23 people, the probability that there are at least two persons on the same day in the same month is greater

More information

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox Lecture 04: Balls and Bins: Overview In today s lecture we will start our study of balls-and-bins problems We shall consider a fundamental problem known as the Recall: Inequalities I Lemma Before we begin,

More information

Wavelets & Mul,resolu,on Analysis

Wavelets & Mul,resolu,on Analysis Wavelets & Mul,resolu,on Analysis Square Wave by Steve Hanov More comics at http://gandolf.homelinux.org/~smhanov/comics/ Problem set #4 will be posted tonight 11/21/08 Comp 665 Wavelets & Mul8resolu8on

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Logis&c Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com Logis&c Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Par$$oned Elias- Fano Indexes

Par$$oned Elias- Fano Indexes Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3

More information

Class Notes. Examining Repeated Measures Data on Individuals

Class Notes. Examining Repeated Measures Data on Individuals Ronald Heck Week 12: Class Notes 1 Class Notes Examining Repeated Measures Data on Individuals Generalized linear mixed models (GLMM) also provide a means of incorporang longitudinal designs with categorical

More information

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on

CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on CSCI 360 Introduc/on to Ar/ficial Intelligence Week 2: Problem Solving and Op/miza/on Professor Wei-Min Shen Week 13.1 and 13.2 1 Status Check Extra credits? Announcement Evalua/on process will start soon

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

CSE 373: Data Structures and Algorithms Pep Talk; Algorithm Analysis. Riley Porter Winter 2017

CSE 373: Data Structures and Algorithms Pep Talk; Algorithm Analysis. Riley Porter Winter 2017 CSE 373: Data Structures and Algorithms Pep Talk; Algorithm Analysis Riley Porter Announcements Op4onal Java Review Sec4on: PAA A102 Tuesday, January 10 th, 3:30-4:30pm. Any materials covered will be posted

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as

More information

Parallelizing Gaussian Process Calcula1ons in R

Parallelizing Gaussian Process Calcula1ons in R Parallelizing Gaussian Process Calcula1ons in R Christopher Paciorek UC Berkeley Sta1s1cs Joint work with: Benjamin Lipshitz Wei Zhuo Prabhat Cari Kaufman Rollin Thomas UC Berkeley EECS (formerly) IBM

More information

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce

Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Sta$s$cal Significance Tes$ng In Theory and In Prac$ce Ben Cartere8e University of Delaware h8p://ir.cis.udel.edu/ictir13tutorial Hypotheses and Experiments Hypothesis: Using an SVM for classifica$on will

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Acceptably inaccurate. Probabilistic data structures

Acceptably inaccurate. Probabilistic data structures Acceptably inaccurate Probabilistic data structures Hello Today's talk Motivation Bloom filters Count-Min Sketch HyperLogLog Motivation Tape HDD SSD Memory Tape HDD SSD Memory Speed Tape HDD SSD Memory

More information

CS 124 Math Review Section January 29, 2018

CS 124 Math Review Section January 29, 2018 CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Introduction to discrete probability. The rules Sample space (finite except for one example)

Introduction to discrete probability. The rules Sample space (finite except for one example) Algorithms lecture notes 1 Introduction to discrete probability The rules Sample space (finite except for one example) say Ω. P (Ω) = 1, P ( ) = 0. If the items in the sample space are {x 1,..., x n }

More information

Discrete Distributions

Discrete Distributions A simplest example of random experiment is a coin-tossing, formally called Bernoulli trial. It happens to be the case that many useful distributions are built upon this simplest form of experiment, whose

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Exact data mining from in- exact data Nick Freris

Exact data mining from in- exact data Nick Freris Exact data mining from in- exact data Nick Freris Qualcomm, San Diego October 10, 2013 Introduc=on (1) Informa=on retrieval is a large industry.. Biology, finance, engineering, marke=ng, vision/graphics,

More information

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource Networks in biology Protein-Protein Interaction Network of Yeast Transcriptional regulatory network of E.coli Experimental

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

With Question/Answer Animations. Chapter 7

With Question/Answer Animations. Chapter 7 With Question/Answer Animations Chapter 7 Chapter Summary Introduction to Discrete Probability Probability Theory Bayes Theorem Section 7.1 Section Summary Finite Probability Probabilities of Complements

More information

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a]

More information

CS 6140: Machine Learning Spring 2017

CS 6140: Machine Learning Spring 2017 CS 6140: Machine Learning Spring 2017 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Logis@cs Assignment

More information

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons Regression Part II One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons One- factor Analysis of variance Categorical Explanatory variable Quan?ta?ve Response

More information

6.854 Advanced Algorithms

6.854 Advanced Algorithms 6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using

More information

A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing

A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing A Framework for Protec/ng Worker Loca/on Privacy in Spa/al Crowdsourcing Nov 12 2014 Hien To, Gabriel Ghinita, Cyrus Shahabi VLDB 2014 1 Mo/va/on Ubiquity of mobile users 6.5 billion mobile subscrip/ons,

More information

Op#mal Control of Nonlinear Systems with Temporal Logic Specifica#ons

Op#mal Control of Nonlinear Systems with Temporal Logic Specifica#ons Op#mal Control of Nonlinear Systems with Temporal Logic Specifica#ons Eric M. Wolff 1 Ufuk Topcu 2 and Richard M. Murray 1 1 Caltech and 2 UPenn University of Michigan October 1, 2013 Autonomous Systems

More information

CS174 Final Exam Solutions Spring 2001 J. Canny May 16

CS174 Final Exam Solutions Spring 2001 J. Canny May 16 CS174 Final Exam Solutions Spring 2001 J. Canny May 16 1. Give a short answer for each of the following questions: (a) (4 points) Suppose two fair coins are tossed. Let X be 1 if the first coin is heads,

More information