B490 Mining the Big Data

Size: px
Start display at page:

Download "B490 Mining the Big Data"

Transcription

1 B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1

2 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1

3 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know 2-2

4 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. 2-3

5 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image 2-4

6 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis 2-5

7 Motivations 2-6 Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis Finding NetFlix users with similar tastes in movies, e.g., for personalized recomendation systems.

8 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. 3-1

9 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? 3-2

10 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? For example, the Euclidean space over R d 3-3

11 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? For example, the Euclidean space over R d L 1 space, L space,

12 Three essential techniques k-gram: convert documents to sets. 4-1

13 Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity 4-2

14 Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. 4-3

15 Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. The usual way of finding similar items Docs k-gram A bag of strings of length k from the doc Minhashing A signature representing the sets LSH points in some space 4-4

16 Jacccard Distance and Min-hashing 5-1

17 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} 6-1

18 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? 6-2

19 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? 6-3

20 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? 6-4

21 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? 6-5

22 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? Question: Characters vs. Words? 6-6

23 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? Question: Characters vs. Words? 6-7 Question: How about different languages?

24 Jaccard similarity Intersection size: 3 Union size: 8 Jaccard similarity: 3/8 Formally, the Jaccard similarity of two sets C 1, C 2 is JS(C 1, C 2 ) = C 1 C 2 C 1 C 2. It can be thought as a distance function, if we define d JS (C 1, C 2 ) = JS(C 1, C 2 ). Actually, any similarity measure can be converted to a distance function 7-1

25 Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? 8-1

26 Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? With clusters: We may have some items which basically represent the same thing. We group objects to clusters. E.g., C 1 = {0, 1, 2}, C 2 = {3, 4}, C 3 = {5, 6}, C 4 = {7, 8, 9} For instance: C 1 may represent action movies, C 2 may represent comedies, C 3 may represent documentaries, C 4 may represent horror movies. Now we can represent A clu = {C 1, C 3 }, B clu = {C 1, C 2, C 3, C 4 } JS clu (A, B) = JS(A clu, B clu ) = {C 1, C 2 } {C 1, C 2, C 3, C 4 } =

27 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S

28 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S Step 1 Randomly permute Item S 1 S 2 S 3 S

29 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S Step 1 Randomly permute Item S 1 S 2 S 3 S Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = 6 9-3

30 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S Step 1 Randomly permute Item S 1 S 2 S 3 S Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = Step 3: Estimate the Jaccard similarity JS(S i, S j ) as { 1 if m(s i ) = m(s j ), X = 0 otherwise.

31 Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) 10-1

32 Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) To approximate Jaccard similarity within ɛ error with probability at least 1 δ, repeat k = 1/ɛ 2 log(1/δ) times and then take the average. (Chernoff bound on board) 10-2

33 Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] 11-1

34 Min-hashing (cont.) 11-2 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))

35 Min-hashing (cont.) 11-3 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i )) Collisions?

36 Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] 3 at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] Collisions? Use Birthday Paradox 11-4 JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))

37 Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-1

38 Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-2

39 Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H Some practical hash function h : k [m]. Modular Hashing: h(x) = x mod m Multiplicative Hashing: h a (x) = m frac(x a) Using Primes Let p [m, 2m] be a prime. Let a, b R [p]. h(x) = ((ax + b) mod p) mod m Tabulation Hashing: h(x) = H 1 [x 1 ]... H k [x k ], where is an bitwise exclusive or, and H i : [m] is a random mapping table (stored in a fast cache)

40 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. 13-1

41 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. 13-2

42 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? 13-3

43 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. 13-4

44 Finding similar items 13-5 Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. Choice 2: Lay down a (random) Grid. Similar to the Locality Sensitive Hashing that we will explore

45 14-1 Locality Sensitive Hashing

46 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. 15-1

47 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-2

48 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d

49 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-4 Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d 2 1. Not good enough: d 2 d 1 = (1 d 1 ) (1 d 2 ). Can we improve it?

50 Partition into bands Docs D 1 D 2 D 3 D 4 h h h h h t b bands r = t/b rows per band Candidates pairs are those that hash to the same bucket for at least 1 band. Tune b, r to catch most similar pairs, but few nonsimilar pairs. 16-1

51 A bite of theory for LSH (using min-hashing) Docs D 1 D 2 D 3 D 4 h h h h h t b bands r = t/b rows per band s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band 17-1

52 A bite of theory for LSH (cont.) 18-1 s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. E.g., if we choose t = 15, r = 3 and b = 5, then f (0.1) = 0.005, f (0.2) = 0.04, f (0.3) = 0.13, f (0.4) = 0.28, f (0.5) = 0.48, f (0.6) = 0.70, f (0.7) = 0.88, f (0.8) = 0.97, f (0.9) = s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band

53 A bite of theory for LSH (cont.) Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? 19-1

54 A bite of theory for LSH (cont.) 19-2 Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? The threshold τ where f has the steepest slope is about τ (1/b) 1/r. So given a similarity s that we want to use as a cut-off, we can solve for r = t/b in s = (1/b) 1/r to yield r log s (1/t). If there is no budget on r and b, as they increase the S curve gets sharper.

55 LSH using Euclidean distance Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = k i=1 (v i u i )

56 LSH using Euclidean distance 20-2 Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = Hashing function h k i=1 (v i u i ) 2 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b Create bins of size γ on u (that is, R 1 ), (better with a random shift z [0, γ)). h(a) = index of the bin a falls into.

57 Property of LSH using Euclidean distance Hashing function h 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b Create bins of size γ on u (that is, R 1 ). h(a) = index of the bin a falls into. h is (γ/2, 2γ, 1/2, 1/3)-sensitive. If a b 2 < γ/2, then Pr[h(a) = h(b)] > 1/2. If a b 2 > 2γ, then Pr[h(a) = h(b)] < 1/

58 Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. 22-1

59 Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. Entrop hashing: solves the first issue. When query h(q), in addition query several offsets q + δ i (1 i L), chosen randomly from the surface of a ball B(q, r) for some carefully chosen radius r. Intuition: the data points close to the query point q are highly likely to hash either to the same value as h(q) or to a value very close to that. Need much fewer hash functions. 22-2

60 23-1 Distances

61 Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) 24-1

62 Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. 24-2

63 Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. A distance that satisfies (M1), (M2) and (M4) is called a quasiometric. 24-3

64 Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p 25-1

65 Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-2

66 Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-3

67 Distance functions (cont.) Jaccard distance: d J (a, b) = 1 JS(a, b) = 1 A B A B. Proof (on board) 26-1

68 Distance functions (cont.) Cosine distance: d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b

69 Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b

70 Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise 27-3

71 Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise It is (γ, φ, (π γ)/π, φ/π)-sensitive for any γ < φ [0, π] 27-4

72 Distance functions (cont.) Edit distance. Consider two strings a, b, and d ed (a, b) = #operations to make a = b, where an operation could be inserting a letter or deleting a letter or substituting a letter App: Google s auto-correct. It is expensive to compute 28-1

73 Distance functions (cont.) Graph distance. Given a graph G = (V, E), where V = {v 1,..., v n } and E = {e 1,..., e m }. The graph distance between d G between any pair v i, v j V is defined as the shortest path between v i, v j. 29-1

74 Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d p i ln(p i /q i ). i=1 30-1

75 Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. 30-2

76 Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. Can be written as H(P, Q) H(P) 30-3

77 Distance functions (cont.) Given two multisets A, B in the grid [ ] 2 with A = B = N, the Earth-Mover Distance (EMD) is defined as the minimum cost of a perfect matching between points in A and B, that is, EMD(A, B) = min π:a B a A a π(a) 1. Good for comparing images 31-1

78 Thank you! Some slides are based on the MMDS book and Jeff Phillips lecture notes

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances 7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee

More information

1 Difference between grad and undergrad algorithms

1 Difference between grad and undergrad algorithms princeton univ. F 4 cos 52: Advanced Algorithm Design Lecture : Course Intro and Hashing Lecturer: Sanjeev Arora Scribe:Sanjeev Algorithms are integral to computer science and every computer scientist

More information

Lecture 8 HASHING!!!!!

Lecture 8 HASHING!!!!! Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

Cache-Oblivious Hashing

Cache-Oblivious Hashing Cache-Oblivious Hashing Zhewei Wei Hong Kong University of Science & Technology Joint work with Rasmus Pagh, Ke Yi and Qin Zhang Dictionary Problem Store a subset S of the Universe U. Lookup: Does x belong

More information

University of Florida CISE department Gator Engineering. Clustering Part 1

University of Florida CISE department Gator Engineering. Clustering Part 1 Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Optimal Data-Dependent Hashing for Approximate Near Neighbors Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point

More information

Notes. Combinatorics. Combinatorics II. Notes. Notes. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry. Spring 2006

Notes. Combinatorics. Combinatorics II. Notes. Notes. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry. Spring 2006 Combinatorics Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry Spring 2006 Computer Science & Engineering 235 Introduction to Discrete Mathematics Sections 4.1-4.6 & 6.5-6.6 of Rosen cse235@cse.unl.edu

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Lecture 24: Approximate Counting

Lecture 24: Approximate Counting CS 710: Complexity Theory 12/1/2011 Lecture 24: Approximate Counting Instructor: Dieter van Melkebeek Scribe: David Guild and Gautam Prakriya Last time we introduced counting problems and defined the class

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle CS 7880 Graduate Cryptography October 20, 2015 Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle Lecturer: Daniel Wichs Scribe: Tanay Mehta 1 Topics Covered Review Collision-Resistant Hash Functions

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module No. # 01 Lecture No. # 08 Shannon s Theory (Contd.)

More information

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is

More information

COS597D: Information Theory in Computer Science September 21, Lecture 2

COS597D: Information Theory in Computer Science September 21, Lecture 2 COS597D: Information Theory in Computer Science September 1, 011 Lecture Lecturer: Mark Braverman Scribe: Mark Braverman In the last lecture, we introduced entropy H(X), and conditional entry H(X Y ),

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

5199/IOC5063 Theory of Cryptology, 2014 Fall

5199/IOC5063 Theory of Cryptology, 2014 Fall 5199/IOC5063 Theory of Cryptology, 2014 Fall Homework 2 Reference Solution 1. This is about the RSA common modulus problem. Consider that two users A and B use the same modulus n = 146171 for the RSA encryption.

More information

MAT2377. Ali Karimnezhad. Version September 9, Ali Karimnezhad

MAT2377. Ali Karimnezhad. Version September 9, Ali Karimnezhad MAT2377 Ali Karimnezhad Version September 9, 2015 Ali Karimnezhad Comments These slides cover material from Chapter 1. In class, I may use a blackboard. I recommend reading these slides before you come

More information

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland. Measures These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland. 1 Introduction Our motivation for studying measure theory is to lay a foundation

More information

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions. CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the

More information

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have The first bound is the strongest, the other two bounds are often easier to state and compute Proof: Applying Markov's inequality, for any >0 we have Pr (1 + ) = Pr For any >0, we can set = ln 1+ (4.4.1):

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) = CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)

More information

Two-batch liar games on a general bounded channel

Two-batch liar games on a general bounded channel Two-batch liar games on a general bounded channel R.B. Ellis 1 K.L. Nyman 2 1 Illinois Institute of Technology 2 Loyola University Chicago BilleraFest Ellis, Nyman (June 14, 2008) Liar Games BilleraFest

More information

6.1 Occupancy Problem

6.1 Occupancy Problem 15-859(M): Randomized Algorithms Lecturer: Anupam Gupta Topic: Occupancy Problems and Hashing Date: Sep 9 Scribe: Runting Shi 6.1 Occupancy Problem Bins and Balls Throw n balls into n bins at random. 1.

More information

What makes groups different?

What makes groups different? What makes groups different? James B. Wilson Department of Mathematics http://www.math.colostate.edu/ jwilson Why the interest in symmetry Beyond aesthetics and curiosity, symmetry receives attention because:

More information

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class Dr. Thomas E. Hicks Data Abstractions Homework - Hashing -1 - INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University Data Set - SSN's from UTSA Class 467 13 3881 498 66 2055 450 27 3804 456 49 5261

More information

CSE 321 Discrete Structures

CSE 321 Discrete Structures CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algoritms for Data Science Arya Mazumdar University of Massacusetts at Amerst Fall 2018 Lecture 11 Locality Sensitive Hasing Midterm exam Average 28.96 out of 35 (82.8%). One exam is still

More information

Discrete Probability

Discrete Probability Discrete Probability Counting Permutations Combinations r- Combinations r- Combinations with repetition Allowed Pascal s Formula Binomial Theorem Conditional Probability Baye s Formula Independent Events

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Algorithms lecture notes 1. Hashing, and Universal Hash functions Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.

More information

Lecture 18: March 15

Lecture 18: March 15 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Hash tables. Hash tables

Hash tables. Hash tables Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties Outline Approximation: Theory and Algorithms Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 3 March 13, 2009 2 3 Nikolaus Augsten (DIS) Approximation: Theory and

More information

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10] Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter A, you write 3,004,572,688. B is 28,731,569½.

More information

Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya

Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya BBM 205 Discrete Mathematics Hacettepe University http://web.cs.hacettepe.edu.tr/ bbm205 Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya Resources: Kenneth

More information

6.854 Advanced Algorithms

6.854 Advanced Algorithms 6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016 Lecture 1: Introduction and Review We begin with a short introduction to the course, and logistics. We then survey some basics about approximation algorithms and probability. We also introduce some of

More information

2030 LECTURES. R. Craigen. Inclusion/Exclusion and Relations

2030 LECTURES. R. Craigen. Inclusion/Exclusion and Relations 2030 LECTURES R. Craigen Inclusion/Exclusion and Relations The Principle of Inclusion-Exclusion 7 ROS enumerates the union of disjoint sets. What if sets overlap? Some 17 out of 30 students in a class

More information

Warm-up Using the given data Create a scatterplot Find the regression line

Warm-up Using the given data Create a scatterplot Find the regression line Time at the lunch table Caloric intake 21.4 472 30.8 498 37.7 335 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Proximity problems in high dimensions

Proximity problems in high dimensions Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition

More information

Why duplicate detection?

Why duplicate detection? Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?

More information

Random Lifts of Graphs

Random Lifts of Graphs 27th Brazilian Math Colloquium, July 09 Plan of this talk A brief introduction to the probabilistic method. A quick review of expander graphs and their spectrum. Lifts, random lifts and their properties.

More information

Introduction to Hash Tables

Introduction to Hash Tables Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In

More information

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors? CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events Discrete Structures II (Summer 2018) Rutgers University Instructor: Abhishek

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

Topics in Approximation Algorithms Solution for Homework 3

Topics in Approximation Algorithms Solution for Homework 3 Topics in Approximation Algorithms Solution for Homework 3 Problem 1 We show that any solution {U t } can be modified to satisfy U τ L τ as follows. Suppose U τ L τ, so there is a vertex v U τ but v L

More information

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox Lecture 04: Balls and Bins: Overview In today s lecture we will start our study of balls-and-bins problems We shall consider a fundamental problem known as the Recall: Inequalities I Lemma Before we begin,

More information

Computational Models - Lecture 3 1

Computational Models - Lecture 3 1 Computational Models - Lecture 3 1 Handout Mode Iftach Haitner and Yishay Mansour. Tel Aviv University. March 13/18, 2013 1 Based on frames by Benny Chor, Tel Aviv University, modifying frames by Maurice

More information

1 Alphabets and Languages

1 Alphabets and Languages 1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,

More information

1 Take-home exam and final exam study guide

1 Take-home exam and final exam study guide Math 215 - Introduction to Advanced Mathematics Fall 2013 1 Take-home exam and final exam study guide 1.1 Problems The following are some problems, some of which will appear on the final exam. 1.1.1 Number

More information