B490 Mining the Big Data
|
|
- Blanche Wilkinson
- 6 years ago
- Views:
Transcription
1 B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1
2 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1
3 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know 2-2
4 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. 2-3
5 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image 2-4
6 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis 2-5
7 Motivations 2-6 Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis Finding NetFlix users with similar tastes in movies, e.g., for personalized recomendation systems.
8 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. 3-1
9 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? 3-2
10 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? For example, the Euclidean space over R d 3-3
11 How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? For example, the Euclidean space over R d L 1 space, L space,
12 Three essential techniques k-gram: convert documents to sets. 4-1
13 Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity 4-2
14 Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. 4-3
15 Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. The usual way of finding similar items Docs k-gram A bag of strings of length k from the doc Minhashing A signature representing the sets LSH points in some space 4-4
16 Jacccard Distance and Min-hashing 5-1
17 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} 6-1
18 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? 6-2
19 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? 6-3
20 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? 6-4
21 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? 6-5
22 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? Question: Characters vs. Words? 6-6
23 Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? Question: Characters vs. Words? 6-7 Question: How about different languages?
24 Jaccard similarity Intersection size: 3 Union size: 8 Jaccard similarity: 3/8 Formally, the Jaccard similarity of two sets C 1, C 2 is JS(C 1, C 2 ) = C 1 C 2 C 1 C 2. It can be thought as a distance function, if we define d JS (C 1, C 2 ) = JS(C 1, C 2 ). Actually, any similarity measure can be converted to a distance function 7-1
25 Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? 8-1
26 Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? With clusters: We may have some items which basically represent the same thing. We group objects to clusters. E.g., C 1 = {0, 1, 2}, C 2 = {3, 4}, C 3 = {5, 6}, C 4 = {7, 8, 9} For instance: C 1 may represent action movies, C 2 may represent comedies, C 3 may represent documentaries, C 4 may represent horror movies. Now we can represent A clu = {C 1, C 3 }, B clu = {C 1, C 2, C 3, C 4 } JS clu (A, B) = JS(A clu, B clu ) = {C 1, C 2 } {C 1, C 2, C 3, C 4 } =
27 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S
28 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S Step 1 Randomly permute Item S 1 S 2 S 3 S
29 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S Step 1 Randomly permute Item S 1 S 2 S 3 S Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = 6 9-3
30 Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S Step 1 Randomly permute Item S 1 S 2 S 3 S Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = Step 3: Estimate the Jaccard similarity JS(S i, S j ) as { 1 if m(s i ) = m(s j ), X = 0 otherwise.
31 Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) 10-1
32 Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) To approximate Jaccard similarity within ɛ error with probability at least 1 δ, repeat k = 1/ɛ 2 log(1/δ) times and then take the average. (Chernoff bound on board) 10-2
33 Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] 11-1
34 Min-hashing (cont.) 11-2 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))
35 Min-hashing (cont.) 11-3 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i )) Collisions?
36 Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] 3 at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] Collisions? Use Birthday Paradox 11-4 JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))
37 Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-1
38 Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-2
39 Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H Some practical hash function h : k [m]. Modular Hashing: h(x) = x mod m Multiplicative Hashing: h a (x) = m frac(x a) Using Primes Let p [m, 2m] be a prime. Let a, b R [p]. h(x) = ((ax + b) mod p) mod m Tabulation Hashing: h(x) = H 1 [x 1 ]... H k [x k ], where is an bitwise exclusive or, and H i : [m] is a random mapping table (stored in a fast cache)
40 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. 13-1
41 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. 13-2
42 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? 13-3
43 Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. 13-4
44 Finding similar items 13-5 Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. Choice 2: Lay down a (random) Grid. Similar to the Locality Sensitive Hashing that we will explore
45 14-1 Locality Sensitive Hashing
46 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. 15-1
47 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-2
48 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d
49 Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-4 Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d 2 1. Not good enough: d 2 d 1 = (1 d 1 ) (1 d 2 ). Can we improve it?
50 Partition into bands Docs D 1 D 2 D 3 D 4 h h h h h t b bands r = t/b rows per band Candidates pairs are those that hash to the same bucket for at least 1 band. Tune b, r to catch most similar pairs, but few nonsimilar pairs. 16-1
51 A bite of theory for LSH (using min-hashing) Docs D 1 D 2 D 3 D 4 h h h h h t b bands r = t/b rows per band s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band 17-1
52 A bite of theory for LSH (cont.) 18-1 s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. E.g., if we choose t = 15, r = 3 and b = 5, then f (0.1) = 0.005, f (0.2) = 0.04, f (0.3) = 0.13, f (0.4) = 0.28, f (0.5) = 0.48, f (0.6) = 0.70, f (0.7) = 0.88, f (0.8) = 0.97, f (0.9) = s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band
53 A bite of theory for LSH (cont.) Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? 19-1
54 A bite of theory for LSH (cont.) 19-2 Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? The threshold τ where f has the steepest slope is about τ (1/b) 1/r. So given a similarity s that we want to use as a cut-off, we can solve for r = t/b in s = (1/b) 1/r to yield r log s (1/t). If there is no budget on r and b, as they increase the S curve gets sharper.
55 LSH using Euclidean distance Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = k i=1 (v i u i )
56 LSH using Euclidean distance 20-2 Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = Hashing function h k i=1 (v i u i ) 2 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b Create bins of size γ on u (that is, R 1 ), (better with a random shift z [0, γ)). h(a) = index of the bin a falls into.
57 Property of LSH using Euclidean distance Hashing function h 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b Create bins of size γ on u (that is, R 1 ). h(a) = index of the bin a falls into. h is (γ/2, 2γ, 1/2, 1/3)-sensitive. If a b 2 < γ/2, then Pr[h(a) = h(b)] > 1/2. If a b 2 > 2γ, then Pr[h(a) = h(b)] < 1/
58 Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. 22-1
59 Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. Entrop hashing: solves the first issue. When query h(q), in addition query several offsets q + δ i (1 i L), chosen randomly from the surface of a ball B(q, r) for some carefully chosen radius r. Intuition: the data points close to the query point q are highly likely to hash either to the same value as h(q) or to a value very close to that. Need much fewer hash functions. 22-2
60 23-1 Distances
61 Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) 24-1
62 Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. 24-2
63 Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. A distance that satisfies (M1), (M2) and (M4) is called a quasiometric. 24-3
64 Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p 25-1
65 Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-2
66 Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-3
67 Distance functions (cont.) Jaccard distance: d J (a, b) = 1 JS(a, b) = 1 A B A B. Proof (on board) 26-1
68 Distance functions (cont.) Cosine distance: d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b
69 Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b
70 Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise 27-3
71 Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise It is (γ, φ, (π γ)/π, φ/π)-sensitive for any γ < φ [0, π] 27-4
72 Distance functions (cont.) Edit distance. Consider two strings a, b, and d ed (a, b) = #operations to make a = b, where an operation could be inserting a letter or deleting a letter or substituting a letter App: Google s auto-correct. It is expensive to compute 28-1
73 Distance functions (cont.) Graph distance. Given a graph G = (V, E), where V = {v 1,..., v n } and E = {e 1,..., e m }. The graph distance between d G between any pair v i, v j V is defined as the shortest path between v i, v j. 29-1
74 Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d p i ln(p i /q i ). i=1 30-1
75 Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. 30-2
76 Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. Can be written as H(P, Q) H(P) 30-3
77 Distance functions (cont.) Given two multisets A, B in the grid [ ] 2 with A = B = N, the Earth-Mover Distance (EMD) is defined as the minimum cost of a perfect matching between points in A and B, that is, EMD(A, B) = min π:a B a A a π(a) 1. Good for comparing images 31-1
78 Thank you! Some slides are based on the MMDS book and Jeff Phillips lecture notes
7 Distances. 7.1 Metrics. 7.2 Distances L p Distances
7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More information6 Distances. 6.1 Metrics. 6.2 Distances L p Distances
6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationDATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More information1 Difference between grad and undergrad algorithms
princeton univ. F 4 cos 52: Advanced Algorithm Design Lecture : Course Intro and Hashing Lecturer: Sanjeev Arora Scribe:Sanjeev Algorithms are integral to computer science and every computer scientist
More informationLecture 8 HASHING!!!!!
Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationLecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane
Randomized Algorithms Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane Sotiris Nikoletseas Professor CEID - ETY Course 2017-2018 Sotiris Nikoletseas, Professor Randomized
More informationLecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationCache-Oblivious Hashing
Cache-Oblivious Hashing Zhewei Wei Hong Kong University of Science & Technology Joint work with Rasmus Pagh, Ke Yi and Qin Zhang Dictionary Problem Store a subset S of the Universe U. Lookup: Does x belong
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 1
Clustering Part 1 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville What is Cluster Analysis? Finding groups of objects such that the objects
More informationData Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining
Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Similarity and Dissimilarity Similarity Numerical measure of how alike two data objects are. Is higher
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect
More informationOptimal Data-Dependent Hashing for Approximate Near Neighbors
Optimal Data-Dependent Hashing for Approximate Near Neighbors Alexandr Andoni 1 Ilya Razenshteyn 2 1 Simons Institute 2 MIT, CSAIL April 20, 2015 1 / 30 Nearest Neighbor Search (NNS) Let P be an n-point
More informationNotes. Combinatorics. Combinatorics II. Notes. Notes. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry. Spring 2006
Combinatorics Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry Spring 2006 Computer Science & Engineering 235 Introduction to Discrete Mathematics Sections 4.1-4.6 & 6.5-6.6 of Rosen cse235@cse.unl.edu
More informationAlgorithms for Querying Noisy Distributed/Streaming Datasets
Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationLecture 24: Approximate Counting
CS 710: Complexity Theory 12/1/2011 Lecture 24: Approximate Counting Instructor: Dieter van Melkebeek Scribe: David Guild and Gautam Prakriya Last time we introduced counting problems and defined the class
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationLecture 11: Hash Functions, Merkle-Damgaard, Random Oracle
CS 7880 Graduate Cryptography October 20, 2015 Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle Lecturer: Daniel Wichs Scribe: Tanay Mehta 1 Topics Covered Review Collision-Resistant Hash Functions
More information2 How many distinct elements are in a stream?
Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition
More informationCryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur
Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Module No. # 01 Lecture No. # 08 Shannon s Theory (Contd.)
More informationLecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.
COMS 4995-3: Advanced Algorithms Feb 15, 2017 Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing. Instructor: Alex Andoni Scribes: Weston Jackson, Edo Roth 1 Introduction Today s lecture is
More informationCOS597D: Information Theory in Computer Science September 21, Lecture 2
COS597D: Information Theory in Computer Science September 1, 011 Lecture Lecturer: Mark Braverman Scribe: Mark Braverman In the last lecture, we introduced entropy H(X), and conditional entry H(X Y ),
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More information5199/IOC5063 Theory of Cryptology, 2014 Fall
5199/IOC5063 Theory of Cryptology, 2014 Fall Homework 2 Reference Solution 1. This is about the RSA common modulus problem. Consider that two users A and B use the same modulus n = 146171 for the RSA encryption.
More informationMAT2377. Ali Karimnezhad. Version September 9, Ali Karimnezhad
MAT2377 Ali Karimnezhad Version September 9, 2015 Ali Karimnezhad Comments These slides cover material from Chapter 1. In class, I may use a blackboard. I recommend reading these slides before you come
More informationMeasures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.
Measures These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland. 1 Introduction Our motivation for studying measure theory is to lay a foundation
More informationLecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.
CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the
More informationThe first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have
The first bound is the strongest, the other two bounds are often easier to state and compute Proof: Applying Markov's inequality, for any >0 we have Pr (1 + ) = Pr For any >0, we can set = ln 1+ (4.4.1):
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More informationCS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =
CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)
More informationTwo-batch liar games on a general bounded channel
Two-batch liar games on a general bounded channel R.B. Ellis 1 K.L. Nyman 2 1 Illinois Institute of Technology 2 Loyola University Chicago BilleraFest Ellis, Nyman (June 14, 2008) Liar Games BilleraFest
More information6.1 Occupancy Problem
15-859(M): Randomized Algorithms Lecturer: Anupam Gupta Topic: Occupancy Problems and Hashing Date: Sep 9 Scribe: Runting Shi 6.1 Occupancy Problem Bins and Balls Throw n balls into n bins at random. 1.
More informationWhat makes groups different?
What makes groups different? James B. Wilson Department of Mathematics http://www.math.colostate.edu/ jwilson Why the interest in symmetry Beyond aesthetics and curiosity, symmetry receives attention because:
More informationINTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class
Dr. Thomas E. Hicks Data Abstractions Homework - Hashing -1 - INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University Data Set - SSN's from UTSA Class 467 13 3881 498 66 2055 450 27 3804 456 49 5261
More informationCSE 321 Discrete Structures
CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationRandomized Algorithms
Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More informationCSE 190, Great ideas in algorithms: Pairwise independent hash functions
CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algoritms for Data Science Arya Mazumdar University of Massacusetts at Amerst Fall 2018 Lecture 11 Locality Sensitive Hasing Midterm exam Average 28.96 out of 35 (82.8%). One exam is still
More informationDiscrete Probability
Discrete Probability Counting Permutations Combinations r- Combinations r- Combinations with repetition Allowed Pascal s Formula Binomial Theorem Conditional Probability Baye s Formula Independent Events
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationAlgorithms lecture notes 1. Hashing, and Universal Hash functions
Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.
More informationLecture 18: March 15
CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may
More informationHash tables. Hash tables
Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationOutline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties
Outline Approximation: Theory and Algorithms Nikolaus Augsten Free University of Bozen-Bolzano Faculty of Computer Science DIS Unit 3 March 13, 2009 2 3 Nikolaus Augsten (DIS) Approximation: Theory and
More informationHash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a
Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More information12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]
Calvin: There! I finished our secret code! Hobbes: Let s see. Calvin: I assigned each letter a totally random number, so the code will be hard to crack. For letter A, you write 3,004,572,688. B is 28,731,569½.
More informationLecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya
BBM 205 Discrete Mathematics Hacettepe University http://web.cs.hacettepe.edu.tr/ bbm205 Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya Resources: Kenneth
More information6.854 Advanced Algorithms
6.854 Advanced Algorithms Homework Solutions Hashing Bashing. Solution:. O(log U ) for the first level and for each of the O(n) second level functions, giving a total of O(n log U ) 2. Suppose we are using
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationAditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016
Lecture 1: Introduction and Review We begin with a short introduction to the course, and logistics. We then survey some basics about approximation algorithms and probability. We also introduce some of
More information2030 LECTURES. R. Craigen. Inclusion/Exclusion and Relations
2030 LECTURES R. Craigen Inclusion/Exclusion and Relations The Principle of Inclusion-Exclusion 7 ROS enumerates the union of disjoint sets. What if sets overlap? Some 17 out of 30 students in a class
More informationWarm-up Using the given data Create a scatterplot Find the regression line
Time at the lunch table Caloric intake 21.4 472 30.8 498 37.7 335 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationProximity problems in high dimensions
Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition
More informationWhy duplicate detection?
Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?
More informationRandom Lifts of Graphs
27th Brazilian Math Colloquium, July 09 Plan of this talk A brief introduction to the probabilistic method. A quick review of expander graphs and their spectrum. Lifts, random lifts and their properties.
More informationIntroduction to Hash Tables
Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In
More informationRecap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?
CS276A Information Retrieval Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces Lecture 7 This
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationLecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events
Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events Discrete Structures II (Summer 2018) Rutgers University Instructor: Abhishek
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationMotivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis
Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated
More informationTopics in Approximation Algorithms Solution for Homework 3
Topics in Approximation Algorithms Solution for Homework 3 Problem 1 We show that any solution {U t } can be modified to satisfy U τ L τ as follows. Suppose U τ L τ, so there is a vertex v U τ but v L
More informationLecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox
Lecture 04: Balls and Bins: Overview In today s lecture we will start our study of balls-and-bins problems We shall consider a fundamental problem known as the Recall: Inequalities I Lemma Before we begin,
More informationComputational Models - Lecture 3 1
Computational Models - Lecture 3 1 Handout Mode Iftach Haitner and Yishay Mansour. Tel Aviv University. March 13/18, 2013 1 Based on frames by Benny Chor, Tel Aviv University, modifying frames by Maurice
More information1 Alphabets and Languages
1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,
More information1 Take-home exam and final exam study guide
Math 215 - Introduction to Advanced Mathematics Fall 2013 1 Take-home exam and final exam study guide 1.1 Problems The following are some problems, some of which will appear on the final exam. 1.1.1 Number
More information