B490 Mining the Big Data - PDF Free Download

B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know 2-2

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. 2-3

Motivations 2-6 Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis Finding NetFlix users with similar tastes in movies, e.g., for personalized recomendation systems.

How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. 3-1

How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? 3-2

Three essential techniques k-gram: convert documents to sets. 4-1

Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity 4-2

Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. The usual way of finding similar items Docs k-gram A bag of strings of length k from the doc Minhashing A signature representing the sets LSH points in some space 4-4

Jacccard Distance and Min-hashing 5-1

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} 6-1

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? 6-2

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? 6-4

Jaccard similarity Intersection size: 3 Union size: 8 Jaccard similarity: 3/8 Formally, the Jaccard similarity of two sets C 1, C 2 is JS(C 1, C 2 ) = C 1 C 2 C 1 C 2. It can be thought as a distance function, if we define d JS (C 1, C 2 ) = JS(C 1, C 2 ). Actually, any similarity measure can be converted to a distance function 7-1

Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? 8-1

Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? With clusters: We may have some items which basically represent the same thing. We group objects to clusters. E.g., C 1 = {0, 1, 2}, C 2 = {3, 4}, C 3 = {5, 6}, C 4 = {7, 8, 9} For instance: C 1 may represent action movies, C 2 may represent comedies, C 3 may represent documentaries, C 4 may represent horror movies. Now we can represent A clu = {C 1, C 3 }, B clu = {C 1, C 2, C 3, C 4 } JS clu (A, B) = JS(A clu, B clu ) = {C 1, C 2 } {C 1, C 2, C 3, C 4 } = 0.5 8-2

Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S 4 1 1 0 0 1 2 1 0 1 0 3 0 1 1 0 4 0 0 1 1 5 1 0 0 0 6 0 0 1 1 9-1

Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S 4 1 1 0 0 1 2 1 0 1 0 3 0 1 1 0 4 0 0 1 1 5 1 0 0 0 6 0 0 1 1 Step 1 Randomly permute Item S 1 S 2 S 3 S 4 2 1 0 1 0 5 1 0 0 0 6 0 0 1 1 1 1 0 0 1 4 0 0 1 1 3 0 1 1 0 Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = 6 9-3

Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) 10-1

Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) To approximate Jaccard similarity within ɛ error with probability at least 1 δ, repeat k = 1/ɛ 2 log(1/δ) times and then take the average. (Chernoff bound on board) 10-2

Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] 11-1

Min-hashing (cont.) 11-2 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))

Min-hashing (cont.) 11-3 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i )) Collisions?

Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] 3 at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] Collisions? Use Birthday Paradox 11-4 JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))

Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-1

Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-3 Some practical hash function h : k [m]. Modular Hashing: h(x) = x mod m Multiplicative Hashing: h a (x) = m frac(x a) Using Primes Let p [m, 2m] be a prime. Let a, b R [p]. h(x) = ((ax + b) mod p) mod m Tabulation Hashing: h(x) = H 1 [x 1 ]... H k [x k ], where is an bitwise exclusive or, and H i : [m] is a random mapping table (stored in a fast cache)

Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. 13-1

Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? 13-3

Finding similar items 13-5 Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. Choice 2: Lay down a (random) Grid. Similar to the Locality Sensitive Hashing that we will explore

14-1 Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d 2 1. 15-3

Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-4 Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d 2 1. Not good enough: d 2 d 1 = (1 d 1 ) (1 d 2 ). Can we improve it?

Partition into bands Docs D 1 D 2 D 3 D 4 h 1 1 2 4 0 h 2 2 0 1 3 h 3 5 3 3 0 h 4 4 0 1 1... h t 1 2 3 0 b bands r = t/b rows per band Candidates pairs are those that hash to the same bucket for at least 1 band. Tune b, r to catch most similar pairs, but few nonsimilar pairs. 16-1

A bite of theory for LSH (using min-hashing) Docs D 1 D 2 D 3 D 4 h 1 1 2 4 0 h 2 2 0 1 3 h 3 5 3 3 0 h 4 4 0 1 1... h t 1 2 3 0 b bands r = t/b rows per band s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band 17-1

A bite of theory for LSH (cont.) 18-1 s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. E.g., if we choose t = 15, r = 3 and b = 5, then f (0.1) = 0.005, f (0.2) = 0.04, f (0.3) = 0.13, f (0.4) = 0.28, f (0.5) = 0.48, f (0.6) = 0.70, f (0.7) = 0.88, f (0.8) = 0.97, f (0.9) = 0.998 s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band

A bite of theory for LSH (cont.) Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? 19-1

A bite of theory for LSH (cont.) 19-2 Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? The threshold τ where f has the steepest slope is about τ (1/b) 1/r. So given a similarity s that we want to use as a cut-off, we can solve for r = t/b in s = (1/b) 1/r to yield r log s (1/t). If there is no budget on r and b, as they increase the S curve gets sharper.

LSH using Euclidean distance Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = k i=1 (v i u i ) 2 20-1

LSH using Euclidean distance 20-2 Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = Hashing function h k i=1 (v i u i ) 2 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b 2. 3. Create bins of size γ on u (that is, R 1 ), (better with a random shift z [0, γ)). h(a) = index of the bin a falls into.

Property of LSH using Euclidean distance Hashing function h 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b 2. 3. Create bins of size γ on u (that is, R 1 ). h(a) = index of the bin a falls into. h is (γ/2, 2γ, 1/2, 1/3)-sensitive. If a b 2 < γ/2, then Pr[h(a) = h(b)] > 1/2. If a b 2 > 2γ, then Pr[h(a) = h(b)] < 1/3. 21-1

Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. 22-1

Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. Entrop hashing: solves the first issue. When query h(q), in addition query several offsets q + δ i (1 i L), chosen randomly from the surface of a ball B(q, r) for some carefully chosen radius r. Intuition: the data points close to the query point q are highly likely to hash either to the same value as h(q) or to a value very close to that. Need much fewer hash functions. 22-2

23-1 Distances

Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. 24-2

Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p 25-1

Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-2

Distance functions (cont.) Jaccard distance: d J (a, b) = 1 JS(a, b) = 1 A B A B. Proof (on board) 26-1

Distance functions (cont.) Cosine distance: d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 27-1

Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 27-2

Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise 27-3

Distance functions (cont.) Edit distance. Consider two strings a, b, and d ed (a, b) = #operations to make a = b, where an operation could be inserting a letter or deleting a letter or substituting a letter App: Google s auto-correct. It is expensive to compute 28-1

Distance functions (cont.) Graph distance. Given a graph G = (V, E), where V = {v 1,..., v n } and E = {e 1,..., e m }. The graph distance between d G between any pair v i, v j V is defined as the shortest path between v i, v j. 29-1

Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d p i ln(p i /q i ). i=1 30-1

Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. 30-2

Distance functions (cont.) Given two multisets A, B in the grid [ ] 2 with A = B = N, the Earth-Mover Distance (EMD) is defined as the minimum cost of a perfect matching between points in A and B, that is, EMD(A, B) = min π:a B a A a π(a) 1. Good for comparing images 31-1

Thank you! Some slides are based on the MMDS book http://infolab.stanford.edu/~ullman/mmds.html and Jeff Phillips lecture notes http://www.cs.utah.edu/~jeffp/teaching/cs5955.html 32-1