B490 Mining the Big Data

Similar documents
7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Similarity Search. Stony Brook University CSE545, Fall 2016

Algorithms for Data Science: Lecture on Finding Similar Items

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

1 Finding Similar Items

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

1 Maintaining a Dictionary

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Finding similar items

Analysis of Algorithms I: Perfect Hashing

COMPSCI 514: Algorithms for Data Science

Bloom Filters and Locality-Sensitive Hashing

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

CS246 Final Exam, Winter 2011

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

1 Difference between grad and undergrad algorithms

Lecture 8 HASHING!!!!!

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS246 Final Exam. March 16, :30AM - 11:30AM

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Cache-Oblivious Hashing

University of Florida CISE department Gator Engineering. Clustering Part 1

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

compare to comparison and pointer based sorting, binary trees

Lecture: Analysis of Algorithms (CS )

Optimal Data-Dependent Hashing for Approximate Near Neighbors

Notes. Combinatorics. Combinatorics II. Notes. Notes. Slides by Christopher M. Bourke Instructor: Berthe Y. Choueiry. Spring 2006

Algorithms for Querying Noisy Distributed/Streaming Datasets

Lecture 3 Sept. 4, 2014

Lecture 24: Approximate Counting

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle

2 How many distinct elements are in a stream?

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lecture 9 Nearest Neighbor Search: Locality Sensitive Hashing.

COS597D: Information Theory in Computer Science September 21, Lecture 2

Approximate counting: count-min data structure. Problem definition

5199/IOC5063 Theory of Cryptology, 2014 Fall

MAT2377. Ali Karimnezhad. Version September 9, Ali Karimnezhad

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Two-batch liar games on a general bounded channel

6.1 Occupancy Problem

What makes groups different?

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

CSE 321 Discrete Structures

6.842 Randomness and Computation Lecture 5

Randomized Algorithms

CS425: Algorithms for Web Scale Data

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CS5112: Algorithms and Data Structures for Applications

COMPSCI 514: Algorithms for Data Science

Discrete Probability

Lecture 4: Hashing and Streaming Algorithms

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Lecture 18: March 15

Hash tables. Hash tables

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Part 1: Hashing and Its Many Applications

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Lecture 4: Counting, Pigeonhole Principle, Permutations, Combinations Lecturer: Lale Özkahya

6.854 Advanced Algorithms

Algorithms for Data Science

14.1 Finding frequent elements in stream

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

2030 LECTURES. R. Craigen. Inclusion/Exclusion and Relations

Warm-up Using the given data Create a scatterplot Find the regression line

Ad Placement Strategies

Proximity problems in high dimensions

Why duplicate detection?

Random Lifts of Graphs

Introduction to Hash Tables

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Introduction to Randomized Algorithms III

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Lecture 8: Conditional probability I: definition, independence, the tree method, sampling, chain rule for independent events

CSE 5243 INTRO. TO DATA MINING

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Topics in Approximation Algorithms Solution for Homework 3

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Computational Models - Lecture 3 1

1 Alphabets and Languages

1 Take-home exam and final exam study guide

Transcription:

B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know 2-2

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. 2-3

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image 2-4

Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis 2-5

Motivations 2-6 Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. Plagiarism, large quotations Application: I am sure you know Similar topic/articles from various places Application: Cluster articles by same story. Google image Social network analysis Finding NetFlix users with similar tastes in movies, e.g., for personalized recomendation systems.

How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. 3-1

How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? 3-2

How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? For example, the Euclidean space over R d 3-3

How can we compare similarity? Convert the data (homework, webpages, images) into an object in an abstract space that we know how to measure distance, and how to do it efficiently. What abstract space? For example, the Euclidean space over R d L 1 space, L space,... 3-4

Three essential techniques k-gram: convert documents to sets. 4-1

Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity 4-2

Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. 4-3

Three essential techniques k-gram: convert documents to sets. Min-hashing: convert large sets to short signatures, while preserving similarity Locality-sensitive hashing (LSH) map the sets to some space so that similar sets will be close in distance. The usual way of finding similar items Docs k-gram A bag of strings of length k from the doc Minhashing A signature representing the sets LSH points in some space 4-4

Jacccard Distance and Min-hashing 5-1

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} 6-1

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? 6-2

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? 6-3

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? 6-4

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? 6-5

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? Question: Characters vs. Words? 6-6

Convert documents to sets k-gram: a sequence of k characters that appears in the doc E.g., doc = abcab, k = 2, then the set of all 2-gram is {ab, bc, ca} Question: What is a good k? Question: Shall we count replicas? Question: How to deal with white space, and stop words (e.g., a you, for, the, to, and, that)? Question: Capitaization? Punctuation? Question: Characters vs. Words? 6-7 Question: How about different languages?

Jaccard similarity Intersection size: 3 Union size: 8 Jaccard similarity: 3/8 Formally, the Jaccard similarity of two sets C 1, C 2 is JS(C 1, C 2 ) = C 1 C 2 C 1 C 2. It can be thought as a distance function, if we define d JS (C 1, C 2 ) = JS(C 1, C 2 ). Actually, any similarity measure can be converted to a distance function 7-1

Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? 8-1

Jaccard similarity with clusters Consider two sets A = {0, 1, 2, 5, 6}, B = {0, 2, 3, 5, 7, 9}. What is the Jaccard similartiy of A and B? With clusters: We may have some items which basically represent the same thing. We group objects to clusters. E.g., C 1 = {0, 1, 2}, C 2 = {3, 4}, C 3 = {5, 6}, C 4 = {7, 8, 9} For instance: C 1 may represent action movies, C 2 may represent comedies, C 3 may represent documentaries, C 4 may represent horror movies. Now we can represent A clu = {C 1, C 3 }, B clu = {C 1, C 2, C 3, C 4 } JS clu (A, B) = JS(A clu, B clu ) = {C 1, C 2 } {C 1, C 2, C 3, C 4 } = 0.5 8-2

Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S 4 1 1 0 0 1 2 1 0 1 0 3 0 1 1 0 4 0 0 1 1 5 1 0 0 0 6 0 0 1 1 9-1

Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S 4 1 1 0 0 1 2 1 0 1 0 3 0 1 1 0 4 0 0 1 1 5 1 0 0 0 6 0 0 1 1 Step 1 Randomly permute Item S 1 S 2 S 3 S 4 2 1 0 1 0 5 1 0 0 0 6 0 0 1 1 1 1 0 0 1 4 0 0 1 1 3 0 1 1 0 9-2

Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S 4 1 1 0 0 1 2 1 0 1 0 3 0 1 1 0 4 0 0 1 1 5 1 0 0 0 6 0 0 1 1 Step 1 Randomly permute Item S 1 S 2 S 3 S 4 2 1 0 1 0 5 1 0 0 0 6 0 0 1 1 1 1 0 0 1 4 0 0 1 1 3 0 1 1 0 Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = 6 9-3

Min-hashing Given sets S 1 = {1, 2, 5}, S 2 = {3}, S 3 = {2, 3, 4, 6}, S 4 = {1, 4, 6} We can represent them as a matrix Item S 1 S 2 S 3 S 4 1 1 0 0 1 2 1 0 1 0 3 0 1 1 0 4 0 0 1 1 5 1 0 0 0 6 0 0 1 1 Step 1 Randomly permute Item S 1 S 2 S 3 S 4 2 1 0 1 0 5 1 0 0 0 6 0 0 1 1 1 1 0 0 1 4 0 0 1 1 3 0 1 1 0 Step 2: Record the first 1 in each column: m(s 1 ) = 2, m(s 2 ) = 3, m(s 3 ) = 2, m(s 4 ) = 6 9-4 Step 3: Estimate the Jaccard similarity JS(S i, S j ) as { 1 if m(s i ) = m(s j ), X = 0 otherwise.

Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) 10-1

Min-hashing (cont.) Lemma: Pr[m(S i ) = m(s j )] = E[X ] = JS(S i, S j ) (proof on the board) To approximate Jaccard similarity within ɛ error with probability at least 1 δ, repeat k = 1/ɛ 2 log(1/δ) times and then take the average. (Chernoff bound on board) 10-2

Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] 11-1

Min-hashing (cont.) 11-2 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))

Min-hashing (cont.) 11-3 How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i )) Collisions?

Min-hashing (cont.) How to implement Min-hashing efficiently? Make one pass over the data. Maintain k random hash functions {h 1, h 2,..., h k } so that h i : [N] [N] 3 at random (more next slide). Maintain k counters {c 1,..., c k } with c i (1 i k) being initialized to. Algorithm Min-hashing For each i S do for j = 1 to k do If (h j (i) < c j ) then c j = h j (i) Output m j (S) = c j for each j [k] Collisions? Use Birthday Paradox 11-4 JS k (S i, S i ) = 1 k k j=1 1(m j(s i ) = m j (S i ))

Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-1

Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-2

Random hashing functions A random hash func. h H is the one that maps from a set A to a set B, so that conditioned on the random choice h H, Even stronger: two any disjoint subset. Good but hard to achieve in theory 1. The location h(x) B for x A is equally likely. 2. (Stronger) For any two x x A, h(x) and h(x ) are independent over the random choice of h H. 12-3 Some practical hash function h : k [m]. Modular Hashing: h(x) = x mod m Multiplicative Hashing: h a (x) = m frac(x a) Using Primes Let p [m, 2m] be a prime. Let a, b R [p]. h(x) = ((ax + b) mod p) mod m Tabulation Hashing: h(x) = H 1 [x 1 ]... H k [x k ], where is an bitwise exclusive or, and H i : [m] is a random mapping table (stored in a fast cache)

Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. 13-1

Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. 13-2

Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? 13-3

Finding similar items Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. 13-4

Finding similar items 13-5 Given a set of n = 1, 000, 000 items, we ask two questions: Which items are similar? We don t want to check all n 2 pairs. Given a query item, which others are similar to that item? We don t want to check all n items. What if the set is n points in the plane R 2? Choice 1: Hierarchical indexing trees (e.g., range tree, kd-tree, B-tree). Problem: do not scale to high dimensions. Choice 2: Lay down a (random) Grid. Similar to the Locality Sensitive Hashing that we will explore

14-1 Locality Sensitive Hashing

Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. 15-1

Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-2

Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d 2 1. 15-3

Locality Sensitive Hashing (LSH) Definition h is (l, u, p l, p u )-sensitive with a distance function d if Pr[h(a) = h(b)] > p l if d(a, b) < l. Pr[h(a) = h(b)] < p u if d(a, b) > u. For this definition to make sense, we need p u < p l for u > l. Idealy, want p l p u to be large even when u l is small. Then we can repeat and further amplify this effect (next slide) 15-4 Define d(a, b) = 1 JS(a, b). The family of Min-hashing functions is (d 1, d 2, 1 d 1, 1 d 2 )-sensitive for any 0 d 1 < d 2 1. Not good enough: d 2 d 1 = (1 d 1 ) (1 d 2 ). Can we improve it?

Partition into bands Docs D 1 D 2 D 3 D 4 h 1 1 2 4 0 h 2 2 0 1 3 h 3 5 3 3 0 h 4 4 0 1 1... h t 1 2 3 0 b bands r = t/b rows per band Candidates pairs are those that hash to the same bucket for at least 1 band. Tune b, r to catch most similar pairs, but few nonsimilar pairs. 16-1

A bite of theory for LSH (using min-hashing) Docs D 1 D 2 D 3 D 4 h 1 1 2 4 0 h 2 2 0 1 3 h 3 5 3 3 0 h 4 4 0 1 1... h t 1 2 3 0 b bands r = t/b rows per band s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band 17-1

A bite of theory for LSH (cont.) 18-1 s = JS(D 1, D 2 ) is the probability that D 1 and D 2 have a hash collision. E.g., if we choose t = 15, r = 3 and b = 5, then f (0.1) = 0.005, f (0.2) = 0.04, f (0.3) = 0.13, f (0.4) = 0.28, f (0.5) = 0.48, f (0.6) = 0.70, f (0.7) = 0.88, f (0.8) = 0.97, f (0.9) = 0.998 s r = probability all hashes collide in 1 band (1 s r ) = probability not all collide in 1 band (1 s r ) b = probability that in no bands do all hashes collide f (s) = 1 (1 s r ) b = probability all hashes collide in at least 1 band

A bite of theory for LSH (cont.) Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? 19-1

A bite of theory for LSH (cont.) 19-2 Choice of r and b. Usually there is a budget of t hash function one is willing to use. Given (a budget of) t hash functions, how does one divvy them up among r and b? The threshold τ where f has the steepest slope is about τ (1/b) 1/r. So given a similarity s that we want to use as a cut-off, we can solve for r = t/b in s = (1/b) 1/r to yield r log s (1/t). If there is no budget on r and b, as they increase the S curve gets sharper.

LSH using Euclidean distance Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = k i=1 (v i u i ) 2 20-1

LSH using Euclidean distance 20-2 Besides the min-hashing distance function, let s use the Euclidean distance function for LSH. d E [u, v] = u v 2 = Hashing function h k i=1 (v i u i ) 2 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b 2. 3. Create bins of size γ on u (that is, R 1 ), (better with a random shift z [0, γ)). h(a) = index of the bin a falls into.

Property of LSH using Euclidean distance Hashing function h 1. First take a random unit vector u R d. A unit vector u satisfies that u 2 = 1, that is d E (u, 0) = 1. We will see how to generate a random unit vector later. 2. Project a, b P R k onto u: a u = a, u = k i=1 a i u i This is contractive: a u b u a b 2. 3. Create bins of size γ on u (that is, R 1 ). h(a) = index of the bin a falls into. h is (γ/2, 2γ, 1/2, 1/3)-sensitive. If a b 2 < γ/2, then Pr[h(a) = h(b)] > 1/2. If a b 2 > 2γ, then Pr[h(a) = h(b)] < 1/3. 21-1

Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. 22-1

Entropy LSH Problem with the basic LSH: Need many hash functions h 1, h 2,..., h t. Consume a lot of space. Hard to implement in the distributed computation model. Entrop hashing: solves the first issue. When query h(q), in addition query several offsets q + δ i (1 i L), chosen randomly from the surface of a ball B(q, r) for some carefully chosen radius r. Intuition: the data points close to the query point q are highly likely to hash either to the same value as h(q) or to a value very close to that. Need much fewer hash functions. 22-2

23-1 Distances

Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) 24-1

Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. 24-2

Distance functions A distance d : X X R + is a metric if (M1) (non-negativity) d(a, b) 0 (M2) (identity) d(a, b) = 0 if and only if a = b (M3) (symmetry) d(a, b) = d(b, a) (M4) (triangle inequality) d(a, b) d(a, c) + d(c, b) A distance that satisfies (M1), (M3) and (M4) is called a pseudometric. A distance that satisfies (M1), (M2) and (M4) is called a quasiometric. 24-3

Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p 25-1

Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-2

Distance functions (cont.) L p distances on two vectors a = (a 1,..., a d ), b = (b 1,..., b d ). ( d ) 1/p d p (a, b) = a b p = i=1 ( a i b i ) p L 2 : Euclidean distance L 1 : also known as Manhattan distance. d 1 (a, b) = d i=1 (a i, b i ). L 0 : d 0 (a, b) = d d i=1 1(a i, b i ). If a i, b i {0, 1}, then this is called Hamming distance. L : d (a, b) = max d i=1 a i b i 25-3

Distance functions (cont.) Jaccard distance: d J (a, b) = 1 JS(a, b) = 1 A B A B. Proof (on board) 26-1

Distance functions (cont.) Cosine distance: d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 27-1

Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 27-2

Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise 27-3

Distance functions (cont.) Cosine distance: A psuedometric. d cos (a, b) = 1 a, b a 2 b 2 = 1 d i=1 a ib i a 2 b 2 We can also develop an LSH function h for d cos (, ), as follows. Choose a random vector v R d. Then let h v (a) = { +1 if v, a > 0 1 otherwise It is (γ, φ, (π γ)/π, φ/π)-sensitive for any γ < φ [0, π] 27-4

Distance functions (cont.) Edit distance. Consider two strings a, b, and d ed (a, b) = #operations to make a = b, where an operation could be inserting a letter or deleting a letter or substituting a letter App: Google s auto-correct. It is expensive to compute 28-1

Distance functions (cont.) Graph distance. Given a graph G = (V, E), where V = {v 1,..., v n } and E = {e 1,..., e m }. The graph distance between d G between any pair v i, v j V is defined as the shortest path between v i, v j. 29-1

Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d p i ln(p i /q i ). i=1 30-1

Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. 30-2

Distance functions (cont.) Kullback-Liebler (KL) Divergence. Given two discrete distributions P = {p 1,..., p d } and Q = {q 1,..., q d }. d KL (P, Q) = d i=1 p i ln(p i /q i ). A distance that is NOT a metric. Can be written as H(P, Q) H(P) 30-3

Distance functions (cont.) Given two multisets A, B in the grid [ ] 2 with A = B = N, the Earth-Mover Distance (EMD) is defined as the minimum cost of a perfect matching between points in A and B, that is, EMD(A, B) = min π:a B a A a π(a) 1. Good for comparing images 31-1

Thank you! Some slides are based on the MMDS book http://infolab.stanford.edu/~ullman/mmds.html and Jeff Phillips lecture notes http://www.cs.utah.edu/~jeffp/teaching/cs5955.html 32-1