Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Similar documents
Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Similarity Search. Stony Brook University CSE545, Fall 2016

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Algorithms for Data Science: Lecture on Finding Similar Items

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

B490 Mining the Big Data

Finding similar items

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

1 Finding Similar Items

CSE 5243 INTRO. TO DATA MINING

COMPSCI 514: Algorithms for Data Science

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246 Final Exam, Winter 2011

Bloom Filters and Locality-Sensitive Hashing

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

CS246 Final Exam. March 16, :30AM - 11:30AM

1 Maintaining a Dictionary

CMU CS 462/662 (INTRO TO COMPUTER GRAPHICS) HOMEWORK 0.0 MATH REVIEW/PREVIEW LINEAR ALGEBRA

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS425: Algorithms for Web Scale Data

Review: Linear and Vector Algebra

2 How many distinct elements are in a stream?

Analysis of Algorithms I: Perfect Hashing

Lecture: Analysis of Algorithms (CS )

COMPSCI 514: Algorithms for Data Science

Why duplicate detection?

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

Quantum Error Correcting Codes and Quantum Cryptography. Peter Shor M.I.T. Cambridge, MA 02139

Vectors and their uses

CS5112: Algorithms and Data Structures for Applications

Linear Algebra, Summer 2011, pt. 3

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

MATH 167: APPLIED LINEAR ALGEBRA Chapter 3

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

LINEAR ALGEBRA KNOWLEDGE SURVEY

Approximate counting: count-min data structure. Problem definition

Lecture 8 HASHING!!!!!

Lecture 2: Vector Spaces, Metric Spaces

Chapter 1 Review of Equations and Inequalities

MITOCW ocw-18_02-f07-lec02_220k

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

N/4 + N/2 + N = 2N 2.

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Efficient Data Reduction and Summarization

The Growth of Functions. A Practical Introduction with as Little Theory as possible

MATRIX DETERMINANTS. 1 Reminder Definition and components of a matrix

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

6 Linear Systems of Equations

MATH 167: APPLIED LINEAR ALGEBRA Least-Squares

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Solutions for Chapter Solutions for Chapter 17. Section 17.1 Exercises

Ad Placement Strategies

Analysis-3 lecture schemes

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

ECS171: Machine Learning

CSE 321 Discrete Structures

CS60021: Scalable Data Mining. Large Scale Machine Learning

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

STA141C: Big Data & High Performance Statistical Computing

Math 320-3: Lecture Notes Northwestern University, Spring 2015

Frequency Estimators

1 Vector Calculus. 1.1 Dot and Cross Product. hu, vi Euc := u v cos(q). Vector Calculus Review

Vector Geometry. Chapter 5

(x 1 +x 2 )(x 1 x 2 )+(x 2 +x 3 )(x 2 x 3 )+(x 3 +x 1 )(x 3 x 1 ).

4.1 Distance and Length

Jeffrey D. Ullman Stanford University

x 1 x 2. x 1, x 2,..., x n R. x n

STA141C: Big Data & High Performance Statistical Computing

The Gram Schmidt Process

Finite Mathematics : A Business Approach

The Gram Schmidt Process

NOTES ON LINEAR ALGEBRA CLASS HANDOUT

Math 3361-Modern Algebra Lecture 08 9/26/ Cardinality

CS154 Final Examination

CPSC 467: Cryptography and Computer Security

Linear Classifiers and the Perceptron

MTH 250 Graded Assignment 4

EM (cont.) November 26 th, Carlos Guestrin 1

Let s start by reviewing what we learned last time. Here is the basic line of reasoning for Einstein Solids

CS 124 Math Review Section January 29, 2018

Introduction to Randomized Algorithms III

Transcription:

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34 (here) Deadlines next Thu, :59 PM: HW0, HW How to find teammates for project? Piazza Team Search Make sure you have a good dataset accessible

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

Task: Given a large number (N in the millions or billions) of documents, find near duplicates Problem: Too many documents to compare all pairs Solution: Hash documents so that similar documents hash into the same bucket Documents in the same bucket are then candidate pairs whose similarity is then evaluated 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4 Document Shingling Signatures: short integer vectors that represent the sets, and reflect their similarity The set of strings of length k that appear in the document Min-Hashing Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

A k-shingle (or k-gram) is a sequence of k tokens that appears in the document Example: k=2; D = abcab Set of 2-shingles: C = S(D ) = {ab, bc, ca} Represent a doc by a set of hash values of its k-shingles A natural similarity measure is then the Jaccard similarity: sim(d, D 2 ) = C ÇC 2 / C ÈC 2 Similarity of two documents is the Jaccard similarity of their shingles 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

Min-Hashing: Convert large sets into short signatures, while preserving similarity: Pr[h(C ) = h(c 2 )] = sim(d, D 2 ) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7 Similarities of columns and signatures (approx.) match! -3 2-4 -2 3-4 Col/Col 0.75 0.75 0 0 Sig/Sig 0.67.00 0 0 Signature matrix M 5 7 6 3 2 4 4 5 6 7 3 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Input matrix (Shingles x Documents) 3 4 7 2 6 5 Permutation p 2 2 4 2 2 2

Hash columns of the signature matrix M: Similar columns likely hash to same bucket Divide matrix M into b bands of r rows (M=b r) Candidate column pairs are those that hash to the same bucket for band Buckets b bands r rows Matrix M Prob. of sharing bucket Threshold s Similarity 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9 Points Hash func. Signatures: short integer signatures that reflect point similarity Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Design a locality sensitive hash function (for a given distance metric) Apply the Bands technique

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 0 The S-curve is where the magic happens Probability of sharing bucket Remember: Probability of equal hash-values = similarity Similarity t of two sets This is what hash-code gives you Pr[h p (C ) = h p (C 2 )] = sim(d, D 2 ) No chance if t<s Threshold s Probability= if t>s Similarity t of two sets This is what we want! How to get a step-function? By choosing r and b!

Remember: b bands, r rows/band Let sim(c, C 2 ) = s What s the prob. that at least band is equal? Pick some band (r rows) Prob. that elements in a single row of columns C and C 2 are equal = s Prob. that all rows in a band are equal = s r Prob. that some row in a band is not equal = - s r Prob. that all bands are not equal = ( - s r ) b Prob. that at least band is equal = - ( - s r ) b P(C 4/0/9, C 2 is a candidate pair) = - ( - s r ) b Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547

Picking r and b to get the best S-curve 50 hash-functions (r=5, b=0) 0.9 Prob. sharing a bucket 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity, s 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

Given a fixed threshold s. We want choose r and b such that the P(Candidate pair) has a step right around s. Prob(Candidate pair) Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. r =..0, b = r =, b =..0 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity r = 5, b =..50 0.9 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. r = 0, b =..50 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity prob = - ( - t r ) b

Signatures: short vectors that represent the sets, and reflect their similarity Min-Hashing Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity

We have used LSH to find similar documents More generally, we found similar columns in large sparse matrices with high Jaccard similarity Can we use LSH for other distance measures? e.g., Euclidean distances, Cosine distance Let s generalize what we ve learned! 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5

d() is a distance measure if it is a function from pairs of points x,y to real numbers such that:! ", $ 0! ", $ = 0 ()) " = $!(", $) =!($, ")! ", $!(", -) +!(-, $) (triangle inequality) Jaccard distance for sets = - Jaccard similarity Cosine distance for vectors = angle between the vectors Euclidean distances: L 2 norm: d(x,y) = square root of the sum of the squares of the differences between x and y in each dimension The most common notion of distance L norm: sum of absolute value of the differences in each dimension Manhattan distance = distance if you travel along coordinates only 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

For Min-Hashing signatures, we got a Min-Hash function for each permutation of rows A hash function is any function that allows us to say whether two elements are equal Shorthand: h(x) = h(y) means h says x and y are equal A family of hash functions is any set of hash functions from which we can pick one at random efficiently Example: The set of Min-Hash functions generated from permutations of rows 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

Suppose we have a space S of points with a distance measure d(x,y) Critical assumption A family H of hash functions is said to be (d, d 2, p, p 2 )-sensitive if for any x and y in S:. If d(x, y) < d, then the probability over all hî H, that h(x) = h(y) is at least p 2. If d(x, y) > d 2, then the probability over all hî H, that h(x) = h(y) is at most p 2 With a LS Family we can do LSH! 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22 Small distance, high probability Distance threshold t Notice it s a distance, not similarity, hence the S-curve is flipped! p Pr[h(x) = h(y)] p 2 d d 2 Large distance, low probability of hashing to the same value Distance d(x,y)

Let: S = space of all sets, d = Jaccard distance, H is family of Min-Hash functions for all permutations of rows Then for any hash function hî H: Pr[h(x) = h(y)] = - d(x, y) Simply restates theorem about Min-Hashing in terms of distances rather than similarities 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 23

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 24 Claim: Min-hash H is a (/3, 2/3, 2/3, /3)- sensitive family for S and d. If distance < /3 (so similarity 2/3) Then probability that Min-Hash values agree is > 2/3 For Jaccard similarity, Min-Hashing gives a (d,d 2,(-d ),(-d 2 ))-sensitive family for any d <d 2

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 25 Can we reproduce the S-curve effect we saw before for any LS family? Prob. of sharing a bucket The bands technique we learned for signature matrices carries over to this more general setting Can do LSH with any (d, d 2, p, p 2 )-sensitive family! Two constructions: AND construction like rows in a band OR construction like many bands Similarity t

Given family H, construct family H consisting of r functions from H For h = [h,,h r ] in H, we say h(x) = h(y) if and only if h i (x) = h i (y) for all i Note this corresponds to creating a band of size r Theorem: If H is (d, d 2, p, p 2 )-sensitive, then H is (d,d 2, (p ) r, (p 2 ) r )-sensitive Proof: Use the fact that h i s are independent i r Also lowers probability for small distances (Bad) Lowers probability for large distances (Good) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 27

Independence of hash functions (HFs) really means that the prob. of two HFs saying yes is the product of each saying yes But two particular hash functions could be highly correlated For example, in Min-Hash if their permutations agree in the first one million entries However, the probabilities in definition of a LSH-family are over all possible members of H, H (i.e., average case and not the worst case) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 28

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 29 Given family H, construct family H consisting of b functions from H For h = [h,,h b ] in H, h(x) = h(y) if and only if h i (x) = h i (y) for at least i Theorem: If H is (d, d 2, p, p 2 )-sensitive, then H is (d, d 2, -(-p ) b, -(-p 2 ) b )-sensitive Proof: Use the fact that h i s are independent Raises probability for small distances (Good) Raises probability for large distances (Bad)

AND makes all probs. shrink, but by choosing r correctly, we can make the lower prob. approach 0 while the higher does not OR makes all probs. grow, but by choosing b correctly, we can make the higher prob. approach while the lower does not Prob. sharing a bucket 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. AND r=..0, b= 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity of a pair of items Prob. sharing a bucket 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 30 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. OR r=, b=..0 Similarity of a pair of items

By choosing b and r correctly, we can make the lower probability approach 0 while the higher approaches As for the signature matrix, we can use the AND construction followed by the OR construction Or vice-versa Or any sequence of AND s and OR s alternating 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 32 r-way AND followed by b-way OR construction Exactly what we did with Min-Hashing AND: If bands match in all r values hash to same bucket OR: Cols that have ³ common bucket à Candidate Take points x and y s.t. Pr[h(x) = h(y)] = s H will make (x,y) a candidate pair with prob. s Construction makes (x,y) a candidate pair with probability -(-s r ) b The S-Curve! Example: Take H and construct H by the AND construction with r = 4. Then, from H, construct H by the OR construction with b = 4

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 33 s p=-(-s 4 ) 4.2.0064.3.0320.4.0985.5.2275.6.4260.7.6666.8.8785.9.9860 Prob(candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0.2 0.4 0.6 0.8 Similarity s r = 4, b = 4 transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.8785,.0064)-sensitive family.

Picking r and b to get desired performance 50 hash-functions (r = 5, b = 0) Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. Threshold s 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity s Blue area X: False Negative rate These are pairs with sim > s but the X fraction won t share a band and then will never become candidates. This means we will never consider these pairs for (slow/exact) similarity calculation! Green area Y: False Positive rate These are pairs with sim < s but we will consider them as candidates. This is not too bad, we will consider them for (slow/exact) similarity computation and discard them. 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 35

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 36 Picking r and b to get desired performance 50 hash-functions (r * b = 50) Prob(Candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Threshold s r=2, b=25 r=5, b=0 r=0, b=5 0. 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Similarity s

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 37 Apply a b-way OR construction followed by an r-way AND construction Transforms similarity s (probability p) into (-(-s) b ) r The same S-curve, mirrored horizontally and vertically Example: Take H and construct H by the OR construction with b = 4. Then, from H, construct H by the AND construction with r = 4

s p=(-(-s) 4 ) 4..040.2.25.3.3334.4.5740.5.7725.6.905.7.9680.8.9936 Prob(candidate pair) 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0. 0 0 0.2 0.4 0.6 0.8 Similarity s The example transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9936,.25)-sensitive family 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 38

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 39 Example: Apply the (4,4) OR-AND construction followed by the (4,4) AND-OR construction Transforms a (.2,.8,.8,.2)-sensitive family into a (.2,.8,.9999996,.000875)-sensitive family Note this family uses 256 (=4*4*4*4) of the original hash functions

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 40 Fixpoint: For each AND-OR S-curve -(-s r ) b, there is a threshold t, for which -(-t r ) b = t Above t, high probabilities are increased; below t, low probabilities are decreased You improve the sensitivity as long as the low probability is less than t, and the high probability is greater than t Iterate as you like Similar observation for the OR-AND type of S- curve: (-(-s) b ) r

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4 Probability Is raised Prob(Candidate pair) Probability Is lowered Threshold t s t

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 42 Pick any two distances d < d 2 Start with a (d, d 2, (- d ), (- d 2 ))-sensitive family Apply constructions to amplify (d, d 2, p, p 2 )-sensitive family, where p is almost and p 2 is almost 0 The closer to 0 and we get, the more hash functions must be used!

LSH methods for other distance metrics: Cosine distance: Random hyperplanes Euclidean distance: Project on lines Points Hash func. Design a (d, d 2, p, p 2 )-sensitive family of hash functions (for that particular distance metric) Signatures: short integer signatures that reflect their similarity Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions Depends on the distance function used 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 44

Data Hash func. Signatures: short integer signatures that reflect their similarity Localitysensitive Hashing Candidate pairs: those pairs of signatures that we need to test for similarity Documents 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MinHash 5 5 2 3 3 6 4 6 4 Bands technique Candidate pairs Data points 0 0 0 0 0 0 0 0 0 0 0 0 Random Hyperplanes - + - - + + + - - - - - Bands technique Candidate pairs 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 45

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 46 Cosine distance = angle between vectors from the origin to the points in question d(a, B) = q = arccos(a B / ǁAǁ ǁBǁ) Has range [", $] (equivalently [0,80 ]) A B ǁBǁ Can divide q by $ to have distance in range [0,] Cosine similarity = -d(a,b) But often defined as cosine sim: cos(*) = - / - / A - Has range - for general vectors - Range 0.. for non-negative vectors (angles up to 90 ) B

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 47 For cosine distance, there is a technique called Random Hyperplanes Technique similar to Min-Hashing Random Hyperplanes method is a (d, d 2, (-d /!), (-d 2 /!))-sensitive family for any d and d 2 Reminder: (d, d 2, p, p 2 )-sensitive. If d(x,y) < d, then prob. that h(x) = h(y) is at least p 2. If d(x,y) > d 2, then prob. that h(x) = h(y) is at most p 2

Each vector v determines a hash function h v with two buckets h v (x) = + if v x ³ 0; = - if v x < 0 LS-family H = set of all functions derived from any vector Claim: For points x and y, Pr[h(x) = h(y)] = d(x,y) /! 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 48

Look in the plane of x and y. v θ x v Hyperplane normal to v. Here h(x) h(y) Hyperplane normal to v. Here h(x) = h(y) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 49 y Note: what is important is that hyperplane is outside the angle, not that the vector is inside.

So: Prob[Red case] = θ /! So: P[h(x)=h(y)] = - θ/" = -d(x,y)/" 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 50

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5 Pick some number of random vectors, and hash your data for each vector The result is a signature (sketch) of + s and s for each data point Can be used for LSH like we used the Min-Hash signatures for Jaccard distance Amplify using AND/OR constructions

Expensive to pick a random vector in M dimensions for large M Would have to generate M random numbers A more efficient approach It suffices to consider only vectors v consisting of + and components Why? Assuming data is random, then vectors of +/- cover the entire space evenly (and does not bias in any way) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 52

Idea: Hash functions correspond to lines Partition the line into buckets of size a Hash each point to the bucket containing its projection onto the line An element of the Signature is a bucket id for that given projection line Nearby points are always close; distant points are rarely in same bucket 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 53

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 54 v v v v v v v v Line Buckets of size a v v v v Lucky case: Points that are close hash in the same bucket Distant points end up in different buckets Two unlucky cases: Top: unlucky quantization Bottom: unlucky projection

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 55 v v v v v v v

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 56 Points at distance d If d << a, then the chance the points are in the same bucket is at least d/a. Bucket width a Randomly chosen line

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 57 If d >> a, θ must be close to 90 o for there to be any chance points go to the same bucket. Points at distance d θ d cos θ Bucket width a Randomly chosen line

If points are distance d < a/2, prob. they are in same bucket - d/a = ½ If points are distance d > 2a apart, then they can be in the same bucket only if d cos θ a cos θ ½ 60 < θ < 90, i.e., at most /3 probability Yields a (a/2, 2a, /2, /3)-sensitive family of hash functions for any a Amplify using AND-OR cascades 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 58

Documents Data Hash func. Design a (d, d 2, p, p 2 )-sensitive family of hash functions (for that particular distance metric) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 MinHash Signatures: short integer signatures that reflect their similarity 5 5 2 3 3 6 4 6 4 Localitysensitive Hashing Bands technique Candidate pairs: those pairs of signatures that we need to test for similarity Amplify the family using AND and OR constructions Candidate pairs Data points 0 0 0 0 0 0 0 0 0 0 0 0 Random Hyperplanes - + - - + + + - - - - - Bands technique Candidate pairs 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 62 Property P(h(C )=h(c 2 ))=sim(c,c 2 ) of hash function h is the essential part of LSH, without which we can t do anything LS-hash functions transform data to signatures so that the bands technique (AND, OR constructions) can then be applied