Similarity Search. Stony Brook University CSE545, Fall 2016

Size: px
Start display at page:

Download "Similarity Search. Stony Brook University CSE545, Fall 2016"

Transcription

1 Similarity Search Stony Brook University CSE545, Fall 20

2 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings Entity Resolution Fingerprint Matching

3 Finding Similar Items: What we will cover Set Similarity Shingling Minhashing Locality-sensitive hashing Embeddings Distance Metrics High-Degree of Similarity

4 Document Similarity Challenge: How to represent the document in a way that can be efficiently encoded and compared?

5 Shingles Goal: Convert documents to sets

6 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters

7 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd}

8 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd} Similar documents will have many common shingles Changing words or order has minimal effect. In practice use 5 < k < 0

9 Shingles Goal: Convert documents to sets k-shingles (aka character n-grams ) - sequence of k characters Large enough that any given shingle appearing a document is highly unlikely (e.g. <.% chance) E.g. k=2 doc= abcdabd singles(doc, 2) = {ab, bc, cd, da, bd} Can hash large singles to smaller (e.g. 9-shingles into 4 bytes) Can also use words (aka n-grams). Similar documents will have many common shingles Changing words or order has minimal effect. In practice use 5 < k < 0

10 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

11 Minhashing Goal: Convert sets to shorter ids, signatures

12 Minhashing - Background Goal: Convert sets to shorter ids, signatures Characteristic Matrix:. Jaccard Similarity: (Leskovec at al., 204;

13 Minhashing - Background Goal: Convert sets to shorter ids, signatures Characteristic Matrix:. Jaccard Similarity: (Leskovec at al., 204; often very sparse! (lots of zeros)

14 Minhashing - Background Characteristic Matrix: ab Jaccard Similarity: bc 0 de 0 ah ha 0 0 ed ca 0

15 Minhashing - Background Characteristic Matrix: ab * * Jaccard Similarity: bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 *

16 Minhashing - Background Characteristic Matrix: ab * * Jaccard Similarity: bc 0 * de 0 * ah ** ha 0 0 ed ** sim(, ) = 3 / (# both have / # at least one has) ca 0 *

17 Minhashing - Background Characteristic Matrix: How many different rows are possible? ab * * bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 *

18 Minhashing - Background Characteristic Matrix: ab * * bc 0 * de 0 * ah ** ha 0 0 ed ** ca 0 * How many different rows are possible?, -- type a, 0 -- type b 0, -- type c 0, 0 -- type d sim(, ) = a / (a+b+c)

19 Shingles Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

20 Minhashing Characteristic Matrix: S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 (Leskovec at al., 204;

21 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 (Leskovec at al., 204;

22 Minhashing Characteristic Matrix: S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 permuted order ha 2 ed 3 ab 4 bc 5 ca ah de Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. (Leskovec at al., 204;

23 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 ha 2 ed 3 ab 4 bc 5 ca ah de (Leskovec at al., 204;

24 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 ab 0 0 ha 4 bc ed de ab ah bc 2 5 ha 0 0 ed 0 0 ca ca ah de h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = (Leskovec at al., 204;

25 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 2 ed 3 ab 4 bc ha ca 2 ed 0 0 ah 5 ca 0 0 de (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) =

26 Minhashing Characteristic Matrix: S 3 S 4 permuted order Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to first row where set appears. 3 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 ha 2 ed 3 ab 4 bc ha ca 2 ed 0 0 ah 5 ca 0 0 de (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row

27 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h ( ) = ed #permuted row 2 h ( ) = ha #permuted row h (S 3 ) = ed #permuted row 2 h (S 4 ) = ha #permuted row

28 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row

29 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 3 4 S 3 S 4 ab 0 0 bc 0 0 de 0 0 ah 0 0 Signature matrix: M Record first row where each set had a in the given permutation S 3 S 4 h 2 2 ha ed ca 0 0 (Leskovec at al., 204; h( ) = ed #permuted row 2 h( ) = ha #permuted row h(s 3 ) = ed #permuted row 2 h(s 4 ) = ha #permuted row

30 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation 2 4 bc de 0 0 ah 0 0 S 3 S 4 h 2 2 ha 0 0 h 2 2 ed ca 0 0 (Leskovec at al., 204;

31 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation 2 4 bc de 0 0 ah 0 0 ha 0 0 S 3 S 4 h 2 2 h ed ca 0 0 (Leskovec at al., 204;

32 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 S 3 S 4 h 2 2 h ed 0 0 h ca 0 0 (Leskovec at al., 204;

33 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 S 3 S 4 h 2 2 h h ca 0 0 (Leskovec at al., 204;

34 Minhashing Characteristic Matrix: Minhash function: h Based on permutation of rows in the characteristic matrix, h maps sets to rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;

35 Minhashing Characteristic Matrix: Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. 4 3 S 3 S 4 ab 0 0 Signature matrix: M Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;

36 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc de 0 0 ah 0 0 ha 0 0 ed 0 0 ca 0 0 S 3 S 4 h 2 2 h h (Leskovec at al., 204;

37 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 Estimate with a random sample of permutations de 0 0 (i.e. ~00) ah ha ed ca (Leskovec at al., 204; S 3 S 4 h 2 2 h h

38 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 Estimate with a random sample of permutations de 0 0 (i.e. ~00) 3 ah 0 0 ha ed 0 0 S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/ ca 0 0 (Leskovec at al., 204;

39 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc 0 0 de 0 0 ah 0 0 ha 0 0 ed 0 0 S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/3 Real Sim(, S 3 ) = Type a / (a + b + c) = 3/ ca 0 0 (Leskovec at al., 204;

40 Minhashing Characteristic Matrix: S 3 S ab 0 0 Property of signature matrix: The Minhash probability function: for h any h Based on permutation i (i.e. any row), that h of rows in the i (S characteristic ) = h i ( ) is the same as Sim(S matrix, h maps sets, S to 2 ) rows. Thus, similarity of signatures S Signature matrix: M, is the fraction of minhash functions (i.e. rows) in which they agree. Record first row where each set had a in the given permutation bc de ah ha ed ca (Leskovec at al., 204; S 3 S 4 h 2 2 h h Estimated Sim(, S 3 ) = agree / all = 2/3 Real Sim(, S 3 ) = Type a / (a + b + c) = 3/4 Try Sim(, S 4 ) and Sim(, )

41 Minhashing To implement Problem: Can t actually do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!)

42 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets)

43 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets) Algorithm: for r in rows of cm: #cm is characteristic matrix compute h i (r) for all i in hashes #produces 00 precomputed values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value h i (r) < M[i][s]: M[i][s] = h i (r)

44 Minhashing To implement Problem: Can t reasonably do permutations (huge space) Can t randomly grab rows according to an order (random disk seeks = slow!) Solution: Use random hash functions. Setup: Pick ~00 hash functions, hashes Known as efficient minhashing. Store M[i][s] = a potential minimum h i (r) #initialized to infinity (num hashs x num sets) Algorithm: for r in rows of cm: #cm is characteristic matrix compute h i (r) for all i in hashes #produces 00 precomputed values for each set s in row r: if cm[r][s] == : for i in hashes: #check which hash produces smallest value h i (r) < M[i][s]: M[i][s] = h i (r)

45 Minhashing What hash functions to use? Start with a decent function E.g. h (x) = ascii(string) % large_prime_number Add a random multiple and addition E.g. h 2 (x) = (a*ascii(string) + b) % large_prime_number

46 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document).

47 Minhashing Problem: Even if hashing, sets of shingles are large (e.g. 4 bytes => 4x the size of the document). New Problem: Even if the size of signatures are small, it can be computationally expensive to find similar pairs. E.g. m documents;,000,000 choose 2 = 500,000,000,000 pairs

48 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity.

49 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. If we wanted the similarity for all pairs of documents, could anything be done?

50 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. Approach: Hash multiple times: similar items are likely in the same bucket.

51 Locality-Sensitive Hashing Goal: find pairs of minhashes likely to be similar (in order to then test more precisely for similarity). Candidate pairs: pairs of elements to be evaluated for similarity. Approach: Hash multiple times: similar items are likely in the same bucket. Approach from MinHash: Hash columns of signature matrix Candidate pairs end up in the same bucket. (LSH is a type of near-neighbor search)

52 Locality-Sensitive Hashing Step : Add bands (Leskovec at al., 204;

53 Locality-Sensitive Hashing Step : Add bands Can be tuned to catch most true-positives with least false-positives. (Leskovec at al., 204;

54 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;

55 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;

56 Locality-Sensitive Hashing Step 2: Hash columns within bands (Leskovec at al., 204;

57 Locality-Sensitive Hashing Step 2: Hash columns within bands Criteria for being candidate pair: They end up in same bucket for at least band. (Leskovec at al., 204;

58 Locality-Sensitive Hashing Step 2: Hash columns within bands Simplification: There are enough buckets compared to rows per band that columns must be identical in order to hash to the same bucket. Thus, we only need to check if identical within a band. (Leskovec at al., 204;

59 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity

60 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity P( == b): probability S and S2 agree within a given band

61 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band

62 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band =.2 20 =.00035

63 Realistic Example: Probabilities of agreement 00,000 documents 00 random permutations/hash functions/rows => if 4byte integers then 40Mb to hold signature matrix => still 00k choose 2 is a lot (~5billion) 20 bands of 5 rows Want 80% Jaccard Similarity ; for any row p( == ) =.8 P( == b): probability S and S2 agree within a given band = =.328 => P(!= b) = =.2 P(!= ): probability S and S2 do not agree in any band =.2 20 = What if wanting 40% Jaccard Similarity?

64 Document Similarity Pipeline Shingling Minhashing Localitysensitive hashing

65 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). (

66 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). Typical properties, d: distance metric d(x, x) = 0 d(x, y) = d(y, x) d(x, y) d(x,z) + d(z,y) (

67 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance Cosine Distance Edit Distance Hamming Distance

68 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance ( L2 Norm ) Cosine Distance Edit Distance Hamming Distance

69 Distance Metrics Pipeline gives us a way to find near-neighbors in high-dimensional space based on Jaccard Distance ( - Jaccard Sim). There are other metrics of similarity. e.g: Euclidean Distance ( L2 Norm ) Cosine Distance Edit Distance Hamming Distance

70 Locality Sensitive Hashing - Theory LSH Can be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar.

71 Locality Sensitive Hashing - Theory LSH Can be generalized to many distance metrics by converting output to a probability and providing a lower bound on probability of being similar. E.g. for euclidean distance: Choose random lines (analogous to hash functions in minhashing) Project the two points onto each line; match if two points within an interval

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Why duplicate detection?

Why duplicate detection? Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer

More information

CSE 321 Discrete Structures

CSE 321 Discrete Structures CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017 Linear Models: Comparing Variables Stony Brook University CSE545, Fall 2017 Statistical Preliminaries Random Variables Random Variables X: A mapping from Ω to ℝ that describes the question we care about

More information

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances 7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Database Systems CSE 514

Database Systems CSE 514 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

Querying. 1 o Semestre 2008/2009

Querying. 1 o Semestre 2008/2009 Querying Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2008/2009 Outline 1 2 3 4 5 Outline 1 2 3 4 5 function sim(d j, q) = 1 W d W q W d is the document norm W q is the

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,

More information

Singular Value Decompsition

Singular Value Decompsition Singular Value Decompsition Massoud Malek One of the most useful results from linear algebra, is a matrix decomposition known as the singular value decomposition It has many useful applications in almost

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Similarity and recommender systems

Similarity and recommender systems Similarity and recommender systems Hiroshi Shimodaira January-March 208 In this chapter we shall look at how to measure the similarity between items To be precise we ll look at a measure of the dissimilarity

More information

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Distances and similarities Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Similarities Start with X which we assume is centered and standardized. The PCA loadings were

More information

proximity similarity dissimilarity distance Proximity Measures:

proximity similarity dissimilarity distance Proximity Measures: Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Background and Two Challenges Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura McGill University, July 2007 1 / 29 Outline

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Paper Reference. Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C2 Advanced Subsidiary. Monday 2 June 2008 Morning Time: 1 hour 30 minutes

Paper Reference. Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C2 Advanced Subsidiary. Monday 2 June 2008 Morning Time: 1 hour 30 minutes Centre No. Candidate No. Paper Reference 6 6 6 4 0 1 Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C2 Advanced Subsidiary Monday 2 June 2008 Morning Time: 1 hour 30 minutes Materials required

More information

1 Matrices and vector spaces

1 Matrices and vector spaces Matrices and vector spaces. Which of the following statements about linear vector spaces are true? Where a statement is false, give a counter-example to demonstrate this. (a) Non-singular N N matrices

More information

Secret Key Systems (block encoding) Encrypting a small block of text (say 64 bits) General considerations for cipher design:

Secret Key Systems (block encoding) Encrypting a small block of text (say 64 bits) General considerations for cipher design: Secret Key Systems (block encoding) Encrypting a small block of text (say 64 bits) General considerations for cipher design: Secret Key Systems Encrypting a small block of text (say 64 bits) General considerations

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Types of data sets Record Tables Document Data Transaction Data Graph World Wide Web Molecular Structures

More information

Probabilistic Near-Duplicate. Detection Using Simhash

Probabilistic Near-Duplicate. Detection Using Simhash Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

Error-correcting codes and applications

Error-correcting codes and applications Error-correcting codes and applications November 20, 2017 Summary and notation Consider F q : a finite field (if q = 2, then F q are the binary numbers), V = V(F q,n): a vector space over F q of dimension

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algoritms for Data Science Arya Mazumdar University of Massacusetts at Amerst Fall 2018 Lecture 11 Locality Sensitive Hasing Midterm exam Average 28.96 out of 35 (82.8%). One exam is still

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Intrusion Detection and Malware Analysis

Intrusion Detection and Malware Analysis Intrusion Detection and Malware Analysis IDS feature extraction Pavel Laskov Wilhelm Schickard Institute for Computer Science Metric embedding of byte sequences Sequences 1. blabla blubla blablabu aa 2.

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

An Efficient Partition Based Method for Exact Set Similarity Joins

An Efficient Partition Based Method for Exact Set Similarity Joins An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

a. Define a function called an inner product on pairs of points x = (x 1, x 2,..., x n ) and y = (y 1, y 2,..., y n ) in R n by

a. Define a function called an inner product on pairs of points x = (x 1, x 2,..., x n ) and y = (y 1, y 2,..., y n ) in R n by Real Analysis Homework 1 Solutions 1. Show that R n with the usual euclidean distance is a metric space. Items a-c will guide you through the proof. a. Define a function called an inner product on pairs

More information

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2) The Market-Basket Model Association Rules Market Baskets Frequent sets A-priori Algorithm A large set of items, e.g., things sold in a supermarket. A large set of baskets, each of which is a small set

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

CISC 4631 Data Mining

CISC 4631 Data Mining CISC 4631 Data Mining Lecture 02: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors) 1 10 What is Data? Collection of data objects and their attributes An attribute

More information

arxiv: v1 [cs.db] 2 Sep 2014

arxiv: v1 [cs.db] 2 Sep 2014 An LSH Index for Computing Kendall s Tau over Top-k Lists Koninika Pal Saarland University Saarbrücken, Germany kpal@mmci.uni-saarland.de Sebastian Michel Saarland University Saarbrücken, Germany smichel@mmci.uni-saarland.de

More information

Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li

Fast Approximate k-way Similarity Search. Anshumali Shrivastava. Ping Li Beyond Pairwise: Provably Fast Algorithms for Approximate k-way Similarity Search NIPS 2013 1 Fast Approximate k-way Similarity Search Anshumali Shrivastava Dept. of Computer Science Cornell University

More information

CSE 215: Foundations of Computer Science Recitation Exercises Set #5 Stony Brook University. Name: ID#: Section #: Score: / 4

CSE 215: Foundations of Computer Science Recitation Exercises Set #5 Stony Brook University. Name: ID#: Section #: Score: / 4 CSE 215: Foundations of Computer Science Recitation Exercises Set #5 Stony Brook University Name: ID#: Section #: Score: / 4 Unit 10: Proofs by Contradiction and Contraposition 1. Prove the following statement

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - DL360 Fall 200 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht0 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

6664/01 Edexcel GCE Core Mathematics C2 Bronze Level B4

6664/01 Edexcel GCE Core Mathematics C2 Bronze Level B4 Paper Reference(s) 6664/01 Edexcel GCE Core Mathematics C Bronze Level B4 Time: 1 hour 30 minutes Materials required for examination papers Mathematical Formulae (Green) Items included with question Nil

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Topic 4 Randomized algorithms

Topic 4 Randomized algorithms CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) = CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)

More information

Error Detection and Correction: Small Applications of Exclusive-Or

Error Detection and Correction: Small Applications of Exclusive-Or Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance Notion of Distance Metric Distance Binary Vector Distances Tangent Distance Distance Measures Many pattern recognition/data mining techniques are based on similarity measures between objects e.g., nearest-neighbor

More information

The Problem with ASICBOOST

The Problem with ASICBOOST The Problem with ACBOOT Jeremy Rubin April 11, 2017 1 ntro Recently there has been a lot of interesting material coming out regarding AC- BOOT. ACBOOT is a mining optimization that allows one to mine many

More information

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions. CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

SOLUTIONS OF 2012 MATH OLYMPICS LEVEL II T 3 T 3 1 T 4 T 4 1

SOLUTIONS OF 2012 MATH OLYMPICS LEVEL II T 3 T 3 1 T 4 T 4 1 SOLUTIONS OF 0 MATH OLYMPICS LEVEL II. If T n = + + 3 +... + n and P n = T T T 3 T 3 T 4 T 4 T n T n for n =, 3, 4,..., then P 0 is the closest to which of the following numbers? (a).9 (b).3 (c) 3. (d).6

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 10 What is Data? Collection of data objects

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

From Non-Negative Matrix Factorization to Deep Learning

From Non-Negative Matrix Factorization to Deep Learning The Math!! From Non-Negative Matrix Factorization to Deep Learning Intuitions and some Math too! luissarmento@gmailcom https://wwwlinkedincom/in/luissarmento/ October 18, 2017 The Math!! Introduction Disclaimer

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

Getting To Know Your Data

Getting To Know Your Data Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data Summarization 3. Measuring Data Similarity and Dissimilarity Data Objects and Attribute Types Types of data sets

More information

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci Association Rules Information Retrieval and Data Mining Prof. Matteo Matteucci Learning Unsupervised Rules!?! 2 Market-Basket Transactions 3 Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit

More information

Paper Reference. Ruler graduated in centimetres and millimetres, protractor, compasses, pen, HB pencil, eraser. Tracing paper may be used.

Paper Reference. Ruler graduated in centimetres and millimetres, protractor, compasses, pen, HB pencil, eraser. Tracing paper may be used. Centre No. Candidate No. Paper Reference 1 3 8 0 3 H Paper Reference(s) 1380/3H Edexcel GCSE Mathematics (Linear) 1380 Paper 3 (Non-Calculator) Vectors Past Paper Questions Arranged by Topic Surname Signature

More information

Exercises for Unit V (Introduction to non Euclidean geometry)

Exercises for Unit V (Introduction to non Euclidean geometry) Exercises for Unit V (Introduction to non Euclidean geometry) V.1 : Facts from spherical geometry Ryan : pp. 84 123 [ Note : Hints for the first two exercises are given in math133f07update08.pdf. ] 1.

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #12: Frequent Itemsets Seoul National University 1 In This Lecture Motivation of association rule mining Important concepts of association rules Naïve approaches for

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

x + 2y + 3z = 8 x + 3y = 7 x + 2z = 3

x + 2y + 3z = 8 x + 3y = 7 x + 2z = 3 Chapter 2: Solving Linear Equations 23 Elimination Using Matrices As we saw in the presentation, we can use elimination to make a system of linear equations into an upper triangular system that is easy

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information