Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Size: px
Start display at page:

Download "Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara"

Transcription

1 Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee Pallickara Computer Science, W5.B.2 W5.B.3 FAQs Compute the similarity between the two strings s1=farmers Insurance, s2 = Liberty Insurance V W CosSimilarity(V,W ) = cos(α) = V W = What is the Jaccard similarity for the same case? W5.B.4 How can we compute TF/IDF based similarity analysis? W5.B.5 Programming Assignment 2 Using n-grams 1

2 Spring 2017 W5.B.6 W5.B.7 Using n-grams A string is divided into smaller tokens of size n. q-grams or n-grams Size of n-gram is string with a length n Size of n-gram words is string with a length n words Generating n-grams Slide a window of size n over the string Generating n-grams s1 = Henri Waternoose s2 = Henry Waternose Generate 3-grams Q-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## Q-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## W5.B.8 W5.B.9 n-gram based token similarity n-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## n-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## 13 overlaps among total 22 distinct q-grams. Overlaps: number of two item pairs having overlap Jaccard similarity StringJaccard(s 1,s 2 ) = = 0.59 n-gram based token similarity n-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## n-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## Using the cosine similarity with tf-idf weights CosSimilarity(V,W ) =? W5.B.10 W5.B.11 Using n-grams Using the same similarity measures used in other token based similarity computations How large should n be? [1/2] n should be picked large enough that the probability of any given n-gram appearing in any given document is low Less sensitive to typographical errors discretization Vs. discretisation What if we change the size of n? Our corpus of documents is s Picking n = 5 Printable ASCII characters 27 5 = 14,348,907 possible 5-grams Is it really true with real s? 2

3 Spring 2017 W5.B.12 W5.B.13 How large should n be? [2/2] All characters do not appear with equal probability Imagine that there are only 20 characters and estimate the number of n-grams as 20 n For research articles, a choice of n = 9 is considered safe Applications of Near-neighbor search W5.B.14 W5.B.15 Near Neighbor(NN) Search Also known as proximity search, similarity search Finding closest or most similar points The post-office problem Assigning to a residence the nearest post office Hashing n-grams Creating buckets and use the bucket numbers as the shingles For the set of 9-grams, Map each of those 9-grams to a bucket number in the range 0 to bytes of data is compacted to 4 bytes Can hash-based approaches provide NN search results? W5.B.16 W5.B.17 Similarity-preserving summaries of set Signatures Replacing large sets of n-grams by much smaller representations Applications of Near-neighbor search: MinHashing We should be able to compare the signatures of two sets and estimate the Jaccard similarity Of the underlying sets from the signatures alone 3

4 Spring 2017 W5.B.18 W5.B.19 Matrix representations of sets Element {a,b,c,d,e S1 = {a,d S2 = {c S3 = {b,d,e S4 = {a,c,d MinHashing Signature generating algorithm Minhash of the characteristic matrix Minhash(π) of a set is the number of the row (element) with first non-zero in the permuted order π π = (b,e,a,d,c) Element Element Permuted order 1 Permuted order 2 W5.B.20 W5.B.21 MinHash and Jaccard Similarity Theorem: P(minhash(S) = minhash (T)) =JaccardSIM (S,T) MinHash and Jaccard Similarity Theorem: P(minhash(S) = minhash (T)) =JaccardSIM (S,T) Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) Element S T b 0 0 e 0 0 a 1 0 d 1 1 c 0 1 Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) Element S T b 0 0 e 0 0 a 1 0 d 1 1 c 0 1 P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is, X/(X+Y) = JaccardSIM (S,T) W5.B.22 W5.B.23 MinHash and permuting the order What if we change the order of Elements? Can we simulate the effect of a random permutation? It is NOT feasible to permute a large characteristic matrix explicitly N element will need N! permutations! Element Element Permuted order 1 Permuted order 2 MinHash Signatures Pick (at random) some number n of permutations of the rows of the characteristic matrix M E.g. 100 permutations or several hundred permutations The minhash functions are determined by these permutations h 1, h 2, h 3,. h n MinHash signature for S Vector [h 1 (S), h 2 (S), h 3 (S),, h n (S)] 4

5 Spring 2017 W5.B.24 W5.B.25 Representing the MinHash Signatures Form a signature matrix The i th column of M is replaced by the min hash signature of the i th column Compressed form for a sparse matrix Element >10 9 elements? Using a random hash function [1/2] A Random hash function Maps row numbers to as many buckets as there are rows 0,1,, k-1 to bucket numbers 0 ~ k-1 Maps some pairs of integers to the same bucket Leaves other buckets unfilled However, not too many collisions W5.B.26 W5.B.27 Using a random hash function [2/2] Pick n randomly chosen hash functions h 1, h 2,. h n on the rows In the signature matrix, let SIG(i,c) be the element of the signature matrix for the i th hash function and column c Initially, the set SIG(i,c) to for all i and c For each row r Compute h 1 (r), h 2 (r),. h n (r) For each column c If c has 0 in row, do nothing If c has 1 in row r, then for each i = 1,2,,n set SIG(i,c) to the smaller of the current value of SIG(i,c) and h i (r) Row (element) S1 Row S2 (element) S1 S3 S2 S4 S3 Hash S4 1 Hash M 0 1 Hash Hash N-2 1 Hash 0 M N-1 0 Hash 0 M N>>M W5.B.28 W5.B.29 (Row # +1) mod 5 Row (element) Hash Hash Hash M Hash M Row (element) X+1 mod x +1 mod 5 (3 x Row # +1) mod 5 5

6 Spring 2017 W5.B.30 W5.B.31 h1 h2 h 1 (x) = (x + 1) mod 5 h 2 (x) = (3x +1) mod 5 In the previous slide, h 1 (0) and h 2 (0) are both 1 The row numbered 0 has 1 s in S1 and S4 h1 1 1 h2 1 1 h1 1 1 h2 1 1 Row number 1 Only in S3 is 1 Hash value H 1 (1)=2 and h 2 (1)=4 h h W5.B.32 W5.B.33 h h Row number 2 S2 and S4 Hash value h 1 (2)=3 and h 2 (2)=2 h h Row number 3 S1, S3 and S4 have 1 Hash value h 1 (3)=4 and h 2 (3)=0 h h h h Calculating a single MinHash value with MapReduce W5.B.34 Suppose you have some documents and have stored n-grams of these documents in a large table W5.B.35 Approach 1. Partitioning by rows The table must be partitioned across the processors by rows Each processor receives a subset of the n-grams for every document Each column n-grams for a single document Each row r r th n-gram for all the documents Can each processor compute the minhash for each document? No. It should pass the hash values and group them based on the document ID (Assume that document ID is stored separately) Compute a minhash value for each of your documents using a single hash function 6

7 Spring 2017 W5.B.36 W5.B.37 Design your map function What should be the input <key, value>? What should be the output <key, value>? map (k: docid, v: (row_number,freq) { if (freq >0) emit_intermediate(ik = k, iv =hash(row_number)) Design your reduce function MapReduce system will group the hashes by documentid Reduce function will find the minimum hash value within the group reduce (ik: docid, ivs: hashval[]) { var minhash := INFINITY for each iv in ivs { if iv < minhash { minhash := iv emit_final(fk = ik, fv = minhash) W5.B.38 W5.B.39 Approach 2. Partitioning by column All the k-grams for a document go to the same processor Do you need a reduce function? map (k: docid, v: kgram_row_with_value[]) { var minhash := INFINITY for each kgram with non 0 vlaue in v { var h := hash(kgram_row) if h < minhash { minhash := h emit_intermediate(ik = k, iv = minhash) W5.B.40 W5.B.41 Finding the most similar documents Number of pairs of documents to compare is too high Locality sensitive hashing 1M documents, signatures of length 250 ( 4 Byte each ) 1M x 1,000Bytes = 1GB Number of comparisons = 1M C 2 (Half of trillion pairs) 1ms per calculation of similarity 6 days to complete computing 7

8 Spring 2017 W5.B.42 W5.B.43 Locality-sensitive hashing (LSH) Do we need to calculate the similarity for all of the pairs? Near-neighbor search LSH Hash items several times, so that similar items are more likely to be hashed to the same bucket than dissimilar items Assign the candidate pair to the same bucket Reduce false positives Dissimilar pairs in the same bucket Reduce false negatives Similar pairs in different buckets W5.B.44 Dividing a signature matrix into 4 bands and 3 rows per band BAND 1 BAND 2 BAND 3 D 10 D 11 D 12 D D11(1,2,1) and D13 (1,2,1) will go to the same bucket (1,3,1) and (0,3,0) will NOT go to the same bucket unless they show the same values in the other band The first band of MinHash signature W5.B.45 Analysis of the Banding Technique [1/2] Suppose that we use b bands of r rows each Suppose that a particular pair of documents have Jaccard Similarity s P(the minhash signatures for these documents agree in any one particular row of the signature matrix) = s BAND 4 W5.B.46 W5.B.47 Analysis of the Banding Technique [2/2] The probability that documents (signatures) become a candidate pair (Signatures should match at least in ONE band): Threshold The value of similarity s to determine two documents are similar An approximation of the threshold is (1/b) 1/r P(the signatures agree in all rows of one particular band)=s r P(the signatures do not agree in at least one row of a particular band)=1-s r P(the signatures do not agree in all rows of any of the bands)= (1-s r ) b P(the signatures agree in all the rows of at least one band)= 1-(1-s r ) b 8

9 Spring 2017 W5.B.48 W5.B.49 For example If there are 16 bands and each band contains 4 rows S=0.5 If there are 8 bands and each band contains 8 rows S=0.77 If there are 4 bands and each band contains 16 rows S=0.91 Suppose that there are 20 bands (b=20), and each band includes 5 rows (r=5). P(the signatures agree in all the rows of at least one band)=1-(1-s r ) b s 1-(1-s r ) b

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In

More information

Why duplicate detection?

Why duplicate detection? Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?

More information

CSE 321 Discrete Structures

CSE 321 Discrete Structures CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Sandeep Tata Jignesh M. Patel Department of Electrical Engineering and Computer Science University of Michigan 22 Hayward Street,

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

CS 584 Data Mining. Association Rule Mining 2

CS 584 Data Mining. Association Rule Mining 2 CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algoritms for Data Science Arya Mazumdar University of Massacusetts at Amerst Fall 2018 Lecture 11 Locality Sensitive Hasing Midterm exam Average 28.96 out of 35 (82.8%). One exam is still

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

More information

Database Systems CSE 514

Database Systems CSE 514 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

An Efficient Partition Based Method for Exact Set Similarity Joins

An Efficient Partition Based Method for Exact Set Similarity Joins An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn

More information

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record. CS 5633 -- Spring 25 Symbol-table problem Hashing Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk CS 5633 Analysis of Algorithms 1 Symbol table holding n records: record

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

FAQs. Fitting h θ (x)

FAQs. Fitting h θ (x) 10/15/2018 - FALL 2018 W9.A.0.0 10/15/2018 - FALL 2018 W9.A.1 FAs How to iprove the ter project proposal? Brainstor with your teaates Make an appointent with e (eeting ust include all of the tea ebers)

More information

We might divide A up into non-overlapping groups based on what dorm they live in:

We might divide A up into non-overlapping groups based on what dorm they live in: Chapter 15 Sets of Sets So far, most of our sets have contained atomic elements (such as numbers or strings or tuples (e.g. pairs of numbers. Sets can also contain other sets. For example, {Z,Q} is a set

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

arxiv: v1 [cs.it] 16 Aug 2017

arxiv: v1 [cs.it] 16 Aug 2017 Efficient Compression Technique for Sparse Sets Rameshwar Pratap rameshwar.pratap@gmail.com Ishan Sohony PICT, Pune ishangalbatorix@gmail.com Raghav Kulkarni Chennai Mathematical Institute kulraghav@gmail.com

More information

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.

What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

U.C. Berkeley CS276: Cryptography Luca Trevisan February 5, Notes for Lecture 6

U.C. Berkeley CS276: Cryptography Luca Trevisan February 5, Notes for Lecture 6 U.C. Berkeley CS276: Cryptography Handout N6 Luca Trevisan February 5, 2009 Notes for Lecture 6 Scribed by Ian Haken, posted February 8, 2009 Summary The encryption scheme we saw last time, based on pseudorandom

More information

CS 484 Data Mining. Association Rule Mining 2

CS 484 Data Mining. Association Rule Mining 2 CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due

More information

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) = CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join

More information

Pivot Selection Techniques

Pivot Selection Techniques Pivot Selection Techniques Proximity Searching in Metric Spaces by Benjamin Bustos, Gonzalo Navarro and Edgar Chávez Catarina Moreira Outline Introduction Pivots and Metric Spaces Pivots in Nearest Neighbor

More information

DATA MINING - 1DL360

DATA MINING - 1DL360 DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala

More information

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction

High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,

More information

Dealing with Text Databases

Dealing with Text Databases Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity

More information

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park 1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution

More information

Relational Nonlinear FIR Filters. Ronald K. Pearson

Relational Nonlinear FIR Filters. Ronald K. Pearson Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION doi:10.1038/nature11875 Method for Encoding and Decoding Arbitrary Computer Files in DNA Fragments 1 Encoding 1.1: An arbitrary computer file is represented as a string S 0 of

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

Introduction to Hash Tables

Introduction to Hash Tables Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In

More information

D B M G Data Base and Data Mining Group of Politecnico di Torino

D B M G Data Base and Data Mining Group of Politecnico di Torino Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Association Rules. Fundamentals

Association Rules. Fundamentals Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.

D B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions. Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k

More information

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example

D B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket

More information

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n Aside: Golden Ratio Golden Ratio: A universal law. Golden ratio φ = lim n a n+b n a n = 1+ 5 2 a n+1 = a n + b n, b n = a n 1 Ruta (UIUC) CS473 1 Spring 2018 1 / 41 CS 473: Algorithms, Spring 2018 Dynamic

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Minwise hashing for large-scale regression and classification with sparse data

Minwise hashing for large-scale regression and classification with sparse data Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons

More information

Counting. Mukulika Ghosh. Fall Based on slides by Dr. Hyunyoung Lee

Counting. Mukulika Ghosh. Fall Based on slides by Dr. Hyunyoung Lee Counting Mukulika Ghosh Fall 2018 Based on slides by Dr. Hyunyoung Lee Counting Counting The art of counting is known as enumerative combinatorics. One tries to count the number of elements in a set (or,

More information

Lecture 8 HASHING!!!!!

Lecture 8 HASHING!!!!! Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)

More information

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms

DATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Probabilistic Hashing for Efficient Search and Learning

Probabilistic Hashing for Efficient Search and Learning Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing

More information

COMP 5331: Knowledge Discovery and Data Mining

COMP 5331: Knowledge Discovery and Data Mining COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond

More information

Lecture Notes for Chapter 6. Introduction to Data Mining

Lecture Notes for Chapter 6. Introduction to Data Mining Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

More information

Geometry of Similarity Search

Geometry of Similarity Search Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000

More information

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30 CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in

More information

Recoverabilty Conditions for Rankings Under Partial Information

Recoverabilty Conditions for Rankings Under Partial Information Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

Efficient Binary Fuzzy Measure Representation and Choquet Integral Learning

Efficient Binary Fuzzy Measure Representation and Choquet Integral Learning Efficient Binary Fuzzy Measure Representation and Choquet Integral Learning M. Islam 1 D. T. Anderson 2 X. Du 2 T. Havens 3 C. Wagner 4 1 Mississippi State University, USA 2 University of Missouri, USA

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Groups that Distribute over Stars

Groups that Distribute over Stars Groups that Distribute over Stars Arthur Holshouser 3600 Bullard St Charlotte, NC, USA, 808 Harold Reiter Department of Mathematics UNC Charlotte Charlotte, NC 83 hbreiter@emailunccedu 1 Abstract Suppose

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Task: Find coalitions in signed networks Incentives: European

More information

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Proximity problems in high dimensions

Proximity problems in high dimensions Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

Data Mining Concepts & Techniques

Data Mining Concepts & Techniques Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

More information

Lecture 14 October 16, 2014

Lecture 14 October 16, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

6.1 Occupancy Problem

6.1 Occupancy Problem 15-859(M): Randomized Algorithms Lecturer: Anupam Gupta Topic: Occupancy Problems and Hashing Date: Sep 9 Scribe: Runting Shi 6.1 Occupancy Problem Bins and Balls Throw n balls into n bins at random. 1.

More information

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

Optimal Blocking by Minimizing the Maximum Within-Block Distance

Optimal Blocking by Minimizing the Maximum Within-Block Distance Optimal Blocking by Minimizing the Maximum Within-Block Distance Michael J. Higgins Jasjeet Sekhon Princeton University University of California at Berkeley November 14, 2013 For the Kansas State University

More information

CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes

CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes Tim Roughgarden & Gregory Valiant June 1, 2016 In the first lecture of CS168, we talked about modern techniques in data storage (consistent

More information