Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
|
|
- Rosemary Cain
- 6 years ago
- Views:
Transcription
1 Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee Pallickara Computer Science, W5.B.2 W5.B.3 FAQs Compute the similarity between the two strings s1=farmers Insurance, s2 = Liberty Insurance V W CosSimilarity(V,W ) = cos(α) = V W = What is the Jaccard similarity for the same case? W5.B.4 How can we compute TF/IDF based similarity analysis? W5.B.5 Programming Assignment 2 Using n-grams 1
2 Spring 2017 W5.B.6 W5.B.7 Using n-grams A string is divided into smaller tokens of size n. q-grams or n-grams Size of n-gram is string with a length n Size of n-gram words is string with a length n words Generating n-grams Slide a window of size n over the string Generating n-grams s1 = Henri Waternoose s2 = Henry Waternose Generate 3-grams Q-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## Q-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## W5.B.8 W5.B.9 n-gram based token similarity n-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## n-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## 13 overlaps among total 22 distinct q-grams. Overlaps: number of two item pairs having overlap Jaccard similarity StringJaccard(s 1,s 2 ) = = 0.59 n-gram based token similarity n-grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_w, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e## n-grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_w, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e## Using the cosine similarity with tf-idf weights CosSimilarity(V,W ) =? W5.B.10 W5.B.11 Using n-grams Using the same similarity measures used in other token based similarity computations How large should n be? [1/2] n should be picked large enough that the probability of any given n-gram appearing in any given document is low Less sensitive to typographical errors discretization Vs. discretisation What if we change the size of n? Our corpus of documents is s Picking n = 5 Printable ASCII characters 27 5 = 14,348,907 possible 5-grams Is it really true with real s? 2
3 Spring 2017 W5.B.12 W5.B.13 How large should n be? [2/2] All characters do not appear with equal probability Imagine that there are only 20 characters and estimate the number of n-grams as 20 n For research articles, a choice of n = 9 is considered safe Applications of Near-neighbor search W5.B.14 W5.B.15 Near Neighbor(NN) Search Also known as proximity search, similarity search Finding closest or most similar points The post-office problem Assigning to a residence the nearest post office Hashing n-grams Creating buckets and use the bucket numbers as the shingles For the set of 9-grams, Map each of those 9-grams to a bucket number in the range 0 to bytes of data is compacted to 4 bytes Can hash-based approaches provide NN search results? W5.B.16 W5.B.17 Similarity-preserving summaries of set Signatures Replacing large sets of n-grams by much smaller representations Applications of Near-neighbor search: MinHashing We should be able to compare the signatures of two sets and estimate the Jaccard similarity Of the underlying sets from the signatures alone 3
4 Spring 2017 W5.B.18 W5.B.19 Matrix representations of sets Element {a,b,c,d,e S1 = {a,d S2 = {c S3 = {b,d,e S4 = {a,c,d MinHashing Signature generating algorithm Minhash of the characteristic matrix Minhash(π) of a set is the number of the row (element) with first non-zero in the permuted order π π = (b,e,a,d,c) Element Element Permuted order 1 Permuted order 2 W5.B.20 W5.B.21 MinHash and Jaccard Similarity Theorem: P(minhash(S) = minhash (T)) =JaccardSIM (S,T) MinHash and Jaccard Similarity Theorem: P(minhash(S) = minhash (T)) =JaccardSIM (S,T) Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) Element S T b 0 0 e 0 0 a 1 0 d 1 1 c 0 1 Proof: X = number of rows with 1 for both S and T (e.g. x = 1) Y = number of rows with either S or T have 1, but not both (e.g. y = 2) Z = number of rows with both 0 (e.g. z = 2) Element S T b 0 0 e 0 0 a 1 0 d 1 1 c 0 1 P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is P(minhash(S) = minhash (T)) Probability that a row of type X is before all of the rows of type Y in a random permuted order is, X/(X+Y) = JaccardSIM (S,T) W5.B.22 W5.B.23 MinHash and permuting the order What if we change the order of Elements? Can we simulate the effect of a random permutation? It is NOT feasible to permute a large characteristic matrix explicitly N element will need N! permutations! Element Element Permuted order 1 Permuted order 2 MinHash Signatures Pick (at random) some number n of permutations of the rows of the characteristic matrix M E.g. 100 permutations or several hundred permutations The minhash functions are determined by these permutations h 1, h 2, h 3,. h n MinHash signature for S Vector [h 1 (S), h 2 (S), h 3 (S),, h n (S)] 4
5 Spring 2017 W5.B.24 W5.B.25 Representing the MinHash Signatures Form a signature matrix The i th column of M is replaced by the min hash signature of the i th column Compressed form for a sparse matrix Element >10 9 elements? Using a random hash function [1/2] A Random hash function Maps row numbers to as many buckets as there are rows 0,1,, k-1 to bucket numbers 0 ~ k-1 Maps some pairs of integers to the same bucket Leaves other buckets unfilled However, not too many collisions W5.B.26 W5.B.27 Using a random hash function [2/2] Pick n randomly chosen hash functions h 1, h 2,. h n on the rows In the signature matrix, let SIG(i,c) be the element of the signature matrix for the i th hash function and column c Initially, the set SIG(i,c) to for all i and c For each row r Compute h 1 (r), h 2 (r),. h n (r) For each column c If c has 0 in row, do nothing If c has 1 in row r, then for each i = 1,2,,n set SIG(i,c) to the smaller of the current value of SIG(i,c) and h i (r) Row (element) S1 Row S2 (element) S1 S3 S2 S4 S3 Hash S4 1 Hash M 0 1 Hash Hash N-2 1 Hash 0 M N-1 0 Hash 0 M N>>M W5.B.28 W5.B.29 (Row # +1) mod 5 Row (element) Hash Hash Hash M Hash M Row (element) X+1 mod x +1 mod 5 (3 x Row # +1) mod 5 5
6 Spring 2017 W5.B.30 W5.B.31 h1 h2 h 1 (x) = (x + 1) mod 5 h 2 (x) = (3x +1) mod 5 In the previous slide, h 1 (0) and h 2 (0) are both 1 The row numbered 0 has 1 s in S1 and S4 h1 1 1 h2 1 1 h1 1 1 h2 1 1 Row number 1 Only in S3 is 1 Hash value H 1 (1)=2 and h 2 (1)=4 h h W5.B.32 W5.B.33 h h Row number 2 S2 and S4 Hash value h 1 (2)=3 and h 2 (2)=2 h h Row number 3 S1, S3 and S4 have 1 Hash value h 1 (3)=4 and h 2 (3)=0 h h h h Calculating a single MinHash value with MapReduce W5.B.34 Suppose you have some documents and have stored n-grams of these documents in a large table W5.B.35 Approach 1. Partitioning by rows The table must be partitioned across the processors by rows Each processor receives a subset of the n-grams for every document Each column n-grams for a single document Each row r r th n-gram for all the documents Can each processor compute the minhash for each document? No. It should pass the hash values and group them based on the document ID (Assume that document ID is stored separately) Compute a minhash value for each of your documents using a single hash function 6
7 Spring 2017 W5.B.36 W5.B.37 Design your map function What should be the input <key, value>? What should be the output <key, value>? map (k: docid, v: (row_number,freq) { if (freq >0) emit_intermediate(ik = k, iv =hash(row_number)) Design your reduce function MapReduce system will group the hashes by documentid Reduce function will find the minimum hash value within the group reduce (ik: docid, ivs: hashval[]) { var minhash := INFINITY for each iv in ivs { if iv < minhash { minhash := iv emit_final(fk = ik, fv = minhash) W5.B.38 W5.B.39 Approach 2. Partitioning by column All the k-grams for a document go to the same processor Do you need a reduce function? map (k: docid, v: kgram_row_with_value[]) { var minhash := INFINITY for each kgram with non 0 vlaue in v { var h := hash(kgram_row) if h < minhash { minhash := h emit_intermediate(ik = k, iv = minhash) W5.B.40 W5.B.41 Finding the most similar documents Number of pairs of documents to compare is too high Locality sensitive hashing 1M documents, signatures of length 250 ( 4 Byte each ) 1M x 1,000Bytes = 1GB Number of comparisons = 1M C 2 (Half of trillion pairs) 1ms per calculation of similarity 6 days to complete computing 7
8 Spring 2017 W5.B.42 W5.B.43 Locality-sensitive hashing (LSH) Do we need to calculate the similarity for all of the pairs? Near-neighbor search LSH Hash items several times, so that similar items are more likely to be hashed to the same bucket than dissimilar items Assign the candidate pair to the same bucket Reduce false positives Dissimilar pairs in the same bucket Reduce false negatives Similar pairs in different buckets W5.B.44 Dividing a signature matrix into 4 bands and 3 rows per band BAND 1 BAND 2 BAND 3 D 10 D 11 D 12 D D11(1,2,1) and D13 (1,2,1) will go to the same bucket (1,3,1) and (0,3,0) will NOT go to the same bucket unless they show the same values in the other band The first band of MinHash signature W5.B.45 Analysis of the Banding Technique [1/2] Suppose that we use b bands of r rows each Suppose that a particular pair of documents have Jaccard Similarity s P(the minhash signatures for these documents agree in any one particular row of the signature matrix) = s BAND 4 W5.B.46 W5.B.47 Analysis of the Banding Technique [2/2] The probability that documents (signatures) become a candidate pair (Signatures should match at least in ONE band): Threshold The value of similarity s to determine two documents are similar An approximation of the threshold is (1/b) 1/r P(the signatures agree in all rows of one particular band)=s r P(the signatures do not agree in at least one row of a particular band)=1-s r P(the signatures do not agree in all rows of any of the bands)= (1-s r ) b P(the signatures agree in all the rows of at least one band)= 1-(1-s r ) b 8
9 Spring 2017 W5.B.48 W5.B.49 For example If there are 16 bands and each band contains 4 rows S=0.5 If there are 8 bands and each band contains 8 rows S=0.77 If there are 4 bands and each band contains 16 rows S=0.91 Suppose that there are 20 bands (b=20), and each band includes 5 rows (r=5). P(the signatures agree in all the rows of at least one band)=1-(1-s r ) b s 1-(1-s r ) b
Finding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationDATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In
More informationWhy duplicate detection?
Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?
More informationCSE 321 Discrete Structures
CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationEstimating the Selectivity of tf-idf based Cosine Similarity Predicates
Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Sandeep Tata Jignesh M. Patel Department of Electrical Engineering and Computer Science University of Michigan 22 Hayward Street,
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationCS 584 Data Mining. Association Rule Mining 2
CS 584 Data Mining Association Rule Mining 2 Recall from last time: Frequent Itemset Generation Strategies Reduce the number of candidates (M) Complete search: M=2 d Use pruning techniques to reduce M
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algoritms for Data Science Arya Mazumdar University of Massacusetts at Amerst Fall 2018 Lecture 11 Locality Sensitive Hasing Midterm exam Average 28.96 out of 35 (82.8%). One exam is still
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationDatabase Systems CSE 514
Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data
Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationAn Efficient Partition Based Method for Exact Set Similarity Joins
An Efficient Partition Based Method for Exact Set Similarity Joins Dong Deng Guoliang Li He Wen Jianhua Feng Department of Computer Science, Tsinghua University, Beijing, China. {dd11,wenhe1}@mails.tsinghua.edu.cn;{liguoliang,fengjh}@tsinghua.edu.cn
More informationSymbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.
CS 5633 -- Spring 25 Symbol-table problem Hashing Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk CS 5633 Analysis of Algorithms 1 Symbol table holding n records: record
More informationSome notes on streaming algorithms continued
U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.
More informationFAQs. Fitting h θ (x)
10/15/2018 - FALL 2018 W9.A.0.0 10/15/2018 - FALL 2018 W9.A.1 FAs How to iprove the ter project proposal? Brainstor with your teaates Make an appointent with e (eeting ust include all of the tea ebers)
More informationWe might divide A up into non-overlapping groups based on what dorm they live in:
Chapter 15 Sets of Sets So far, most of our sets have contained atomic elements (such as numbers or strings or tuples (e.g. pairs of numbers. Sets can also contain other sets. For example, {Z,Q} is a set
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationarxiv: v1 [cs.it] 16 Aug 2017
Efficient Compression Technique for Sparse Sets Rameshwar Pratap rameshwar.pratap@gmail.com Ishan Sohony PICT, Pune ishangalbatorix@gmail.com Raghav Kulkarni Chennai Mathematical Institute kulraghav@gmail.com
More informationWhat is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured.
What is Text mining? To discover the useful patterns/contents from the large amount of data that can be structured or unstructured. Text mining What can be used for text mining?? Classification/categorization
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationU.C. Berkeley CS276: Cryptography Luca Trevisan February 5, Notes for Lecture 6
U.C. Berkeley CS276: Cryptography Handout N6 Luca Trevisan February 5, 2009 Notes for Lecture 6 Scribed by Ian Haken, posted February 8, 2009 Summary The encryption scheme we saw last time, based on pseudorandom
More informationCS 484 Data Mining. Association Rule Mining 2
CS 484 Data Mining Association Rule Mining 2 Review: Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due
More informationCS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =
CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationAsymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment
Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment Anshumali Shrivastava and Ping Li Cornell University and Rutgers University WWW 25 Florence, Italy May 2st 25 Will Join
More informationPivot Selection Techniques
Pivot Selection Techniques Proximity Searching in Metric Spaces by Benjamin Bustos, Gonzalo Navarro and Edgar Chávez Catarina Moreira Outline Introduction Pivots and Metric Spaces Pivots in Nearest Neighbor
More informationDATA MINING - 1DL360
DATA MINING - 1DL36 Fall 212" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht12 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationInsert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park
1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution
More informationRelational Nonlinear FIR Filters. Ronald K. Pearson
Relational Nonlinear FIR Filters Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Thomas Jefferson University Philadelphia, PA Moncef Gabbouj Institute of Signal
More informationSUPPLEMENTARY INFORMATION
SUPPLEMENTARY INFORMATION doi:10.1038/nature11875 Method for Encoding and Decoding Arbitrary Computer Files in DNA Fragments 1 Encoding 1.1: An arbitrary computer file is represented as a string S 0 of
More informationInteger Sorting on the word-ram
Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words
More informationIntroduction to Hash Tables
Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In
More informationD B M G Data Base and Data Mining Group of Politecnico di Torino
Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationAssociation Rules. Fundamentals
Politecnico di Torino Politecnico di Torino 1 Association rules Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket counter Association rule
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Elena Baralis, Silvia Chiusano. Politecnico di Torino 1. Definitions.
Definitions Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Itemset is a set including one or more items Example: {Beer, Diapers} k-itemset is an itemset that contains k
More informationD B M G. Association Rules. Fundamentals. Fundamentals. Association rules. Association rule mining. Definitions. Rule quality metrics: example
Association rules Data Base and Data Mining Group of Politecnico di Torino Politecnico di Torino Objective extraction of frequent correlations or pattern from a transactional database Tickets at a supermarket
More informationAside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n
Aside: Golden Ratio Golden Ratio: A universal law. Golden ratio φ = lim n a n+b n a n = 1+ 5 2 a n+1 = a n + b n, b n = a n 1 Ruta (UIUC) CS473 1 Spring 2018 1 / 41 CS 473: Algorithms, Spring 2018 Dynamic
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationEfficient Data Reduction and Summarization
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information
More informationMinwise hashing for large-scale regression and classification with sparse data
Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons
More informationCounting. Mukulika Ghosh. Fall Based on slides by Dr. Hyunyoung Lee
Counting Mukulika Ghosh Fall 2018 Based on slides by Dr. Hyunyoung Lee Counting Counting The art of counting is known as enumerative combinatorics. One tries to count the number of elements in a set (or,
More informationLecture 8 HASHING!!!!!
Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)
More informationDATA MINING LECTURE 4. Frequent Itemsets, Association Rules Evaluation Alternative Algorithms
DATA MINING LECTURE 4 Frequent Itemsets, Association Rules Evaluation Alternative Algorithms RECAP Mining Frequent Itemsets Itemset A collection of one or more items Example: {Milk, Bread, Diaper} k-itemset
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationProbabilistic Hashing for Efficient Search and Learning
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing
More informationCOMP 5331: Knowledge Discovery and Data Mining
COMP 5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei And slides provide by Raymond
More informationLecture Notes for Chapter 6. Introduction to Data Mining
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
More informationGeometry of Similarity Search
Geometry of Similarity Search Alex Andoni (Columbia University) Find pairs of similar images how should we measure similarity? Naïvely: about n 2 comparisons Can we do better? 2 Measuring similarity 000000
More informationCSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30
CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in
More informationRecoverabilty Conditions for Rankings Under Partial Information
Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationCOMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform
COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationEfficient Binary Fuzzy Measure Representation and Choquet Integral Learning
Efficient Binary Fuzzy Measure Representation and Choquet Integral Learning M. Islam 1 D. T. Anderson 2 X. Du 2 T. Havens 3 C. Wagner 4 1 Mississippi State University, USA 2 University of Missouri, USA
More informationPart I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim
More informationGroups that Distribute over Stars
Groups that Distribute over Stars Arthur Holshouser 3600 Bullard St Charlotte, NC, USA, 808 Harold Reiter Department of Mathematics UNC Charlotte Charlotte, NC 83 hbreiter@emailunccedu 1 Abstract Suppose
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Task: Find coalitions in signed networks Incentives: European
More informationLocality Sensitive Hashing
Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationProximity problems in high dimensions
Proximity problems in high dimensions Ioannis Psarros National & Kapodistrian University of Athens March 31, 2017 Ioannis Psarros Proximity problems in high dimensions March 31, 2017 1 / 43 Problem definition
More informationFrequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:
Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify
More informationData Mining Concepts & Techniques
Data Mining Concepts & Techniques Lecture No. 04 Association Analysis Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro
More informationLecture 14 October 16, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More information6.1 Occupancy Problem
15-859(M): Randomized Algorithms Lecturer: Anupam Gupta Topic: Occupancy Problems and Hashing Date: Sep 9 Scribe: Runting Shi 6.1 Occupancy Problem Bins and Balls Throw n balls into n bins at random. 1.
More informationTDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1
TDDD43 Information Retrieval Fang Wei-Kleiner ADIT/IDA Linköping University Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1 Outline 1. Introduction 2. Inverted index 3. Ranked Retrieval tf-idf
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationOptimal Blocking by Minimizing the Maximum Within-Block Distance
Optimal Blocking by Minimizing the Maximum Within-Block Distance Michael J. Higgins Jasjeet Sekhon Princeton University University of California at Berkeley November 14, 2013 For the Kansas State University
More informationCS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes
CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes Tim Roughgarden & Gregory Valiant June 1, 2016 In the first lecture of CS168, we talked about modern techniques in data storage (consistent
More information