Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
|
|
- Lindsay Murphy
- 6 years ago
- Views:
Transcription
1 Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
2 Many problems can be expressed as finding similar objects: Find near(est)-neighbors Example Applications: Pages with similar words For duplicate detection, clustering by topic Customers who purchased similar products knn classification, collaborative filtering Images with similar features Image recommendation Record linkage (deduplication) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
3 [Hays and Efros, SIGGRAPH 7] nearest neighbors from a collection of million images J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3
4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 4 Given: (High dimensional) data points x, x, For example: Image is a vector of pixel colors [ ] And some distance function d(x, x ) Which quantifies the distance between x and x Goal: Find all pairs of data points (x i, x j ) that are within some distance threshold d x i, x j s
5 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Given: (High dimensional) data points x, x, For example: Image is a vector of pixel colors [ ] And some distance function d(x, x ) Which quantifies the distance between x and x Goal: Find all pairs of data points (x i, x j ) that are within some distance threshold d x i, x j s Naïve solution would take O N where N is the number of data points
6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Hash objects to buckets such that objects that are similar hash to the same bucket Only compare candidates in each bucket Benefits: Instead of O(N ) comparisons, we need O(N) to find similar documents Hash functions depend on similarity functions
7 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7 Goal: Given a large number (N in the millions or billions) of documents, find near duplicate pairs Applications: Mirror websites, or approximate mirrors Similar news articles at many news sites Problems: Too many documents to compare all pairs
8 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 Shingling: Convert documents to sets Simple approaches: Document = set of words appearing in document Document = set of important words
9 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 9 Need to account for ordering of words! Document = set of Shingles A set of k-shingles (or k-grams) is a set of k- sequence tokens that appears in the doc Tokens can be characters, words Example: k=; D = abcab Set of -shingles: S(D ) = {ab, bc, ca}
10 A natural similarity measure is the Jaccard similarity: sim(d, D ) = C C / C C J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
11 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Encode sets using / (bit, boolean) vectors One dimension per element in the universal set Interpret set intersection as bitwise AND, and set union as bitwise OR Example: C = ; C = Size of intersection = 3; size of union = 4, Jaccard similarity = 3/4 Distance: d(c,c ) = (Jaccard similarity) = /4
12 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Shingles Rows = elements (shingles) Columns = sets (documents) in row e and column s if and only if e is a member of s Column similarity is the Jaccard similarity of the corresponding sets Typical matrix is sparse! Example: sim(c,c ) =? Documents
13 Shingles Rows = elements (shingles) Columns = sets (documents) in row e and column s if and only if e is a member of s Column similarity is the Jaccard similarity of the corresponding sets Typical matrix is sparse! Example: sim(c,c ) =? Size of intersection = 3; size of union = 6, Jaccard similarity = 3/6 d(c,c ) = (Jaccard similarity) = 3/6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Documents 3
14 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Suppose we need to find near-duplicate documents among N = million documents Naïvely, we would have to compute pairwise Jaccard similarities for every pair of docs N(N )/ 5* comparisons At 5 secs/day and 6 comparisons/sec, it would take 5 days For N = million, it takes more than a year
15 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Key Idea: hash each column C to a small signature h(c): () h(c) is small enough that the signature fits in RAM () sim(c, C ) is the same as the similarity of signatures h(c ) and h(c ) Locality sensitive hashing: If sim(c,c ) is high, then with high prob. h(c ) = h(c ) If sim(c,c ) is low, then with high prob. h(c ) h(c ) Expect that most pairs of near duplicate docs hash into the same bucket!
16 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 7 Input matrix (Shingles x Documents)
17 Signature matrix M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, nd element of the permutation is the first to map to a 4 th element of the permutation is the first to map to a Input matrix (Shingles x Documents) Permutation
18 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 9 Imagine the rows of the boolean matrix permuted under random permutation Define a hash function h (C) = the index of the first (in the permuted order ) row in which column C has value : h (C) = min (C) Use several (e.g., ) independent hash functions (that is, permutations) to create a signature of a column
19 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Permuting rows even once is prohibitive Row hashing! Pick K hash functions k i Ordering under k i gives a random row permutation! How to pick a random hash function h(x)? Universal hashing: h a,b (x)= (a x+b) mod N where: a,b random integers
20 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, One-pass implementation For each column C Initialize all hash values for each permutation i: sig(c)[i] = For each row If there is a in column C Update hash value of column C if the row number in the current permutation is smaller than current value If k i < sig(c)[i], then sig(c)[i] k i
21 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets,
22 4 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4 Input matrix (Shingles x Documents) Permutation
23 One bit matching (given a ) Pr[h (C ) = h (C )] = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Signature matrix M 4
24 One bit matching (given ) Pr[h (C ) = h (C )] = Sim(C, C) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Signature matrix M 4
25 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 8 Given cols C and C, rows may be classified as: C C A B C D a = # rows of type A, etc. Note: sim(c, C ) = a/(a +b +c) Then: Pr[h(C ) = h(c )] = Sim(C, C ) Look down the cols C and C until we see a If it s a type-a row, then h(c ) = h(c ) If a type-b or type-c row, then not
26 9 One given bit matching Pr[h (C ) = h (C )] = sim(c, C ) The expected similarity of the signatures (defined as fractions of matching values) sim[h (C ) = h (C )] = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4
27 3 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) The expected similarity of the signatures (defined as fractions of matching values) sim[h (C ) = h (C )] = sim(c, C ) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4
28 3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Similarities: Col/Col Sig/Sig Signature matrix M 4 Input matrix (Shingles x Documents) Permutation
29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 3 Key Idea: hash each column C to a small signature h(c): () h(c) is small enough that the signature fits in RAM () sim(c, C ) is the same as the similarity of signatures h(c ) and h(c ) Locality sensitive hashing: If sim(c,c ) is high, then with high prob. h(c ) = h(c ) If sim(c,c ) is low, then with high prob. h(c ) h(c ) Expect that most pairs of near duplicate docs hash into the same bucket!
30 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 33 Probability of sharing a bucket No chance if t < s Similarity threshold s Probability = if t > s Similarity t =sim(c, C ) of two sets
31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 34 Probability of matching bit Similarity t =sim(c, C ) of two sets
32 35 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4
33 36 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = sim(c, C ) K J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, Signature matrix M 4
34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 37 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = sim(c, C ) K Any bit matching (OR): Pr[any h (C ) = h (C )] = Signature matrix M 4
35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 38 One given bit matching: Pr[h (C ) = h (C )] = sim(c, C ) Similarity of signatures: sim[h (C ) = h (C )] = sim(c, C ) All bits matching (AND): Pr[h(C ) = h(c )] = sim(c, C ) K Any bit matching (OR): Pr[any h (C ) = h (C )] = - (-sim(c, C )) K Signature matrix M 4
36 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, b bands r rows per band One signature Signature matrix M
37 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 48 Divide matrix M into b bands of r rows Candidate column pairs are those that hash to the same values for any band Tune b and r to catch most similar pairs, but few non-similar pairs
38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Columns C and C have similarity t Prob. of one given band (r rows) matching = Prob. of any band matching =
39 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 5 Columns C and C have similarity t Prob. of one given band (r rows) matching = t r Prob. of any band matching = - ( - t r ) b
40 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 54 Probability of sharing a bucket s ~ (/b) /r - ( -t r ) b Similarity t=sim(c, C ) of two sets
41 Prob. sharing a bucket J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 55 Tradeoff of false positive and false negatives Example: 5 hash-functions (r=5, b=) Blue area: False Negative rate Green area: False Positive rate Similarity
42 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6
43 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 6 Given a (d,d,p,p )-sensitive family F AND-construction AND of r members of F (d,d,p r,p r )-sensitive Mirrors the effect of r rows in a single band OR-construction OR of b members of F (d,d,(-p ) b,(-p ) b )-sensitive Mirrors the effect of combining multiple bands Select r and b to increase p and decrease p
44 Hashing function: a vector of d dimensions is hashed into the ith bit value (d, d, -d/d, -d/d)-sensitive AND/OR constructions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 63
45 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 64 Distance function: d = theta A randomly chosen vector v f Given two vectors x and y, f(x) = f(y) iff v f x and v f y have the same sign (d,d, -d/8, -d/8)-sensitive
46 A randomly chosen line with segments of length a A point is hashed to the bucket in which its projection onto the line lies (d,d, p,p)-sensitive J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 65
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationCS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya
CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as
More informationPiazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)
4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationCSE 5243 INTRO. TO DATA MINING
CSE 5243 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitivity Hashing Huan Sun, CSE@The Ohio State University /7/27 Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan
More informationSimilarity Search. Stony Brook University CSE545, Fall 2016
Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationCOMPSCI 514: Algorithms for Data Science
COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationDATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationWhy duplicate detection?
Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection?
More informationAlgorithms for Data Science: Lecture on Finding Similar Items
Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org Customer
More informationSlide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.
Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering
More informationToday s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara
Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More information1 Finding Similar Items
1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how
More informationTheory of LSH. Distance Measures LS Families of Hash Functions S-Curves
Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes
More informationThe University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.
The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and
More informationFinding similar items
Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the
More informationFrequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:
Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationLeast Squares Classification
Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data
Lecture Notes to Winter Term 2017/2018 Text Processing and High-Dimensional Data Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,
More informationData and Algorithms of the Web
Data and Algorithms of the Web Link Analysis Algorithms Page Rank some slides from: Anand Rajaraman, Jeffrey D. Ullman InfoLab (Stanford University) Link Analysis Algorithms Page Rank Hubs and Authorities
More informationCS425: Algorithms for Web Scale Data
CS: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS. The original slides can be accessed at: www.mmds.org J. Leskovec,
More informationCS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction
CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University.
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationLecture 14 October 16, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 14 October 16, 2014 Scribe: Jao-ke Chin-Lee 1 Overview In the last lecture we covered learning topic models with guest lecturer Rong Ge.
More informationError Detection and Correction: Small Applications of Exclusive-Or
Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com
More informationEfficient Data Reduction and Summarization
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationHomework #1 RELEASE DATE: 09/26/2013 DUE DATE: 10/14/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.
Homework #1 REEASE DATE: 09/6/013 DUE DATE: 10/14/013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIAS ARE WECOMED ON THE FORUM. Unless granted by the instructor in advance, you must turn in a printed/written
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More information1 [15 points] Frequent Itemsets Generation With Map-Reduce
Data Mining Learning from Large Data Sets Final Exam Date: 15 August 2013 Time limit: 120 minutes Number of pages: 11 Maximum score: 100 points You can use the back of the pages if you run out of space.
More informationInformation Gain. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.
te to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them
More informationCSE 321 Discrete Structures
CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold
More informationMetric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)
Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy
More informationAnomaly Detection. Jing Gao. SUNY Buffalo
Anomaly Detection Jing Gao SUNY Buffalo 1 Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their
More informationScalable Algorithms for Distribution Search
Scalable Algorithms for Distribution Search Yasuko Matsubara (Kyoto University) Yasushi Sakurai (NTT Communication Science Labs) Masatoshi Yoshikawa (Kyoto University) 1 Introduction Main intuition and
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationproximity similarity dissimilarity distance Proximity Measures:
Similarity Measures Similarity and dissimilarity are important because they are used by a number of data mining techniques, such as clustering nearest neighbor classification and anomaly detection. The
More informationProbabilistic Hashing for Efficient Search and Learning
Ping Li Probabilistic Hashing for Efficient Search and Learning July 12, 2012 MMDS2012 1 Probabilistic Hashing for Efficient Search and Learning Ping Li Department of Statistical Science Faculty of omputing
More informationCS60021: Scalable Data Mining. Dimensionality Reduction
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a
More informationPart-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287
Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the
More informationV.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74
V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:
More informationClassification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.
Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes),
More informationCS540 ANSWER SHEET
CS540 ANSWER SHEET Name Email 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 1 2 Final Examination CS540-1: Introduction to Artificial Intelligence Fall 2016 20 questions, 5 points
More informationDATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD
DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary
More informationBAYESIAN DECISION THEORY
Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will
More informationLecture Introduction. 2 Linear codes. CS CTT Current Topics in Theoretical CS Oct 4, 2012
CS 59000 CTT Current Topics in Theoretical CS Oct 4, 01 Lecturer: Elena Grigorescu Lecture 14 Scribe: Selvakumaran Vadivelmurugan 1 Introduction We introduced error-correcting codes and linear codes in
More informationPredicting New Search-Query Cluster Volume
Predicting New Search-Query Cluster Volume Jacob Sisk, Cory Barr December 14, 2007 1 Problem Statement Search engines allow people to find information important to them, and search engine companies derive
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationRegularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline
Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function
More informationMATH 433 Applied Algebra Lecture 22: Review for Exam 2.
MATH 433 Applied Algebra Lecture 22: Review for Exam 2. Topics for Exam 2 Permutations Cycles, transpositions Cycle decomposition of a permutation Order of a permutation Sign of a permutation Symmetric
More informationText Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)
Text Categorization CSE 454 (Based on slides by Dan Weld, Tom Mitchell, and others) 1 Given: Categorization A description of an instance, x X, where X is the instance language or instance space. A fixed
More informationCS 231A Section 1: Linear Algebra & Probability Review
CS 231A Section 1: Linear Algebra & Probability Review 1 Topics Support Vector Machines Boosting Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More informationCOMS W4995 Introduction to Cryptography October 12, Lecture 12: RSA, and a summary of One Way Function Candidates.
COMS W4995 Introduction to Cryptography October 12, 2005 Lecture 12: RSA, and a summary of One Way Function Candidates. Lecturer: Tal Malkin Scribes: Justin Cranshaw and Mike Verbalis 1 Introduction In
More informationHashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille
Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
More informationLecture 5: Clustering, Linear Regression
Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-2 STATS 202: Data mining and analysis Sergio Bacallado September 19, 2018 1 / 23 Announcements Starting next week, Julia Fukuyama
More informationELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties
ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties Prof. James She james.she@ust.hk 1 Last lecture 2 Selected works from Tutorial
More informationLecture 5: Clustering, Linear Regression
Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu Task: Find coalitions in signed networks Incentives: European
More informationHashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries
More informationTemporal and spatial approaches for land cover classification.
Temporal and spatial approaches for land cover classification. Ryabukhin Sergey sergeyryabukhin@gmail.com Abstract. This paper describes solution for Time Series Land Cover Classification Challenge (TiSeLaC).
More informationNeural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17
3/9/7 Neural Networks Emily Fox University of Washington March 0, 207 Slides adapted from Ali Farhadi (via Carlos Guestrin and Luke Zettlemoyer) Single-layer neural network 3/9/7 Perceptron as a neural
More informationPrivacy-Preserving Data Mining
CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,
More informationCS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property
More informationCoding for Digital Communication and Beyond Fall 2013 Anant Sahai MT 1
EECS 121 Coding for Digital Communication and Beyond Fall 2013 Anant Sahai MT 1 PRINT your student ID: PRINT AND SIGN your name:, (last) (first) (signature) PRINT your Unix account login: ee121- Prob.
More informationDatabase Systems CSE 514
Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section
More informationCS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationVC-dimension for characterizing classifiers
VC-dimension for characterizing classifiers Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to
More informationLecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.
CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the
More informationStatistical Problem. . We may have an underlying evolving system. (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t
Markov Chains. Statistical Problem. We may have an underlying evolving system (new state) = f(old state, noise) Input data: series of observations X 1, X 2 X t Consecutive speech feature vectors are related
More informationBayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington
Bayesian Classifiers and Probability Estimation Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington 1 Data Space Suppose that we have a classification problem The
More informationReconnaissance d objetsd et vision artificielle
Reconnaissance d objetsd et vision artificielle http://www.di.ens.fr/willow/teaching/recvis09 Lecture 6 Face recognition Face detection Neural nets Attention! Troisième exercice de programmation du le
More information6.1 Occupancy Problem
15-859(M): Randomized Algorithms Lecturer: Anupam Gupta Topic: Occupancy Problems and Hashing Date: Sep 9 Scribe: Runting Shi 6.1 Occupancy Problem Bins and Balls Throw n balls into n bins at random. 1.
More informationTopic 4 Randomized algorithms
CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science
More informationDecision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014
Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists
More informationHigh Dimensional Geometry, Curse of Dimensionality, Dimension Reduction
Chapter 11 High Dimensional Geometry, Curse of Dimensionality, Dimension Reduction High-dimensional vectors are ubiquitous in applications (gene expression data, set of movies watched by Netflix customer,
More informationMachine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier
Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically
More information