COMPSCI 514: Algorithms for Data Science

Similar documents
Similarity Search. Stony Brook University CSE545, Fall 2016

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Algorithms for Data Science: Lecture on Finding Similar Items

B490 Mining the Big Data

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

CS5112: Algorithms and Data Structures for Applications

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing

1 Finding Similar Items

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

CS425: Algorithms for Web Scale Data

Finding similar items

CSE 5243 INTRO. TO DATA MINING

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Text Processing and High-Dimensional Data

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

NCC Education Limited. Substitution Topic NCC Education Limited. Substitution Topic NCC Education Limited

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Announcements Monday, November 13

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS5112: Algorithms and Data Structures for Applications

Why duplicate detection?

Announcements Monday, November 13

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Database Systems CSE 514

CS246 Final Exam, Winter 2011

Name Period. Date: have an. Essential Question: Does the function ( ) inverse function? Explain your answer.

CS246 Final Exam. March 16, :30AM - 11:30AM

COMPSCI 514: Algorithms for Data Science

1.1 Administrative Stuff

MATRIX DETERMINANTS. 1 Reminder Definition and components of a matrix

Relational Nonlinear FIR Filters. Ronald K. Pearson

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

Homework Assignment 6 Answers

Recommendation Systems

CS60021: Scalable Data Mining. Dimensionality Reduction

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

26 Group Theory Basics

CS60021: Scalable Data Mining. Large Scale Machine Learning

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

Probability (Devore Chapter Two)

Privacy in Statistical Databases

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

MLE/MAP + Naïve Bayes

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Fooling Sets and. Lecture 5

CSE446: non-parametric methods Spring 2017

Exploring Large Graphs

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

1111: Linear Algebra I

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

compare to comparison and pointer based sorting, binary trees

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Exam Question 10: Differential Equations. June 19, Applied Mathematics: Lecture 6. Brendan Williamson. Introduction.

COMP 175 COMPUTER GRAPHICS. Lecture 04: Transform 1. COMP 175: Computer Graphics February 9, Erik Anderson 04 Transform 1

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview

Math 1021, Linear Algebra 1. Section: A at 10am, B at 2:30pm

Lecture 10: The Normal Distribution. So far all the random variables have been discrete.

Matrix-Vector Products and the Matrix Equation Ax = b

Social Choice and Networks

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Algorithms for Nearest Neighbors

CS 124 Math Review Section January 29, 2018

Hash tables. Hash tables

CS425: Algorithms for Web Scale Data

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

MATH 122 SYLLBAUS HARVARD UNIVERSITY MATH DEPARTMENT, FALL 2014

Intelligent Data Analysis Lecture Notes on Document Mining

Section 4.6 Negative Exponents

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

March 19 - Solving Linear Systems

Lecture 2e Row Echelon Form (pages 73-74)

MA 1128: Lecture 08 03/02/2018. Linear Equations from Graphs And Linear Inequalities

Data preprocessing. DataBase and Data Mining Group 1. Data set types. Tabular Data. Document Data. Transaction Data. Ordered Data

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Stat 609: Mathematical Statistics I (Fall Semester, 2016) Introduction

Today s Menu. Administrativia Two Problems Cutting a Pizza Lighting Rooms

Intelligent Data Analysis. Mining Textual Data. School of Computer Science University of Birmingham

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Testing a Hash Function using Probability

Optimal Data-Dependent Hashing for Approximate Near Neighbors

I am trying to keep these lessons as close to actual class room settings as possible.

Select/Special Topics in Atomic Physics Prof. P. C. Deshmukh Department of Physics Indian Institute of Technology, Madras

Perform the same three operations as above on the values in the matrix, where some notation is given as a shorthand way to describe each operation:

Ch. 7 Statistical Intervals Based on a Single Sample

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

MATH 433 Applied Algebra Lecture 22: Review for Exam 2.

Decision T ree Tree Algorithm Week 4 1

Last Time. x + 3y = 6 x + 2y = 1. x + 3y = 6 y = 1. 2x + 4y = 8 x 2y = 1. x + 3y = 6 2x y = 7. Lecture 2

Section F Ratio and proportion

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Transcription:

COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018

Lecture 9 Similarity Queries

Few words about the exam The exam is Thursday (Oct 4) in two days In this room this time Duration 1 hour Syllabus: Till the end of last class Focus: Clustering An extensive list of chapters were provided in piazza, you should concentrate on things that were discussed in detail in class.

From the two books From Blum et al. book: Ch 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9.1, 3.9.2, 3.9.4 All exercise in Ch 3 are relevant Ch 4.1, 4.8. Ex. 4.1-4.5, 4.11, 4.12, 4.25, 4,55-4.57 Ch 7.1, 7.2, 7.3, 7.4, 7.5 Ex. 7.1-7.21, 7.25-7.29 From MMDS: Exercises are at the end of a section Ch 5.1, 5.5 Ch 7.1, 7.3, 7.4 Ch 10.4 Ch 11.1, 11.2, 11.3 Only things that are covered in class are in syllabus Concentrate on topics that were not covered by the first homework It is fine if you learn something by mistake

Finding similar items [Hays and Efros, SIGGRAPH 2007] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Finding near-neighbors in a high dimensional space Pages with similar words For data de-duplication, classification by topics Customers who purchased similar products Product classification Images with similar features The space is high dimensional The database can be huge

Finding near-neighbors in a high dimensional space Near neighbors: Points that are only a small distance away Have to define distance or similarity For example, Jaccard distance/similarity For any two sets A and B, the Jaccard similarity A B A B For any two sets A and B, the Jaccard distance 1 A B A B

Jaccard metric into one of set intersection, we use a technique called shingling, which is introduced in Section 3.2. For any two sets A and B, the Jaccard similarity 3.1.1 Jaccard Similarity of Sets A B A B The Jaccard similarity of sets S and T is S T / S T, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard similarity of S and T by SIM(S, T ). For any two sets A and B, the Jaccard distance A B 1 A B Example 3.1 : In Fig. 3.1 we see two sets S and T. There are three elements in their intersection and a total of eight elements that appear in S or T or both. Thus, SIM(S, T )=3/8. S T similarity = 3 8 ; distance = 5 8 Figure 3.1: Two sets with Jaccard similarity 3/8

Running example: Similarity of documents Jaccard distance for textual similarity To find plagiarism No simple automated process (documents are not exactly same) To find the same news from different sources To match mirror pages

Documents to sets: Shingling Jaccard distance for textual similarity Document: string of characters k-shingle: a substring of length k within the document Example: abcdabd Set of 2-shingles: { ab,bc,cd,da,bd} Bag of 2-shingles: {ab,bc,cd,da,ab,bd} We might remove whitespace altogether We might replace sequence of whitespaces by only one whitespace

Shingling: choosing the shingle size k = 1: bad k should be picked large enough that the probability of any given shingle appearing in any given document is low. k very large: not enough statistics: bad For large documents k = 9 is considered safe Hashing Shingles: Give each shingle a number and then instead of set of shingles, keep the set of numbers

Shingles built from words First rule: avoid stop words: and, such, to etc. Use stop word followed by the next two or three words: and other world leaders etc.

3.3.1 Matrix Representation of Sets Summarizing the shingles Before explaining how it is possible to construct small signatures from large sets, it is helpful to visualize a collection of sets as their characteristic matrix. The columns of the matrix correspond to the sets, and the rows correspond to elements How of tothe go universal from Setsset from Signatures which elements of the sets are drawn. There is a 1 in row Thercharacteristic and column c if matrix the element of setsfor row r is a member of the set for column c. Otherwise the value in position (r, c) is 0. Element S 1 S 2 S 3 S 4 a 1 0 0 1 b 0 0 1 0 c 0 1 0 1 d 1 0 1 1 e 0 0 1 0 Figure 3.2: A matrix representing four sets U = {a, b, c, d, e} S 1 = {a, d}; S 2 = {c} etc. Example 3.6 : In Fig. 3.2 is an example of a matrix representing sets chosen from the universal set {a, b, c, d, e}. Here,S 1 = {a, d}, S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row and leftmost columns are not part of the matrix, but are present only to remind us what the rows and columns represent.

Sets Signatures: Minhashing Hash the columns of the characteristic matrix How? Pick a permutation of the rows Hash value of a column is the element in first row in the permuted order where the column has 1

elements of the characteristic of the universal matrix. set from In this which section, elements we shall of the learn setshow are drawn. a minhash There is is computed a 1 in rowinrprinciple, and column andcinif the later Minhashing element sectionsfor werow shall r issee a member how a good of the approximation c. to Otherwise minhash theisvalue computed in position in practice. (r, c) is 0. set for column To minhash a set represented by a column of the characteristic matrix, pick a permutation of the rows. Element The minhash S 1 Svalue 2 Sof 3 any S 4 column is the number of the first row, in the permuted a order, 1 in which 0 0the column 1 has a 1. b 0 0 1 0 Example 3.7 : Let us suppose c we pick 0 the 1 order 0 of 1 rows beadc for the matrix of Fig. 3.2. This permutation d defines 1 a minhash 0 1 function 1 h that maps sets to rows. Let us compute the minhash e 0value 0 of set 1 S 1 0according to h. The first column, which is the column for set S 1, has 0 in row b, soweproceedtorowe, the second in the permuted order. There is again a 0 in the column for S 1,so we proceed Permutation to rowfigure ofa, the wherewefinda1.thus.h(s 3.2: rows: Ab,e,a,c,d matrix representing 1 )=a. four sets Example 3.6 : In Fig. 3.2 Element is an example S 1 S 2 of asmatrix 3 S 4 representing sets chosen from the universal set {a, b, c, b d, e}. 0Here,S 0 1 = 1 {a, d}, 0 S 2 = {c}, S 3 = {b, d, e}, and S 4 = {a, c, d}. The top row e and leftmost 0 0 columns 1 0 are not part of the matrix, but are present only to remind a us what 1 the 0 rows 0 and 1 columns represent. d 1 0 1 1 It is important to remember c that 0 the 1 characteristic 0 1 matrix is unlikely to be the way the data is stored, but it is useful as a way to visualize the data. For one reason not to store data as a matrix, these matrices are almost always sparse (they h(shave 1 ) = many a; Figure h(s more 2 ) 3.3: = 0 s c; Athan h(s permutation 3 ) 1 s) = b; inh(s practice. of 4 ) the = rows ait saves of Fig. space 3.2 to represent a sparse matrix of 0 s and 1 s by the positions in which the 1 s appear. Foranother

suppose we pick the order of rows beadc for the matr tation defines a minhash Why Minhashing? function h that maps sets the minhash value of set S 1 according to h. The fir For any two sets S, T, the probability that the minhashes (with a lumn for set S random uniform 1, has 0 in row b, soweproceedtorow permutation) being equal uted order. There is again a 0 in the column for S 1, P(h(S) = h(t )) = Sim(S, T ), herewefinda1.thus.h(s 1 )=a. the Jaccard similarity of the two sets. Why? Element S 1 S 2 S 3 S 4 b 0 0 1 0 e 0 0 1 0 a 1 0 0 1 d 1 0 1 1 c 0 1 0 1

Minhash signature Take n (say 100) random permutations Each define a minhash function: h 1, h 2,..., h n Minhash signature of S is a vector [h 1 (S), h 2 (S),..., h n (S)] n is much smaller than the number of rows of the characteristic matrix