Why duplicate detection?

Similar documents
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Finding similar items

Similarity Search. Stony Brook University CSE545, Fall 2016

Analysis of Algorithms I: Perfect Hashing

CSE 321 Discrete Structures

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

Introduction to Information Retrieval

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Algorithms for Data Science: Lecture on Finding Similar Items

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Expectation of geometric distribution

Probabilistic Near-Duplicate. Detection Using Simhash

Algorithms for Data Science

Lecture: Analysis of Algorithms (CS )

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

Discrete Structures for Computer Science

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

compare to comparison and pointer based sorting, binary trees

HW2 Solutions Problem 1: 2.22 Find the sign and inverse of the permutation shown in the book (and below).

CS246 Final Exam, Winter 2011

Database Systems CSE 514

2 = = 0 Thus, the number which is largest in magnitude is equal to the number which is smallest in magnitude.

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

14.1 Finding frequent elements in stream

B490 Mining the Big Data

Hash tables. Hash tables

CPSC 467: Cryptography and Computer Security

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Hash tables. Hash tables

CPSC 467: Cryptography and Computer Security

1 Maintaining a Dictionary

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Hash tables. Hash tables

1 Finding Similar Items

COMPSCI 514: Algorithms for Data Science

ORTHOGONAL ARRAYS OF STRENGTH 3 AND SMALL RUN SIZES

Hash-based Indexing: Application, Impact, and Realization Alternatives

Generating Subfields. Mark van Hoeij. June 15, 2017

Error Detection and Correction: Small Applications of Exclusive-Or

Advanced Counting Techniques. Chapter 8

Hashing, Hash Functions. Lecture 7

Design and Analysis of Algorithms

Introduction to Hash Tables

Discrete Probability

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Topic 4 Randomized algorithms

FRACTIONAL FACTORIAL DESIGNS OF STRENGTH 3 AND SMALL RUN SIZES

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Does Unlabeled Data Help?

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

CSE 312, 2017 Winter, W.L.Ruzzo. 5. independence [ ]

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

4.4 Uniform Convergence of Sequences of Functions and the Derivative

0-1 Knapsack Problem

Chapter 4: Interpolation and Approximation. October 28, 2005

Data Mining and Matrices

Tricks of the Trade in Combinatorics and Arithmetic

Randomized Algorithms, Spring 2014: Project 2

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

Topics in Approximation Algorithms Solution for Homework 3

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

B669 Sublinear Algorithms for Big Data

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

10.3 Random variables and their expectation 301

GALOIS THEORY : LECTURE 4 LEO GOLDMAKHER

Lecture 12: Interactive Proofs

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Entropy Rate of Stochastic Processes

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Some notes on streaming algorithms continued

Ma/CS 6a Class 4: Primality Testing

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set

Chapter 9: Relations Relations

CPSC 467: Cryptography and Computer Security

Recommendation Systems

Advanced Counting Techniques

Hashing Data Structures. Ananda Gunawardena

Part 1: Hashing and Its Many Applications

Integer Sorting on the word-ram

k-protected VERTICES IN BINARY SEARCH TREES

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

15-855: Intensive Intro to Complexity Theory Spring Lecture 16: Nisan s PRG for small space

CSCE 222 Discrete Structures for Computing

Efficient Data Reduction and Summarization

Functions and cardinality (solutions) sections A and F TA: Clive Newstead 6 th May 2014

MAT137 - Term 2, Week 2

Transcription:

Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection? About 30-40% of the pages on the Web are (near) duplicates of other pages. E.g., mirror sites Search engines try to avoid indexing duplicate pages Save storage Save processing time Avoid returning duplicate pages in search results Improved user s search experience The goal: detect duplicate pages 1

Exact-duplicates detection A naïve approach detect exact duplicates Map each page to some fingerprint, e.g. 64-bit If two web pages have an equal fingerprint check if content is equal Near-duplicates What about near-duplicates? Pages that are almost identical. Common on the Web. E.g., only date differs. Eliminating near duplicates is desired! The challenge How to efficiently detect near duplicates? Exhaustively comparing all pairs of web pages wouldn t scale. 2

Shingling K-shingles of a document d is defined to be the set of all consecutive sequences of k terms in d k is a positive integer E.g., 4-shingles of My name is Inigo Montoya. You killed my father. Prepare to die : { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die } Computing Similarity Intuition: two documents are near-duplicates if their shingles sets are nearly the same. Measure similarity using Jaccard coefficient Degree of overlap between two sets Denote by S (d) the set of shingles of document d J(S(d1),S(d2)) = S (d1) S (d2) / S (d1) S (d2) If J exceeds a preset threshold (e.g. 0.9) declare d1,d2 near duplicates. Issue: computation is costly and done pairwise How can we compute Jaccard efficiently? 3

Hashing shingles Map each shingle into a hash value integer Over a large space, say 64 bits H(di) denotes the hash values set derived from S(di) Need to detect pairs whose sets H() have a large overlap How to do this efficiently? In next slides Permuting Let p be a random permutation over the hash values space. π: N N, x y: π x π y Let P(di) denote the set of permuted hash values in H(di) P d i = π H d i Let xi be the smallest integer in P(di) x i = min P d i 4

Illustration Document 1 2 64 2 64 2 64 Start with 64-bit H(shingles) Permute on the number line with p 2 64 Pick the min value Key Theorem Theorem: J(S(di),S(dj)) = P(xi = xj) xi, xj of the same permutation Intuition: if shingle sets of two documents are nearly the same and we randomly permute, then there is a high probability that the minimal values are equal. 5

Proof (1) View sets S1,S2 as columns of a matrix A one row for each element in the universe. a ij = 1 indicates presence of item i in set j Example S 1 S 2 0 1 1 0 1 1 Jaccard(S 1,S 2 ) = 2/5 = 0.4 0 0 1 1 0 1 Proof (2) For columns S i, S j, four types of rows S i S j A 1 1 B 1 0 C 0 1 D 0 0 Let A = # of rows of type A Clearly, J(S1,S2) = A/(A+B+C) 6

Proof (3) Let p be a random permutation of the rows of A Denote by P(Sj) the column that results from applying p to the j-th column Let xi be the index of the first row in which the column P(Si) has a 1 P(S1) P(S2) 0 1 1 0 1 1 0 0 1 1 0 1 Proof (4) We ve seen that J(S1,S2) = A/(A+B+C) Claim: P(xi=xj) = A/(A+B+C) Why? Look down columns Si, Sj until first non-type- D row I.e., look for xi or xj (the smallest or both if they are equal) P(xi) = P(xj) type A row As we picked a random permutation, the probability for a type A row is A/(A+B+C) P(xi=xj) = J(S1,S2) 7

Sketches Thus our Jaccard coefficient test is probabilistic Need to estimate P(xi=xj) Method: Pick k (~200) random row permutations P Sketch di = list of xi values for each permutation List is of length k Jaccard estimation: Fraction of permutations where sketch values agree Sketch di Sketch dj / k Example Sketches S 1 S 2 S 3 R 1 1 0 1 R 2 0 1 1 R 3 1 0 0 R 4 1 0 1 R 5 0 1 0 S 1 S 2 S 3 Perm 1 = (12345) 1 2 1 Perm 2 = (54321) 4 5 4 Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3 0/3 2/3 0/3 8

Algorithm for Clustering Near-Duplicate Documents 1.Compute the sketch of each document 2.From each sketch, produce a list of <shingle, docid> pairs 3.Group all pairs by shingle value 4.For any shingle that is shared by more than one document, output a triplet <smaller-docid, larger-docid, 1>for each pair of docids sharing that shingle 5.Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docid, larger-docid, # common shingles> 6.Join any pair of documents whose number of common shingles exceeds a chosen threshold using a Union-Find algorithm 7.Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the map-reduce programming paradigm Implementation Trick Permuting universe even once is prohibitive Row Hashing Pick P hash functions h k Ordering under h k gives random permutation of rows One-pass Implementation For each C i and h k, keep slot for min-hash value Initialize all slot(c i,h k ) to infinity Scan rows in arbitrary order looking for 1 s Suppose row R j has 1 in column C i For each h k, if h k (j) < slot(c i,h k ), then slot(c i,h k ) h k (j) 9

Example C 1 C 2 R 1 1 0 R 2 0 1 R 3 1 1 R 4 1 0 R 5 0 1 C 1 slots h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 C 2 slots h(3) = 3 1 2 g(3) = 2 2 0 h(x) = x mod 5 g(x) = 2x+1 mod 5 (Denote R 5 as R 0 ) h(4) = 4 1 2 g(4) = 4 2 0 h(5) = 0 1 0 g(5) = 1 2 0 10