Why duplicate detection?

Size: px
Start display at page:

Download "Why duplicate detection?"

Transcription

1 Near-Duplicates Detection Naama Kraus Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze Some slides are courtesy of Kira Radinsky Why duplicate detection? About 30-40% of the pages on the Web are (near) duplicates of other pages. E.g., mirror sites Search engines try to avoid indexing duplicate pages Save storage Save processing time Avoid returning duplicate pages in search results Improved user s search experience The goal: detect duplicate pages 1

2 Exact-duplicates detection A naïve approach detect exact duplicates Map each page to some fingerprint, e.g. 64-bit If two web pages have an equal fingerprint check if content is equal Near-duplicates What about near-duplicates? Pages that are almost identical. Common on the Web. E.g., only date differs. Eliminating near duplicates is desired! The challenge How to efficiently detect near duplicates? Exhaustively comparing all pairs of web pages wouldn t scale. 2

3 Shingling K-shingles of a document d is defined to be the set of all consecutive sequences of k terms in d k is a positive integer E.g., 4-shingles of My name is Inigo Montoya. You killed my father. Prepare to die : { my name is inigo name is inigo montoya is inigo montoya you inigo montoya you killed montoya you killed my, you killed my father killed my father prepare my father prepare to father prepare to die } Computing Similarity Intuition: two documents are near-duplicates if their shingles sets are nearly the same. Measure similarity using Jaccard coefficient Degree of overlap between two sets Denote by S (d) the set of shingles of document d J(S(d1),S(d2)) = S (d1) S (d2) / S (d1) S (d2) If J exceeds a preset threshold (e.g. 0.9) declare d1,d2 near duplicates. Issue: computation is costly and done pairwise How can we compute Jaccard efficiently? 3

4 Hashing shingles Map each shingle into a hash value integer Over a large space, say 64 bits H(di) denotes the hash values set derived from S(di) Need to detect pairs whose sets H() have a large overlap How to do this efficiently? In next slides Permuting Let p be a random permutation over the hash values space. π: N N, x y: π x π y Let P(di) denote the set of permuted hash values in H(di) P d i = π H d i Let xi be the smallest integer in P(di) x i = min P d i 4

5 Illustration Document Start with 64-bit H(shingles) Permute on the number line with p 2 64 Pick the min value Key Theorem Theorem: J(S(di),S(dj)) = P(xi = xj) xi, xj of the same permutation Intuition: if shingle sets of two documents are nearly the same and we randomly permute, then there is a high probability that the minimal values are equal. 5

6 Proof (1) View sets S1,S2 as columns of a matrix A one row for each element in the universe. a ij = 1 indicates presence of item i in set j Example S 1 S Jaccard(S 1,S 2 ) = 2/5 = Proof (2) For columns S i, S j, four types of rows S i S j A 1 1 B 1 0 C 0 1 D 0 0 Let A = # of rows of type A Clearly, J(S1,S2) = A/(A+B+C) 6

7 Proof (3) Let p be a random permutation of the rows of A Denote by P(Sj) the column that results from applying p to the j-th column Let xi be the index of the first row in which the column P(Si) has a 1 P(S1) P(S2) Proof (4) We ve seen that J(S1,S2) = A/(A+B+C) Claim: P(xi=xj) = A/(A+B+C) Why? Look down columns Si, Sj until first non-type- D row I.e., look for xi or xj (the smallest or both if they are equal) P(xi) = P(xj) type A row As we picked a random permutation, the probability for a type A row is A/(A+B+C) P(xi=xj) = J(S1,S2) 7

8 Sketches Thus our Jaccard coefficient test is probabilistic Need to estimate P(xi=xj) Method: Pick k (~200) random row permutations P Sketch di = list of xi values for each permutation List is of length k Jaccard estimation: Fraction of permutations where sketch values agree Sketch di Sketch dj / k Example Sketches S 1 S 2 S 3 R R R R R S 1 S 2 S 3 Perm 1 = (12345) Perm 2 = (54321) Perm 3 = (34512) Similarities /3 2/3 0/3 8

9 Algorithm for Clustering Near-Duplicate Documents 1.Compute the sketch of each document 2.From each sketch, produce a list of <shingle, docid> pairs 3.Group all pairs by shingle value 4.For any shingle that is shared by more than one document, output a triplet <smaller-docid, larger-docid, 1>for each pair of docids sharing that shingle 5.Sort and aggregate the list of triplets, producing final triplets of the form <smaller-docid, larger-docid, # common shingles> 6.Join any pair of documents whose number of common shingles exceeds a chosen threshold using a Union-Find algorithm 7.Each resulting connected component of the UF algorithm is a cluster of near-duplicate documents Implementation nicely fits the map-reduce programming paradigm Implementation Trick Permuting universe even once is prohibitive Row Hashing Pick P hash functions h k Ordering under h k gives random permutation of rows One-pass Implementation For each C i and h k, keep slot for min-hash value Initialize all slot(c i,h k ) to infinity Scan rows in arbitrary order looking for 1 s Suppose row R j has 1 in column C i For each h k, if h k (j) < slot(c i,h k ), then slot(c i,h k ) h k (j) 9

10 Example C 1 C 2 R R R R R C 1 slots h(1) = g(1) = h(2) = g(2) = C 2 slots h(3) = g(3) = h(x) = x mod 5 g(x) = 2x+1 mod 5 (Denote R 5 as R 0 ) h(4) = g(4) = h(5) = g(5) =

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

Similarity Search. Stony Brook University CSE545, Fall 2016

Similarity Search. Stony Brook University CSE545, Fall 2016 Similarity Search Stony Brook University CSE545, Fall 20 Finding Similar Items Applications Document Similarity: Mirrored web-pages Plagiarism; Similar News Recommendations: Online purchases Movie ratings

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

CSE 321 Discrete Structures

CSE 321 Discrete Structures CSE 321 Discrete Structures March 3 rd, 2010 Lecture 22 (Supplement): LSH 1 Jaccard Jaccard similarity: J(S, T) = S*T / S U T Problem: given large collection of sets S1, S2,, Sn, and given a threshold

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) = CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)

More information

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara

Today s topics. Example continued. FAQs. Using n-grams. 2/15/2017 Week 5-B Sangmi Pallickara Spring 2017 W5.B.1 CS435 BIG DATA Today s topics PART 1. LARGE SCALE DATA ANALYSIS USING MAPREDUCE FAQs Minhash Minhash signature Calculating Minhash with MapReduce Locality Sensitive Hashing Sangmi Lee

More information

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya

CS60021: Scalable Data Mining. Similarity Search and Hashing. Sourangshu Bha>acharya CS62: Scalable Data Mining Similarity Search and Hashing Sourangshu Bha>acharya Finding Similar Items Distance Measures Goal: Find near-neighbors in high-dim. space We formally define near neighbors as

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

Expectation of geometric distribution

Expectation of geometric distribution Expectation of geometric distribution What is the probability that X is finite? Can now compute E(X): Σ k=1f X (k) = Σ k=1(1 p) k 1 p = pσ j=0(1 p) j = p 1 1 (1 p) = 1 E(X) = Σ k=1k (1 p) k 1 p = p [ Σ

More information

Probabilistic Near-Duplicate. Detection Using Simhash

Probabilistic Near-Duplicate. Detection Using Simhash Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Algorithms lecture notes 1. Hashing, and Universal Hash functions Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.

More information

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples Expectation of geometric distribution Variance and Standard Deviation What is the probability that X is finite? Can now compute E(X): Σ k=f X (k) = Σ k=( p) k p = pσ j=0( p) j = p ( p) = E(X) = Σ k=k (

More information

Discrete Structures for Computer Science

Discrete Structures for Computer Science Discrete Structures for Computer Science William Garrison bill@cs.pitt.edu 6311 Sennott Square Lecture #24: Probability Theory Based on materials developed by Dr. Adam Lee Not all events are equally likely

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

HW2 Solutions Problem 1: 2.22 Find the sign and inverse of the permutation shown in the book (and below).

HW2 Solutions Problem 1: 2.22 Find the sign and inverse of the permutation shown in the book (and below). Teddy Einstein Math 430 HW Solutions Problem 1:. Find the sign and inverse of the permutation shown in the book (and below). Proof. Its disjoint cycle decomposition is: (19)(8)(37)(46) which immediately

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Database Systems CSE 514

Database Systems CSE 514 Database Systems CSE 514 Lecture 8: Data Cleaning and Sampling CSEP514 - Winter 2017 1 Announcements WQ7 was due last night (did you remember?) HW6 is due on Sunday Weston will go over it in the section

More information

2 = = 0 Thus, the number which is largest in magnitude is equal to the number which is smallest in magnitude.

2 = = 0 Thus, the number which is largest in magnitude is equal to the number which is smallest in magnitude. Limits at Infinity Two additional topics of interest with its are its as x ± and its where f(x) ±. Before we can properly discuss the notion of infinite its, we will need to begin with a discussion on

More information

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30 CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Hash tables. Hash tables

Hash tables. Hash tables Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 14 October 16, 2013 CPSC 467, Lecture 14 1/45 Message Digest / Cryptographic Hash Functions Hash Function Constructions Extending

More information

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record. CS 5633 -- Spring 25 Symbol-table problem Hashing Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk CS 5633 Analysis of Algorithms 1 Symbol table holding n records: record

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 16 October 30, 2017 CPSC 467, Lecture 16 1/52 Properties of Hash Functions Hash functions do not always look random Relations among

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

1 Finding Similar Items

1 Finding Similar Items 1 Finding Similar Items This chapter discusses the various measures of distance used to find out similarity between items in a given set. After introducing the basic similarity measures, we look at how

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 9 Similarity Queries Few words about the exam The exam is Thursday (Oct 4) in two days In

More information

ORTHOGONAL ARRAYS OF STRENGTH 3 AND SMALL RUN SIZES

ORTHOGONAL ARRAYS OF STRENGTH 3 AND SMALL RUN SIZES ORTHOGONAL ARRAYS OF STRENGTH 3 AND SMALL RUN SIZES ANDRIES E. BROUWER, ARJEH M. COHEN, MAN V.M. NGUYEN Abstract. All mixed (or asymmetric) orthogonal arrays of strength 3 with run size at most 64 are

More information

Hash-based Indexing: Application, Impact, and Realization Alternatives

Hash-based Indexing: Application, Impact, and Realization Alternatives : Application, Impact, and Realization Alternatives Benno Stein and Martin Potthast Bauhaus University Weimar Web-Technology and Information Systems Text-based Information Retrieval (TIR) Motivation Consider

More information

Generating Subfields. Mark van Hoeij. June 15, 2017

Generating Subfields. Mark van Hoeij. June 15, 2017 June 15, 2017 Overview Papers: 1 (vh, Klüners, Novocin) ISSAC 2011. 2 The Complexity of Computing all Subfields of an Algebraic Number Field (Szutkoski, vh), Submitted to JSC. 3 Functional Decomposition

More information

Error Detection and Correction: Small Applications of Exclusive-Or

Error Detection and Correction: Small Applications of Exclusive-Or Error Detection and Correction: Small Applications of Exclusive-Or Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin Exclusive-Or (XOR,

More information

Advanced Counting Techniques. Chapter 8

Advanced Counting Techniques. Chapter 8 Advanced Counting Techniques Chapter 8 Chapter Summary Applications of Recurrence Relations Solving Linear Recurrence Relations Homogeneous Recurrence Relations Nonhomogeneous Recurrence Relations Divide-and-Conquer

More information

Hashing, Hash Functions. Lecture 7

Hashing, Hash Functions. Lecture 7 Hashing, Hash Functions Lecture 7 Symbol-table problem Symbol table T holding n records: x record key[x] Other fields containing satellite data Operations on T: INSERT(T, x) DELETE(T, x) SEARCH(T, k) How

More information

Design and Analysis of Algorithms

Design and Analysis of Algorithms CSE 101, Winter 2018 Design and Analysis of Algorithms Lecture 5: Divide and Conquer (Part 2) Class URL: http://vlsicad.ucsd.edu/courses/cse101-w18/ A Lower Bound on Convex Hull Lecture 4 Task: sort the

More information

Introduction to Hash Tables

Introduction to Hash Tables Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In

More information

Discrete Probability

Discrete Probability Discrete Probability Counting Permutations Combinations r- Combinations r- Combinations with repetition Allowed Pascal s Formula Binomial Theorem Conditional Probability Baye s Formula Independent Events

More information

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22 CS 70 Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22 Random Variables and Expectation Question: The homeworks of 20 students are collected in, randomly shuffled and returned to the students.

More information

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park 1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution

More information

Topic 4 Randomized algorithms

Topic 4 Randomized algorithms CSE 103: Probability and statistics Winter 010 Topic 4 Randomized algorithms 4.1 Finding percentiles 4.1.1 The mean as a summary statistic Suppose UCSD tracks this year s graduating class in computer science

More information

FRACTIONAL FACTORIAL DESIGNS OF STRENGTH 3 AND SMALL RUN SIZES

FRACTIONAL FACTORIAL DESIGNS OF STRENGTH 3 AND SMALL RUN SIZES FRACTIONAL FACTORIAL DESIGNS OF STRENGTH 3 AND SMALL RUN SIZES ANDRIES E. BROUWER, ARJEH M. COHEN, MAN V.M. NGUYEN Abstract. All mixed (or asymmetric) orthogonal arrays of strength 3 with run size at most

More information

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves

Theory of LSH. Distance Measures LS Families of Hash Functions S-Curves Theory of LSH Distance Measures LS Families of Hash Functions S-Curves 1 Distance Measures Generalized LSH is based on some kind of distance between points. Similar points are close. Two major classes

More information

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m) BALLS, BINS, AND RANDOM GRAPHS We use the Chernoff bound for the Poisson distribution (Theorem 5.4) to bound this probability, writing the bound as Pr(X 2: x) :s ex-ill-x In(x/m). For x = m + J2m In m,

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

More information

CSE 312, 2017 Winter, W.L.Ruzzo. 5. independence [ ]

CSE 312, 2017 Winter, W.L.Ruzzo. 5. independence [ ] CSE 312, 2017 Winter, W.L.Ruzzo 5. independence [ ] independence Defn: Two events E and F are independent if P(EF) = P(E) P(F) If P(F)>0, this is equivalent to: P(E F) = P(E) (proof below) Otherwise, they

More information

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries

More information

4.4 Uniform Convergence of Sequences of Functions and the Derivative

4.4 Uniform Convergence of Sequences of Functions and the Derivative 4.4 Uniform Convergence of Sequences of Functions and the Derivative Say we have a sequence f n (x) of functions defined on some interval, [a, b]. Let s say they converge in some sense to a function f

More information

0-1 Knapsack Problem

0-1 Knapsack Problem KP-0 0-1 Knapsack Problem Define object o i with profit p i > 0 and weight w i > 0, for 1 i n. Given n objects and a knapsack capacity C > 0, the problem is to select a subset of objects with largest total

More information

Chapter 4: Interpolation and Approximation. October 28, 2005

Chapter 4: Interpolation and Approximation. October 28, 2005 Chapter 4: Interpolation and Approximation October 28, 2005 Outline 1 2.4 Linear Interpolation 2 4.1 Lagrange Interpolation 3 4.2 Newton Interpolation and Divided Differences 4 4.3 Interpolation Error

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 08 Boolean Matrix Factorization Rainer Gemulla, Pauli Miettinen June 13, 2013 Outline 1 Warm-Up 2 What is BMF 3 BMF vs. other three-letter abbreviations 4 Binary matrices, tiles,

More information

Tricks of the Trade in Combinatorics and Arithmetic

Tricks of the Trade in Combinatorics and Arithmetic Tricks of the Trade in Combinatorics and Arithmetic Zachary Friggstad Programming Club Meeting Fast Exponentiation Given integers a, b with b 0, compute a b exactly. Fast Exponentiation Given integers

More information

Randomized Algorithms, Spring 2014: Project 2

Randomized Algorithms, Spring 2014: Project 2 Randomized Algorithms, Spring 2014: Project 2 version 1 March 6, 2014 This project has both theoretical and practical aspects. The subproblems outlines a possible approach. If you follow the suggested

More information

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010 Randomized Algorithms Lecture 4 Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010 1 Pairwise independent hash functions In the previous lecture we encountered two families

More information

Topics in Approximation Algorithms Solution for Homework 3

Topics in Approximation Algorithms Solution for Homework 3 Topics in Approximation Algorithms Solution for Homework 3 Problem 1 We show that any solution {U t } can be modified to satisfy U τ L τ as follows. Suppose U τ L τ, so there is a vertex v U τ but v L

More information

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 4. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 4 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation CS 70 Discrete Mathematics and Probability Theory Fall 203 Vazirani Note 2 Random Variables: Distribution and Expectation We will now return once again to the question of how many heads in a typical sequence

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74 V.4 MapReduce. System Architecture 2. Programming Model 3. Hadoop Based on MRS Chapter 4 and RU Chapter 2!74 Why MapReduce? Large clusters of commodity computers (as opposed to few supercomputers) Challenges:

More information

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD

CSE101: Design and Analysis of Algorithms. Ragesh Jaiswal, CSE, UCSD Course Overview Material that will be covered in the course: Basic graph algorithms Algorithm Design Techniques Greedy Algorithms Divide and Conquer Dynamic Programming Network Flows Computational intractability

More information

10.3 Random variables and their expectation 301

10.3 Random variables and their expectation 301 10.3 Random variables and their expectation 301 9. For simplicity, assume that the probabilities of the birth of a boy and of a girl are the same (which is not quite so in reality). For a certain family,

More information

GALOIS THEORY : LECTURE 4 LEO GOLDMAKHER

GALOIS THEORY : LECTURE 4 LEO GOLDMAKHER GALOIS THEORY : LECTURE 4 LEO GOLDMAKHER 1. A BRIEF REVIEW OF S n The rest of the lecture relies on familiarity with the symmetric group of n elements, denoted S n (this is the group of all permutations

More information

Lecture 12: Interactive Proofs

Lecture 12: Interactive Proofs princeton university cos 522: computational complexity Lecture 12: Interactive Proofs Lecturer: Sanjeev Arora Scribe:Carl Kingsford Recall the certificate definition of NP. We can think of this characterization

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011 CMPS 561 Boolean Retrieval Ryan Benton Sept. 7, 2011 Agenda Indices IR System Models Processing Boolean Query Algorithms for Intersection Indices Indices Question: How do we store documents and terms such

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Entropy Rate of Stochastic Processes

Entropy Rate of Stochastic Processes Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Ma/CS 6a Class 4: Primality Testing

Ma/CS 6a Class 4: Primality Testing Ma/CS 6a Class 4: Primality Testing By Adam Sheffer Reminder: Euler s Totient Function Euler s totient φ(n) is defined as follows: Given n N, then φ n = x 1 x < n and GCD x, n = 1. In more words: φ n is

More information

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University Lecture 7: Fingerprinting David Woodruff Carnegie Mellon University How to Pick a Random Prime How to pick a random prime in the range {1, 2,, M}? How to pick a random integer X? Pick a uniformly random

More information

Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set

Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set P.N. Balister E. Győri R.H. Schelp April 28, 28 Abstract A graph G is strongly set colorable if V (G) E(G) can be assigned distinct nonempty

More information

Chapter 9: Relations Relations

Chapter 9: Relations Relations Chapter 9: Relations 9.1 - Relations Definition 1 (Relation). Let A and B be sets. A binary relation from A to B is a subset R A B, i.e., R is a set of ordered pairs where the first element from each pair

More information

CPSC 467: Cryptography and Computer Security

CPSC 467: Cryptography and Computer Security CPSC 467: Cryptography and Computer Security Michael J. Fischer Lecture 15 October 20, 2014 CPSC 467, Lecture 15 1/37 Common Hash Functions SHA-2 MD5 Birthday Attack on Hash Functions Constructing New

More information

Recommendation Systems

Recommendation Systems Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume

More information

Advanced Counting Techniques

Advanced Counting Techniques . All rights reserved. Authorized only for instructor use in the classroom. No reproduction or further distribution permitted without the prior written consent of McGraw-Hill Education. Advanced Counting

More information

Hashing Data Structures. Ananda Gunawardena

Hashing Data Structures. Ananda Gunawardena Hashing 15-121 Data Structures Ananda Gunawardena Hashing Why do we need hashing? Many applications deal with lots of data Search engines and web pages There are myriad look ups. The look ups are time

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

k-protected VERTICES IN BINARY SEARCH TREES

k-protected VERTICES IN BINARY SEARCH TREES k-protected VERTICES IN BINARY SEARCH TREES MIKLÓS BÓNA Abstract. We show that for every k, the probability that a randomly selected vertex of a random binary search tree on n nodes is at distance k from

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

15-855: Intensive Intro to Complexity Theory Spring Lecture 16: Nisan s PRG for small space

15-855: Intensive Intro to Complexity Theory Spring Lecture 16: Nisan s PRG for small space 15-855: Intensive Intro to Complexity Theory Spring 2009 Lecture 16: Nisan s PRG for small space For the next few lectures we will study derandomization. In algorithms classes one often sees clever randomized

More information

CSCE 222 Discrete Structures for Computing

CSCE 222 Discrete Structures for Computing CSCE 222 Discrete Structures for Computing Sets and Functions Dr. Hyunyoung Lee Based on slides by Andreas Klappenecker 1 Sets Sets are the most fundamental discrete structure on which all other discrete

More information

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Functions and cardinality (solutions) sections A and F TA: Clive Newstead 6 th May 2014

Functions and cardinality (solutions) sections A and F TA: Clive Newstead 6 th May 2014 Functions and cardinality (solutions) 21-127 sections A and F TA: Clive Newstead 6 th May 2014 What follows is a somewhat hastily written collection of solutions for my review sheet. I have omitted some

More information

MAT137 - Term 2, Week 2

MAT137 - Term 2, Week 2 MAT137 - Term 2, Week 2 This lecture will assume you have watched all of the videos on the definition of the integral (but will remind you about some things). Today we re talking about: More on the definition

More information