CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

Similar documents
Distinct-Value Synopses for Multiset Operations

Efficient Data Reduction and Summarization

Lecture 3 Sept. 4, 2014

Algorithms for Data Science: Lecture on Finding Similar Items

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

Introduction to Randomized Algorithms III

Probabilistic Hashing for Efficient Search and Learning

1 Approximate Quantiles and Summaries

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

COMPSCI 514: Algorithms for Data Science

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Data Streams & Communication Complexity

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Multimedia Databases 1/29/ Indexes for Multimedia Data Indexes for Multimedia Data Indexes for Multimedia Data

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

CS264: Beyond Worst-Case Analysis Lecture #18: Smoothed Complexity and Pseudopolynomial-Time Algorithms

CS264: Beyond Worst-Case Analysis Lecture #15: Smoothed Complexity and Pseudopolynomial-Time Algorithms

CS 347 Parallel and Distributed Data Processing

Lecture 4 Thursday Sep 11, 2014

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

Minwise hashing for large-scale regression and classification with sparse data

Machine Learning Theory Lecture 4

Data Dependencies in the Presence of Difference

Lecture: Analysis of Algorithms (CS )

Succinct Approximate Counting of Skewed Data

B490 Mining the Big Data

Lecture 4: Hashing and Streaming Algorithms

Approximate counting: count-min data structure. Problem definition

14.1 Finding frequent elements in stream

Lecture 3 Frequency Moments, Heavy Hitters

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Lecture 2. Frequency problems

A Tight Lower Bound for Dynamic Membership in the External Memory Model

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Locally Differentially Private Protocols for Frequency Estimation. Tianhao Wang, Jeremiah Blocki, Ninghui Li, Somesh Jha

Bloom Filters, Minhashes, and Other Random Stuff

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

CS246 Final Exam, Winter 2011

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

Databases 2011 The Relational Algebra

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Analysis of Algorithms I: Perfect Hashing

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

Lecture 1 : Probabilistic Method

One Permutation Hashing for Efficient Search and Learning

Notes for Lecture 14 v0.9

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Filters. Alden Walker Advisor: Miller Maley. May 23, 2007

7 Algorithms for Massive Data Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

1 Maintaining a Dictionary

Bloom Filters and Locality-Sensitive Hashing

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

Probabilistic Near-Duplicate. Detection Using Simhash

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Mixture distributions in Exams MLC/3L and C/4

X = X X n, + X 2

Algorithms for Nearest Neighbors

New cardinality estimation algorithms for HyperLogLog sketches

A Framework for Estimating Stream Expression Cardinalities

Ad Placement Strategies

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Accelerating Stochastic Optimization

Nearest Neighbor Classification Using Bottom-k Sketches

1 Estimating Frequency Moments in Streams

A Simple and Efficient Estimation Method for Stream Expression Cardinalities

Dictionary: an abstract data type

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES

Lecture 21: October 19

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM

Proof In the CR proof. and

Schema Refinement and Normal Forms Chapter 19

Why duplicate detection?

Algorithms for Querying Noisy Distributed/Streaming Datasets

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

2 How many distinct elements are in a stream?

Sketching in Adversarial Environments

6.842 Randomness and Computation Lecture 5

PCP Theorem and Hardness of Approximation

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

1 AC 0 and Håstad Switching Lemma

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Calculus (Math 1A) Lecture 5

Optimization and Optimal Control in Banach Spaces

Introduction to Proofs in Analysis. updated December 5, By Edoh Y. Amiran Following the outline of notes by Donald Chalice INTRODUCTION

Parallel Numerical Algorithms

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Lecture Lecture 25 November 25, 2014

Transcription:

CS6931 Database Seminar Lecture 6: Set Operations on Massive Data

Set Resemblance and MinWise Hashing/Independent Permutation

Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1 = S1, f2 = S2, a = S1 S2. The resemblance R is a popular measure of set similarity

Basics and Applications A set S U = {0, 1,...,D. 1} can be viewed as 0/1 vector in D dimensions. Application Shingling: Each document (Web page) can be viewed as a set of w-shingles. For example, after parsing, a sentence welcome to school of computing becomes w = 1: {welcome, to, school, of, computing} w = 2: {welcome to, to school, school of, of computing} W = 3: {welcome to school, to school of, school of computing} Shingling generates extremely high dimensional vectors, e.g., D = (10 5 ) w.

Near Duplicate Detection Each document (Web page) is parsed into a set of w-shingles. Suppose there are roughly 10 10 English Web pages, i.e., N = 10 10 sets. Q: Can we efficiently find the duplicate documents (e.g., paper plagiarism), fast and using only affordable memory space? A: If Resemblance is the similarity measure, then minwise hashing may provide a highly practical algorithm for duplicate detection.

MinWise Hashing/Independent Permutation Suppose a random permutation π is performed on U, i.e., π: U U, where U={0,1,., D-1}. A simple argument can show that, where S1 U and S2 U Pr (min(π(s1)) = min(π(s2)))= S1 S2 S1 S2 = R Why?

To Reduce Variance After k independent permutations, π 1, π 2,..., π k, one can estimate R without bias, as a binomial: R= 1 k why? k j=1 Var(R)= 1 k why? Because a = R estimated R! 1*min π j S1 = min π j S2 + R(1 R) 1+R (f1 + f2), we can estimate a from the

How to implement minwise independent permuation and other set operations There is a stoc 98 paper Or in practice, you can simulate minwise independent permuation using k-wise indepdent hash function: usually works pretty well. MinWise synopsis: S(A)={min(π 1 (A), π 2 (A),..., π k (A)} How to compute S(A B)? Given S(A), S(B), and A, B, how to estimate A B and A B?

Storage Problem of Minwise Hashing Each hashed value, e.g., min(π(s1)), is usually stored using 64 bits For typical applications, k=50 or 100 is required (hashed values) The total storage can be prohibitive: Nx64x100=8TB, if N=10 10! Storage problems also cause computational problems, of course.

b-bit MinWise Hashing Basic idea: Only store the lowest b-bits of each hashed value, for small b. Intuition why this works: When two sets are identical, then their lowest b-bits of the hashed values are of course also equal. When two sets are similar, then their lowest b-bits of the hashed values should be also similar (True?). Therefore, hopefully, we do not need many bits to obtain useful information, especially considering real applications often care about pairs with reasonably large resemblance values (e.g., 0.5). For more details on why this works, refer to the resource on the web.

General set operations and Distributed sets Kevin Beyer Peter J. Haas Berthold Reinwald Yannis Sismanis Rainer Gemulla

SIGMOD 07 Introduction Estimating # Distinct Values (DV) crucial for: Data integration & cleaning * E.g. schema discovery, duplicate detection Query optimization Network monitoring Materialized view selection for datacubes Exact DV computation is impractical Sort/scan/count or hash table Problem: bad scalability Approximate DV synopses 25 year old literature Hashing-based techniques

SIGMOD 07 Motivation: A Synopsis Warehouse Full-Scale Warehouse Of Data Partitions Synopsis Synopsis Synopsis Warehouse of Synopses S 1,1 S 1,2 S n,m combine S *,* S 1-2,3-7 etc Goal: discover partition characteristics & relationships to other partitions Keys, functional dependencies, similarity metrics (Jaccard) Similar to Bellman [DJMS02] Accuracy challenge: small synopses sizes, many distinct values

SIGMOD 07 Background on KMV synopsis Outline An unbiased low-variance DV estimator Optimality Asymptotic error analysis for synopsis sizing Compound Partitions Union, intersection, set difference Multiset Difference: AKMV synopses Deletions Empirical Evaluation

Partition a b a a e D distinct values hash K-Min Value (KMV) Synopsis k-min X X X X X X X X X X 0 U... 1 (1) U (2) U (k) Hashing = dropping DVs uniformly on [0,1] KMV synopsis: L { U( 1), U(2),..., U( k) } Leads naturally to basic estimator [BJK+02] Basic estimator: ˆ BE E[ U( k) ] k / D Dk k / U( k) All classic estimators approximate the basic estimator Expected construction cost: O( N k loglog D) Space: O( k log D) 1/D SIGMOD 07

Look at spacings Example with k = 4 and D = 7: Intuition E[V] 1 / D so that D 1 / E[V] Estimate D as 1 / Avg(V 1,,V k ) I.e., as k / Sum(V 1,,V k ) I.e., as k / u (k) Upward bias (Jensen s inequality) so change k to k-1 if X is a random variable and φ is a convex function, then φ(e(x)) E(φ(X)) (the secant line of a convex function lies above the graph of the function) V 1 V 2 V 3 V 4 x x x x x x x 0 1 16

SIGMOD 07 New Synopses & Estimators Better estimators for classic KMV synopses Better accuracy: unbiased, low mean-square error Exact error bounds (in paper) Asymptotic error bounds for sizing the synopses Augmented KMV synopsis (AKMV) Permits DV estimates for compound partitions Can handle deletions and incremental updates A Synopsis S A A op B Combine S A op B B Synopsis S B

Unbiased DV Estimator from KMV Synopsis Unbiased Estimator [Cohen97]: Exact error analysis based on theory of order statistics Asymptotically optimal as k becomes large (MLE theory) Analysis with many DVs Theorem: Proof: Dˆ E * Show that U (i) -U (i-1) approx exponential for large D * Then use [Cohen97] Use above formula to size synopses a priori UB k Dˆ UB k D 2 D ( k 2) ( k 1) / U ( k) SIGMOD 07

SIGMOD 07 Background on KMV synopsis Outline An unbiased low-variance DV estimator Optimality Asymptotic error analysis for synopsis sizing Compound Partitions Union, intersection, set difference Multiset Difference: AKMV synopses Deletions Empirical Evaluation

SIGMOD 07 L A 0 X (Multiset) Union of Partitions k-min X X X L B k-min X X X X 1 0 1 L 0 k-min X X X X X X U (k) 1 Combine KMV synopses: L=L A L B Theorem: L is a KMV synopsis of AB Can use previous unbiased estimator: Dˆ UB k ( k 1) / U ( k)

L=L A L B as with union (contains k elements) (Multiset) Intersection of Partitions Note: L corresponds to a uniform random sample of DVs in AB K = # values in L that are also in D(AB) Theorem: Can compute from L A and L B alone D D( A B) D D( A B) K /k estimates Jaccard distance: Dˆ ( k 1) / U( k ) D D( A B) estimates Unbiased estimator of #DVs in the intersection: Dˆ K k k 1 ( ) U k See paper for variance of estimator Can extend to general compound partitions from ordinary set operations SIGMOD 07

Multiset Differences: AKMV Synopsis Augment KMV synopsis with multiplicity counters L + =(L,c) Space: O( k log D k log M) M=max multiplicity Proceed almost exactly as before i.e. L + (E/F)=(L E L F,(c E -c F ) + ) Unbiased DV estimator: K g is the #positive counters K k g k 1 U ( k ) Closure property: G=E op F E AKMV Synopsis L + E=(L E,c E ) L + G =(L E L F,h op (c E,c F )) Combine AKMV Synopsis Can also handle deletions F L + F=(L F,c F ) SIGMOD 07