Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Similar documents
Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Algorithms for Data Science

Introduction to Randomized Algorithms III

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

Bloom Filters, general theory and variants

arxiv: v1 [cs.ds] 3 Feb 2018

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Hash tables. Hash tables

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Lecture 4 Thursday Sep 11, 2014

Hash tables. Hash tables

Part 1: Hashing and Its Many Applications

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Using the Power of Two Choices to Improve Bloom Filters

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

Bloom Filters, Minhashes, and Other Random Stuff

Cuckoo Hashing and Cuckoo Filters

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

6.1 Occupancy Problem

Lecture 4 February 2nd, 2017

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Weighted Bloom Filter

Tight Bounds for Sliding Bloom Filters

Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

1 Maintaining a Dictionary

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

6 Filtering and Streaming

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

An Early Traffic Sampling Algorithm

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Cache-Oblivious Hashing

Lecture 2. Frequency problems

Explicit and Efficient Hash Functions Suffice for Cuckoo Hashing with a Stash

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Data Structures and Algorithm. Xiaoqing Zheng

CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing

1 Difference between grad and undergrad algorithms

CS483 Design and Analysis of Algorithms

The Bloom Paradox: When not to Use a Bloom Filter

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

md5bloom: Forensic Filesystem Hashing Revisited

So far we have implemented the search for a key by carefully choosing split-elements.

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Foundations of Data Mining

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions

B490 Mining the Big Data

Hashing Data Structures. Ananda Gunawardena

Analysis of Algorithms I: Perfect Hashing

Count-Min Tree Sketch: Approximate counting for NLP

Lecture 3 Sept. 4, 2014

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Advanced Data Structures

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Information Retrieval

COS598D Lecture 3 Pseudorandom generators from one-way functions

Databases Exam HT2016 Solution

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

Array-based Hashtables

Outline. Approximation: Theory and Algorithms. Application Scenario. 3 The q-gram Distance. Nikolaus Augsten. Definition and Properties

Database Systems SQL. A.R. Hurson 323 CS Building

SPACE-EFFICIENT DATA SKETCHING ALGORITHMS FOR NETWORK APPLICATIONS

Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing

Solutions. Prelim 2[Solutions]

Schema Refinement & Normalization Theory: Functional Dependencies INFS-614 INFS614, GMU 1

Inferential Time-Decaying Bloom Filters

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:

V.4 MapReduce. 1. System Architecture 2. Programming Model 3. Hadoop. Based on MRS Chapter 4 and RU Chapter 2 IR&DM 13/ 14 !74

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Dictionary: an abstract data type

Lecture 4: Hashing and Streaming Algorithms

CBFM cutted Bloom filter matrix for multi-dimensional membership query

Amortized Analysis (chap. 17)

Acceptably inaccurate. Probabilistic data structures

Approximate counting: count-min data structure. Problem definition

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Dictionary: an abstract data type

django in the real world

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004

Module 1: Analyzing the Efficiency of Algorithms

Algorithmic Methods of Data Mining, Fall 2005, Course overview 1. Course overview

Databases 2011 The Relational Algebra

Introduction to Data Management. Lecture #6 (Relational DB Design Theory)

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

1 Hashing. 1.1 Perfect Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Ad Placement Strategies

Transcription:

Lecture 24: Bloom Filters Wednesday, June 2, 2010 1

Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2

Lecture on Bloom Filters Not described in the textbook! Lecture based in part on: Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom Filters: A Survey", Internet Mathematics 1 (4): 485 509 Bloom, Burton H. (1970), "Space/time trade-offs in hash coding with allowable errors", Communications of the ACM 13 (7): 422 42 3

Example (from Pig Latin lecture) Users(name, age) Pages(user, url) SELECT Pages.url, count(*) as cnt FROM Users, Pages WHERE Users.age in [18..25] GROUP BY Pages.url ORDER DESC cnt 4

Example Problem: many pages, but only a few visited by users with age 18..25 Pig s solution: MAP phase send all pages and all users to the reducers How can we reduce communication cost? 5

Hash Maps Let S = {x 1, x 2,..., x n } be a set of elements Let m > n Hash function h : S {1, 2,, m} S = {x 1, x 2,..., x n } H= 1 2 m 0 0 1 0 1 1 0 0 1 1 0 0 6

0 0 1 0 1 1 0 0 0 1 0 1 Hash Map = Dictionary The hash map acts like a dictionary Insert(x, H) = set bit h(x) to 1 Collisions are possible Member(y, H) = check if bit h(y) is 1 False positives are possible Delete(y, H) = not supported! Extensions possible, see later 7

0 0 1 0 1 1 0 0 0 1 0 1 Example (cont d) Map-Reduce task 1 Map task: compute a hash map H for users with age in [18..25]. Several Map tasks in parallel. Reduce task: combine all hash maps using OR. One single reducer suffices. Map-Reduce task 2 Why don t we lose any pages? Map tasks 1: map each user to the appropriate region Map tasks 2: map each page in H to appropriate region Reduce task: do the join 8

Analysis Let S = {x 1, x 2,..., x n } Let j = a specific bit in H (1 j m) What is the probability that j remains 0 after inserting all n elements from S into H? Will compute in two steps 9

0 0 0 0 1 0 0 0 0 0 0 0 Analysis Recall H = m Let s insert only x i into H What is the probability that bit j is 0? 10

0 0 0 0 1 0 0 0 0 0 0 0 Analysis Recall H = m Let s insert only x i into H What is the probability that bit j is 0? Answer: p = 1 1/m 11

0 0 1 0 1 1 0 0 0 1 0 1 Analysis Recall H = m, S = {x 1, x 2,..., x n } Let s insert all elements from S in H What is the probability that bit j remains 0? 12

0 0 1 0 1 1 0 0 0 1 0 1 Analysis Recall H = m, S = {x 1, x 2,..., x n } Let s insert all elements from S in H What is the probability that bit j remains 0? Answer: p = (1 1/m) n 13

0 0 1 0 1 1 0 0 0 1 0 1 Probability of False Positives Take a random element y, and check member(y,h) What is the probability that it returns true? 14

0 0 1 0 1 1 0 0 0 1 0 1 Probability of False Positives Take a random element y, and check member(y,h) What is the probability that it returns true? Answer: it is the probability that bit h(y) is 1, which is f = 1 (1 1/m) n 1 e -n/m 15

Bloom Filters Introduced by Burton Bloom in 1970 Improve the false positive ratio Idea: use k independent hash functions 16

Bloom Filter = Dictionary Insert(x, H) = set bits h 1 (x),..., h k (x) to 1 Collisions are possible Member(y, H) = check if bits h 1 (y),..., h k (y) are 1 False positives are possible Delete(z, H) = not supported! Extensions possible, see later 17

Example Bloom Filter k=3 Insert(x,H) Member(y,H) y 1 = is not in H (why?); y 2 may be in H (why?) 18

Bloom Filter Principle Wherever a list or set is used, and space is at a premium, consider using a Bloom filter if the effect of false positives can be mitigated 19

Choosing k Two competing forces: If k = large Test more bits for member(y,h) lower false positive rate More bits in H are 1 higher false positive rate If k = small More bits in H are 0 lower positive rate Test fewer bits for member(y,h) higher rate 20

Analysis Recall H = m, #hash functions = k Let s insert only x i into H What is the probability that bit j is 0? 21

Analysis Recall H = m, #hash functions = k Let s insert only x i into H What is the probability that bit j is 0? Answer: p = (1 1/m) k 22

Analysis Recall H = m, S = {x 1, x 2,..., x n } Let s insert all elements from S in H What is the probability that bit j remains 0? 23

Analysis Recall H = m, S = {x 1, x 2,..., x n } Let s insert all elements from S in H What is the probability that bit j remains 0? Answer: p = (1 1/m) kn e -kn/m 24

Probability of False Positives Take a random element y, and check member(y,h) What is the probability that it returns true? 25

Probability of False Positives Take a random element y, and check member(y,h) What is the probability that it returns true? Answer: it is the probability that all k bits h 1 (y),, h k (y) are 1, which is: f = (1-p) k (1 e -kn/m ) k 26

Optimizing k For fixed m, n, choose k to minimize the false positive rate f Denote g = ln(f) = k ln(1 e -kn/m ) Goal: find k to minimize g k = ln 2 m /n 27

Bloom Filter Summary Given n = S, m = H, choose k = ln 2 m /n hash functions Probability that some bit j is 1 p e -kn/m = ½ Expected distribution m/2 bits 1, m/2 bits 0 Probability of false positive f = (1-p) k (½) k =(½) (ln 2)m/n (0.6185) m/n 28

Bloom Filter Summary In practice one sets m = cn, for some constant c Thus, we use c bits for each element in S Then f (0.6185) c = constant Example: m = 8n, then k = 8(ln 2) = 5.545 (use 6 hash functions) f (0.6185) m/n = (0.6185) 8 0.02 (2% false positives) Compare to a hash table: f 1 e -n/m = 1-e -1/8 0.11 29

Set Operations Intersection and Union of Sets: Set S Bloom filter H Set S Bloom filter H How do we computed the Bloom filter for the intersection of S and S? 30

Set Operations Intersection and Union: Set S Bloom filter H Set S Bloom filter H How do we computed the Bloom filter for the intersection of S and S? Answer: bit-wise AND: H H 31

Counting Bloom Filter Goal: support delete(z, H) Keep a counter for each bit j Insertion increment counter Deletion decrement counter Overflow keep bit 1 forever Using 4 bits per counter: Probability of overflow 1.37 10-15 m 32

Application: Dictionaries Bloom originally introduced this for hyphenation 90% of English words can be hyphenated using simple rules 10% require table lookup Use bloom filter to check if lookup needed 33

Application: Distributed Caching Web proxies maintain a cache of (URL, page) pairs If a URL is not present in the cache, they would like to check the cache of other proxies in the network Transferring all URLs is expensive! Instead: compute Bloom filter, exchange periodically 34