Part 1: Hashing and Its Many Applications

Similar documents
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Introduction to Randomized Algorithms III

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

arxiv: v1 [cs.ds] 3 Feb 2018

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Bloom Filters, general theory and variants

Algorithms for Data Science

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Application: Bucket Sort

Approximate counting: count-min data structure. Problem definition

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

Balls & Bins. Balls into Bins. Revisit Birthday Paradox. Load. SCONE Lab. Put m balls into n bins uniformly at random

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

The Bloom Paradox: When not to Use a Bloom Filter

The first bound is the strongest, the other two bounds are often easier to state and compute. Proof: Applying Markov's inequality, for any >0 we have

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

In a five-minute period, you get a certain number m of requests. Each needs to be served from one of your n servers.

Part 2: Random Routing and Load Balancing

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Randomized Load Balancing:The Power of 2 Choices

CS246 Final Exam. March 16, :30AM - 11:30AM

Ad Placement Strategies

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

6 Filtering and Streaming

Introduction to Hash Tables

Cuckoo Hashing and Cuckoo Filters

Maintaining Significant Stream Statistics over Sliding Windows

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Using the Power of Two Choices to Improve Bloom Filters

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Randomized algorithm

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Acceptably inaccurate. Probabilistic data structures

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,

B490 Mining the Big Data

Hash tables. Hash tables

Hash tables. Hash tables

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Tight Bounds for Sliding Bloom Filters

Cache-Oblivious Hashing

Course overview. Heikki Mannila Laboratory for Computer and Information Science Helsinki University of Technology

Classic Network Measurements meets Software Defined Networking

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Algorithmic methods of data mining, Autumn 2007, Course overview0-0. Course overview

Weighted Bloom Filter

Bloom Filters, Minhashes, and Other Random Stuff

Data-Intensive Distributed Computing

Succinct Approximate Counting of Skewed Data

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Data Mining Recitation Notes Week 3

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

B669 Sublinear Algorithms for Big Data

Lecture 4. P r[x > ce[x]] 1/c. = ap r[x = a] + a>ce[x] P r[x = a]

Tail Inequalities. The Chernoff bound works for random variables that are a sum of indicator variables with the same distribution (Bernoulli trials).

So far we discussed random number generators that need to have the maximum length period.

14.1 Finding frequent elements in stream

6.854 Advanced Algorithms

Lecture 4: Hashing and Streaming Algorithms

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

compare to comparison and pointer based sorting, binary trees

Introduction to Cryptography Lecture 4

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

Hashing. Martin Babka. January 12, 2011

Pr[A B] > Pr[A]Pr[B]. Pr[A B C] = Pr[(A B) C] = Pr[A]Pr[B A]Pr[C A B].

Distance-Sensitive Bloom Filters

Block Heavy Hitters Alexandr Andoni, Khanh Do Ba, and Piotr Indyk

CS 580: Algorithm Design and Analysis

Data Stream Methods. Graham Cormode S. Muthukrishnan

Randomized Algorithms

Some notes on streaming algorithms continued

An Early Traffic Sampling Algorithm

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Homework 10 Solution

Count-Min Tree Sketch: Approximate counting for NLP

13. RANDOMIZED ALGORITHMS

10-701/15-781, Machine Learning: Homework 4

2 How many distinct elements are in a stream?

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

Problem 1: Compactness (12 points, 2 points each)

Solutions to Problem Set 4

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

6.1 Occupancy Problem

Randomized Algorithms. Zhou Jun

9 Searching the Internet with the SVD

The Count-Min-Sketch and its Applications

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Tight Bounds for Distributed Functional Monitoring

Transcription:

1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching

Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random choices during the execution We also mae lots of random choices everyday, because Lac of information Convenience and simplicity To diversify ris and try luc! These reasons apply to algorithmic design But unscrupulous random choices may end with useless results Question: How do we mae smart random choices? In practice: Simple random choices often wor amazingly well In theory: Simple maths can justify these simple random choices

Applications of Randomized Algorithms 3? Randomized algorithms are especially useful for applications with Large data set and insufficient memory Limited computational power Uninformed nowledge Minor fault tolerability A long list of applications include Information retrieval, databases, bioinformatics (e.g. Google, DNA matching) Networing, telecommunications (e.g. AT&T) Optimization, data prediction, financial trading Artificial intelligence, machine learning Graphics, multi-media, computer games Information security, and a lot more

A Key Example: Hashing 4 0A 084 ffe2 5908 d3e string-based hashing for address boo Hashing enables large-scale, fast data processing Expedite the performance of large data/file systems in search engines (Google, Bing) Enable fast response time in small low-power devices (iphone, ipad) Hashing is a random sampling/projection of some more complicated data objects (e.g. strings, graphs, functions, sets, data structures) E.g. String-based hashing maps a input string to a shorter hash (string) by a hash function Assuming that a hash function is selected randomly (without a priori nowledge) from a large class of hash functions Hence, when we do not specify the detailed implementation of a particular hash function, the behaviour of hashing appears probabilistic

Balls and Bins Model 5 0A 084 ffe2 5908 d3e n bins m balls A generic model for hashing is balls-and-bins model Throw m balls into n bins, such that each ball is uniformly randomly distributed among the bins Interpretations of the model Balls = data objects, Bins = hashes (Coupon Collector Problem) Balls = coupons, Bins = types of coupons (Birthday Attac Prob.) Balls = people, Bins = birthdates Key questions Efficiency: How many non-empty bins? Performance: What is the maximum number of balls in all the bins? Balls-and-bins model is a random model Its behaviour is naturally analysed by probability theory

Poisson Approximation 6 The probability that bin i has r balls follows binominal distribution P X i = r = m r 1 n r 1 1 n m r = 1 r! m m 1 m r+1 n r But the expression can be too unwieldy When m and n are very large, we can approximate by m m 1 m r+1 n r m n m Hence, P X i = r e n r and 1 1 r! n m r e m n This is nown as Poisson distribution Po(r) = e μ μ r m n r r! 1 1 n m r The mean of Poisson distribution is μ = m n The probability of a non-empty bin is P X i 0 1 Po 0 = 1 e μ Po(r) = e μ μ r r!

Maximum Load 7 Recall a well-nown technique called Union Bound P X 1 r or or X n r P X 1 r + + P X n r The probability that all bins have less than M balls is 1 P max i=1,,n X i M 1 n P X i M If M > μ = m n, then P X i M e μ e μ M (Shown by Chernoff Bound in homewor) If m = n (hence = 1) and M = P X i ln ln n e 1 e M M M e ln ln n = 3 ln n e M M ln ln n, then ln ln n ln ln n ln n 1 P max X e i 1 n i=1,,n ln ln n Therefore, the maximum load is larger than (i.e., P max X i 0, as n ) i=1,,n ln ln n Or we say that the maximum load is less than ln ln n = e (ln ln ln n ln ln n) ln ln n e ln ln n ln ln n 20 18 16 14 12 10 8 200 400 600 800 1000 ln ln n ln ln n) e(ln ln ln n e 1 1 en ln ln n has a vanishing probability with high probability.

B. Roberts K. G. Smith A. Williams C. Phillips D. Johnson 1 0 0 1 1 0 1 Bloom filter n-bit string 1 0 0 1 1 0 1 H. Morrison (Not a member) 1 0 0 1 1 0 1 A. Clare (False positive) Bloom Filter Instead of hashing from a string, we also consider more complicated objects A Bloom filter maps a set of strings to a n-bit string There are hash functions, each hash function h maps a string to a value in {1,..,n} We initially set the Bloom filter to be an n-bit zero string If we include a string s in the Bloom filter, we set the h (s)-th bit in the Bloom filter to be one for every To validate whether a string s is a member of a Bloom filter, we chec if the h (s)-th bit in the Bloom filter is one for every A string belonging to a Bloom filter will be confirmed as a member by the validation (i.e., there is no false negative) However, it is possible that a string not belonging to a Bloom filter will also be confirmed as a member by the validation (i.e., there can be false positive) 8

Sid C-K Chau Applications of Bloom Filter Useful to applications with minor fault tolerance to false positives: Bloom filter is a compact representation of a set of strings 1) Spell and password checers with a set unsuitable words 2) Distributed database query 3) Content distribution and web cache 4) Peer-to-peer networs 5) Pacet filtering and measurement of pre-defined flows 6) Information security, computer graphics, etc. Distributed database query Peer-to-peer networs 9 C. Phillips B. Roberts K. G. Smith. A. Williams D. Johnson K. G. Smith C. Phillips B. Roberts K. G. Smith 0 1 1 0 1 0 1 A. Williams D. Johnson K. G. Smith I want to now who has C. Edwards, K. Johnson, and? 1 0 0 1 1 0 1 Do we have any common items?

Optimization of Bloom Filter 10 We want to minimize the number of false positives There are m strings to be included in an n-bit string Bloom filter There are hash functions, each hash function h maps a string to a value in {1,..,n} The probability that a particular bit in the Bloom filter becomes one after including m strings is 1 1 1 n m 1 e m n, assuming that n and m are very large Consider validating if a random string is included in the Bloom filter or not The probability that the validation succeeds is 1 1 1 n m 1 e m n f m,n () f m,n () is also the probability of a false positive. Hence, we want to minimize f m,n with respect to d ln f m,n () d = ln(1 e m/n ) + m n e m/n 1 e m/n Hence, d ln f m,n() = 0 = ln 2 n and f d m m,n = 1 2 = 0.612 n/m For instance, if m =100 and f m,n = 0.01, then n = 938 and = 7

A stream of items with multiple occurrences d a a d d c d c c c b b b b b b b b a a a Heavy Hitter Problem Find the most frequent items in a stream In a networ, find the users who consume the most bandwidth by observing a stream of pacets In a search engine, find the most queried phrases From the transactions of a supermaret, find the most purchased items Heavy hitter problem There is a stream of items with multiple occurrences We want to find the items with the most occurrences, when observing the stream continuously We do not now the number of distinct items in a prior manner We are only allowed to use storage space much less than the number of items in the stream Algorithms that process a stream of data with tight space consumption are called streaming algorithms 11

rows of counters (hash funcs.) Sid C-K Chau Count-min Setch 12 d a a d c d b b 12 30 3 0 20 12 32 50 +1 +1 +1 33 12 90 54 +1 15 30 11 20 m/ columns of counters We use an approach similar to the Bloom filter called count-min setch A setch is an array of m/ counters, *C i,j + There are hash functions, each hash function h i maps an item to a value in {1,..,m/} Initially set all counters to be zero (C i,j = 0) When we observe an item s in the stream, increase the h i (s)-th counter (C i,hi (s) = C i,hi (s)+1) for every i At the end, we obtain the number of occurrences of an item s by the minimum of all the counters that are mapped by s as N s = min *C i,hi (s): i = 1,, + N(s) is of course an overestimate of the true number of occurrences, because multiple items can be mapped to the same counter by a hash function However, N(s) is not far from the true value

Principle of Count-min Setch 13 Let the true number of occurrences of item s be T(s) Let the total number of occurrences of all items be T The probability that N(s) T(s)+ T is at most Let X t be the random item at time t=1,,t mε, where 1 T Then the counter C i,hi (s) = t=1 1,h i X t = h i s - and is a random variable, where 1, - is an indicator function We obtain the expected deviation of C i,hi (s) from T(s) by T E,C i,hi (s) T(s)- = E, t=1:x t s 1,h i X t = h i s -- T T = t=1:x t s E 1 h i X t = h i s = t=1:x t s P h i X t = h i s T P h i X t = h i s = T m Recall Marov inequality: P X x E X, for positive x x 0 P X < x + x P X x y y P X = y = E X Hence, P*C i,hi (s) T(s) εt+ E C i,h i s T(s) εt = εm Continue in the next slide

Principle of Count-min Setch 14 Follow from the last slide Since P*C i,hi (s) T(s) εt+, P* min *C εm i,h i=1,.., i (s)+ T s + εt+ If we minimize = m ε/e, εm εm with respect to, then = e mε/e, and P*N(s) T s + εt+ e mε/e If we let = ln 1 δ and m = ln 1 δ e ε, then P*N(s) T s + εt+ δ Therefore, is a tolerance threshold that bounds the deviation of N(s) from count-min setch, and is an error probability that bounds the probability of N(s) deviating for the at most T For example, if we set = 0.1 and = 0.01, then the number of counters we need is m = 125 and the number of hash functions is = 5 (note that both m and are independent of the number of items in the stream) Streaming algorithms can do much more powerful tass than finding the most frequent items, such as the distributions, correlations and other statistics in a stream of items in a continuous fashion εm

Summary 15 Randomized algorithms are algorithms that mae smart random choices during execution Hashing is a ey example that enables large-scale and fast data processing A simple balls-and-bins model can characterize the probabilistic properties of hashing (e.g. maximum load) A Bloom filter is an example that generates a hash to determine the membership of a set of strings Streaming algorithms use a random compact data structure (setches) to determine the statistics of a stream of items in continuous fashions Hashing can be regarded as a random projection from a high dimensional space of data to a low dimensional space of hashes

References 16 Main reference: Mitzenmacher and Upfal boo, Probability and Computing: Randomized Algorithms and Probabilistic Analysis Chapter 5.2-5.3: Balls-and-Bins model, Poisson distribution Chapter 5.5.4: Bloom filter Chapter 13.4: Count-min setch Additional references Broder and Mitzenmacher, Networ Applications of Bloom Filters: A Survey, Internet Mathematics 1 (4), pp485 509 Cormode and Hadjieleftheriou, Finding the frequent items in streams of data, Communications of the ACM, Oct 2009, pp97-105 More related materials are available at http://www.cl.cam.ac.u/~cc25/teaching