Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Similar documents
1 Maintaining a Dictionary

Algorithms lecture notes 1. Hashing, and Universal Hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Analysis of Algorithms I: Perfect Hashing

6.1 Occupancy Problem

Lecture: Analysis of Algorithms (CS )

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

1 Difference between grad and undergrad algorithms

compare to comparison and pointer based sorting, binary trees

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Algorithms for Data Science

Hash tables. Hash tables

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Hash tables. Hash tables

Lecture 4 Thursday Sep 11, 2014

Hash tables. Hash tables

Lecture 8 HASHING!!!!!

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Lecture 4 February 2nd, 2017

1 Hashing. 1.1 Perfect Hashing

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Explicit and Efficient Hash Functions Suffice for Cuckoo Hashing with a Stash

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Lecture 4: Hashing and Streaming Algorithms

Lecture 3 Sept. 4, 2014

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Introduction to Hash Tables

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

So far we have implemented the search for a key by carefully choosing split-elements.

? 11.5 Perfect hashing. Exercises

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Expectation of geometric distribution

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Some notes on streaming algorithms continued

Lecture 23: Alternation vs. Counting

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

14.1 Finding frequent elements in stream

Hashing Data Structures. Ananda Gunawardena

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

CS483 Design and Analysis of Algorithms

B490 Mining the Big Data

Lecture Lecture 3 Tuesday Sep 09, 2014

Advanced Data Structures

Module 1: Analyzing the Efficiency of Algorithms

1 Randomized Computation

Advanced Data Structures

Application: Bucket Sort

Problem. Maintain the above set so as to support certain

Lecture 2. Frequency problems

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Approximate counting: count-min data structure. Problem definition

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Lecture 7: Fingerprinting. David Woodruff Carnegie Mellon University

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 6 September 13, 2016

CPSC 467: Cryptography and Computer Security

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Data Structure. Mohsen Arab. January 13, Yazd University. Mohsen Arab (Yazd University ) Data Structure January 13, / 86

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

2 How many distinct elements are in a stream?

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Array-based Hashtables

CS361 Homework #3 Solutions

Hashing, Hash Functions. Lecture 7

Hash Functions for Priority Queues

Big Data. Big data arises in many forms: Common themes:

Problem Set 4 Solutions

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

2 Completing the Hardness of approximation of Set Cover

s 1 if xπy and f(x) = f(y)

Design and Analysis of Algorithms

Integer Linear Programs

1 Estimating Frequency Moments in Streams

6.854 Advanced Algorithms

Data Structures and Algorithm. Xiaoqing Zheng

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Ad Placement Strategies

Frequency Estimators

6.842 Randomness and Computation Lecture 5

Randomness and Computation March 13, Lecture 3

Lecture 24: Approximate Counting

Dictionary: an abstract data type

Cuckoo Hashing and Cuckoo Filters

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Cache-Oblivious Hashing

Outline. CS38 Introduction to Algorithms. Max-3-SAT approximation algorithm 6/3/2014. randomness in algorithms. Lecture 19 June 3, 2014

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Priority queues implemented via heaps

5199/IOC5063 Theory of Cryptology, 2014 Fall

1 AC 0 and Håstad Switching Lemma

Succinct Data Structures for Approximating Convex Functions with Applications

Transcription:

Lecture 5: Hashing David Woodruff Carnegie Mellon University

Hashing Universal hashing Perfect hashing

Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of length at most 80 Let S be a subset of U, which is a small dictionary S could be all English words Support operations to maintain the dictionary Insert(x): add the key x to S Query(x): is the key x in S? Delete(x): remove the key x from S

Dictionary Models Static: don t support insert and delete operations, just optimize for fast query operations For example, the English dictionary does not change much Could use a sorted array with binary search Insertion only: just support insert and query operations Dynamic: support insert, delete, and query operations Could use a balanced search tree (AVL trees) to get O(log S ) time per operation Hashing is an alternative approach, often the fastest and most convenient

Formal Hashing Setup Universe U is very large E.g., set of ASCII strings of length 80 is 128 Care about a small subset S U. Let N = S. S could be the names of all students in this class Our data structure is an array A of size M and a hash function h: U 0, 1,, M 1}. Typically M U, so can t just store each key x in A[x] Insert(x) will try to place key x in A[h(x)] But what if h(x) = h(y) for x y?we let each entry of A be a linked list. To insert an element x into A[h(x)], insert it at the top of the list Hope linked lists are small

How to Choose the Hash Function h? Want it to be unlikely that h(x) = h(y) for different keys x and y Want our array size M to be O(N), where N is number of keys Want to quickly compute h(x) given x We will treat this computation as O(1) time How long do Query(x) and Delete(x) take? O(length of list A[h(x)]) time How long does Insert(x) take? O(1) time no matter what How long can the lists A[h(x)] be?

Bad Sets Exist for any Hash Function Claim: For any hash function h: U > {0, 1, 2,, M 1}, if U N1 M1, there is a set S of N elements of U that all hash to the same location Proof: If every location had at most N 1 elements of U hashing to it, we would have U N1 M There s no good hash function h that works for every S. Thoughts? Universal Hashing: Randomly choose h! Show for any sequence of insert, query, and delete operations, the expected number of operations, over a random h, is small

Universal Hashing Definition: A set H of hash functions h, where each h in H maps U > {0, 1, 2,, M 1} is universal if for all x y, Pr h x h y 1 M The condition holds for every x y, and the randomness is only over the choice of h from H Equivalently, for every x y, we have:

Universal Hashing Examples

Examples that are Not Universal Note that a and b collide with probability more than 1/M = 1/2

Universal Hashing Example The following hash function is universal with M = {0,1,2}

Using Universal Hashing Theorem: If H is universal, then for any set S Uwith S = N, for any x S, if we choose h at random from H, the expected number of collisions between x and other elements in S is less than N/M. Proof: For y Swith y x, let C 1if h(x) = h(y), otherwise C 0 Let C C E C Prh x h y be the total number of collisions with x By linearity of expectation, E C EC

Using Universal Hashing Corollary: If H is universal, for any sequence of L insert, query, and delete operations in which there are at most M keys in the data structure at any time, the expected cost of the L operations for a random h His O(L) Assumes the time to compute h is O(1) Proof: For any operation in the sequence, its expected cost is O(1) by the last theorem, so the expected total cost is O(L) by linearity of expectation

But how to Construct a Universal Hash Family? Suppose and M Let A be a random m x u binary matrix, and h(x) = Ax mod 2 Claim: for,

But how to Construct a Universal Hash Family? Claim: For x y, Pr h x h y Proof: A x mod 2 A x mod 2, where A is the i th column of A If h(x) = h(y), then Ax=Ay mod 2, so A(x y) = 0 mod 2 If xy, there exists an i for which x y Fix A for all ji, which fixes b = A x y mod 2 A(x y) = 0 mod 2 if and only if A = b Pr A b So h(x) = Ax mod 2 is universal

k wise Independent Families Definition: A hash function family H is k universal if for every set of k distinct keys x,,x and every set of k values v,,v 0, 1,, M 1, Pr[h x v AND h x v AND AND h x v = If H is 2 universal, then it is universal. Why? h(x) = Ax mod 2 for a random binary A is not 2 universal. Why? Exercise: Show Ax + b mod 2 is 2 universal, where A in 0,1 and b 0,1 are chosen independently and uniformly at random

More Universal Hashing Given a key x, suppose x = x,,x where each x 0, 1,, M 1 Suppose M is prime Choose random r,..,r 0, 1,, M 1 and define h x r x r x r x mod M Claim: the family of such hash functions is universal, that is, Pr h x h y for all distinct x and y

More Efficient Universal Hashing Claim: the family of such hash functions is universal, that is, Pr h x h y for all x y Proof: Since x y, there is an i for which x y Let h x r x, and h(x) = h (x) + r x mod M If h(x) = h(y), then h (x) + r x = h (y) + r y mod M So r x y h y h x mod M, or r This happens with probability exactly 1/M mod M

Perfect Hashing If we fix the dictionary S of size N, can we find a hash function h so that all query(x) operations take constant time? Claim: If H is universal and M N, then Pr no collisions in S Proof: How many pairs {x,y} of distinct x,y in S are there? Answer: N(N 1)/2 For each pair, the probability of a collision is at most 1/M Pr[exists a collision] (N(N 1)/2)/M Just try a random h and check if there are any collisions Problem: our hash table has M N space! How can we get O(N) space?

Perfect Hashing in O(N) Space 2 Level Scheme Choose a hash function h:u 1,2,,Nfrom a universal family Let L be the number of items x in S for which h(x) = i Choose N second level hash functions h,h,,h, where h :U 1,,L By previous analysis, can choose hash functions h,h,,h so that there are no collisions, so O(1) time Hash table size is How big is that??,, L

Perfect Hashing in O(N) Space 2 Level Scheme Theorem: If we pick h from a universal family H, then Pr L 4N 1 2,, Proof: It suffices to show E L 2Nand apply Markov s inequality Let C, 1if h(x) = h(y). By counting collisions on both sides, L, C, If x = y, then C, 1. If xy, then E C, PrC, 1 E L E C, N E C, NNN1/N, < 2N So choose a random h in H, check if,, L 4N, and if so, then choose h,,h