Hash tables. Hash tables

Similar documents
Hash tables. Hash tables

Hash tables. Hash tables

Data Structures and Algorithm. Xiaoqing Zheng

Lecture: Analysis of Algorithms (CS )

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Hashing, Hash Functions. Lecture 7

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

So far we have implemented the search for a key by carefully choosing split-elements.

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Introduction to Hash Tables

6.1 Occupancy Problem

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

1 Maintaining a Dictionary

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

Analysis of Algorithms I: Perfect Hashing

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Introduction to Hashtables

CSE 502 Class 11 Part 2

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Fundamental Algorithms

Algorithms for Data Science

Lecture 4: Two-point Sampling, Coupon Collector s problem

Searching, mainly via Hash tables

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Algorithms lecture notes 1. Hashing, and Universal Hash functions

? 11.5 Perfect hashing. Exercises

Cache-Oblivious Hashing

Module 1: Analyzing the Efficiency of Algorithms

Lecture 12: Hash Tables [Sp 15]

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

Hashing. Why Hashing? Applications of Hashing

compare to comparison and pointer based sorting, binary trees

CSE 548: (Design and) Analysis of Algorithms

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS 580: Algorithm Design and Analysis

2. This exam consists of 15 questions. The rst nine questions are multiple choice Q10 requires two

Hashing. Data organization in main memory or disk

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Data Structures and Algorithm. Xiaoqing Zheng

Hashing. Martin Babka. January 12, 2011

Randomized algorithm

Randomized Load Balancing:The Power of 2 Choices

Amortized analysis. Amortized analysis

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Hashing Data Structures. Ananda Gunawardena

Lecture Lecture 3 Tuesday Sep 09, 2014

Outline. CS38 Introduction to Algorithms. Max-3-SAT approximation algorithm 6/3/2014. randomness in algorithms. Lecture 19 June 3, 2014

CS483 Design and Analysis of Algorithms

Basics of hashing: k-independence and applications

1 Hashing. 1.1 Perfect Hashing

Problem Set 4 Solutions

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Application: Bucket Sort

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

Lecture 5: Two-point Sampling

Independence, Variance, Bayes Theorem

Introduction to Algorithms

6.842 Randomness and Computation Lecture 5

b = 10 a, is the logarithm of b to the base 10. Changing the base to e we obtain natural logarithms, so a = ln b means that b = e a.

Explicit and Efficient Hash Functions Suffice for Cuckoo Hashing with a Stash

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

Analysis of Algorithms Prof. Karen Daniels

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

1 Difference between grad and undergrad algorithms

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

s 1 if xπy and f(x) = f(y)

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

Module 9: Tries and String Matching

Cryptographic Hash Functions

14.1 Finding frequent elements in stream

In a five-minute period, you get a certain number m of requests. Each needs to be served from one of your n servers.

Randomized Algorithms. Zhou Jun

Non-Interactive Zero Knowledge (II)

Lecture 1 : Data Compression and Entropy

Errata: First Printing of Mitzenmacher/Upfal Probability and Computing

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Assignment 4: Solutions

Sliding Windows with Limited Storage

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

Tutorial 4. Dynamic Set: Amortized Analysis

CS483 Design and Analysis of Algorithms

Lecture 23: Alternation vs. Counting

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

< k 2n. 2 1 (n 2). + (1 p) s) N (n < 1

Expectation of geometric distribution

2. ALGORITHM ANALYSIS

Transcription:

Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X = k] In particular, if the values of X are 0 or 1, then E[X ] = Pr[X = 1].

Basic Probability Theory Union bound: For events A 1,..., A k, [ k ] Pr A i i=1 k Pr[A i ] i=1 Linearity of expectation: For random variables X 1,..., X k, [ k ] E X i = E[X i ] i=1 i=1

Dictionary Definition A dictionary is a data-structure that stores a set S of elements, where each element x has a field x.key, and supports the following operations: Search(S, k) Return an element x S with x.key = k Insert(S, x) Add x to S. Delete(S, x) Delete x from S. We will assume that The keys are from U = {0,..., u 1}. The keys are distinct.

Direct addressing S is stored in an array T [0..m 1]. The entry T [k] contains a pointer to the element with key k, if such element exists, and NULL otherwise. Search(T, k): return T [k]. Insert(T, x): T [x.key] x. Delete(T, x): T [x.key] NULL. What is the problem with this structure? 1 S U 0 9 7 4 6 3 2 5 8 0 1 2 3 4 5 6 7 8 9 key 2 3 5 8 satellite data

Hashing In order to reduce the space complexity of direct addressing, map the keys to a smaller range {0,... m 1} using a hash function. There is a problem of collisions (two or more keys that are mapped to the same value). There are two ways to handle collisions: Chaining Open addressing

Hash table with chaining Let h : U {0,..., m 1} be a hash function (m < u). S is stored in a table T [0..m 1] of linked lists. The element x S is stored in the list T [h(x.key)]. Search(T, k): Search the list T [h(k)]. Insert(T, x): Insert x at the head of T [h(x.key)]. Delete(T, x): Delete x from T [h(x.key)]. S={6,9,19,26,30} m=5, h(x)=x mod 5 30 6 26 19 9

Analysis Assumption of simple uniform hashing: any element is equally likely to hash into any of the m slots, independently of where other elements have hashed into. The above assumption is true when the keys are chosen uniformly and independently at random (with repetitions), and the hash function satisfies {k U : h(k) = i} = u/m for every i {0,..., m 1}. We want to analyze the performance of hashing under the assumption of simple uniform hashing. This is the balls into bins problem. Suppose we randomly place n balls into m bins. Let X be the number of balls in bin 1. The time complexity of a random search in a hash table is Θ(1 + X ).

The expectation of X Let α = n/m. Claim E[X ] = α. Proof. Let I j be a random variable which is 1 if the j-th ball is in bin 1, and 0 otherwise. X = n j=1 I j, so n E[X ] = E[ I j ] = j=1 n E[I j ] = j=1 n j=1 Pr[I j = 1] = n 1 m

The distribution of X Claim Pr[X = r] e α α r r!. (i.e., X has approximately Poisson distribution). Proof. Pr[X = r] = = ( ) ( ) r ( n 1 1 1 ) n r r m m n(n 1) (n r + 1) r! 1 m r ( 1 1 ) n r m If m and n are large, n(n 1) (n r + 1) n r and ( ) 1 1 n r m e n/m. Thus, Pr[X = r] e n/m (n/m) r. r!

The maximum number of balls in a bin Let Y be the maximum number of balls in some bin. Claim For m = n, Pr[Y R] 1 n 1.5, where R = e ln n/ ln ln n. Proof. Let X i be the number of balls in bin i. Pr[Y r] n Pr[X i r] = n Pr[X r] i=1 Pr[X r] ( n r ) ( 1 n ) r nr r! 1 = 1 ( e ) r n r r! r where the last inequality is true since e r = r i i=0 r r. i! r!

The distribution of max i X i Proof (continued). ( e ) R Pr[Y R] n R ( ln ln n = n ln n ) e ln n/ ln ln n = e ln n (ln ln ln n ln ln n) e ln n/ ln ln n e e ln ln ln n (e 1) ln n+ln n = e ln ln n e 1.5 ln n = 1 n. 1.5

Rehashing We wish to maintain n = O(m) in order to have Θ(1) search time. This can be achieved by rehashing. Suppose we want n m. When the table has m elements and a new element is inserted, create a new table of size 2m and copy all elements into the new table. The cost of rehashing is Θ(n).

Universal hash functions Definition A collection H of hash functions is a universal if for every pair of distinct keys x, y U, Pr h H [h(x) = h(y)] 2 m. Example Let p be a prime number larger than u. f a,b (x) = ((ax + b) mod p) mod m H p,m = {f a,b a {1, 2,..., p 1}, b {0, 1,..., p 1}}

Universal hash functions Theorem Suppose that H is a universal collection of hash functions. If a hash table for S is built using a randomly chosen h H, then for every k U, the expected time of Search(S, k) is Θ(1 + n/m). Proof. Let X = length of T [h(k)]. X = y S I y where I y = 1 if h(y.key) = h(k) and I y = 0 otherwise. [ ] E[X ] = E I y = E[I y ] = y S y S y S 1 + n 2 m. Pr [h(y.key) = h(k)] h H

Time complexity of chaining Under the assumption of simple uniform hashing, the expected time of a search is Θ(1 + α) time. If α = Θ(1), and under the assumption of simple uniform hashing, the worst case time of a search is Θ(log n/ log log n), with probability at least 1 1/n Θ(1). If the hash function is chosen from a universal collection at random, the expected time of a search is Θ(1 + α). The worst case time of insert is Θ(1) if there is no rehashing.

Interpreting keys as natural numbers How can we convert floats or ASCII strings to natural numbers? An ASCII string can be interpreted as a number in base 128. Example For the string CLRS, the ASCII values of the characters are C = 67, L = 76, R = 82, S = 83. So CLRS is (67 128 3 )+(76 128 2 )+(82 128 1 )+(83 128 0 ) = 141, 764, 947.

Horner s rule Horner s rule: a d x d + a d 1 x d 1 + + a 1 x 1 + a 0 = Code: y = a d for i = d 1 to 0 y = a i + xy ( ((a d x + a d 1 )x + a d 2 )x + )x + a 0. If d is large the value of y is too big. Solution: evaluate the polynomial modulo p: y = a d for i = d 1 to 0 y = (a i + xy) mod p

Rabin-Karp pattern matching algorithm Suppose we are given strings P and T, and we want to find all occurences of P in T. The Rabin-Karp algorithm is as follows: Compute h(p) for every substring T of T of length P if h(t ) = h(p) check whether T = P. The values h(t ) for all T can be computed in Θ( T ) time using rolling hash. Example Let T = BGUCS and P = GUC. Let T 1 = BGU, T 2 = GUC. h(t 1 ) = (66 128 2 + 71 128 + 85) mod p h(t 2 ) = (71 128 2 + 85 128 + 67) mod p = (h(t 1 ) 66 128 2 ) 128 + 67 mod p

Applications Data deduplication: Suppose that we have many files, and some files have duplicates. In order to save storage space, we want to store only one instance of each distinct file. Distributed storage: Suppose we have many files, and we want to store them on several servers.

Distributed storage Let m denote the number of servers. The simple solution is to use a hash function h : U {1,..., m}, and assign file x to server h(x). The problem with this solution is that if we add a server, we need to do rehashing which will move most files between servers.

Distributed storage: Consistent hashing Suppose that the server have identifiers s 1,..., s m. Let h : U [0, 1] be a hash function. For each server i associate a point h(s i ) on the unit circle. For each file f, assign f to the server whose point is the first point encountered when traversing the unit cycle anti-clockwise starting from h(f ). server 1 server 2 0 server 3

Distributed storage: Consistent hashing Suppose that the server have identifiers s 1,..., s m. Let h : U [0, 1] be a hash function. For each server i associate a point h(s i ) on the unit circle. For each file f, assign f to the server whose point is the first point encountered when traversing the unit cycle anti-clockwise starting from h(f ). server 1 server 2 0 server 3

Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the first server point after h(s m+1 ). We only need to reassign some of the files that were assigned to server i. The expected number of files reassignments is n/(m + 1). server 1 server 2 0 server 4 server 3

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 1 2 3 4 5 6 7 8 9 h(k)=k mod 10

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 1 2 3 4 14 5 6 7 8 9 insert(t,14)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 20 1 2 3 4 14 5 6 7 8 9 insert(t,20)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 20 1 2 3 4 14 5 34 6 7 8 9 insert(t,34)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 20 1 2 3 4 14 5 34 6 7 8 9 19 insert(t,19)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 20 1 2 3 4 14 5 34 6 55 7 8 9 19 insert(t,55)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert an element k, try to insert to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Code: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = NULL OR... T [j] = k return error hash table overflow 0 20 1 49 2 3 4 14 5 34 6 55 7 8 9 19 insert(t,49)

Insert How to perform Search? 0 20 1 49 2 3 4 14 5 34 6 55 7 8 9 19 search(t,55)

Insert Search(T, k) is performed as follows: for i = 0,..., m 1 j = h(k) + i mod m if T [j] = k return j if T [j] = NULL return NULL return NULL 0 20 1 49 2 3 4 14 5 34 6 55 7 8 9 19 search(t,25)

Delete Delete method 1: To delete element k, store in T [h(k)] a special value DELETED. 0 20 1 49 2 3 4 14 5 D 6 55 7 8 9 19 delete(t,34)

Delete Delete method 2: Erase k from the table (replace it by NULL) and also erase all the elements in the block of k. Then, reinsert the latter elements to the table. 0 20 1 49 2 3 4 14 5 34 6 55 7 8 9 19 delete(t,34)

Delete Delete method 2: Erase k from the table (replace it by NULL) and also erase all the elements in the block of k. Then, reinsert the latter elements to the table. 0 20 1 49 2 3 4 5 6 7 8 9 19

Delete Delete method 2: Erase k from the table (replace it by NULL) and also erase all the elements in the block of k. Then, reinsert the latter elements to the table. 0 1 2 3 4 5 6 7 8 9 20 49 14 55 19

Open addressing Open addressing is a generalization of linear probing. Let h : U {0,... m 1} {0,... m 1} be a hash function such that {h(k, 0), h(k, 1),... h(k, m 1)} is a permutation of {0,... m 1} for every k U. The slots examined during search/insert are h(k, 0), then h(k, 1), h(k, 2) etc. In the example on the right, h(k, i) = (h 1 (k) + ih 2 (k)) mod 13 where h 1 (k) = k mod 13 h 2 (k) = 1 + (k mod 11) 0 1 79 2 3 4 69 5 98 6 7 72 8 9 14 10 11 50 12 insert(t,14)

Open addressing Insertion is the same as in linear probing: for i = 0,..., m 1 j = h(k, i) if T [j] = NULL OR T [j] = DELETED T [j] = k return error hash table overflow Deletion is done using delete method 1 defined above (using special value DELETED).

Double hashing In the double hashing method, h(k, i) = (h 1 (k) + ih 2 (k)) mod m for some hash functions h 1 and h 2. The value h 2 (k) must be relatively prime to m for the entire hash table to be searched. This can be ensured by either Taking m to be a power of 2, and the image of h 2 contains only odd numbers. Taking m to be a prime number, and the image of h 2 contains integers from {1,..., m 1}. For example, h 1 (k) = k mod m h 2 (k) = 1 mod m where m is prime and m < m.

Time complexity of open addressing Assume uniform hashing: the probe sequence of each key is equally likely to be any of the m! permutations of {0,... m 1}. Assuming uniform hashing and no deletions, the expected number of probes in a search is 1 At most 1 α for unsuccessful search. At most 1 α ln 1 1 α for successful search.