CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Size: px

Start display at page:

Download "CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30"

Shavonne Garrison
5 years ago
Views:

1 CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, / 30

2 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30

3 Open Addressing Open Addressing. Each entry in the hash table stores a fixed number c of elements. This has the immediate implication that we only use it when cn < m. We will for simplicity assume this capacity c is 1. We will talk about expanding the table later... Q. How can we insert a new element if we get a collision? A. Find a new location to store the new element. We need to know where we put it as well. Search a well-defined sequence of other locations in the hash table until we find one that s not full. This sequence is called a probe sequence. 3 / 30

4 Probe Sequences We will look at each of the following methods for generating a probe sequence. linear probing: try A[(h(k) + i) mod m], i = 0, 1, 2,... quadratic probing: try A[(h(k) + c 1 i + c 2 i 2 ) mod m] double hashing: try A[(h(k) + i h (k)) mod m] where h is another hash function Linear Probing: The easiest open addressing strategy is linear-probing. For a hash table of size m, key k and hash function h(k), the probe sequence is calculated as: s i = (h(k) + i) mod m for i = 0, 1, 2,.... Q. What is the value of s 0, (the home location for the item)? A. h(k) 4 / 30

5 Linear Probing Q. What is the problem with linear probing? A. Clustering. Q: What happens when we hash to something within a group of filled locations? we have to probe the whole group until we reach an empty slot. we increase the size of the cluster. Resulting in two keys that didn t necessarily share the same home" location ending up with almost identical probe sequences. 5 / 30

6 Non-Linear Probing Non-linear probing includes schemes where the probe sequence does not involve steps of fixed size. Example. Quadratic probing where the probe sequence is calculated as: s i = (h(k) + c 1 i + c 2 i 2 ) mod m for i = 0, 1, 2,.... Q. Now what problem may occur? A. probe sequences will still be identical for elements that hash to the same home location. 6 / 30

7 Double Hashing In double hashing we use a different hash function h 2 (k) to calculate the step size. The probe sequence is: s i = (h(k) + i h 2 (k)) mod m for i = 0, 1, 2,.... Note that h 2 (k) should not be 0 for any k. Also, we want to choose h 2 so that, if h(k 1 ) = h(k 2 ) for two keys k 1, k 2, it won t be the case that h 2 (k 1 ) = h 2 (k 2 ). That is, the two hash functions don t cause collisions on the same pairs of keys. 7 / 30

8 Analysis of Open Addressing: Notice that in open addressing, INSERT and SEARCH take the same amount of work. Let s consider the complexity of INSERT for a key k: It s not hard to come up with worst-case situations where the above types of open addressing require Θ(n) time for INSERT. To simplify the analysis of the average case, we make some assumptions: the hash table has m locations the hash table contains n elements and we want to insert a new key k. consider a random probe sequence for k, that is, it s probe sequence is equally likely to be any permutation of (0, 1,..., m 1). 8 / 30

9 Average Insert Time under Open Addressing Let T denote the number of probes performed in the INSERT. Q. Then the average case time for insert is the expected time E(T). What is E(T)? A. m 1 E(T) = i Pr(T = i) i=0 Q. What is P(T = i)? A. This is hard. Find another way to express T = i... P(T = i) = P(T i) P(T i + 1) Let A i denote the event that every location up until the i-th probe is occupied. Then, T i iff A 1, A 2,..., A i 1 all occur, so Pr(T i) = Pr(A 1 A 2 A i 1 ) = Pr(A 1 ) Pr(A 2 A 1 ) Pr(A 3 A 1 A 2 ) Pr(A i 1 A 1 A i 2 ) 9 / 30

10 E(T) So far we have: A i denotes the event that every location up until the i-th probe is occupied. Then, T i iff A 1, A 2,..., A i 1 all occur, so Pr(T i) = Pr(A 1 ) Pr(A 2 A 1 ) Pr(A 3 A 1 A 2 ) Pr(A i 1 A 1 A i 2 ) Q. What is Pr(A j A 1 A j 1 )? A. Intuition. We need the number of elements that we have not seen so far over the number of slots we have not seen so far. For j 1, Pr(A j A 1 A j 1 ) = (n (j 1))/(m (j 1)), Pr(T i) = n/m (n 1)/(m 1) (n (i 2))/(m (i 2)) (n/m) i 1 a i 1 10 / 30

11 Average Case Complexity Now we can calculate the expected value of T, or the average-case complexity of INSERT. Changes shown in green... E(T) = m 1 i Pr(T = i) i=0 i Pr(T = i) i=1 i(pr(t i) Pr(T i + 1)) i=1 i Pr(T i) i Pr(T i + 1) i=1 i=1 i Pr(T i) ( (i+1) Pr(T i + 1) 1 P(T i + 1) ) i=1 i=1 i Pr(T i) i Pr(T i) + P(T i) i=1 i=2 Pr(T 1) + i Pr(T i) i Pr(T i) + P(T i) i=2 i=2 i=2 i=2 11 / 30

12 Average Case Complexity E(T) Pr(T 1) + (i) Pr(T i) (i) Pr(T i) + P(T i) i=2 Pr(T 1) + P(T i) Pr(T i) i=1 i=1 i=0 a i 1 a i 1 1 a. i=2 i=2 From previous slides Note: a < 1 since n < m bigger the load factor the longer it takes to insert something. This is what we expect, intuitively. i=2 12 / 30

13 Remove, under Open Addressing Under open addressing, two approaches for REMOVE: Find an existing key to fill the hole. Tricky for probing, impossible for double hashing. Mark the cell as deactivated. (Do not mark as free!) Each cell has 3 possibilities: Free: can insert here, can stop searching here. Deactivated: can insert here, cannot stop searching here. Stores a key. Accumulates junk, slows down all operations. Remove is problematic under open addressing. 13 / 30

14 Hash Functions: Division Method Assume each key is an integer. h(k) = k mod m Simple but susceptible to regular patterns in keys more collisions. Q. How can we improve this? A. Using prime numbers for m, the length of the array, reduces this problem. 14 / 30

15 Hash Function: Multiplication Method In theory: Pick real constant A with 0 < A < 1. h(k) = m fraction(k A) In practice: Assume each key is a w-bit natural number. Define A by picking w-bit constant s with 0 < s < 2 w and letting A = s/2 w. Use m = 2 p, for some 0 p < w h(k) = m fraction(k A) = 2 p fraction(k s/2 w ) (k s) mod 2w = 2 p 2 w (k s) mod 2 w = 2 w p What does (k s) mod 2 w return? The lower w bits of (k s) What does dividing by 2 w p do? Returns the upper p bits of the lower w bits of (k s). 15 / 30

16 Multiplication Method in Practice (k s) mod 2 w h(k) = m fraction(k A) = 2 w p To compute the hash function h(k) we can simply: 1. Obtain k s as a 2w-bit integer 2. Retain the lower w bits of k s 3. Retain the upper p bits of the result of part 2 Summary. h(k) = ((k s) mod 2 w ) >> (w p) where >> is the shift operator. Want A to be irrational - often use golden ratio for A and work backwards to define s. 16 / 30

17 Hash Function: Polynomial Hash When each key consists of multiple machine words (e.g., a string). Pick constant a, not equal to zero or one. If your key is internally the machine words x 0,..., x k 1 : h( x 0,..., x k 1 ) = (x 0 a k 1 + x 1 a k x k 2 a + x k 1 ) mod m Compute by: c = 0 (some people use non-zero constant here) for i in 0..k 1 c = c a + x i c = c mod m In practice people use xor instead of +. Every bit contributes. Order contributes too. 17 / 30

18 Hash Function: FNV-1 (Fowler - Noll - Vo) FNV-1 is a family of hash functions, one for each word size, e.g., there is one for 32 bits, there is one for 64 bits, etc. They work by chopping your key into 8-bit words. 32-bit FNV-1: hash = This is called the FNV - offset - basis for i in 0..k-1: # for each byte of data hash = hash 0x This is an FNV prime number in hex hash = hash XOR byte i Do your own hash mod m afterwards as needed (not part of FNV-1). Links: Wikipedia article, Noll s FNV page. On the down side: Problems with Hash Tables. 18 / 30

19 Hash Function: FNV-1a FNV-1a is like FNV-1 but with xor before multiply. 32-bit FNV-1a: hash = for i in 0..k-1: # for each byte of data hash = hash XOR byte i hash = hash 0x Recommended over FNV-1 for being more random and uniform. 19 / 30

20 Problems with Hashing When the set S of keys is unknown, we can no longer assume a uniform distribution. Further, regular patterns can be found, making any deterministic hashing scheme vulnerable to malicious slowdowns. Links: ocert advisory # , LWN article Q. What might be a solution? A. Create a family of hash functions and select one at random. Called universal hashing. 20 / 30

21 Universal Hashing Definition. A family H of of hash functions is universal iff: For any two keys j and k, with j k and table of size m, at most H /m functions satisfy h(j) = h(k) i.e., randomly pick h from H with uniform probability, then Pr(h(j) = h(k)) 1/m why? Can think of this as being equivalent" to a hash function that maps keys to hash codes randomly Note. Given a set of keys, we randomly select a hash function h() from H and use this function for every key in our set. 21 / 30

22 Universal Hashing Expected Number of Collisions Q. What would you hope the expected number of collisions is? A. n keys hashed to a table of size m, nicely spread out... O(n/m). Proof. Let S be a set of n keys. Let j be a key not in S. Randomly pick h from H. Q. How many keys in S does j collide with? A. Let random variable C be the number of such collisions. For each k S: Let indicator random variable X k be 1 when j collides with k. E(C) = E X k = E(X k ) k S k S j = Pr(h(j) = h(k)) k S j 1/m = (n 1)/m < n/m k S j 22 / 30

23 A Universal Family Find a prime p large enough such that m < p and every key k (assume integer) satisfies 0 k < p. Define: Universal Family: f a,b (k) = (a k + b) mod p h a,b (k) = f a,b (k) mod m = ((a k + b) mod p) mod m H = {h a,b : (0 < a < p) (0 b < p)} Randomly pick a from 1 to p 1, pick b from 0 to p 1. Q. How many choices for a and b? A. (p 1) p choices. 23 / 30

24 Proving H is a universal family In order to prove that H is a universal family, we need to show: For keys k, j st. k j, Pr(h(j) = h(k)) 1/m. I.e.,Pr(h(j) = h(k)) number of (a,b) collisions num (a,b) pairs 1/m. Q. How many (a, b) pairs are there in total? A. We said p(p 1). 24 / 30

25 Overview Show that f a,b has no collisions. Show that for keys j, k with j k. For every (r, s) with 0 r < p, 0 s < p, r s: there exists unique (a, b) such that f a,b (j) = (aj + b) mod p = r and f a,b (k) = (ak + b) mod p = s. This means there is a 1-1 correspondence between (a, b) and (r, s). Since 1-1 correspondence between (a, b) and (r, s), to count the number of a and b pairs that cause a collision, i.e., h a,b (j) = h a,b (k) for h k is less than O(1/m) we can count the number of r and s pairs such that r s mod m. 25 / 30

26 No Collisions from f a,b Recall: Prime p is large enough such that m < p and every key k (assume integer) satisfies 0 k < p. f a,b (k) = (a k + b) mod p h a,b (k) = f a,b (k) mod m = ((a k + b) mod p) mod m Claim. Let j and k be different keys, then f a,b (j) f a,b (k). Proof. Assume otherwise. Without loss of generality, assume k < j. Suppose f a,b (j) = f a,b (k), (a j + b) (a k + b) is a multiple of p a (j k) is a multiple of p a or j k is a multiple of p because p is prime But 0 < a < p and 0 < j k < p. Neither can be a multiple of p Contradiction. 26 / 30

27 A One-One Correspondence Claim. Given keys j, k with j k. For every (r, s) with 0 r < p, 0 s < p, r s: there exists unique (a, b) such that f a,b (j) = (aj + b) mod p = r and f a,b (k) = (ak + b) mod p = s. In other words, there is a one-one correspondence between (a, b) s and (r, s) s. Proof. Left as an exercise - or see Ch Now, we count how many (a, b) s cause h a,b (j) = h a,b (k) (collisions). Henceforth we will count how many (r, s) s cause r mod m = s mod m but r s. 27 / 30

28 How Many (a, b) s Cause h a,b Collisions We want to prove (given keys j, k with j k) and random a, b, the number of collisions: {(a, b) : h a,b (j) = h a,b (k)} p(p 1)/m = H /m In other words, H is a universal family of hash functions. {(a, b) : h a,b (j) = h a,b (k)} = {(a, b) : f a,b (j) mod m = f a,b (k) mod m} = {(r, s) : r s r mod m = s mod m} p 1 = {s : r s r mod m = s mod m} r=0 p 1 = {s : r mod m = s mod m} 1 r=0 p 1 p/m 1 r=0 28 / 30

29 How Many (a, b) s Cause h a,b Collisions p 1 p/m 1 r=0 Q. What is the largest value that p/m can be? A. Notice that p/m {p/m, (p + 1)/m,..., (p + m 1)/m} p 1 (p + m 1)/m m/m r=0 p 1 = (p 1)/m r=0 = p(p 1)/m = H m 29 / 30

30 Rehashing Need to enlarge the array when enough keys are inserted. Think dynamic array... Changing the table size m changing the hash function Move every key! Choose a new m, approx twice the old one. Allocate new array. For every key: re-compute hash value, store in new array. Want twice the array for the good amortized cost. Can you determine the amortized cost? See also Problems with Hash Tables again. It seems many implementations do it wrong: They add rather than multiply. 30 / 30

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect