Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In its most simplest form, each cell is either empty or contains a reference to a stored element, or a reference to a list of stroed elements. The size n of a hash table equals the number of cells in the array. To insert an element x into a hash table of size n, one requires a hash function h that takes as input x and returns a (usually large) integer h(x). Then the value h(x) mod n provides an index of a table cell to where x may be placed. If there is no element currently stored in this cell, then x may be placed there. Otherwise, x may have to be re-located, depending on whether or not the table allows for multiple elements to be stored at a single cell. In practice, h evaluates only a portion of x, called its key, that uniquely identfies x. For example, an employee record may consist of multiple kilobytes of data, including a string s of ten digits that serves as the employee s unique identifier. Then h(s) may be used to insert the record into a hash table. Example 1. Given a hash table of size 10, let the set of elements to be hashed be the set of 5-digit strings. Define h by h(x) equals the integer value of the string. For example, h(00020) = 20, while h(00000) = 0. Use h to hash the elements, 54371, 51323, 16170, 64199, 44344, 96787, 19898, 00002, 55435 to cells of the table. Example 1 Solution. 1
Properties of a good hash function h Efficient Computing h(x) should require a constant number of steps per bit of x. This is essential since every table operation involving x will require a computation of h(x). Uniform h should ideally map elements randomly and uniformly over the entire range of integers. This allows each element to have an equally likely chance of being inserted into any of the table cells. Suppose element e is represented by an array of m bytes b 0, b 1,..., b m 1. Then a common method for defining a hash function h(e) is to consider the key polynomial. p(x) = b 0 x m 1 + b 1 x m 2 + + b m 1. Then we may define h(e) p(a), where a is an appropriately chosen positive integer that helps h attain the uniformity property. For example, a = 37 has been shown to give adequate results. Example 2. Provide the key polynomial in the case that e is the string Hello!. We now show that an element s key polynomial can be computed in O(m) steps using Horner s algorithm. Given polynomial p(x) = b 0 x m 1 + b 1 x m 2 + + b m 1, Horner s algorithm begins by first computing p 0 (x) = c 0. Now assume p k (x) has been computed for some k 0, then p k+1 (x) is computed via the equation p K+1 (x) = xp k (x) + c k+1. 2
Example 3. Given the polynomial p(x) = 2x 3 3x 2 + 5x 7, show the sequence of polynomials that are evaluated leading up the evaluation of p(x) using Horner s algorithm. Show the algorithm can be used to compute p( 2). Horner s Algorithm Input coefficient array b[0 : m 1]. Input value x. Initialize sum: sum 0. For each i from 0 to m 1 sum sum x + b[i]. Return sum. 3
Collision Handling Given a hash table T of size n, a collision occurs in T whenever an element x is to be inserted into T, and h(x) mod n yields the index of a table cell that already contains an inserted element. The two most common methods for resolving a collision are separate chaining and open-address probing. Separate chaining Separate chaining handles collisions by allowing for multiple elements to be inserted within the same table cell. This is accomplished by allowing the cell to reference a list of all elements inserted in the cell. Separate chaining represents a very simple and convenient way of handling collisions, and can be efficient provided the lists do not become too long. The load factor of a hash table is defined as λ = m n is the table size. where m is the number of hashed items, and n Theorem 1. Let λ = m be the load factor of a hash table. Under the assumption that separate n chaining is used to resolve collisions, the average length of a chain is λ. Proof. Letting L i, i = 1, 2,..., n denote the length of chain i, we see that the average chain length is 1 n (L 1 + L 2 + + L n ) = m n = λ, and the result is proved. From Theorem 1, and the facts that 1. successfully finding data in a list will on the average take L 2 list steps, where L is the length of the 2. unsuccessfully finding data in a list will take L steps on average it follows that the complexity of hashing is directly dependent upon λ, and so λ should be made a small constant, such as λ = 1. 4
Example 4. Assuming the data and hash function from Example 1, hash the elements into a table of size 10 using separate chaining. Assume additional elements 12345, 23423, 17654, 12343. The following is stated without proof. A proof can be found in Mitzenmacher and Upfal s Probability and Computing. Theorem 2. If n elements are uniformly hashed into a hash table of size n using separate chaining, then with probability approaching 1 as n approaches infinity, the longest chain of the table will be Θ(ln n/ ln(ln n))). It turns out one can do much better, if, instead of hashing to only one bin, one hashes to,say d 2 bins, and chooses the bin of least size to place the element. Theorem 3. Suppose that n elements are uniformly hashed into a table of size n using separate chaining in the following manner. For each element d 2 index locations are determined via a d hash functions. Moreover, the element is placed into the list having the shortest length out of the corresponding d lists (with ties being broken randomly). Then after all n elements have been hashed, with probability 1 o(1/n), the longest list will be at most ln(ln n)/ ln d + O(1). 5
Open-address probing collision resolution The method of open-address probing allows for only one inserted element per cell, and thus must find a new location for an element x for which I = h(x) mod n results in an index for an already-occupied cell. In response to this the values (I + f(i)) mod n, i = 1, 2,... are computed until one of them results in an unoccupied cell index. The following are some possibilities for probing function f. Linear Probing f(i) = ai + b, is a linear function Quadratic Probing f(i) = ai 2 is a quadratic function Random Probing The next cell to be probed is randomly and uniformly selected using a pseudorandom number generator Double Hashing f(i) = ih 2 (x) for some second hash function h 2 Example 5. Hash 1.34,1.45,2.56,5.12,5.34,4.34 using hash function h(x) = x and i) linear probing with f(i) = i; ii) quadratic probing with f(i) = i 2 for collision resolution. Assume a table size of 13. 6
Example 6. Assuming open addressing with a random probe function, determine the expected number of probes that are needed to hash m elements into a table of size n. 7
Theorem 4. If p is prime, than the first p perfect squares (mod p) are distinct. As a corollary, if 2 a hash-table size is prime, and the load factor is less than 1, then every collision can be sucessfully 2 re-addressed via quadratic probing. Proof. Let m 1 and m 2 be integers from {0, 1,..., p 2 1}, and suppose m2 1 m 2 2 mod p. Then by definition of mod, we have p (m 2 2 m 2 1), which implies p (m 2 m 1 )(m 2 + m 1 ), which implies either p (m 2 m 1 ) or p (m 2 + m 1 ). In the first case, we must then have m 1 = m 2. The latter case is impossible since 0 < m 2 + m 1 < p, and so p cannot possibly divide into this sum. Therefore, if m 1 is distinct from m 2, then their squares must also be distinct, modulo p. 8
Universal Hashing Even for a good hash function h, it is possible for an adversary to choose a set of elements that cause h to perform badly by causing several collisions. Universal hashing solves this problem by instead providing an entire family of hash functions, where, for each element to be hashed, one of the hash functions is randomly selected to perform the hashing. It is assumed that the adversary does not know which hash function from the family will be selected. We call a collection of hash functions universal if for each pair of distinct elements h and k, the number of hash functions for which h(k) = h(l) is at most H n /n, where n is the range of each hash function, and H n is the set of hash functions. In other words, the chance of a collision between k and l is at most 1. It is not n hard to prove the existence of universal families of hash functions, but this goes beyond the scope of this lecture. Moreover, it can be proven that, for universal hashing family H n and any sequence of m inserts, finds, and removals using a table of size n, it takes expected time Θ(m) to handle these operations. Perfect Hashing Perfect hashing is the notion of designing a hash function and table so that find operations do not incur any collisions when retrieving an element. Perfect hashing can be achieved when the data to be stored in the table remains fixed. Perfect hashing schemes can be achieved using universal hashing along with a dual-table approach. The first table contains references to other tables, each one with its own hash function. The hash functions are chosen carefully so that the second hash yields zero collisions. Again, the details go beyond the scope of this lecture. Secure Hash Algorithm (SHA) A secure hash algorithm is one that used for mapping a sequence of characters (such as a password) into a pseudorandom word whose length ranges anywhere from 128-512 bits, depending on the algorithm. SHAs undergo rigorous testing by the National Institute of Standards and Technology. One desirable property of a SHA is that it produces uncorrelated output words for two input words, even if the input words may differ by only a single character. This makes it very difficult for a hacker who might possess the hashed word w (via a security breach) and is searching for a preimage that hashes to w. 9
Exercises. 1. Given elements 4371, 1323, 6173, 4199, 4344, 9679, 1989, hash function h(x) = x, and table size 10, hash the elements using i) separate chaining, ii) linear probing with f(i) = i, and iii) quadratic probing with f(i) = i 2. 2. Given hash function h(x) = x, a hash table of size 13, use quadratic probing with f(i) = i 2 to hash the numbers 2.12, 2.31, 6.21, 2.99, 2.56, 11.94, 11.00. Draw the resulting hash table. 3. Suppose we are using a random hash function that hashes m elements into a table of size n. What is the expected number of collisions? In other words, what is the expected size of the set {(x i, x j ) i j and h(x i ) = h(x j )}? Hint: define appropriate indicator random variables, and use the fact that the expectancy of a sum equals the sum of the expectancies. 4. When using separate chaining to handle collisions within a hash table T, suppose we insist on representing each chain as a sorted array, and that a linear-time merge procedure is performed whenever a newly hashed element is inserted into the chain. If T has load factor λ, then provide big-o expressions for the average time it will take to i) successfully find an element in T, ii) unsuccessfully find an element in T, and iii) insert an element into T. Explain. 5. Suppose that we are storing a set of m keys into a hash table of size n. Show that if the keys are drawn from a universe having size that exceeds mn, then there is a subset of m keys that all hash to the same slot; so that the worst-case searching time for separate-chaining hashing is Θ(m). 6. Suppose we want to determine if the string s = s 1 s 2 s k is a substring of the much larger string a 1 a 2 a n. One approach is to compute h(s) with some hash function h. Then, for each i = 1, 2,..., n k + 1, compute h(a i a i+1 a i+k 1 ). If this value is identical to h(s), then compare this string with s, and return true if the strings are identical. Otherwise, proceed to the next substring. Argue that the running time of this algorithm is O(k(n k)). Prove that, if we assume the hash function is h(x 0 x l ) = l i=0 x l i37 i, then the running time can be reduced to O(n). Hint: argue that h(a i+1 a i+2 a i+k ) can be computed in Θ(1) steps assuming h(a i a i+1 a i+k 1 ) has been computed. 7. Given two lists of integers of sizes m and n respectively, describe an algorithm that runs in time O(m + n) that computes the intersection of the lists. Argue that your algorithm is correct, and has the desired running time. 10
Hints and Answers. 1. i) Separate chaining: cell 1: 4371, cell 3: 1323, 6173, cell 4: 4344, cell 9: 4199, 9679, 1989. Linear probing: 9679 4371 1989 1323 6173 4344 4199 0 1 2 3 4 5 6 7 8 9 Quadratic Probing: 9679 4371 1323 6173 4344 1989 4199 0 1 2 3 4 5 6 7 8 9 2. Given hash function h(x) = x, a hash table of size 13, use quadratic probing with f(i) = i 2 to hash the numbers 2.12, 2.31, 6.21, 2.99, 2.56, 11.94, 11.00. Draw the resulting hash table. 2.12 2.31 2.56 6.21 11.00 2.99 11.94 0 1 2 3 4 5 6 7 8 9 10 11 12 3. Let I ij be an indicator random variable that is set to 1 if x i and x j hash to the same cell, where 1 i < j < m. Then E[I ij ] equals the probability that x i and x j hash to the same cell, which equals 1/n (why?). Then since there are m(m 1)/2 different indicator variables, it follows that the expected size of {(x i, x j ) i j and h(x i ) = h(x j )} equals m(m 1)/2n. 4. Successful and unsuccsessful searches: O(log L), where L is the chain size, since binary search can be used. Insertions: O(log L) by using a binary heap at each cell to dynamically sort the incoming elements. 5. Use the pigeon-hole principle. 6. Letting b i = a i+1 a i+2 a i+k 1 a i+k then show h(b i+1 ) = (h(b i ) a i+1 37 k 1 ) 37 + a i+k+1. In other words, the hash for b i+1 can be obtained from the hash of b i in a constant number of operations. 7. Hash the first list into a table T. This takes O(m) steps. Then for each a in the second list hash a to check if a T. Add a to the output list if this is the case. This requires an additional O(m) steps for a total of O(m + n) steps. 11