Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

Size: px

Start display at page:

Download "Electrical & Computer Engineering University of Waterloo Canada February 6, 2007"

Ashlyn Franklin
5 years ago
Views:

Lecture 9: Lecture 9: Electrical & Computer Engineering University

that a hash table consists of m slots into which we are placing

We put n keys k 1, k 2,..., k n into locations h(k 1 ), h(k 2 ),.

1 Lecture 9: Lecture 9: Electrical & Computer Engineering University of Waterloo Canada February 6, 2007 Hash tables Lecture 9: Recall that a hash table consists of m slots into which we are placing items; A map h : K [0, m 1] from key values to slots. We put n keys k 1, k 2,..., k n into locations h(k 1 ), h(k 2 ),..., h(k n ). In the ideal situation we can then locate keys with O(1) operations.

2 Horner s Rule I Horner s rule gives an efficient method for evaluating hash functions for sequences, e.g., strings. Consider a hash function of the form Lecture 9: h(k) = k mod m If we wish to hash a string such as hello, we can interpret it as a long binary number: in ASCII, hello is }{{} }{{} }{{} }{{} }{{} h e l l o As a sequence of integers, hello is [104, 101, 108, 108, 111]. We want to compute ( ) mod m Horner s Rule II Horner s rule is a general trick for evaluating a polynomial. We write ax 3 + bx 2 + cx + d = (ax 2 + bx + c)x + d = ((ax + b)x + c)x + d Lecture 9: So that instead of computing x 3, x 2,... we have only multiplications: t 1 = ax + b t 2 = t 1 x + c t 3 = t 2 x + d Trivia: some early CPUs included an instruction opcode for applying Horner s rule. May be making a comeback!

3 Horner s Rule III To use Horner s rule for hashing: to compute (a b c d) mod m, t 1 = (a b) mod m t 2 = (t c) mod m t 3 = (t d) mod m Lecture 9: Note that multiplying by 2 k is simply a shift by k bits. Why this works. In short, algebra. The integers Z form a ring under multiplication and addition. The hash function h(k) = k mod m can be interpreted as a homomorphism from the ring Z of integers to the ring Z/mZ of integers modulo m. Homomorphisms preserve structure in the following sense: if we write + for integer addition, and for addition modulo m, h(a + b) = h(a) h(b) i.e., it doesn t matter whether we compute (a + b) mod m or compute (a mod m) and (b mod m) and add with modular Horner s Rule IV arithmetic: we get the same answer either way. Similarly, if we write for multiplication in Z, and for multiplication in Z/mZ, h(a b) = h(a) h(b) Horner s rule works precisely because h : Z Z/mZ is a homomorphism: h(((a b) c) d) = (((h(a) h(2 8 ) h(b)) h(2 8 ) h(c)) h(2 8 ) h(d)) Lecture 9: This can be optimized to use fewer applications of h, as above. In this form it is obvious why m = 2 8 is a horrible choice for a hash table size: 2 8 mod 2 8 = 0, so (((h(a) h(2 8 ) h(b)) h(2 8 ) h(c)) h(2 8 ) h(d)) = (((h(a) 0 h(b)) 0 h(c)) 0 h(d)) = h(d) i.e., the hash value depends only on the last byte. Similarly, if we used m = 2 16, we would have h(2 16 ) = 0, which would remove all but the last two bytes from the hash value computation. For background on algebra see, e.g., [1, 9, 7].

4 Collisions A collision occurs when two keys map to the same location in the hash table, i.e., there are distinct x, y M such that h(x) = h(y). Strategies for handling collisions: 1. Pick a value of m large enough so that collisions are rare, and can be easily dealt with e.g., by maintaining a short overflow list of items whose hash slot is already occupied. 2. Pick the hash function h to avoid collisions. 3. Put another data structure in each hash table slot (a list, tree, or another hash table); 4. If a hash slot is full then try some other slots in some fixed sequence (open addressing). Lecture 9: Collision Strategy 1: Pick m big I Let s see how big m must be for the probability of collisions to be small. Two cases: n > m: then there must be a collision, by the pigeonhole principle. 1 n m: may or may not be a collision. The birthday problem : what is the probability that amongst n people, at least two share the same birthday? This is a hashing problem: people are keys, days of the year are slots, and h maps people to their birthdays. If n 23, then the probability of two people having the same birthday is > 1 2. (Counterintuitive, but true.) The birthday problem analysis is straightforward to adapt to hashing. Lecture 9:

5 Collision Strategy 1: Pick m big II Suppose the hash function h and the distribution of keys cooperate to produce a uniform distribution of keys into hash table slots. Recall that with a uniform distribution, probability may be computed by simple counting: Lecture 9: Pr(event E happens) = # outcomes in which E happens # outcomes First we count the number of hash functions without collisions: There are m choices of where to put the first key; m 1 choices of where to put the second key;... m n + 1 choices of where to put the n th key. The number of hash functions with no collisions is m n = m (m 1) (m n + 1) = m! (m n)!. (Note2.) Next we count the number of hash functions allowing collisions: Collision Strategy 1: Pick m big III There are m choices of where to put the first key; m choices of where to put the second key;... m choices of where to put the n th key. The number of hash functions allowing collisions is m n. The probability of a collision-free arrangement is Lecture 9: p = m! (m n)! m n Asymptotic estimate of ln p, assume m n: ln p n2 2m + n ( ) n 3 2m + O m 2 (1) Here we have used Stirling s approximation and ln(m n) = ln m n O n. 2 m m 2 Two cases: If n 2 m then ln p 0. If n 2 m then ln p.

6 Collision Strategy 1: Pick m big IV Recall that if ln p = x + ɛ then Lecture 9: p = e x+ɛ = e x e ɛ = e x ( 1 + ɛ + ɛ 2 + ) Taylor series = e x (1 + O(ɛ)) if ɛ o(1) Probability of a collision-free arrangement is Interpretation: p e n(n 1) 2m + O (n 3 e n(n 1) 2m m 2 ) Collision Strategy 1: Pick m big V Lecture 9: If m ω(n 2 ) there are no collisions (almost surely). If m o(n 2 ) there is a collision (almost surely). i.e., if we want a low probability of collisions, our hash table has to be quadratic (or more) in the number of items. 1 If m + 1 pigeons are placed in m pigeonholes, there must be two pigeons in the same hole. (Replace pigeons with keys, and pigeonholes with hash slots. ) 2 The handy notation m m is called a falling power [8].

7 Threshold functions m = 1 2 n2 is an example of a threshold function: the threshold, asymptotic probability of event is 0 the threshold, asymptotic probability of event is 1. Lecture 9: Prob. of no collision 1 0 n n 2 ɛ n 2 n 2+ɛ Hash table size (m) n 3 Collision Strategy 1: pick m big Lecture 9: Picking m big is not an effective strategy for handling collisions. For n = 1000 elements, this table shows how big m must be to achieve the desired probability of no collisions: p m

8 Collision Strategy 1: pick m big The analysis of collisions in hashing demonstrates two pigeonhole principles. The simplest pigeonhole principle states that if you put m + 1 pigeons in m holes, there must be one hole with 2 pigeons. With respect to hash tables, the pigeonhole applies as follows: If a hash table with m slots is used to store m + 1 elements, there is a collision. The probability-of-collision analysis of the previous slide demonstrates a probabilistic pigeonhole principle: if you put ω( n) pigeons in n holes, there is a hole with 2 pigeons almost surely (i.e., with probability converging to 1 as n.) Lecture 9: Collision Strategy 2: pick h carefully I Can we pick our hash function h to avoid collisions? For example, if we use hash functions of the form h(k) = m{kφ} Lecture 9: we could try random values of φ (0, 1) until we found one that was collision-free. We have a probability of success p e n 2m(m 1) (1 + o(1)) Geometric distribution: Probability of success p, probability of failure 1 p Each trial independent, identically distributed. Probability that k tries are needed for success = (1 p) k 1 p Mean: p 1

9 Collision Strategy 2: pick h carefully II Number of values of φ we expect to try before we find a collision-free hash table for n = 1000: m # Expected failures before success Lecture 9: Picking hash functions randomly in this manner is unlikely to be practical. There are better strategies: see [6, 2]. Collision Strategy 3: secondary data structures I By far the most common technique for handling collisions is to put a secondary data structure in each hash table slot: A linked list ( chaining ) A binary search tree (BSTs) Another hash table Let α = n m be the load factor: the average number of items per hash table slot. Assuming uniform distribution of keys into slots: Linked lists require 1 + α steps (on average) to find a key; Suitable BSTs require 1 + max(c log α, 0) steps (on average). 3 Using secondary hash tables of size quadratic in the number of elements in the slot, one can achieve O(1) lookups on average, and require only Θ(n) space. Lecture 9:

10 Collision Strategy 3: secondary data structures II Analysis of secondary hash tables: Let N i be a random variable indicating the number of items landing in slot j. E[N i ] = α Var[N i ] = n «m m {z } Bernoulli variance Space required for secondary hash tables is proportional to 2 3 E 4 X 5 = X 1 i m N 2 i 1 i m = m E[N 2 i ] = n 1 m X 1 i m 1 1 m Var[N i ] + α 2 ««+ n2 m 2 Lecture 9: n2 m + n n m Plus space Θ(m) for the primary hash table = Θ(m + n2 + n). Choosing m = Θ(n) yields linear space. m 3 The max( ) deals with the possibility that α < 1, in which case log α < 0. Collision Strategy 4: open addressing I Open addressing is a family of techniques for resolving collisions that do not require secondary data structures. This has the advantage of not requiring any dynamic memory allocation. In the simplest scenario we have a function s : H H that is ideally a permutation of the hash values, for example the linear probing function Lecture 9: s(x) = (x + 1) mod m When we attempt to insert a key k, we look in slot h(k), s(h(k)), s(s(h(k))), etc. until an empty slot is found. To find a key k, we look in slot h(k), s(h(k)), s(s(h(k))), etc. until either k or an empty slot is found.

11 Collision Strategy 4: open addressing II However, the use of permutations performs badly as the hash table becomes fuller: tend to get clumps/clusters, i.e., long sequences h(k), s(h(k), s(s(h(k))),... where all the slots are occupied (see e.g. [10]). Performance can be good for not very full tables, e.g. α < 2 3. As α 1 operations begin to take Θ( n) time [5]. Quadratic probing offers less clumping: try slots h 0 (k), h 1 (k), where Lecture 9: h i (k) = (h(k) + i 2 ) mod m h(k) is an initial fixed hash function. If m prime, the sequence h i (k) will visit every slot. Double hashing uses two hash functions, h 1 and h 2 : h i (k) = (h 1 (k) + i h 2 (k)) mod m h 1 (k) gives an initial slot to try; h 2 (k) gives a stride (reduces to linear probing when h 2 (k) = 1.) Collision Strategy 4: open addressing III Lecture 9: Under favourable conditions, an open addressing scheme behaves like a geometric distribution when searching for an open slot: the probability of finding an empty slot is 1 α, so the expected number of trials is 1 1 α. Note the catastrophe when α 1.

12 Summary of collision strategies Lecture 9: Strategy E[access time] Space Choose m big O(1) Ω(n 2 ) Linked List 1 + α O(n + m) Binary Search Tree 1 + max(c log α, 0) O(n + m) Secondary Hash Tables O(1) O(n + m) 1 Open addressing 1 α O(m) Open addressing can be quite effective if α 1, but fails catastrophically as α 1. Summary of collision strategies If unexpectedly n m (e.g. we have far more data than we designed for), then α. For example, if m O(1) and n ω(1): Linked list has O(n) accesses; BSTs have O(log n) accesses offer a gentler failure mode. If hash function is badly nonuniform: Linked list can be O(n); BST will have O(log n); Secondary hash tables may require O(n 2 ) space. To summarize: hash table + BST will give fast search times, and let you sleep at night. To maintain O(1) access times as n, it is necessary to maintain m n. This can be done by choosing an allowable interval α [c 1, c 2 ]; when α > c 2 resize the hash table to make α = c 1. So long as c 2 > c 1, this strategy adds O(1) amortized time per insertion, as in dynamic arrays. Lecture 9:

13 Applications of hashing I Lecture 9: is a ubiquitous concept, used not just for maintaining collections but also for cryptography combinatorics data mining computational geometry databases router traffic analysis An example: probabilistic counting Probabilistic Counting I Lecture 9: Problem: estimate the number of unique elements in a LARGE collection (e.g., a database, a data stream) without requiring much working space Useful for query optimization in databases [11]: e.g. to evaluate A B C can do either A (B C) or (A B) C one of these might be very fast, one very slow. have rough estimates of B C vs A B to decide which strategy will be faster.

14 Probabilistic Counting I Less serious (but more readily understood) example: Shakespeare s complete works: N=884,647 words (or so) n=28,239 unique words (or so) w = average word length N max n = prior estimate on n Problem: estimate n the number of unique words used. Approaches: 1. Sorting: Put all 884,647 words in a list and sort, then count. (Time O(Nw log N), space O(Nw)) 2. Trie: Scan through the words and build a trie, with counters at each node; requires O(nw) space (neglecting size of counters.) 3. Super-LogLog Probabilistic Counting [3]: Use 128 bytes of space, obtain estimate of words (error 9.4%). Lecture 9: Probabilistic Counting I Inputs: a multiset A of elements, possibly with many duplicates (e.g., Shakespeare s plays) Problem: estimate card(a): the number of unique elements in A (e.g., number of distinct words Shakespeare used) Simple starting idea: hash the objects into an m-element hash table. Instead of storing keys, just count the number of elements landing in each hash slot. Extreme cases to illustrate the principle: Elements of A are all different: will get an even distribution in the hash table. Elements of A are all the same: will get one hash table slot with all the elements! The shape of the hash table distribution reflects the frequency of duplicates. Lecture 9:

15 Probabilistic Counting Lecture 9: Linear Counting [11] Compute hash values in the range [0, N max ) Maintain a bitmap representing which elements of the hash table would be occupied, and estimate n from the sparsity of the hash table. Uses Θ(Nmax ) bits, e.g., on the order of card(a) bits. Room for improvement: the precise sparsity pattern doesn t matter: just the number of full vs. empty slots. Probabilistic Counting I Probabilistic Counting [4] Compute hash values in the range [0, Nmax ) Instead of counting hash values directly, count the occurrence of hash values matching certain patterns: Lecture 9: Pattern xxxxxxx1 xxxxxx10 xxxxx100 xxxx1000. Expected occurrences 2 1 card(a) 2 2 card(a) 2 3 card(a) 2 4 card(a). Use these counts to estimate card(a). To improve accuracy, use m different hash functions. Uses Θ(m log N max ) storage, and delivers accuracy of O(m 1/2 )

16 Probabilistic Counting Super-LogLog [3] requires Θ(log log N max ) bits. With 1.28kb of memory can estimate card(a) to within accuracy of 2.5% for N max 130 million. Probabilistic counters: count to N using log log N bits: Lecture 9: Need log N states, which can be encoded in log log N bits. I [1] Stanley Burris and H. P. Sankappanavar. A Course in Universal Algebra. Springer-Verlag, bib pdf [2] Martin Dietzfelbinger, Anna Karlin, Kurt Mehlhorn, and Friedhelm MeyerAuf Der. Dynamic perfect hashing: Upper and lower bounds. SIAM J. Comput., 23(4): , bib Lecture 9: [3] Marianne Durand and Philippe Flajolet. Loglog counting of large cardinalities (extended abstract). In Giuseppe Di Battista and Uri Zwick, editors, ESA, volume 2832 of Lecture Notes in Computer Science, pages Springer, bib pdf

17 II [4] Philippe Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2): , September bib pdf Lecture 9: [5] Philippe Flajolet, Patricio V. Poblete, and Alfredo Viola. On the analysis of linear probing hashing. Algorithmica, 22(4): , bib pdf [6] Michael L. Fredman and Janos Komlos an Endre Szemeredi. Storing a sparse table with 0(1) worst case access time. J. ACM, 31(3): , bib III [7] Joseph A. Gallian. Contemporary Abstract Algebra. D. C. Heath and Company, Toronto, 3rd edition, bib Lecture 9: [8] Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. Concrete Mathematics: A Foundation for Computer Science. Addison-Wesley, Reading, MA, USA, second edition, bib [9] Saunders MacLane and Garrett Birkhoff. Algebra. Chelsea Publishing Co., New York, third edition, bib

18 IV [10] Robert Sedgewick and Philippe Flajolet. An introduction to the analysis of algorithms. Addison-Wesley Publishing Company, Reading, MA-Menlo Park-New York-Don Mills, Ontario-Wokingham, England-Amsterdam-Bonn- Sydney-Singapore-Tokyo-Madrid-San Juan-Milan-Paris, bib Lecture 9: [11] Kyu-Young Whang, Brad T. Vander-Zanden, and Howard M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst., 15(2): , bib pdf

Introduction to Hash Tables

Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In