Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Size: px

Start display at page:

Download "Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science"

Jemima Jennings
5 years ago
Views:

1 Round 5: Hashing Tommi Junttila Aalto University School of Science Department of Computer Science CS-A1140 Data Structures and Algorithms Autumn 017 Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

2 Material in the book Introduction to Algorithms, 3rd ed. (online via Aalto lib): Sections Similar materia elsewhere: Section 3.4 in Algorithms, 4th ed. an these slides (quadratic probing etc not in the book) hashing Chapter in the OpenDSA book External links: MIT OCW video on hashing with chaining MIT OCW video on open addressing and cryptographic hashing Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn 017 / 47

3 In many applications, dictionaries with INSERT, SEARCH, and DELETE operations are enough With hashing we can perform these in O(1) time on average worst-case time requirement can be Θ(n) but with good design this is extremely improbable finding smallest and largest elements takes Θ(n) time in the worst case Implementations: C++11 standard library unordered set and unordered map Java HashSet and HashMap Scala HashSet and HashMap... Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

4 Intro: small key universe, bit sets and direct-access tables Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

5 Bit sets Let us first assume that the number n of all possible keys is small, and the keys are, or can be easily mapped to, the integers 0,...,n 1 A set data structure on these keys is easy to implement as a bit set (aka bit array or bit vector ): Allocate an array a = a 0 a 1...n m 1 of m = n 8 bytes A key k {0,...,n 1} belongs to the set if and only if the bit k mod 8 is 1 in the byte a k/8 Example The array below stores a subset of the keys {0,1,...,999}. It includes the keys = 1, = 4, = 806, = 999 but not, for instance, the key = 805. a bit byte Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

6 It is easy to implement the operations INSERT, SEARCH, and DELETE so that they operate in constant time Bit sets are a very memory efficient way of representing dense sets (i.e., sets in which a large number of the keys are included) For sparse sets space is wasted and, for instance, listing all the keys in the set becomes heavy Some implementations: BitSet in Scalas bitset in the C++ standard library Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

7 Direct-access tables Suppose that we have a small set of possible keys For instance, in the two-letter country codes in ISO alpha- drawn from the alphabet {A,B,...,Z} of 6 letters, there are 6 6 = 676 possible codes (of which 49, such as FI and UK, are actually assigned to some meaning) We can implement a dictionary mapping country codes to objects (e.g., capital city name) by having a direct-access table with 676 entries The value v of the code c 1 c is simply stored in the entry index(c 1 c ) = f(c 1 ) 6 + f(c ) of the array, where f(a) = 0,f(B) = 1,...,f(Z) = 5 INSERT, SEARCH, and DELETE now easy to implement in O(1) time Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

8 Example Mapping country codes to the entries in a direct-access table arr: AA AB AC DE FI UK US ZZ arr..... Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

9 Implementing a country code map in Scala: import scala. r e f l e c t. ClassTag class CountryMap [ B >: N u l l ] ( ) ( i m p l i c i t tag : ClassTag [ B ] ) { private val a r r = new Array [ B] ( ) private def f ( c : Char ) = c. t o I n t A. t o I n t private def isvalidcode ( code : S t r i n g ) = code. length == && 0 <= f ( code ( 0 ) ) && f ( code ( 0 ) ) < 6 && 0 <= f ( code ( 1 ) ) && f ( code ( 1 ) ) < 6 def index ( code : S t r i n g ) : I n t = { require ( isvalidcode ( code ) ) } f ( code ( 0 ) ) * 6 + f ( code ( 1 ) ) def apply ( code : S t r i n g ) : Option [ B ] = { val v = a r r ( index ( code ) ) i f ( v == null ) None else Some( v ) } def update ( code : String, value : B) = { r e q u i r e ( value!= null ) a r r ( index ( code ) ) = value } def d elete ( code : S t r i n g ) = { a r r ( index ( code ) ) = null } } Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

10 Note: use of nulls is generally discouraged in Scala (use Option instead) but excusable inside data structure implementations Extending the class with constant-time size operation is easy Example Building a direct-access table mapping country codes to some capital city names. val c a p i t a l = new CountryMap [ S t r i n g ] ( ) c a p i t a l ( DE ) = B e r l i n c a p i t a l ( FI ) = H e l s i n k i c a p i t a l ( UK ) = London c a p i t a l ( US ) = Washington p r i n t l n ( c a p i t a l ( FI ) ) produces Some( H e l s i n k i ) AA AB AC DE FI arr.. Berlin Helsinki UK 530. London US 538. Washington ZZ 675. Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

11 Hashing and hash tables Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

12 Hashing and hash tables Extend the idea of direct-access tables to very large (or even infinite) key universes U At any given time, only a subset K U of the possible keys are used The main idea is to have a hash table of m entries and then use a hash function h : U {0,1,...,m 1} to map each key to an index in in the table In the ideal case, each key should map to a different index... but in general this is difficult to obtain efficiently and there will be collisions when two keys should be put in the same table index We will discuss the design of hash functions later Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

13 Example Assume a hash table with m = 13 entries and a hash function for strings implemented in Scala.11.8 with def h(s: String) = math.abs(s.hashcode) % 13 Many strings map to different indices and the idea works as is Germany Finland United States Denmark United Kingdom Sweden Berlin Helsinki Washington Copenhagen London Stockholm arr But some strings map to the same index causing collisions Germany Finland United States Denmark Austria United Kingdom Sweden Berlin Helsinki Washington London Stockholm arr Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

14 How probable are collisions? Suppose a hash function h under the simple uniform hashing assumption: any key is equally likely to hash to a value in {0,1,...,m 1} independently of the other keys Under the assumption, the number of randomly drawn keys such that at least one collision is produced with probability p is ( ) 1 m ln 1 p As an example, for a hash table with m = entries we only need to insert only 1178 random keys to produce at least one collision with probability of 0.5 As a consequence, collisions become rather likely quite soon A special case of this is called the birthday paradox: there are 365 days in a year but if we have 3 randomly selected people in the same room, the probability that there are two people having the same birthday is at least 0.5 (assuming that the birthdays of people in general are evenly distributed over a year) Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

15 Collision resolution Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

16 In order to handle the inevitable collisions, we need a collision resolution scheme In the following, we ll see chaining, and open hashing with some variants We need an additional definition: the load factor α of a hash table with m entries when there are n keys stored in it is α = n/m Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

17 Chaining Also called separate chaining and open hashing The idea: each entry in the hash table starts a linked list and the key/value-pairs are stored in the list After finding the index h(k), and thus the correct list for a key k, the rest is as with mutable linked lists: search: traverse the list to see if the key was stored in an entry insertion: traverse the list and insert a new entry at the end of the list if the key was not found deletion: traverse the list and remove the entry if the key was found Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

18 Example: Collision handling with chaining The entries in the linked lists are drawn as key value next entry, where key is the key (or actually a reference to it in case of a non-primitive type) value is the value (or a reference to it), and next entry is a reference to the next entry in the list (null if last) Germany Finland 0 1 arr Germany Berlin Finland Helsinki United States 3 United States Washington Denmark Austria 8 9 Denmark Copenhagen Austria Vienna United Kingdom 10 United Kingdom London Sweden 11 Sweden Stockholm 1 Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

19 Example: Implementing sets with hashing and chaining If we are not interested in maps but of sets, we just drop the value field and key next entry use entries of form in the lists. Inserting the numbers 131, 9833, 344, 6, 17, 434, 653 and -13 in a hash set of integers with the hash function h(k) = k mod m produces the hash table 0 1 arr Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

20 unordered sets in C++11 The GNU ISO C++ library implements open hashing Hash functions and key equality comparators already implemented for basic types # i n c l u d e <iostream> # i n c l u d e <unordered set> i n t main ( ) { / / A set of small prime numbers std : : unordered set<i n t> myset = {3,5,7,11,13,17,19,3,9}; myset. erase ( 1 3 ) ; / / erasing by key myset. erase ( myset. begin ( ) ) ; / / erasing by i t e r a t o r std : : cout << myset contains : ; f o r ( const i n t & x : myset ) std : : cout << << x ; std : : cout << std : : endl ; r e t u r n 0; } One possible output (note that the elements are not ordered): myset contains : Similarly for unordered multisets, maps, and multimaps Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

21 Analysis Suppose that we have inserted n keys in a hash table with m entries Under the simple uniform hashing assumption, the expected number of keys in the linked list at an index i is n m Assume that computing the hash value h(k) for each key takes O(1) time The cost of searching for a key in the hash table is O(1 + n ), i.e. m O(1 + α), on average in both cases when the key is not found and when it is found in the table When n is proportional to m, i.e. n cm for some constant c, implying α c, then searching takes time O(1 + cm ) = O(1) on m average If insertion and deletion operations first search whether the key is in the table, then they also take time O(1) on average if n cm for some constant c Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

22 The worst-case behaviour occurs when all the keys hash to the same index: the hash table effectively reduces to a linked list and inserting, searching, and deleting keys require Θ(n) time good design of hash functions is important Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn 017 / 47

23 Rehashing How large hash table should we allocate in the beginning if/when we do not know how many keys will be seen in the future? Or, what should we do when the load factor grows too large? The answer is rehashing: start with a smallish hash table and then grow its size when the load factor rises above certain threshold When the hash table is resized, all the keys (or key/value pairs) must be reinserted to it as their indices in the table are probably changed (hence the term rehashing ) What is a good value of load factor for triggering rehashing? There is no single best answer; the GNU ISO C++ library version 4.6 triggers rehashing when the load factor is 1.0 and doubles the number of entries Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

24 Open addressing Another collision resolution scheme Also (very confusingly) called closed hashing Idea: use an array large enough to hold all the inserted keys Each array index stores one key/value pair (or simply the key if we are implementing a set instead of a map), or is null when it is free Thus the load factor is at most 1.0 in open addressing When a collision occurs during an insert, probe an another index in the table until a free one is found Similarly, when searching for a key, one probes indices until the key is found or an empty slot is encountered Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

25 For probing, we need to define which index is tried next To do this, we define a hash function h : U {0,1,...,m 1} {0,1,...m 1} where the second argument is the probe number For each key k, the probe sequence is thus h(k,0),h(k,1),...,h(k,m 1) To ensure that every index in the table is probed at some point, the probe sequence must be a permutation of {0,1,...,m 1} for every key k Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

26 Linear probing The simplest way to define probe sequences Given an auxiliary hash function h : U {0,1,...,m 1} h is the usually the hash function provided by the class or user (e.g., hashcode in Scala) Define the hash function Clearly, the probe sequence h(k,i) = (h (k) + i) mod m h(k,0),h(k,1),...,h(k,m 1) is a permutation of {0,1,...,m 1} Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

27 Example: Collision resolution with linear probing The entries in the table are drawn as key value, where again key is the key (or actually a reference to it in case of a non-primitive type), and value is the value (or a reference to it). Inserting the mappings Finland Helsinki United States Washington United Kingdom London Denmark Copenhagen Austria Vienna Sweden Stockholm in a table of size 13 in this order by using our ealier hash function and linear probing gives the hash table on the right When inserting the map Austria Vienna, we observe that the index 8 is already taken by Denmark, probe the next one, see that it is free, and insert the key/value-pair there arr Germany Berlin Finland Helsinki United States Washington Denmark Copenhagen Austria Vienna United Kingdom London Sweden Stockholm Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

28 Example: Sets with hashing and linear probing When implementing sets instead of maps, the table entries consist of the key values (or references to them) only. For instance, inserting the numbers 131, 9833, 344, 6, 17, 434, 653 and -13 in a hash table of integers with m = 13 by using the auxiliary hash function h (k) = k: is inserted to h(131,0) = ( ) mod 13 = is inserted to h(9833,0) = ( ) mod 13 = is inserted to h(344,0) = ( ) mod 13 = is inserted to h(6,0) = (6 + 0) mod 13 = is inserted to h(17,0) = (17 + 0) mod 13 = 4 6 as h(434,0) = ( ) mod 13 = 9 is occupied, the next index h(434,1) = ( ) mod 13 = 10 is probed, found free, and 434 is inserted there 7 as h(653,0) = ( ) mod 13 = 3 is occupied, the next indices h(653,1) = 4 and h(653,) = 5 are probed and found occupied, and finally 653 is inserted to h(653,3) = 6 8 as h( 13,0) = 0 is occupied, 13 is inserted to the next free index h( 13,1) = arr Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

29 In the following examples, for representational simplicity, we only consider hash sets for integers generalization to maps over arbitrary types is straightforward Linear probing suffers from the problem of primary clustering: Example an empty slot preceded by i occupied slots get filled with probability (i + 1)/m instead of 1/m, and thus occupied slots start to cluster Suppose that h (k) = k for integer-valued keys and m = 17. Inserting the values 1, 50,, 0, 38, 35 in this order with linear probing produces the hash table where the arrows show the probe sequence for the key 35. Note that only 1, 50, and 35 of the above keys hash to the same value. Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

30 Quadratic probing Eliminates primary clustering with more complex probing sequences Define h(k,i) = (h (k) + c 1 i + c i ) mod m where c 1 and c are some positive constants To make the probe sequence a permutation of {0,1,...,m 1}, the values c 1, c, and m must be constrained Example Assume that m = 11 (i.e., a prime) and h(k,i) = (h (k) + i + i ) mod m If h (k) = 0, then the probe sequence is 0,,6,1,9,8,9,1,6,,0. This is not a permutation of {0,1,...,10} but probes only 6 of 11 possible slots. Thus if the load factor is high, this probe sequence may not find any free slots. Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

31 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

32 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

33 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

34 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

35 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

36 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

37 Example Assume that m is a power of two. In this case, it can be proven that h(k,i) = (h i(i + 1) (k) + ) mod m = (h (k) + 0.5i + 0.5i ) mod m produces probe sequencies that are permutations of {0, 1,..., m}. For instance, if m = 3 = 8 and h (k) = 0, then the probe sequence is 0,1,3,6,,7,5,4. Example Suppose that h (k) = k for integer-valued keys and m = 16. Insert the values 1, 50,, 0, 38, 35 in this order with quadratic probing hash function h(k) = (h (k) + i(i+1) ) mod m to the hash table below (again, arrows show the probe sequences): Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

38 Quadratic probing does not suffer from primary clustering But two keys k 1 and k with the same hash value h (k 1 ) = h (k ) have the same probe sequence This (less severe) problem with quadratic probing is called secondary clustering Another inoptimality of linear and quadratic probings is the following: For hash tables of size m, there are m! possible probe sequences But for any auxiliary hash function h, both linear and quadratic probing (with fixed constants) only explore at most m of those Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

39 Double hashing Produces more probe sequences than linear and quadratic probing by using an another hash function for generating the probe sequence steps The general form is h(k,i) = (h 1 (k) + i h (k)) mod m Double hashing can produce m different probe sequences Again, to obtain permutation probe sequencies, we must constrain the function h This can be done by forcing the value h (k) to be relatively prime to the hash table size m Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

40 Example One convenient way to make the probe sequence h(k,0),h(k,1),...,h(k,m 1) to be a permutation of {0,1,...,m 1} is to require that m is a power of two and h (k) is an odd number. Example An another way to make the probe sequence h(k,0),h(k,1),...,h(k,m 1) to be a permutation of {0,1,...,m 1} is to require that m is a prime number and h (k) is a positive number less than m. For instance, for integer keys we could choose h 1 (k) = k mod m and h (k) = 1 + (k mod m ) for some m slightly less than m. Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

41 Removing keys Removing keys when using chaining was straightforward With open addressing, we must ensure that removing does not leave holes in the probe sequencies of other keys Example Suppose that h (k) = k for integer-valued keys, m = 16, and we use quadratic probing with the hash function h(k) = (h (k) + i(i+1) ) mod m to insert the keys 3, 0 and 35 in a hash table. The result is If we now delete the key 0 by simply removing it, the result is But in this hash table the search operation does not find the key 35 anymore as it probes the slots 3 and 4 only, concluding that the key is not in the set when it sees the free slot at index 4. Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

42 One solution is to replace the deleted key with some special tombstone key value del in the hash table Search and deletion consider these values to be real keys but insertion can overwrite them If the table has too many such tombstones, one should probably garbage collect them by rehashing all the values in the same table so that search operations do not perform unnecessary work Example Consider again the hash table of the previous example: Delete the key 0 by putting the tombstone value in its place: del Now the search operation can work unmodified and finds the key 35 after probing the slots 3, 4 and 6. If we now insert the key 19, we probe the slots 3 and 4 and then insert the key in the slot 4. Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

43 Analysis For the analysis, assume uniform hashing: the probe sequence of each key is equally likely to be any of the m! permutations of {0,1,...,m 1} Of the probing methods presented, double hashing is closest to this requirement Theorem (11.6 in Introduction to Algorithms, 3rd ed. (online via Aalto lib)) Under the uniform hashing assumption, the expected number of probings done in the case of an unsuccesfull search is at most 1/(1 α). Recall that the load factor α is less than 1 in a non-full hash-table in open addressing and thus 1/(1 α) = 1 + α + α + α Informal intuitive explanation: 1 comes from the fact that at least one probe is done The first probed slot was occupied with probability α and thus a second probe is done with probability α The first and second probed slot were both occupied with probability α and thus a third probe is done with probability α... Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

44 Therefore, if we keep the load factor below some constant, then the insertion, search, and deletion take constant time For instance, if we keep the load factor at or below 0.5, then the expected number of probes is at most Note: the deleted values in the table are accounted to the load factor in this case! If deletions are known to occurr often, then separate chaining is probably a better collision resolution approach Again, when an open addressing hash-table becomes too full, we perform rehashing: grow the size of the table and reinsert the keys to this new larger table As in the case of dynamically grown arrays, the size of the hash table is usually approximately doubled What is a good value of load factor for triggering rehashing? There is no single best answer but usually one uses something like 0.50 or 0.75 for open addressing Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

45 Building hash functions Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

46 In simple uniform hashing, each key should be equally likely to hash to any of the m values, independently where any other key has hashed to But we do not necessarily know the distribution of the keys inserted nor are they necessarily drawn independently A good approach computes hash values in a way that we expect to be independent of the patterns that appear in the keys As an example, in a compiler symbol table construction we do not want to the common strings i and j to hash to the neigbouring values especially if we use open addressing and linear probing As a rough guide: the more random the hash value looks like, the better For map/set hash table use, the hash value should also be very efficient to compute Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

47 Hashing integers We first show some ways to produce hash functions for integer keys We have already seen the division method in which the hash value is the remainder when dividing the key with the hash table size m: h(k) = k mod m This is fast to compute as only one division operation is required But it may be bad choice if m is a power of two with m = p : only the p least significant bits of the key influence the hash value If possible, m should rather be a prime number not too close to a power of two Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

48 The multiplication method produces a hash value for a w-bit integer key k by multiplying it with another, well chosen constant w-bit integer A, to a w-bit integer r = r w 1 r w...r 0 = ka and taking the hash value from the most significant bits of the least significant half r w 1...r 0 Here m is usually a power of two, m = p, so that one can simply take the p most significant bits in r w 1...r 0 by shifting and masking For instance, consider the Scala function def h(x: Int): Int = (x * L).toInt Now h(1).tohexstring = 9e3779b9 h().tohexstring = 3c6ef37 h(3).tohexstring = daa66db and so on, looking quite random Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

49 Hashing strings To translate a string s = c 0...c n 1 into an integer, characters c i in the string can be processed one-by-one E.g., the current Java implementations use the hash function n 1 h(s) = i=0 c i 31 n 1 i From openjdk Java 8 (with software caching of the hash value): p u b l i c f i n a l class S t r i n g implements java. io. Serializable, Comparable<String >, CharSequence { / * * The value i s used f o r character storage. * / p r i v a t e f i n a l char value [ ] ; / * * Cache the hash code f o r the s t r i n g * / private i n t hash ; / / Default to 0... p u b l i c i n t hashcode ( ) { i n t h = hash ; i f ( h == 0 && value. l ength > 0) { char val [ ] = value ; f o r ( i n t i = 0; i < value. length ; i ++) { h = 31 * h + v a l [ i ] ; } hash = h ; } r e t u r n h ; } } Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

50 Hashing compound objects Computing hash values for compound objects (objects with fields, arrays) can be done by combining the (hash) values of the components One of the simplest examples is computing hash code for integer arrays in openjdk Java 6 p u b l i c s t a t i c i n t hashcode ( i n t a [ ] ) { i f ( a == n u l l ) r e t u r n 0; i n t r e s u l t = 1; f o r ( i n t element : a ) r e s u l t = 31 * r e s u l t + element ; return r e s u l t ; } Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

51 Currently Scala uses a bit more complex hash function based on MurMurHash 3 It tries to produce good hash values (few collisions) and be fast From the Scala source code: f i n a l def arrayhash [@s p e c i a l i z e d T ] ( a : Array [ T ], seed : I n t ) : I n t = { var h = seed var i = 0 while ( i < a. length ) { h = mix ( h, a ( i ). ## ) / / ## i s hashcode i += 1 } f i n a l i z e H a s h ( h, a. length ) } f i n a l def mix ( hash : I n t, data : I n t ) : I n t = { var h = mixlast ( hash, data ) h = r o t l ( h, 13) / / r o t l i s I n t e r g e r. r o t a t e L e f t h * 5 + 0xe6546b64 } Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

52 f i n a l def mixlast ( hash : I n t, data : I n t ) : I n t = { var k = data k * = 0xcc9ed51 k = r o t l ( k, 15) k * = 0x1b hash ˆ k } / * * F i n a l i z e a hash to i n c o r p o r a t e the length and make sure a l l b i t s avalanche. * / f i n a l def f i n a l i z e H a s h ( hash : I n t, length : I n t ) : I n t = avalanche ( hash ˆ length ) / * * Force a l l b i t s of the hash to avalanche. Used f o r f i n a l i z i n g the hash. * / private f i n a l def avalanche ( hash : I n t ) : I n t = { var h = hash h ˆ = h >>> 16 h * = 0x85ebca6b h ˆ = h >>> 13 h * = 0xcbae35 h ˆ = h >>> 16 h } Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

53 More Naturally, a lot of hash functions for different purposes have been proposed See, e.g., the following links: libsupc%b%b/hash_bytes.cc?view=markup#l74 Tommi Junttila (Aalto University) Round 5 CS-A1140 / Autumn / 47

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated