Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Size: px

Start display at page:

Download "Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park"

Maximilian Murphy
6 years ago
Views:

1 1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution Insert Sorted List Insert as the Last element (the First element?) Delete Chaining Search Open Addressing Probe Sequences 1 Data Structure Review COSC Data Structure Review COSC Stacks: LIFO Queues: FIFO Typical functions for managing a Stack Typical functions for managing a Queue Push EnQueue Pop DeQueue IsEmpty IsEmpty IsFull IsFull Peek (or Top) Peek (or Front) Stack Implementations Queue Implementations Using a Linked List Using a Linked List Using an Array Using an Array 4 s s (Direct-Address Table) Many applications require a dynamic data structure for supporting the so-called dictionary operations insert, search and delete. A hash table is an effective data structure for implementing such operations. Even though the worst case running time for a search takes linear time O(n) under reasonable assumptions, the average case takes constant time O(1). Direct addressing is a simple technique that works well when the universe U of keys is reasonably small. For U = {, 1,,, m 1} we can use an array T[..m 1] Direct-address table with U={,1,,, 4,, 6, 7,, 9 } K={,,, } U (universe of keys) 7 9 K (actual keys used) key data 6 1

2 1617 s Problems with Direct Addressing: If the universe U is large, storing a table of size U may be impractical or impossible. (e.g., the range of keys is all positive integers) Often, the set K of keys that are actually usedstored is small compared to U. In this case, most of the space allocated for T is wasted. (e.g., the range of keys is 1 but there are only 1 inputs) s are a good alternative When K is much smaller than U, a hash table requires significantly less space than a direct-address table. Storage requirements can be reduced to ( K ). The average case search time can be O(1). However, this is not the worst case. s A Hash table is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name, ID number ), to their associated values. The Big Idea: Instead of storing an element with key k in slot k, use a function h and store the element in slot h(k). We call h a hash function. h: U {, 1,..., m 1}, such that h(k) is a legal slot number in the hash table T. We say that k hashes to slot h(k). 7 s s T U (Universe of keys) k 1 k K (Actual 4 k keys used) k k Hash table T with a hash function h h(k 1) h(k 4) h(k ) h(k ) h(k ) m - 1 Collisions: A collision occurs when more than one key hashes to the same slot. This can happen when there are more possible keys than slots ( U > m). (where m is the number of slots in the hash table) For a given set K of keys with K m, a collision may or may not happen. When K > m, collisions will occur. Therefore, a hash table implementation must be able to handle collisions in all cases. Two common approaches for resolving collisions: Chaining Open Addressing 9 1 s (Collision Resolution by Chaining) T U (universe of keys) k 1 k 4 k K (actual keys used) k k Hash table T with a hash function h h(k 1) h(k 4) A Collision h(k )= h(k ) h(k ) m - 1 Collision Resolution by Chaining Instead of containing an actual element, slot j contains a pointer to the head of a list storing all elements that hash to j (or to the sentinel if using a circular, doubly linked list with a sentinel) If there are no such elements, slot j contains NULL. 11 1

3 1617 (Collision Resolution by Chaining) (Collision Resolution by Chaining) T k 1 k 4 Order in which keys are inserted Sam Smith 1 Sue Marker 1679 U (universe of keys) Sandra Lee 14 k 1 k K (Actual 4 k keys used) k 7 k k k k k k k k 6 k 7 Ted Doe Sam Parker Sam Smith Ted Doe 179 Mike Backer 1447 k 6 m - 1 Mike Backer 4 Sandra Lee 1796 Sue Marker 49 Sam Parker 1674 Hash table T with a hash function h 1 14 (Operations in Hashing with Chaining) (Operations in Hashing with Chaining) Insertion: CHAINED_HASH_INSERT(T, x) Insert x at the head of list T [h(x->key)] Worst-case running time is O(1). Assumes that the element being inserted isn t already in the list. It would take an additional search to check if it was already inserted. Search: CHAINED_HASH_SEARCH(T, k) Search for an element with key k in list T[h(k)] Running time is proportional to the length of the list of elements in slot h(k) (Operations in Hashing with Chaining) (Analysis of Hashing with Chaining) Deletion: CHAINED-HASH-DELETE(T, x) Delete x from the list in T [h(x->key)] Given pointer x to the element to delete, so no search is needed to find this element. Worst-case running time is O(1) time if the lists are doubly linked. If the lists are singly-linked, then deletion takes as long as searching, because we must find x s predecessor in its list in order to correctly update next pointers. We can analyze Hashing with Chaining based on the load factor. The load factor is the average number of elements loaded in each slot. Let n is the number of elements in the hash table Let m is the number of slots in the table Load factor =nm can be > 1, =1, or < 1 Worst case is when all n keys hash to the same slot get a single list of length n worst-case time to search is (n), plus time to compute the hash function. Average case depends on how well the hash function distributes the keys among the slots. 17 1

4 1617 (Analysis of Hashing with Chaining) Average-case performance of hashing with chaining. For average-case performance, we assume simple uniform hashing: any given element is equally likely to hash into any of the m slots. For j =, 1,...,m 1, we denote the length of list T [ j ] by n j. Then n = n + n 1 + +n m 1. (n is total number of input) Average value of n j is E [n j ] = α = nm. Assume that we can compute the hash function in O(1) time, so that the time required to search for the element with key k depends on the length n h(k) of the list T [h(k)]. (Analysis of Hashing with Chaining) To analyze the average case, we need to consider two cases: 1. If the hash table contains no element with key k, then the search is unsuccessful.. If the hash table does contain an element with key k, then the search is successful Theorem 1) An unsuccessful search takes expected time (1 + ). Theorem ) A successful search takes expected time (1 + ). 19 Hash collisions can also be resolved using open addressing. In open addressing, all elements occupy the hash table itself. Store all keys in the hash table itself. Order in which keys are inserted Sam Smith 1 Sue Marker 1679 Each slot contains either a key or NIL. Sandra Lee 14 To search for key k: Ted Doe 1 Sam Smith Compute h(k) and examine slot h(k). Examining a slot is known as a probe. If slot h(k) contains key k, the search is successful. If this slot contains NIL, the search is unsuccessful. If slot h(k) contains a key that is not k. We compute the index of some other slot, based on k and on which probe (count from : th, 1st, nd, etc.) were on. Keep probing until we either find key k (successful search) or we find a slot holding NIL (unsuccessful search). Sam Parker Mike Backer Sue Marker Ted Doe 179 Mike Backer 1447 Sandra Lee 1796 Sam Parker We need the sequence of slots probed to be a permutation of the slot numbers <, 1,..., m 1> (so that we examine all slots if we have to, and so that we don t examine any slot more than once). Thus, the hash function h is a function of h: U P P (where U is a set of keys and P is a set of slot numbers) The requirement that the sequence of slots be a permutation of, 1,...,m 1 is equivalent to requiring that the probe sequence h(k, ), h(k, 1),..., h(k, m 1) be a permutation of <, 1,...,m 1>. To insert, act as though we re searching, and insert at the first NIL slot we find. HASH-SEARCH returns the index of a slot containing key k, or NIL if the search is unsuccessful. HASH_SEARCH(T, k) { 1 i = ; rval = NIL; do { 4 j = h(k, i); if T[j] == k 6 rval = j; 7 i = i + 1; } while (T[j] NIL and T[j] k and i m) 9 return rval; } 4 4

5 1617 HASH_INSERT(T, k) { 1 i = ; stop = false; do { 4 j = h(k, i); if (T[j] == NIL) { 6 T[j] = k; 7 stop = true; } else { 9 i = i + 1; 1 } 11 } while(!stop and i m) 1 if (i == m) 1 Hash table overflow error } Deletion: Cannot just put NIL into the slot containing the key we want to delete. Sam Smith Sandra Lee Ted Doe Sam Parker Mike Backer Sue Marker Sue Marker 1679 Sam Smith Ted Doe 179 Mike Backer 1447 Sandra Lee 1796 Sam Parker 1674 Delete Ted Doe 6 Deletion: Cannot just put NIL into the slot containing the key we want to delete. Sam Smith Sandra Lee Sam Parker Mike Backer Sue Marker Sue Marker 1679 Sam Smith Mike Backer 1447 Sandra Lee 1796 Sam Parker 1674 Search Mike Backer Fail Deletion: Cannot just put NIL into the slot containing the key we want to delete. Solution: Use a special value DELETED instead of NIL when marking a slot as empty during deletion. Search should treat DELETED as though the slot holds a key that does not match the one being searched for. Insertion should treat DELETED as though the slot were empty, so that it can be reused. The disadvantage of using DELETED is that now search time is no longer dependent on the load factor. 7 HASH_INSERT(T, k) { 1 i = ; stop = false; do { 4 j = h(k, i); if (T[j] == NIL T[j] == DELETED) { 6 T[j] = k; 7 stop = true; } else 9 i = i + 1; 1 } while (!stop and i m) 11 if (i == m) 1 Hash table overflow error } Probe sequences are the list of locations which a method for open addressing produces as alternatives in the event of a collision. Linear probing Quadratic probing Double hashing 9

6 1617 Linear Probing Given an ordinary hash function hˊ: U > {, 1,, m 1} the method of linear probing uses the hash function h k, i = h k + i mod m, for i =, 1,, m 1 The initial probe determines the entire sequence only m possible sequences. Linear probing is easy to implement but it suffers from primary clustering. Primary clustering: probing sequences for two different keys may overlap and create clustering. As a result, the average search and insertion times increase. With linear probing, two different keys k 1 and k may overlap and create clustering. Ex) the probing sequence for key k1 is: {hˊ(k 1 ), hˊ(k 1 ) + 1, hˊ(k 1 ) +, } = { 1, 11, 1, 1, } the probing sequence for key k is: {h (k ), h (k ) + 1, h (k ) +, } = { 7,, 9, 1, 11, 1, 1, } 1 Quadratic probing The probe sequence starts at h (k). It jumps around in the table according to a quadratic function of the probe number: h k, i = h k + c 1 i + c i mod m, where c 1, c are constants. Must constrain c1, c, and m in order to ensure that we get a full permutation of {, 1,..., m 1}. If two distinct keys have the same h value, then they have the same probe sequence. This results in secondary clustering. Double Hashing Use two auxiliary hash functions, h 1 and h. h 1 gives the initial probe, and h gives the remaining probes: h(k, i) = (h 1 (k) + ih (k)) mod m Must have h (k) be relatively prime to m (no factors in common other than 1) in order to guarantee that the probe sequence is a full permutation of <, 1,..., m 1>. Could choose m to be a power of and h to always produce an odd number > 1. Could let m be prime and have 1 < h (k) < m. 4 (Collision Resolution by Open Addressing: Analysis) Assumptions: Analysis is in terms of load factor α. We will assume that the table never completely fills, so we always have n < m α < 1. Assume uniform hashing. No deletion. In a successful search, each key is equally likely to be searched for. Theorems: The expected number of probes in an unsuccessful search is at most 1 1 α. The expected number of probes in a successful search is at most 1 log 1 α 1 α (Hash Functions) What makes a good hash function? Ideally, the hash function satisfies the Simple Uniform Hashing Assumption. SUHA is a basic assumption that facilitates the mathematical analysis of hash tables. The assumption states that a hypothetical hashing function will evenly distribute items into the slots of a hash table. Moreover, each item to be hashed has an equal probability of being placed into a slot, regardless of the other elements already placed. 6 6

7 1617 (Hash Functions) What makes a good hash function? In practice, it s usually not possible to satisfy this assumption. We usually do not know in advance the probability distribution that keys are drawn from, and the keys may not be drawn independently. Instead, a heuristics based on the domain of the keys are often used to create a hash function that performs reasonably well. (Hash Functions) Keys are natural numbers Hash functions typically assume that the keys are natural numbers. When they are not, we need to find a way to convert them to natural numbers. For example, if the keys are character strings, we could convert them as follows: with key = Father ASCII value of F=7, a= 97, t=116, h=14, e=11, r=114. Since there are 1 basic ASCII code values, we can convert the string Father to = (Hash Functions: The Division Method) A key k is mapped into one of m slots by taking the remainder of k divided by m: h(k) = k mod m Advantage: Fast, since it requires just one division operation. Disadvantage: Have to avoid certain values of m: Powers of are bad. If m = p for integer p, then h(k) is just the least significant p bits of k. Good choice for m: A prime number not too close to an exact power of. The multiplication method hash function creates a slot number as follows: 1. Select a constant value A in the range < A < 1.. Multiply key k by A.. Extract the fractional part of ka. 4. Multiply the fractional part by m.. Take the floor of the result. h k = m (k A mod 1) where "k A mod 1" is the fractional part of ka which is between and 1 An advantage of this approach is that the value of m is not critical. 9 4 A power of is usually chosen for m in order to facilitate easy implementation of the function on most computers. Select m = p for some integer p. Suppose the word size of the machine is w bits. Assume that k fits into a single word. (k takes w bits.) Let s be an integer in the range < s < w (s takes w bits). Restrict A to be a fraction of the form s w A word = w bits k s = A w (A = s w s = A w ) r 1 r Multiply k by s. Since we are multiplying two w-bit words, the result is w bits, r 1 w + r, where r 1 is the high-order word of the product and r is the low-order word. h(k)

8 1617 r 1 holds the integer part of ka ( ka ) and r holds the fractional part of ka (ka mod 1 = ka ka ). Think of the binary point. (analog of decimal point, but for binary representation) as being between r 1 and r. Since we don t care about the integer part of ka, we can forget about r 1 and just use part of r. Since we want m (k A mod 1), we could get that value by shifting r to the left by p = log m bits (m= p )and then taking the p bits that were shifted to the left of the binary point. We don t need to shift. The p bits that would have been shifted to the left of the binary point are the p most significant bits of r. So we can just take these bits after having formed r by multiplying k by s. Example: with m = (= implies p = ), w =, k = 1. We must select s, < s < < s <. Let s select s = 1. Then, A = 1 = 1 ka = 1 1 = 7 = 17 = + 17 we only use fraction such that ka mod 1 = 17 to hash. So h k = m ka mod 1 = 17 = Using the implementation: ks = 1 1 = 7 = + 17 r1 =, r = 17. Written in w = bits, r = 11. Take the p = most significant bits of r, get 1 in binary, or 4 in decimal, so that h(k) = 4. w = bit k = 1 1 = 111 s =1 1 = 111 ks = 7 1 = 111 How to choose A: The multiplication method works with any legal value of A. But it works better with some values than with others, depending on the keys being hashed. Knuth suggests using A ( 1) r 1 = 1 r = 11 h(k) = extract p bits = 1 = 4 m = = p = 4 46

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated