Introducton to Algorthms 6.046J/8.40J Lecture 7 Prof. Potr Indyk
Data Structures Role of data structures: Encapsulate data Support certan operatons (e.g., INSERT, DELETE, SEARCH) Our focus: effcency of the operatons Algorthms vs. data structures Introducton to Algorthms February 27, 2003 L7.2
Symbol-table problem Symbol table T holdng n records: x record key[x] Other felds contanng satellte data Operatons on T: INSERT(T, x) DELETE(T, x) SEARCH(T, k) How should the data structure T be organzed? Introducton to Algorthms February 27, 2003 L7.3
Drect-access table IDEA: Suppose that the set of keys s K {0,,, m }, and keys are dstnct. Set up an array T[0.. m ]: x f k K and key[x] = k, T[k] = NIL otherwse. Then, operatons take Θ() tme. Problem: The range of keys can be large: 64-bt numbers (whch represent 8,446,744,073,709,55,66 dfferent keys), character strngs (even larger!). Introducton to Algorthms February 27, 2003 L7.4
Hash functons Soluton: Use a hash functon h to map the unverse U of all keys nto T {0,,, m }: K k k 5 k 4 k 2 k 3 U When a record to be nserted maps to an already occuped As each key slot s n nserted, T, a collson h maps occurs. t to a slot of T. Introducton to Algorthms February 27, 2003 L7.5 0 h(k ) h(k 4 ) h(k 2 ) = h(k 5 ) h(k 3 ) m
Resolvng collsons by channg Records n the same slot are lnked nto a lst. T 49 86 52 h(49) = h(86) = h(52) = Introducton to Algorthms February 27, 2003 L7.6
Analyss of channg We make the assumpton of smple unform hashng: Each key k K of keys s equally lkely to be hashed to any slot of table T, ndependent of where other keys are hashed. Let n be the number of keys n the table, and let m be the number of slots. Defne the load factor of T to be α = n/m = average number of keys per slot. Introducton to Algorthms February 27, 2003 L7.7
Search cost Expected tme to search for a record wth a gven key = Θ( + α). apply hash functon and access slot search the lst Expected search tme = Θ() f α = O(), or equvalently, f n = O(m). Introducton to Algorthms February 27, 2003 L7.8
Choosng a hash functon The assumpton of smple unform hashng s hard to guarantee, but several common technques tend to work well n practce as long as ther defcences can be avoded. Desrata: A good hash functon should dstrbute the keys unformly nto the slots of the table. Regularty n the key dstrbuton should not affect ths unformty. Introducton to Algorthms February 27, 2003 L7.9
Dvson method Assume all keys are ntegers, and defne h(k) = k mod m. Defcency: Don t pck an m that has a small dvsor d. A preponderance of keys that are congruent modulo d can adversely affect unformty. Extreme defcency: If m = 2 r, then the hash doesn t even depend on all the bts of k: If k = 0000000 2 and r = 6, then h(k) = 000 2. h(k) Introducton to Algorthms February 27, 2003 L7.0
Dvson method (contnued) h(k) = k mod m. Pck m to be a prme not too close to a power of 2 or 0 and not otherwse used promnently n the computng envronment. Annoyance: Sometmes, makng the table sze a prme s nconvenent. But, ths method s popular, although the next method we ll see s usually superor. Introducton to Algorthms February 27, 2003 L7.
Multplcaton method Assume that all keys are ntegers, m = 2 r, and our computer has w-bt words. Defne h(k) = (A k mod 2 w ) rsh (w r), where rsh s the bt-wse rght-shft operator and A s an odd nteger n the range 2 w < A < 2 w. Don t pck A too close to 2 w. Multplcaton modulo 2 w s fast. The rsh operator s fast. Introducton to Algorthms February 27, 2003 L7.2
Multplcaton method example h(k) = (A k mod 2 w ) rsh (w r) Suppose that m = 8 = 2 3 and that our computer has w = 7-bt words: 0 0 0 = A 0 0 = k 0 0 0 0 0 0 0 h(k) A. Modular wheel 7 6 5 0 4. 2 3. 3A 2A Introducton to Algorthms February 27, 2003 L7.3
Dot-product method Randomzed strategy: Let m be prme. Decompose key k nto r + dgts, each wth value n the set {0,,, m }. That s, let k = k 0, k,, k m, where 0 k < m. Pck a = a 0, a,, a m where each a s chosen randomly from {0,,, m }. Defne h ( k) = a k mod m. a r =0 Excellent n practce, but expensve to compute. Introducton to Algorthms February 27, 2003 L7.4
A weakness of hashng as we saw t Problem: For any hash functon h, a set of keys exsts that can cause the average access tme of a hash table to skyrocket. An adversary can pck all keys from {k U : h(k) = } for some slot. IDEA: Choose the hash functon at random, ndependently of the keys. Even f an adversary can see your code, he or she cannot fnd a bad set of keys, snce he or she doesn t know exactly whch hash functon wll be chosen. Introducton to Algorthms February 27, 2003 L7.5
Unversal hashng Defnton. Let U be a unverse of keys, and let H be a fnte collecton of hash functons, each mappng U to {0,,, m }. We say H s unversal f for all x, y U, where x y, we have {h H : h(x) = h(y)} = H /m. That s, the chance of a collson between x and y s /m f we choose h randomly from H. {h : h(x) = h(y)} H m H Introducton to Algorthms February 27, 2003 L7.6
Unversalty s good Theorem. Let h be a hash functon chosen (unformly) at random from a unversal set H of hash functons. Suppose h s used to hash n arbtrary keys nto the m slots of a table T. Then, for a gven key x, we have E[#collsons wth x] < n/m. Introducton to Algorthms February 27, 2003 L7.7
Proof of theorem Proof. Let C x be the random varable denotng the total number of collsons of keys n T wth x, and let f h(x) = h(y), c xy = 0 otherwse. Note: E[c xy ] = /m and C =. x c xy y T {x} Introducton to Algorthms February 27, 2003 L7.8
Proof (contnued) E [ C ] = E x c xy y T { x} Take expectaton of both sdes. Introducton to Algorthms February 27, 2003 L7.9
Proof (contnued) E[ C x ] = E c xy y T { x} Take expectaton of both sdes. = y T { x} E[ c xy ] Lnearty of expectaton. Introducton to Algorthms February 27, 2003 L7.20
Proof (contnued) E[ C x ] = E c xy y T { x} Take expectaton of both sdes. = y T { x} E[ c xy ] Lnearty of expectaton. = / y T { x} m E[c xy ] = /m. Introducton to Algorthms February 27, 2003 L7.2
Proof (contnued) E[ C x ] = E c xy y T { x} Take expectaton of both sdes. = y T { x} E[ c xy ] Lnearty of expectaton. = / y T { x} m E[c xy ] = /m. = n m. Algebra. Introducton to Algorthms February 27, 2003 L7.22
Constructng a set of unversal hash functons Let m be prme. Decompose key k nto r + dgts, each wth value n the set {0,,, m }. That s, let k = k 0, k,, k r, where 0 k < m. Randomzed strategy: Pck a = a 0, a,, a r where each a s chosen randomly from {0,,, m }. Defne h ( k) = a k mod m. a r =0 How bg s H = {h a }? H = m r +. Dot product, modulo m REMEMBER THIS! Introducton to Algorthms February 27, 2003 L7.23
Unversalty of dot-product hash functons Theorem. The set H = {h a } s unversal. Proof. Suppose that x = x 0, x,, x r and y = y 0, y,, y r are dstnct keys. Thus, they dffer n at least one dgt poston, wlog poston 0. For how many h a H do x and y collde? h a ( x) = h a ( b) r = 0 a x r = 0 a y (mod m) Introducton to Algorthms February 27, 2003 L7.24
Introducton to Algorthms February 27, 2003 L7.25 Proof (contnued) Equvalently, we have ) (mod 0 ) ( 0 m y x a r = or ) (mod 0 ) ( ) ( 0 0 0 m y x a y x a r + = ) (mod ) ( ) ( 0 0 0 m y x a y x a r = whch mples that,.
Fact from number theory Theorem. Let m be prme. For any z Z m such that z 0, there exsts a unque z Z m such that z z (mod m). Example: m = 7. z 2 3 4 5 6 z 4 5 2 3 6 Introducton to Algorthms February 27, 2003 L7.26
We have Back to the proof r a0( x0 y0) a ( x y ) (mod m), = and snce x 0 y 0, an nverse (x 0 y 0 ) must exst, whch mples that r ( 0 0 a0 a x y ) ( x y ) (mod m). = Thus, for any choces of a, a 2,, a r, exactly one choce of a 0 causes x and y to collde. Introducton to Algorthms February 27, 2003 L7.27
Proof (completed) Q. How many h a s cause x and y to collde? A. There are m choces for each of a, a 2,, a r, but once these are chosen, exactly one choce for a 0 causes x and y to collde, namely r a a ( x y ) ( x y ) 0 = 0 0 = mod m. Thus, the number of h a s that cause x and y to collde s m r = m r = H /m. Introducton to Algorthms February 27, 2003 L7.28
Perfect hashng Gven a set of n keys, construct a statc hash table of sze m = O(n) such that SEARCH takes Θ() tme n the worst case. IDEA: Twolevel scheme 0 44 3 3 wth unversal 2 hashng at 3 both levels. 4 00 00 26 26 5 No collsons 6 99 86 86 40 at level 2! T S 4 427 S h 3 (4) = h 3 (27) = 4 S 6 40 37 37 22 22 m a 0 2 3 4 5 6 7 8 Introducton to Algorthms February 27, 2003 L7.29
Collsons at level 2 Theorem. Let H be a class of unversal hash functons for a table of sze m = n 2. Then, f we use a random h H to hash n keys nto the table, the expected number of collsons s at most /2. Proof. By the defnton of unversalty, the probablty that 2 gven keys n the table collde under h s /m = /n 2. Snce there are ( n) pars 2 of keys that can possbly collde, the expected number of collsons s n n( n ) = <. 2 n2 2 n2 2 Introducton to Algorthms February 27, 2003 L7.30
No collsons at level 2 Corollary. The probablty of no collsons s at least /2. Proof. Markov s nequalty says that for any nonnegatve random varable X, we have Pr{X t} E[X]/t. Applyng ths nequalty wth t =, we fnd that the probablty of or more collsons s at most /2. Thus, just by testng random hash functons n H, we ll quckly fnd one that works. Introducton to Algorthms February 27, 2003 L7.3
Analyss of storage For the level- hash table T, choose m = n, and let n be random varable for the number of keys that hash to slot n T. By usng n 2 slots for the level-2 hash table S, the expected total storage requred for the two-level scheme s therefore E m = 0 Θ ( n 2 ) = Θ( n) snce the analyss s dentcal to the analyss from rectaton of the expected runnng tme of bucket sort. (For a probablty bound, apply Markov.), Introducton to Algorthms February 27, 2003 L7.32
Resolvng collsons by open addressng No storage s used outsde of the hash table tself. Inserton systematcally probes the table untl an empty slot s found. The hash functon depends on both the key and probe number: h : U {0,,, m } {0,,, m }. The probe sequence h(k,0), h(k,),, h(k,m ) should be a permutaton of {0,,, m }. The table may fll up, and deleton s dffcult (but not mpossble). Introducton to Algorthms February 27, 2003 L7.33
Example of open addressng Insert key k = 496: 0. Probe h(496,0) T 586 33 204 48 0 collson m Introducton to Algorthms February 27, 2003 L7.34
Example of open addressng Insert key k = 496: T 0. Probe h(496,0). Probe h(496,) 586 collson 33 0 204 48 m Introducton to Algorthms February 27, 2003 L7.35
Example of open addressng Insert key k = 496: 0. Probe h(496,0). Probe h(496,) 2. Probe h(496,2) T 586 33 204 496 48 0 nserton m Introducton to Algorthms February 27, 2003 L7.36
Example of open addressng Search for key k = 496: 0. Probe h(496,0). Probe h(496,) 2. Probe h(496,2) T 586 33 204 496 48 Search uses the same probe sequence, termnatng successfully f t fnds the key m and unsuccessfully f t encounters an empty slot. 0 Introducton to Algorthms February 27, 2003 L7.37
Probng strateges Lnear probng: Gven an ordnary hash functon h (k), lnear probng uses the hash functon h(k,) = (h (k) + ) mod m. Ths method, though smple, suffers from prmary clusterng, where long runs of occuped slots buld up, ncreasng the average search tme. Moreover, the long runs of occuped slots tend to get longer. Introducton to Algorthms February 27, 2003 L7.38
Probng strateges Double hashng Gven two ordnary hash functons h (k) and h 2 (k), double hashng uses the hash functon h(k,) = (h (k) + h 2 (k)) mod m. Ths method generally produces excellent results, but h 2 (k) must be relatvely prme to m. One way s to make m a power of 2 and desgn h 2 (k) to produce only odd numbers. Introducton to Algorthms February 27, 2003 L7.39
Analyss of open addressng We make the assumpton of unform hashng: Each key s equally lkely to have any one of the m! permutatons as ts probe sequence. Theorem. Gven an open-addressed hash table wth load factor α = n/m <, the expected number of probes n an unsuccessful search s at most /( α). Introducton to Algorthms February 27, 2003 L7.40
Proof of the theorem Proof. At least one probe s always necessary. Wth probablty n/m, the frst probe hts an occuped slot, and a second probe s necessary. Wth probablty (n )/(m ), the second probe hts an occuped slot, and a thrd probe s necessary. Wth probablty (n 2)/(m 2), the thrd probe hts an occuped slot, etc. Observe that n m < n m = α for =, 2,, n. Introducton to Algorthms February 27, 2003 L7.4
Introducton to Algorthms February 27, 2003 L7.42 Proof (contnued) Therefore, the expected number of probes s + + + + + 2 2 n m m n m n m n ( ) ( ) ( ) ( ) α α α α α α α α α = = + + + + + + + + = 0 3 2. The textbook has a more rgorous proof.
Implcatons of the theorem If α s constant, then accessng an openaddressed hash table takes constant tme. If the table s half full, then the expected number of probes s /( 0.5) = 2. If the table s 90% full, then the expected number of probes s /( 0.9) = 0. Introducton to Algorthms February 27, 2003 L7.43