Hashing and Amortization - PDF Free Download

Lecture Hashig ad Amortizatio Supplemetal readig i CLRS: Chapter ; Chapter 7 itro; Sectio 7.. Arrays ad Hashig Arrays are very useful. The items i a array are statically addressed, so that isertig, deletig, ad lookig up a elemet each take O( time. Thus, arrays are a terrific way to ecode fuctios {,..., } T, where T is some rage of values ad is kow ahead of time. For example, takig T = {,}, we fid that a array A of bits is a great way to store a subset of {,..., }: we set Ai] = if ad oly if i is i the set (see Figure.. Or, iterpretig the bits as biary digits, we ca use a -bit array to store a iteger betwee ad 2. I this way, we will ofte idetify the set {,} with the set {,...,2 }. What if we wated to ecode subsets of a arbitrary domai U, rather tha just {,..., }? Or to put thigs differetly, what if we wated a keyed (or associative array, where the keys could be arbitrary strigs? While the workigs of such data structures (such as dictioaries i Pytho are abstracted away i may programmig laguages, there is usually a array-based solutio workig behid the scees. Implemetig associative arrays amouts to fidig a way to tur a key ito a array idex. Thus, we are lookig for a suitable fuctio U {,..., }, called a hash fuctio. Equipped with this fuctio, we ca perform key lookup: U hash fuctio {,..., } array lookup T (see Figure.2. This particular implemetatio of associative arrays is called a hash table. There is a problem, however. Typically, the domai U is much larger tha {,..., }. For ay hash fuctio h : U {,..., }, there is some i such that at least U elemets are mapped to i. The set A: 2 3 4 5 6 7 8 9 2 Figure.. This 2-bit array ecodes the set { 2,4,5,8, } {,...2 }.

2 h ( key 3 = 3 key3, val 3 4 h ( key = 5 key, val h ( key 2 = 6 key2, val 2 7 Figure.2. A associative array with keys i U ad values i T ca be implemeted as a (U T-valued array equipped with a hash fuctio h : U {,..., }. h (i of all elemets mapped to i is called the load o i, ad whe this load cotais more tha oe of the keys we are tryig to store i our hash table we say there is a collisio at i. Collisios are problem for us if two keys map to the same idex, the what should we store at that idex? We have to store both values somehow. For ow let s say we do this i the simplest way possible: storig at each idex i i the array a liked list (or more abstractly, some sort of bucket-like object cosistig of all values whose keys are mapped to i. Thus, lookup takes O ( h (i time, which may be poor if there are collisios at i. Rather tha thikig about efficiet ways to hadle collisios, let s try to reaso about the probability of havig collisios if we choose our hash fuctios well..2 Hash Families Without ay prior iformatio about which elemets of U will occur as keys, the best we ca do is to choose our hash fuctio h at radom from a suitable hash family. A hash family o U is a set H of fuctios U {,..., }. Techically speakig, H should come equipped with a probability distributio, but usually we just take the uiform distributio o H, so that each hash fuctio is equally likely to be chose. If we wat to avoid collisios, it is reasoable to hope that, for ay fixed x, x 2 U (x x 2, the values h(x ad h(x 2 are completely ucorrelated as h rages through the sample space H. This leads to the followig defiitio: Defiitio. A hash family H o U is said to be uiversal if, for ay x, x 2 U (x x 2, we have Pr h(x = h(x 2 ]. If you are expectig lots of collisios, a more efficiet way to hadle thigs is to create a two-layered hash table, where each elemet of A is itself a hash table with its ow, differet hash fuctio. I order to have collisios i a two-layer hash table, the same pair of keys must collide uder two differet hash fuctios. If the hash fuctios are chose well (e.g., if the hash fuctios are chose radomly, the this is extremely ulikely. Of course, if you wat to be eve more sure that collisios wo t occur, you ca make a three-layer hash table, ad so o. There is a trade-off, though: itroducig uecessary layers of hashig comes with a time ad space overhead which, while it may ot show up i the big-o aalysis, makes a differece i practical applicatios. Lec pg. 2 of 7

Similarly, H is said to be ɛ-uiversal if for ay x x 2 we have Pr h(x = h(x 2 ] ɛ. The cosequeces of the above hypotheses with regard to collisios are as follows: Propositio.. Let H be a uiversal hash family o U. Fix some subset S U ad some elemet x U. Pick h H at radom. The expected umber of elemets of S that map to h(x is at most + S. I symbols, h E ( h(x ] + S. If H is ɛ-uiversal rather tha uiversal, the the same holds whe + S is replaced by + ɛ S. Proof. For a propositio ϕ with radom parameters, let I ϕ be the idicator radom variable which equals if ϕ is true ad equals otherwise. The fact that H is uiversal meas that for each x U \ {x} we have E Ih(x=h(x ]. Thus by the liearity of expectatio, we have E h ( h(x S ] = I x S + E = I x S + + S. x S x x E x S x x I h(x=h(x Ih(x=h(x ] The reasoig is almost idetical whe H is ɛ-uiversal rather tha uiversal. Corollary.2. For a hash table i which the hash fuctio ( is chose from a uiversal family, isertio, deletio, ad lookup have expected ruig time O + S, where S U is the set of keys which actually occur. If istead the hash family is ɛ-uiversal, the the operatios have expected ruig time O ( + ɛ S. Corollary.3. Cosider a hash table of size with keys i U, whose hash fuctio is chose from a uiversal hash family. Let S U be the set of keys which actually occur. If S = O(, the isertio, deletio, ad lookup have expected ruig time O(. Let H be a uiversal hash family o U. If S = O(, the the expected load o each idex is O(. Does this mea that a typical hash table has O( load at each idex? Surprisigly, the aswer is o, eve whe the hash fuctio is chose well. We ll see this below whe we look at examples of uiversal hash families. Examples.4.. The set of all fuctios h : U {,..., } is certaily uiversal. I fact, we could ot hope to get ay more balaced tha this: Lec pg. 3 of 7

For ay x U, the radom variable h(x (where h is chose at radom is uiformly distributed o the set {,..., }. For ay pair x x 2, the radom variables h(x, h(x 2 are idepedet. I fact, for ay fiite subset {x,..., x k } U, the tuple ( h(x,..., h(x k is uiformly distributed o { } k.,..., The load o each idex i is a biomial radom variable with parameters ( S,. Fact. Whe p is small ad N is large eough that N p is moderately sized, the biomial distributio with parameters (N, p is approximated by the Poisso distributio with parameter N p. That is, if X is a biomial radom variable with parameters (N, p, the Pr X = k ] (N pk e N p (k. k! I our case, N = S ad p =. Thus, if L i is the load o idex i, the For example, if S =, the Pr L i = k ] ( S k k! e S /. Pr L i = ] e.3679, Pr L i = ] e.3679, Pr L i = 2 ] 2 e.839,. Further calculatio shows that, whe S =, we have ] ( lg E max L i = Θ. i lglg ( lg Moreover, with high probability, max L i does ot exceed O lglg. Thus, a typical hash table with S = ad h chose uiformly from the set of all fuctios looks like Figure.3: about 37% of the buckets empty, about 37% of the buckets havig oe elemet, ad about 26% of the buckets havig more tha oe elemet, icudig some buckets with Θ elemets. 2. I Problem Set 4 we cosidered the hash family H = { h p : p k ad p is prime }, where h p : {,...,2 m } {,..., k } is the fuctio h p (x = x mod p. I Problem 4(a you proved that, for each x y, we have Pr h p (x = h p (y ] ml k. p k ( lg lglg Lec pg. 4 of 7

2 3 4 ( lg. Maximum load = Θ lglg. Figure.3. A typical hash table with S = ad h chose uiformly from the family of all fuctios U {,..., }. 3. I Problem Set 5, we fixed a prime p ad cosidered the hash family { } H = h a : a Z m p, where h a : Z m p Z p is the dot product h a ( x = x a = x i a i (mod p. 4. I Problem Set 6, we fixed a prime p ad positive itegers m ad k ad cosidered the hash family { } H = h A : A Z k m p, where h A : Z m p Zk p is the fuctio h A ( x = A x. 5. If H is a ɛ -uiversal hash family of fuctios {,} m {,} k ad H 2 is a ɛ 2 -uiversal hash family of fuctios {,} k {,} l, the 2 H = H 2 H = { h 2 h : h H, h 2 H 2 } is a (ɛ + ɛ 2 -uiversal hash family of fuctios {,} m {,} l. To see this, ote that for ay x x, the uio boud gives h2 h (x = h 2 h (x ] Pr h H h 2 H 2 2 To fully specify H, we have to give ot just a set but also a probability distributio. The hash families H ad H 2 come with probability distributios, so there is a iduced distributio o H H 2. We the equip H with the distributio iduced by the map H H 2 H, (h, h 2 h 2 h. You could cosider this a mathematical techicality if you wish: if H ad H 2 are give uiform distributios (as they typically are, the the distributio o H H 2 is also uiform. The distributio o H eed ot be uiform, however: a elemet of H is more likely to be chose if it ca be expressed i multiple ways as the compositio of a elemet of H 2 with a elemet of H. Lec pg. 5 of 7

( ] = Pr h (x = h (x or h (x h (x ad h 2 h (x = h 2 h (x ] ] Pr h (x = h (x + Pr h (x h (x ad h 2 h (x = h 2 h (x ɛ + ɛ 2. I choosig the parameters to build a hash table, there is a tradeoff. Makig larger decreases the likelihood of collisios, ad thus decreases the expected ruig time of operatios o the table, but also requires the allocatio of more memory, much of which is ot eve used to store data. I situatios where avoidig collisios is worth the memory cost (or i applicatios other tha hash tables, whe the correspodig tradeoff is worth it, we ca make much larger tha S. Propositio.5. Let H be a uiversal hash family U {,..., }. Let S U be the the set of keys that occur. The the expected umber of collisios is at most ( S 2. I symbols, ] ( S E 2. I h(x=h(x x x U Proof. There are ( S 2 pairs of distict elemets i S, ad each pair has probability at most of causig a collisio. The result follows from liearity of expectatio. Corollary.6. If S 2, the the expected umber of collisios is less tha /2, ad the probability that a collisio exists is less tha /2. Proof. Apply the Markov boud. Thus, if is sufficietly large compared to S, a typical hash table cosists mostly of empty buckets, ad with high probability, there is at most oe elemet i each bucket. As we metioed above, choosig a large for a hash table is expesive i terms of space. While the competig goals of fast table operatios ad low storage cost are a fact of life if othig is kow about S i advace, we will see i recitatio that, if S is kow i advace, it is feasible to costruct a perfect hash table, i.e., a hash table i which there are o collisios. Of course, the smallest value of for which this is possible is = S. As we will see i recitatio, there are reasoably efficiet algorithms to costruct a perfect hash table with = O ( S..3 Amortizatio What if the size of S is ot kow i advace? I order to allocate the array for a hash table, we must choose the size at creatio time, ad may ot chage it later. If S turs out to be sigificatly greater tha, the there will always be lots of collisios, o matter which hash fuctio we choose. Luckily, there is a simple ad elegat solutio to this problem: table doublig. The idea is to start with some particular table size = O(. If the table gets filled, simply create a ew table of size 2 ad migrate all the old elemets to it. While this migratio operatio is costly, it happes ifrequetly eough that, o the whole, the strategy of table doublig is efficiet. Let s take a closer look. To simplify matters, let s assume that oly isertios ad lookups occur, with o deletios. What is the worst-case cost of a sigle operatio o the hash table? Lec pg. 6 of 7

Lookup: O(, as usual. Isertio: O(, if we have to double the table. Thus, the worst-case total ruig time of k operatios (k = S o the hash table is O ( + + k = O ( k 2. The crucial observatio is that this boud is ot tight. Table doublig oly happes after the secod, fourth, eighth, etc., isertios. Thus, the total cost of k isertios is k O( + O ( lg k 2 j = O (k + O (2k = O (k. j= Thus, i ay sequece of isertio ad lookup operatios o a dyamically doubled hash table, the average, or amortized, cost per operatio is O(. This sort of aalysis, i which we cosider the total cost of a sequece of operatios rather tha the cost of a sigle step, is called amortized aalysis. I the ext lecture we will itroduce methods of aalyzig amortized ruig time. Lec pg. 7 of 7

MIT OpeCourseWare http://ocw.mit.edu 6.46J / 8.4J Desig ad Aalysis of Algorithms Sprig 22 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.