OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass the ormal searchig mechaism by traslatig the search key ito a iteger ad usig that iteger to access the item directly. To be specific, assume that the uderlyig array has a idex rage from 1 to M, ad that the hashig fuctio is H. Give a strig s, H(s) is a iteger i the rage 1..M. The hashig fuctio is θ(1), ad the array access is also θ(1), so at first glace it appears that hashig must also be θ(1). However, this is ot the case. Ufortuately the fuctio H(s) is ot oe-to-oe. I other words, there are two keys s 1 ad s 2 such that s 1 s 2, but H(s 1 )=H(s 2 ). This is kow as a collisio. Whe collisios occur betwee keys, oe of the keys must be located somewhere else i the table other tha the positio give by H(s). There are several methods for determiig the positio of the secod key whe it is iitially placed i the table. Oe is to do a liear search for the ext available slot ad place the key there. Aother method (which is geerally cosidered to be best) is to covert the array ito a table of poiters, ad store all collidig keys as a liked list. The table cotais a poiter to the head of each liked list. All of the most popular methods for resolvig collisios are based o some form of liear search. It is this liear search that makes hashig θ(). To see why this is so, first cosider the worst case. Obviously, the more collisios that occur, the worse the ruig time will be, so let s assume that every key causes a collisio. I other words, the hashig fuctio, H(s), returs the same value for every strig s. For table lookups, the worst case is still the case where the key does ot appear i the table. Uder these coditios, every table-lookup will require a liear search of items, which requires θ() time. Although it is theoretically possible to create a worst-case set of strigs for ay statically defied hashig fuctio, most hashig fuctios are desiged to make the worst case extremely ulikely. To gai some appreciatio of how well hashig could be expected to perform o the average, let s assume that H(s) gives a evely distributed set of idices. Because biary search is able to hadle a arbitrary umber of keys, it is oly fair to assume that our hashig algorithm ca do the same. This elimiates those classes of hashig algorithms that store all keys i a fixed sized table, but strictly speakig these algorithms are ot correct because they fail whe the size of the iput exceeds the size of the table. The liked-list type of hash table does ot suffer from this deficiecy. We will assume that the hash table is of fixed size, ad that the umber of keys is much larger tha the maximum table idex, M. Oce M keys have bee added to the table, a collisio must occur o the ext key (uless, of course, oe has already occurred). Uder the assumptio that idices are evely distributed, the umber of keys that hash to ay

particular idex, i, is equal to. Durig a table lookup, the average case would require M a liear search of keys, which is agai θ(). 2M From the above aalysis, it would appear that hashig is ot as efficiet as biary search. I fact, if we are dealig with millios or billios of keys, this will probably be the case. However it is importat to ote that although hashig is liear that it is M times faster tha a ormal liear search, ad M is geerally a reasoably large umber. I the asymptotic otatio, the costat 1/M is absorbed ito the costat of proportioality. Oe of the motivatios for usig asymptotic otatio is that costats of proportioality ted to be fairly close to oe aother, ad hece ca be safely igored. This is ot the case for hashig. The costat of proportioality is extremely small compared to that for liear search. Hashig is typically used i cases where the umber of etries i the table will ot be sigificatly larger tha M. If we assume that this is the case, ad that the hashig fuctio H(s) is well desiged, hashig will exhibit behavior very close to θ(1), ad will perform sigificatly faster tha biary search. Aother importat poit is the amout of time required to create the table. Hashig is usually used i a eviromet where the table must be created dyamically oe key at a time, ad where table look-ups are iterspersed with table isertios. If a sorted array were used to create the table, each isertio would require approximately θ() operatios to isert a key i its proper positio. This would give θ( 2 ) performace for isertig keys. There are methods for creatig biary search trees o the fly, but rebalacig the trees requires a sigificat amout of overhead. (This is a whole topic uto itself, ad I do t wat to digress too far here.) Agai, uder the two assumptios, that H(s) is well desiged, ad that the umber of keys,, is ot sigificatly larger tha the size of the table M, the amout of time required to isert a key ito the hash table will be approximately θ(1). This gives approximately θ() performace for isertig keys. However, if the umber of keys is sigificatly larger tha the size of the table, or if the probability of collisios is high, the the time boud for isertig keys will be θ( 2 ). Strictly speakig, the amout of time required to isert a ew etry ito a liked-list hash table is θ(1), if you do t care about isertig duplicates. The worst-case θ() performace is based o the assumptio that a prelimiary table look-up will be required to avoid isertig duplicates. Improvig Worst-Case Hashig. The worst-case aalysis of hashig was based o the assumptio that a liear search would be required to resolve collisios. This assumptio causes a factor of to appear i all time bouds. If the performace of collisio resolutio could be improved, it should be possible to improve the worst-case time boud. Suppose that istead of a liear search, a biary search was used to resolve collisios. I this case, the worst case timeboud for table lookups could be reduced to lg, which is θ(lg ). 2M Aother itriguig possibility is to use hashig itself to resolve collisios. Assume that if a collisio occurs usig H(s), that a secod fuctio H 1 (s) will be used to resolve the collisio. To keep thigs simple, let s assume that secod-level hash tables are used to store the keys hashed by H 1 (s). This would require oe secod-level table for each

etry i the first table. Sice it is possible for H 1 (s) to produce collisios, some assumptios must be made about collisio resolutio i the secod-level hash tables. If we assume that a stadard liear search is used to resolve secod level collisios, worstcase behavior would occur whe every key caused a collisio o both levels, givig a worst case performace of θ(). Assumig evely distributed keys at both levels, ad a extremely large value for, the umber of comparisos that would be required is 2MN, where M is the size of the first-level table, ad N is the size of the secod-level tables. If we assume all tables are the same size, the result is 2 2M. As yet, we have ot obtaied ay improvemet i the asymptotic time boud, although we have sigificatly reduced the costat of proportioality. Let s carry this idea to its atural coclusio. Assume that we have a ifiite sequece of hashig fuctios H 0 (s), H 1 (s),..., H i (s)..., where H 0 (s) will be used for the level-oe tables, H 1 (s) will be used for the level-two tables, ad so forth. We will assume that these fuctios are applied successively util o collisio occurs. Immediately we have a problem with our worst-case assumptio of gettig a collisio o every key, because this would imply a ifiite recursio. If we were clever eough to desig such a algorithm, the we would certaily be clever eough to guaratee that this would ot occur. Let us assume, istead, that we get evely distributed keys at ever level, ad try to determie just how well we could do uder these circumstaces. For simplicity, assume that all tables are of size M, ad that the umber of keys,, is much larger tha M. As we add levels to the hash table, the amout of space for storig keys (or poiters to liked-lists of keys) becomes larger. M keys ca be stored at level 1, M 2 keys ca be stored at level 2, M 3 keys at level 3, ad so forth. At some level k we will have eough space to store all keys, ad at this level, liear searches will o loger be required. We have reached this level whe M k. Note that k = log M. Sice the operatios at each level require costat time, this gives us a time boud of θ(lg ). Perfect Hashig As poited out above, it is the resolutio of collisios that prevets hashig from beig θ(1). Suppose it were possible to elimiate collisios etirely. This would require a hash fuctio that was somehow adaptable to its eviromet, at least to the degree that it could hadle a table of arbitrary size. Let us assume further, that we have a fuctio H(s) that is oe-to-oe. I other words, if H(s 1 )=H(s 2 ) the s 1 =s 2. This is t really a reasoable assumptio, for the followig reasos. As poited out above, it is ecessary to assume that the size of the table ca be arbitrarily large which implies that the legth of the strigs must be arbitrarily large as well, sice there are oly a fiite umber of characters that could occur i ay strig. However, let s assume that the legth of the strigs is limited to some reasoable value, say 8. Let s also assume that the strigs are limited to certai reasoable characters, say upper ad lower case letters. Icludig the ull strig, this gives us 52 0 +52 1 +52 2 +52 3 +52 4 +52 5 +52 6 +52 7 +52 8 differet strigs that could be used as a argumet to H. To simplify matters, let s assume that all strigs are exactly 8 characters log, which gives us a total of 52 8 differet strigs that could be used

as a iput to H. Now if we assume that our table is idexed by a usiged 32-bit iteger, we would have a total of 2 32 table etries. Observe that 2 32 =16 8. Which is bigger 52 8 or 16 8? Ad just how big is a table with 2 32 etries ayway? If we do othig more tha store the strig, this will require 2 32 *2 3 =2 35 bytes. To put this i perspective 2 35 bytes is 32 gigabytes. Nevertheless, let us pursue this dream of perfect hashig a bit further. If H(s) truly was a oe-to-oe fuctio, the collisios would be impossible. Assumig that H(s) rus i θ(1) time, hashig would ideed be a θ(1) algorithm. However, the assumptio that H(s) is a θ(1) process is ot realistic either. Remember that it is ecessary for our algorithm to be able to hadle a arbitrarily large umber of strigs. If H(s) is truly a perfect hashig fuctio, it must look at every character of s. To see why this is so, imagie that there is a character i s at positio i that is ot examied by H. Ay strig that differed from s oly i positio i would hash to the same idex as s, cotradictig the assumptio that H was oe-to-oe. Sice there are oly a fiite umber of characters that ca appear i a strig, the ecessity of hadlig a arbitrarily large umber of strigs demads that we also be able to hadle arbitrarily log strigs. Sice H is required to examie every character of every strig it processes, it must take more time for log strigs tha for short strigs. The miimum permissible legth for the maximum legth strig i a set of strigs is proportioal to lg. For example, if we have a set of 258 strigs, we kow that at least oe strig must have a legth greater tha 2. There is oe strig of legth zero, ad 256 strigs of legth 1, for a total of 257. The 258th strig must have a legth of at least 2. Cotiuig i this fashio, ay set of strigs of size = 1+ 256 k i= 0 i must have a strig of legth at least k+1. Furthermore, k i log 256 = log 256 1 + 256 k 1 i 0 = +. Sice all logarithms are proportioal to oe = aother, k+1 is proportioal to lg. Thus, eve a perfect hashig algorithm must be at least of order θ(lg ).