OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Similar documents
Hashing and Amortization

11. Hash Tables. m is not too large. Many applications require a dynamic set that supports only the directory operations INSERT, SEARCH and DELETE.

4.3 Growth Rates of Solutions to Recurrences

CS 270 Algorithms. Oliver Kullmann. Growth of Functions. Divide-and- Conquer Min-Max- Problem. Tutorial. Reading from CLRS for week 2

Infinite Sequences and Series

Analysis of Algorithms. Introduction. Contents

Data Structures and Algorithm. Xiaoqing Zheng

CS / MCS 401 Homework 3 grader solutions

Advanced Course of Algorithm Design and Analysis

19.1 The dictionary problem

6.3 Testing Series With Positive Terms

CS161: Algorithm Design and Analysis Handout #10 Stanford University Wednesday, 10 February 2016

1 Hash tables. 1.1 Implementation

Hashing. Algorithm : Design & Analysis [09]

Recursive Algorithm for Generating Partitions of an Integer. 1 Preliminary

Lecture 9: Hierarchy Theorems

Skip Lists. Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 S 3 S S 1

CS 332: Algorithms. Linear-Time Sorting. Order statistics. Slide credit: David Luebke (Virginia)

Design and Analysis of ALGORITHM (Topic 2)

Analysis of Algorithms -Quicksort-

Recurrence Relations

Test One (Answer Key)

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

MA131 - Analysis 1. Workbook 2 Sequences I

CS583 Lecture 02. Jana Kosecka. some materials here are based on E. Demaine, D. Luebke slides

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

Mathematical Induction

Sequences I. Chapter Introduction

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

CSE 202 Homework 1 Matthias Springer, A Yes, there does always exist a perfect matching without a strong instability.

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

HOMEWORK 2 SOLUTIONS

Ch3. Asymptotic Notation

Divide and Conquer. 1 Overview. 2 Multiplying Bit Strings. COMPSCI 330: Design and Analysis of Algorithms 1/19/2016 and 1/21/2016

Definitions: Universe U of keys, e.g., U N 0. U very large. Set S U of keys, S = m U.

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Algorithm Analysis. Chapter 3

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Classification of problem & problem solving strategies. classification of time complexities (linear, logarithmic etc)

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Sorting Algorithms. Algorithms Kyuseok Shim SoEECS, SNU.

Sequences. Notation. Convergence of a Sequence

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Sequences and Series of Functions

Disjoint set (Union-Find)

Analysis of Experimental Measurements

Chapter 4. Fourier Series

Dynamic Programming. Sequence Of Decisions

Dynamic Programming. Sequence Of Decisions. 0/1 Knapsack Problem. Sequence Of Decisions

Recursive Algorithms. Recurrences. Recursive Algorithms Analysis

1 Statement of the Game

x a x a Lecture 2 Series (See Chapter 1 in Boas)

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

Sums, products and sequences

Math 2784 (or 2794W) University of Connecticut

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

CHAPTER I: Vector Spaces

Skip lists: A randomized dictionary

DATA STRUCTURES I, II, III, AND IV

Lecture 11: Pseudorandom functions

Lecture 11: Hash Functions and Random Oracle Model

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

7.7 Hashing. 7.7 Hashing. Perfect Hashing. Direct Addressing

ANALYSIS OF EXPERIMENTAL ERRORS

Algorithm Analysis. Algorithms that are equally correct can vary in their utilization of computational resources

MA131 - Analysis 1. Workbook 3 Sequences II

ECEN 655: Advanced Channel Coding Spring Lecture 7 02/04/14. Belief propagation is exact on tree-structured factor graphs.

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

Frequentist Inference

Sequences III. Chapter Roots

7 Sequences of real numbers

Math 113 Exam 3 Practice

Number Representation

Ma 530 Introduction to Power Series

This Lecture. Divide and Conquer. Merge Sort: Algorithm. Merge Sort Algorithm. MergeSort (Example) - 1. MergeSort (Example) - 2

2.4 Sequences, Sequences of Sets

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Statistics 511 Additional Materials

Problem Set 2 Solutions

CS284A: Representations and Algorithms in Molecular Biology

Model Theory 2016, Exercises, Second batch, covering Weeks 5-7, with Solutions

sin(n) + 2 cos(2n) n 3/2 3 sin(n) 2cos(2n) n 3/2 a n =

HOMEWORK #10 SOLUTIONS

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Posted-Price, Sealed-Bid Auctions

Average-Case Analysis of QuickSort

Lecture 2: April 3, 2013

Induction: Solutions

INFINITE SEQUENCES AND SERIES

The Random Walk For Dummies

Series III. Chapter Alternating Series

INFINITE SEQUENCES AND SERIES

Estimation of a population proportion March 23,

The Growth of Functions. Theoretical Supplement

Chapter 22 Developing Efficient Algorithms

NUMERICAL METHODS FOR SOLVING EQUATIONS

The Binomial Theorem

Lecture 10 October Minimaxity and least favorable prior sequences

Transcription:

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES Peter M. Maurer Why Hashig is θ(). As i biary search, hashig assumes that keys are stored i a array which is idexed by a iteger. However, hashig attempts to bypass the ormal searchig mechaism by traslatig the search key ito a iteger ad usig that iteger to access the item directly. To be specific, assume that the uderlyig array has a idex rage from 1 to M, ad that the hashig fuctio is H. Give a strig s, H(s) is a iteger i the rage 1..M. The hashig fuctio is θ(1), ad the array access is also θ(1), so at first glace it appears that hashig must also be θ(1). However, this is ot the case. Ufortuately the fuctio H(s) is ot oe-to-oe. I other words, there are two keys s 1 ad s 2 such that s 1 s 2, but H(s 1 )=H(s 2 ). This is kow as a collisio. Whe collisios occur betwee keys, oe of the keys must be located somewhere else i the table other tha the positio give by H(s). There are several methods for determiig the positio of the secod key whe it is iitially placed i the table. Oe is to do a liear search for the ext available slot ad place the key there. Aother method (which is geerally cosidered to be best) is to covert the array ito a table of poiters, ad store all collidig keys as a liked list. The table cotais a poiter to the head of each liked list. All of the most popular methods for resolvig collisios are based o some form of liear search. It is this liear search that makes hashig θ(). To see why this is so, first cosider the worst case. Obviously, the more collisios that occur, the worse the ruig time will be, so let s assume that every key causes a collisio. I other words, the hashig fuctio, H(s), returs the same value for every strig s. For table lookups, the worst case is still the case where the key does ot appear i the table. Uder these coditios, every table-lookup will require a liear search of items, which requires θ() time. Although it is theoretically possible to create a worst-case set of strigs for ay statically defied hashig fuctio, most hashig fuctios are desiged to make the worst case extremely ulikely. To gai some appreciatio of how well hashig could be expected to perform o the average, let s assume that H(s) gives a evely distributed set of idices. Because biary search is able to hadle a arbitrary umber of keys, it is oly fair to assume that our hashig algorithm ca do the same. This elimiates those classes of hashig algorithms that store all keys i a fixed sized table, but strictly speakig these algorithms are ot correct because they fail whe the size of the iput exceeds the size of the table. The liked-list type of hash table does ot suffer from this deficiecy. We will assume that the hash table is of fixed size, ad that the umber of keys is much larger tha the maximum table idex, M. Oce M keys have bee added to the table, a collisio must occur o the ext key (uless, of course, oe has already occurred). Uder the assumptio that idices are evely distributed, the umber of keys that hash to ay

particular idex, i, is equal to. Durig a table lookup, the average case would require M a liear search of keys, which is agai θ(). 2M From the above aalysis, it would appear that hashig is ot as efficiet as biary search. I fact, if we are dealig with millios or billios of keys, this will probably be the case. However it is importat to ote that although hashig is liear that it is M times faster tha a ormal liear search, ad M is geerally a reasoably large umber. I the asymptotic otatio, the costat 1/M is absorbed ito the costat of proportioality. Oe of the motivatios for usig asymptotic otatio is that costats of proportioality ted to be fairly close to oe aother, ad hece ca be safely igored. This is ot the case for hashig. The costat of proportioality is extremely small compared to that for liear search. Hashig is typically used i cases where the umber of etries i the table will ot be sigificatly larger tha M. If we assume that this is the case, ad that the hashig fuctio H(s) is well desiged, hashig will exhibit behavior very close to θ(1), ad will perform sigificatly faster tha biary search. Aother importat poit is the amout of time required to create the table. Hashig is usually used i a eviromet where the table must be created dyamically oe key at a time, ad where table look-ups are iterspersed with table isertios. If a sorted array were used to create the table, each isertio would require approximately θ() operatios to isert a key i its proper positio. This would give θ( 2 ) performace for isertig keys. There are methods for creatig biary search trees o the fly, but rebalacig the trees requires a sigificat amout of overhead. (This is a whole topic uto itself, ad I do t wat to digress too far here.) Agai, uder the two assumptios, that H(s) is well desiged, ad that the umber of keys,, is ot sigificatly larger tha the size of the table M, the amout of time required to isert a key ito the hash table will be approximately θ(1). This gives approximately θ() performace for isertig keys. However, if the umber of keys is sigificatly larger tha the size of the table, or if the probability of collisios is high, the the time boud for isertig keys will be θ( 2 ). Strictly speakig, the amout of time required to isert a ew etry ito a liked-list hash table is θ(1), if you do t care about isertig duplicates. The worst-case θ() performace is based o the assumptio that a prelimiary table look-up will be required to avoid isertig duplicates. Improvig Worst-Case Hashig. The worst-case aalysis of hashig was based o the assumptio that a liear search would be required to resolve collisios. This assumptio causes a factor of to appear i all time bouds. If the performace of collisio resolutio could be improved, it should be possible to improve the worst-case time boud. Suppose that istead of a liear search, a biary search was used to resolve collisios. I this case, the worst case timeboud for table lookups could be reduced to lg, which is θ(lg ). 2M Aother itriguig possibility is to use hashig itself to resolve collisios. Assume that if a collisio occurs usig H(s), that a secod fuctio H 1 (s) will be used to resolve the collisio. To keep thigs simple, let s assume that secod-level hash tables are used to store the keys hashed by H 1 (s). This would require oe secod-level table for each

etry i the first table. Sice it is possible for H 1 (s) to produce collisios, some assumptios must be made about collisio resolutio i the secod-level hash tables. If we assume that a stadard liear search is used to resolve secod level collisios, worstcase behavior would occur whe every key caused a collisio o both levels, givig a worst case performace of θ(). Assumig evely distributed keys at both levels, ad a extremely large value for, the umber of comparisos that would be required is 2MN, where M is the size of the first-level table, ad N is the size of the secod-level tables. If we assume all tables are the same size, the result is 2 2M. As yet, we have ot obtaied ay improvemet i the asymptotic time boud, although we have sigificatly reduced the costat of proportioality. Let s carry this idea to its atural coclusio. Assume that we have a ifiite sequece of hashig fuctios H 0 (s), H 1 (s),..., H i (s)..., where H 0 (s) will be used for the level-oe tables, H 1 (s) will be used for the level-two tables, ad so forth. We will assume that these fuctios are applied successively util o collisio occurs. Immediately we have a problem with our worst-case assumptio of gettig a collisio o every key, because this would imply a ifiite recursio. If we were clever eough to desig such a algorithm, the we would certaily be clever eough to guaratee that this would ot occur. Let us assume, istead, that we get evely distributed keys at ever level, ad try to determie just how well we could do uder these circumstaces. For simplicity, assume that all tables are of size M, ad that the umber of keys,, is much larger tha M. As we add levels to the hash table, the amout of space for storig keys (or poiters to liked-lists of keys) becomes larger. M keys ca be stored at level 1, M 2 keys ca be stored at level 2, M 3 keys at level 3, ad so forth. At some level k we will have eough space to store all keys, ad at this level, liear searches will o loger be required. We have reached this level whe M k. Note that k = log M. Sice the operatios at each level require costat time, this gives us a time boud of θ(lg ). Perfect Hashig As poited out above, it is the resolutio of collisios that prevets hashig from beig θ(1). Suppose it were possible to elimiate collisios etirely. This would require a hash fuctio that was somehow adaptable to its eviromet, at least to the degree that it could hadle a table of arbitrary size. Let us assume further, that we have a fuctio H(s) that is oe-to-oe. I other words, if H(s 1 )=H(s 2 ) the s 1 =s 2. This is t really a reasoable assumptio, for the followig reasos. As poited out above, it is ecessary to assume that the size of the table ca be arbitrarily large which implies that the legth of the strigs must be arbitrarily large as well, sice there are oly a fiite umber of characters that could occur i ay strig. However, let s assume that the legth of the strigs is limited to some reasoable value, say 8. Let s also assume that the strigs are limited to certai reasoable characters, say upper ad lower case letters. Icludig the ull strig, this gives us 52 0 +52 1 +52 2 +52 3 +52 4 +52 5 +52 6 +52 7 +52 8 differet strigs that could be used as a argumet to H. To simplify matters, let s assume that all strigs are exactly 8 characters log, which gives us a total of 52 8 differet strigs that could be used

as a iput to H. Now if we assume that our table is idexed by a usiged 32-bit iteger, we would have a total of 2 32 table etries. Observe that 2 32 =16 8. Which is bigger 52 8 or 16 8? Ad just how big is a table with 2 32 etries ayway? If we do othig more tha store the strig, this will require 2 32 *2 3 =2 35 bytes. To put this i perspective 2 35 bytes is 32 gigabytes. Nevertheless, let us pursue this dream of perfect hashig a bit further. If H(s) truly was a oe-to-oe fuctio, the collisios would be impossible. Assumig that H(s) rus i θ(1) time, hashig would ideed be a θ(1) algorithm. However, the assumptio that H(s) is a θ(1) process is ot realistic either. Remember that it is ecessary for our algorithm to be able to hadle a arbitrarily large umber of strigs. If H(s) is truly a perfect hashig fuctio, it must look at every character of s. To see why this is so, imagie that there is a character i s at positio i that is ot examied by H. Ay strig that differed from s oly i positio i would hash to the same idex as s, cotradictig the assumptio that H was oe-to-oe. Sice there are oly a fiite umber of characters that ca appear i a strig, the ecessity of hadlig a arbitrarily large umber of strigs demads that we also be able to hadle arbitrarily log strigs. Sice H is required to examie every character of every strig it processes, it must take more time for log strigs tha for short strigs. The miimum permissible legth for the maximum legth strig i a set of strigs is proportioal to lg. For example, if we have a set of 258 strigs, we kow that at least oe strig must have a legth greater tha 2. There is oe strig of legth zero, ad 256 strigs of legth 1, for a total of 257. The 258th strig must have a legth of at least 2. Cotiuig i this fashio, ay set of strigs of size = 1+ 256 k i= 0 i must have a strig of legth at least k+1. Furthermore, k i log 256 = log 256 1 + 256 k 1 i 0 = +. Sice all logarithms are proportioal to oe = aother, k+1 is proportioal to lg. Thus, eve a perfect hashig algorithm must be at least of order θ(lg ).