Hashing and Amortization

Similar documents
1 Hash tables. 1.1 Implementation

OPTIMAL ALGORITHMS -- SUPPLEMENTAL NOTES

Hashing. Algorithm : Design & Analysis [09]

11. Hash Tables. m is not too large. Many applications require a dynamic set that supports only the directory operations INSERT, SEARCH and DELETE.

IP Reference guide for integer programming formulations.

Lecture 4: Unique-SAT, Parity-SAT, and Approximate Counting

CS / MCS 401 Homework 3 grader solutions

Design and Analysis of Algorithms

An Introduction to Randomized Algorithms

19.1 The dictionary problem

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

Definitions: Universe U of keys, e.g., U N 0. U very large. Set S U of keys, S = m U.

6.3 Testing Series With Positive Terms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Skip lists: A randomized dictionary

Posted-Price, Sealed-Bid Auctions

Infinite Sequences and Series

Recitation 4: Lagrange Multipliers and Integration

Lecture 9: Hierarchy Theorems

Problem Set 2 Solutions

CS284A: Representations and Algorithms in Molecular Biology

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Math 216A Notes, Week 5

Amortized Analysis - Part 2 - Dynamic Tables. Objective: In this lecture, we shall explore Dynamic tables and its amortized analysis in detail.

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Advanced Course of Algorithm Design and Analysis

CS161: Algorithm Design and Analysis Handout #10 Stanford University Wednesday, 10 February 2016

Lecture 9: Expanders Part 2, Extractors

CS 332: Algorithms. Linear-Time Sorting. Order statistics. Slide credit: David Luebke (Virginia)

Intro to Learning Theory

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Lecture 2: April 3, 2013

Square-Congruence Modulo n

Sequences and Series of Functions

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

The Growth of Functions. Theoretical Supplement

DATA STRUCTURES I, II, III, AND IV

Shannon s noiseless coding theorem

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Information Theory and Statistics Lecture 4: Lempel-Ziv code

7.7 Hashing. 7.7 Hashing. Perfect Hashing. Direct Addressing

2 Statistical Principles

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

1 Convergence in Probability and the Weak Law of Large Numbers

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 12

The Binomial Theorem

CS 270 Algorithms. Oliver Kullmann. Growth of Functions. Divide-and- Conquer Min-Max- Problem. Tutorial. Reading from CLRS for week 2

Advanced Stochastic Processes.

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

4.3 Growth Rates of Solutions to Recurrences

Lecture 2: Concentration Bounds

MA131 - Analysis 1. Workbook 3 Sequences II

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Lecture 11: Hash Functions and Random Oracle Model

Lecture 10 October Minimaxity and least favorable prior sequences

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

7.1 Convergence of sequences of random variables

Disjoint set (Union-Find)

ECEN 655: Advanced Channel Coding Spring Lecture 7 02/04/14. Belief propagation is exact on tree-structured factor graphs.

7.1 Convergence of sequences of random variables

Last time, we talked about how Equation (1) can simulate Equation (2). We asserted that Equation (2) can also simulate Equation (1).

Basics of Probability Theory (for Theory of Computation courses)

Element sampling: Part 2

Frequentist Inference

The Boolean Ring of Intervals

Fall 2013 MTH431/531 Real analysis Section Notes

Induction: Solutions

Lecture 14: Graph Entropy

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

MAT1026 Calculus II Basic Convergence Tests for Series

Lecture 19: Convergence

Linear Programming and the Simplex Method

Lecture 4: April 10, 2013

( ) = p and P( i = b) = q.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Lecture 4 February 16, 2016

Intermediate Math Circles November 4, 2009 Counting II

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Lecture 12: November 13, 2018

, then cv V. Differential Equations Elements of Lineaer Algebra Name: Consider the differential equation. and y2 cos( kx)

THE ASYMPTOTIC COMPLEXITY OF MATRIX REDUCTION OVER FINITE FIELDS

Law of the sum of Bernoulli random variables

Lecture 11: Pseudorandom functions

Skip Lists. Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 S 3 S S 1

Lecture 9: Pseudo-random generators against space bounded computation,

MA131 - Analysis 1. Workbook 2 Sequences I

Sequences, Series, and All That

Lecture Overview. 2 Permutations and Combinations. n(n 1) (n (k 1)) = n(n 1) (n k + 1) =

SDS 321: Introduction to Probability and Statistics


Math 155 (Lecture 3)

Chapter 6 Infinite Series

Sequences. Notation. Convergence of a Sequence

HOMEWORK 2 SOLUTIONS

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

6.046 Recitation 5: Binary Search Trees Bill Thies, Fall 2004 Outline

Section 5.1 The Basics of Counting

Transcription:

Lecture Hashig ad Amortizatio Supplemetal readig i CLRS: Chapter ; Chapter 7 itro; Sectio 7.. Arrays ad Hashig Arrays are very useful. The items i a array are statically addressed, so that isertig, deletig, ad lookig up a elemet each take O( time. Thus, arrays are a terrific way to ecode fuctios {,..., } T, where T is some rage of values ad is kow ahead of time. For example, takig T = {,}, we fid that a array A of bits is a great way to store a subset of {,..., }: we set Ai] = if ad oly if i is i the set (see Figure.. Or, iterpretig the bits as biary digits, we ca use a -bit array to store a iteger betwee ad 2. I this way, we will ofte idetify the set {,} with the set {,...,2 }. What if we wated to ecode subsets of a arbitrary domai U, rather tha just {,..., }? Or to put thigs differetly, what if we wated a keyed (or associative array, where the keys could be arbitrary strigs? While the workigs of such data structures (such as dictioaries i Pytho are abstracted away i may programmig laguages, there is usually a array-based solutio workig behid the scees. Implemetig associative arrays amouts to fidig a way to tur a key ito a array idex. Thus, we are lookig for a suitable fuctio U {,..., }, called a hash fuctio. Equipped with this fuctio, we ca perform key lookup: U hash fuctio {,..., } array lookup T (see Figure.2. This particular implemetatio of associative arrays is called a hash table. There is a problem, however. Typically, the domai U is much larger tha {,..., }. For ay hash fuctio h : U {,..., }, there is some i such that at least U elemets are mapped to i. The set A: 2 3 4 5 6 7 8 9 2 Figure.. This 2-bit array ecodes the set { 2,4,5,8, } {,...2 }.

2 h ( key 3 = 3 key3, val 3 4 h ( key = 5 key, val h ( key 2 = 6 key2, val 2 7 Figure.2. A associative array with keys i U ad values i T ca be implemeted as a (U T-valued array equipped with a hash fuctio h : U {,..., }. h (i of all elemets mapped to i is called the load o i, ad whe this load cotais more tha oe of the keys we are tryig to store i our hash table we say there is a collisio at i. Collisios are problem for us if two keys map to the same idex, the what should we store at that idex? We have to store both values somehow. For ow let s say we do this i the simplest way possible: storig at each idex i i the array a liked list (or more abstractly, some sort of bucket-like object cosistig of all values whose keys are mapped to i. Thus, lookup takes O ( h (i time, which may be poor if there are collisios at i. Rather tha thikig about efficiet ways to hadle collisios, let s try to reaso about the probability of havig collisios if we choose our hash fuctios well..2 Hash Families Without ay prior iformatio about which elemets of U will occur as keys, the best we ca do is to choose our hash fuctio h at radom from a suitable hash family. A hash family o U is a set H of fuctios U {,..., }. Techically speakig, H should come equipped with a probability distributio, but usually we just take the uiform distributio o H, so that each hash fuctio is equally likely to be chose. If we wat to avoid collisios, it is reasoable to hope that, for ay fixed x, x 2 U (x x 2, the values h(x ad h(x 2 are completely ucorrelated as h rages through the sample space H. This leads to the followig defiitio: Defiitio. A hash family H o U is said to be uiversal if, for ay x, x 2 U (x x 2, we have Pr h(x = h(x 2 ]. If you are expectig lots of collisios, a more efficiet way to hadle thigs is to create a two-layered hash table, where each elemet of A is itself a hash table with its ow, differet hash fuctio. I order to have collisios i a two-layer hash table, the same pair of keys must collide uder two differet hash fuctios. If the hash fuctios are chose well (e.g., if the hash fuctios are chose radomly, the this is extremely ulikely. Of course, if you wat to be eve more sure that collisios wo t occur, you ca make a three-layer hash table, ad so o. There is a trade-off, though: itroducig uecessary layers of hashig comes with a time ad space overhead which, while it may ot show up i the big-o aalysis, makes a differece i practical applicatios. Lec pg. 2 of 7

Similarly, H is said to be ɛ-uiversal if for ay x x 2 we have Pr h(x = h(x 2 ] ɛ. The cosequeces of the above hypotheses with regard to collisios are as follows: Propositio.. Let H be a uiversal hash family o U. Fix some subset S U ad some elemet x U. Pick h H at radom. The expected umber of elemets of S that map to h(x is at most + S. I symbols, h E ( h(x ] + S. If H is ɛ-uiversal rather tha uiversal, the the same holds whe + S is replaced by + ɛ S. Proof. For a propositio ϕ with radom parameters, let I ϕ be the idicator radom variable which equals if ϕ is true ad equals otherwise. The fact that H is uiversal meas that for each x U \ {x} we have E Ih(x=h(x ]. Thus by the liearity of expectatio, we have E h ( h(x S ] = I x S + E = I x S + + S. x S x x E x S x x I h(x=h(x Ih(x=h(x ] The reasoig is almost idetical whe H is ɛ-uiversal rather tha uiversal. Corollary.2. For a hash table i which the hash fuctio ( is chose from a uiversal family, isertio, deletio, ad lookup have expected ruig time O + S, where S U is the set of keys which actually occur. If istead the hash family is ɛ-uiversal, the the operatios have expected ruig time O ( + ɛ S. Corollary.3. Cosider a hash table of size with keys i U, whose hash fuctio is chose from a uiversal hash family. Let S U be the set of keys which actually occur. If S = O(, the isertio, deletio, ad lookup have expected ruig time O(. Let H be a uiversal hash family o U. If S = O(, the the expected load o each idex is O(. Does this mea that a typical hash table has O( load at each idex? Surprisigly, the aswer is o, eve whe the hash fuctio is chose well. We ll see this below whe we look at examples of uiversal hash families. Examples.4.. The set of all fuctios h : U {,..., } is certaily uiversal. I fact, we could ot hope to get ay more balaced tha this: Lec pg. 3 of 7

For ay x U, the radom variable h(x (where h is chose at radom is uiformly distributed o the set {,..., }. For ay pair x x 2, the radom variables h(x, h(x 2 are idepedet. I fact, for ay fiite subset {x,..., x k } U, the tuple ( h(x,..., h(x k is uiformly distributed o { } k.,..., The load o each idex i is a biomial radom variable with parameters ( S,. Fact. Whe p is small ad N is large eough that N p is moderately sized, the biomial distributio with parameters (N, p is approximated by the Poisso distributio with parameter N p. That is, if X is a biomial radom variable with parameters (N, p, the Pr X = k ] (N pk e N p (k. k! I our case, N = S ad p =. Thus, if L i is the load o idex i, the For example, if S =, the Pr L i = k ] ( S k k! e S /. Pr L i = ] e.3679, Pr L i = ] e.3679, Pr L i = 2 ] 2 e.839,. Further calculatio shows that, whe S =, we have ] ( lg E max L i = Θ. i lglg ( lg Moreover, with high probability, max L i does ot exceed O lglg. Thus, a typical hash table with S = ad h chose uiformly from the set of all fuctios looks like Figure.3: about 37% of the buckets empty, about 37% of the buckets havig oe elemet, ad about 26% of the buckets havig more tha oe elemet, icudig some buckets with Θ elemets. 2. I Problem Set 4 we cosidered the hash family H = { h p : p k ad p is prime }, where h p : {,...,2 m } {,..., k } is the fuctio h p (x = x mod p. I Problem 4(a you proved that, for each x y, we have Pr h p (x = h p (y ] ml k. p k ( lg lglg Lec pg. 4 of 7

2 3 4 ( lg. Maximum load = Θ lglg. Figure.3. A typical hash table with S = ad h chose uiformly from the family of all fuctios U {,..., }. 3. I Problem Set 5, we fixed a prime p ad cosidered the hash family { } H = h a : a Z m p, where h a : Z m p Z p is the dot product h a ( x = x a = x i a i (mod p. 4. I Problem Set 6, we fixed a prime p ad positive itegers m ad k ad cosidered the hash family { } H = h A : A Z k m p, where h A : Z m p Zk p is the fuctio h A ( x = A x. 5. If H is a ɛ -uiversal hash family of fuctios {,} m {,} k ad H 2 is a ɛ 2 -uiversal hash family of fuctios {,} k {,} l, the 2 H = H 2 H = { h 2 h : h H, h 2 H 2 } is a (ɛ + ɛ 2 -uiversal hash family of fuctios {,} m {,} l. To see this, ote that for ay x x, the uio boud gives h2 h (x = h 2 h (x ] Pr h H h 2 H 2 2 To fully specify H, we have to give ot just a set but also a probability distributio. The hash families H ad H 2 come with probability distributios, so there is a iduced distributio o H H 2. We the equip H with the distributio iduced by the map H H 2 H, (h, h 2 h 2 h. You could cosider this a mathematical techicality if you wish: if H ad H 2 are give uiform distributios (as they typically are, the the distributio o H H 2 is also uiform. The distributio o H eed ot be uiform, however: a elemet of H is more likely to be chose if it ca be expressed i multiple ways as the compositio of a elemet of H 2 with a elemet of H. Lec pg. 5 of 7

( ] = Pr h (x = h (x or h (x h (x ad h 2 h (x = h 2 h (x ] ] Pr h (x = h (x + Pr h (x h (x ad h 2 h (x = h 2 h (x ɛ + ɛ 2. I choosig the parameters to build a hash table, there is a tradeoff. Makig larger decreases the likelihood of collisios, ad thus decreases the expected ruig time of operatios o the table, but also requires the allocatio of more memory, much of which is ot eve used to store data. I situatios where avoidig collisios is worth the memory cost (or i applicatios other tha hash tables, whe the correspodig tradeoff is worth it, we ca make much larger tha S. Propositio.5. Let H be a uiversal hash family U {,..., }. Let S U be the the set of keys that occur. The the expected umber of collisios is at most ( S 2. I symbols, ] ( S E 2. I h(x=h(x x x U Proof. There are ( S 2 pairs of distict elemets i S, ad each pair has probability at most of causig a collisio. The result follows from liearity of expectatio. Corollary.6. If S 2, the the expected umber of collisios is less tha /2, ad the probability that a collisio exists is less tha /2. Proof. Apply the Markov boud. Thus, if is sufficietly large compared to S, a typical hash table cosists mostly of empty buckets, ad with high probability, there is at most oe elemet i each bucket. As we metioed above, choosig a large for a hash table is expesive i terms of space. While the competig goals of fast table operatios ad low storage cost are a fact of life if othig is kow about S i advace, we will see i recitatio that, if S is kow i advace, it is feasible to costruct a perfect hash table, i.e., a hash table i which there are o collisios. Of course, the smallest value of for which this is possible is = S. As we will see i recitatio, there are reasoably efficiet algorithms to costruct a perfect hash table with = O ( S..3 Amortizatio What if the size of S is ot kow i advace? I order to allocate the array for a hash table, we must choose the size at creatio time, ad may ot chage it later. If S turs out to be sigificatly greater tha, the there will always be lots of collisios, o matter which hash fuctio we choose. Luckily, there is a simple ad elegat solutio to this problem: table doublig. The idea is to start with some particular table size = O(. If the table gets filled, simply create a ew table of size 2 ad migrate all the old elemets to it. While this migratio operatio is costly, it happes ifrequetly eough that, o the whole, the strategy of table doublig is efficiet. Let s take a closer look. To simplify matters, let s assume that oly isertios ad lookups occur, with o deletios. What is the worst-case cost of a sigle operatio o the hash table? Lec pg. 6 of 7

Lookup: O(, as usual. Isertio: O(, if we have to double the table. Thus, the worst-case total ruig time of k operatios (k = S o the hash table is O ( + + k = O ( k 2. The crucial observatio is that this boud is ot tight. Table doublig oly happes after the secod, fourth, eighth, etc., isertios. Thus, the total cost of k isertios is k O( + O ( lg k 2 j = O (k + O (2k = O (k. j= Thus, i ay sequece of isertio ad lookup operatios o a dyamically doubled hash table, the average, or amortized, cost per operatio is O(. This sort of aalysis, i which we cosider the total cost of a sequece of operatios rather tha the cost of a sigle step, is called amortized aalysis. I the ext lecture we will itroduce methods of aalyzig amortized ruig time. Lec pg. 7 of 7

MIT OpeCourseWare http://ocw.mit.edu 6.46J / 8.4J Desig ad Aalysis of Algorithms Sprig 22 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.