Algorithms Design & Analysis. Hash Tables

Similar documents
Introduction to Algorithms

Introduction to Algorithms

8.1 Hashing Algorithms

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

MA/CSSE 473 Day 27. Dynamic programming

CS286.2 Lecture 4: Dinur s Proof of the PCP Theorem

Econometric Methods. Review of Estimation

Lecture 9: Tolerant Testing

CHAPTER 4 RADICAL EXPRESSIONS

Chapter 5 Properties of a Random Sample

Pseudo-random Functions

Chapter 9 Jordan Block Matrices

Algorithms Theory, Solution for Assignment 2

The Occupancy and Coupon Collector problems

Special Instructions / Useful Data

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Introduction to Probability

18.413: Error Correcting Codes Lab March 2, Lecture 8

For combinatorial problems we might need to generate all permutations, combinations, or subsets of a set.

,m = 1,...,n; 2 ; p m (1 p) n m,m = 0,...,n; E[X] = np; n! e λ,n 0; E[X] = λ.

Hard Core Predicates: How to encrypt? Recap

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

Computational Geometry

9 U-STATISTICS. Eh =(m!) 1 Eh(X (1),..., X (m ) ) i.i.d

This lecture and the next. Why Sorting? Sorting Algorithms so far. Why Sorting? (2) Selection Sort. Heap Sort. Heapsort

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

X ε ) = 0, or equivalently, lim

Pseudo-random Functions. PRG vs PRF

The Mathematical Appendix

The Selection Problem - Variable Size Decrease/Conquer (Practice with algorithm analysis)

Introduction to local (nonparametric) density estimation. methods

Lecture 2 - What are component and system reliability and how it can be improved?

Lecture: Analysis of Algorithms (CS )

ENGI 3423 Simple Linear Regression Page 12-01

Lecture 3 Probability review (cont d)

Lecture 02: Bounding tail distributions of a random variable

(b) By independence, the probability that the string 1011 is received correctly is

Chapter 14 Logistic Regression Models

D. VQ WITH 1ST-ORDER LOSSLESS CODING

CHAPTER VI Statistical Analysis of Experimental Data

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test

Median as a Weighted Arithmetic Mean of All Sample Observations

PTAS for Bin-Packing

Chapter 8. Inferences about More Than Two Population Central Values

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

MATH 247/Winter Notes on the adjoint and on normal operators.

Summary of the lecture in Biostatistics

Chapter 4 Multiple Random Variables

Simple Linear Regression

å 1 13 Practice Final Examination Solutions - = CS109 Dec 5, 2018

Lecture 4 Sep 9, 2015

Lecture Notes Types of economic variables

Bayes (Naïve or not) Classifiers: Generative Approach

Hashing. Alexandra Stefan

Dimensionality Reduction and Learning

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

CIS 800/002 The Algorithmic Foundations of Data Privacy October 13, Lecture 9. Database Update Algorithms: Multiplicative Weights

CS473-Algorithms I. Lecture 12b. Dynamic Tables. CS 473 Lecture X 1

Statistics Descriptive and Inferential Statistics. Instructor: Daisuke Nagakura

Lebesgue Measure of Generalized Cantor Set

Mu Sequences/Series Solutions National Convention 2014

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Lecture 3. Sampling, sampling distributions, and parameter estimation

MA 524 Homework 6 Solutions

1 Onto functions and bijections Applications to Counting

A tighter lower bound on the circuit size of the hardest Boolean functions

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Class 13,14 June 17, 19, 2015

22 Nonparametric Methods.

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Exchangeable Sequences, Laws of Large Numbers, and the Mortgage Crisis.

ρ < 1 be five real numbers. The

Parameter, Statistic and Random Samples

Analyzing Control Structures

11. Hash Tables. m is not too large. Many applications require a dynamic set that supports only the directory operations INSERT, SEARCH and DELETE.

Exercises for Square-Congruence Modulo n ver 11

Rademacher Complexity. Examples

Unsupervised Learning and Other Neural Networks

Descriptive Statistics

Lecture 1. (Part II) The number of ways of partitioning n distinct objects into k distinct groups containing n 1,

To use adaptive cluster sampling we must first make some definitions of the sampling universe:

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

EECE 301 Signals & Systems

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

D KL (P Q) := p i ln p i q i

AN UPPER BOUND FOR THE PERMANENT VERSUS DETERMINANT PROBLEM BRUNO GRENET

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

1. A real number x is represented approximately by , and we are told that the relative error is 0.1 %. What is x? Note: There are two answers.

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Non-uniform Turán-type problems

Qualifying Exam Statistical Theory Problem Solutions August 2005

Random Variables and Probability Distributions

Analysis of Variance with Weibull Data

1 Solution to Problem 6.40

EP2200 Queueing theory and teletraffic systems. Queueing networks. Viktoria Fodor KTH EES/LCN KTH EES/LCN

Transcription:

Algorthms Desg & Aalyss Hash Tables

Recap Lower boud Order statstcs 2

Today s topcs Drect-accessble table Hash tables Hash fuctos Uversal hashg Perfect Hashg Ope addressg 3

Symbol-table problem Symbol table T holdg records: x record key[x] Other felds cotag satellte data Operatos o T: INSERT(T,x) DELETE(T,x) SEARCH(T,k) How should the data structure T be orgazed? 4

Drect-accessble table IDEA: Suppose that the set of keys s K {,,, m }, ad keys are dstct. Set up a array T[.. m ]: x f k K ad keys[x] = k T [ k] = NIL otherwse. The, operatos take Θ() tme. Problem: The rage of keys ca be large: 64-bt umbers (whch represet 8,446,744,73,79,55,66 dfferet keys), character strgs (eve larger!). 5

Hash fuctos Soluto: Use a hash fucto h to map the T uverse U of all keys to {,,, m }: h(k ) h(k 4 ) K k k 5 k 4 k 2 k 3 h(k 2 ) = h(k 5 ) h(k 3 ) m- Whe a record to be serted maps to a already occuped slot T, a collso occurs. 6

Resolvg collsos by chag Records the same slot are lked to a lst. T 49 86 52 h(49) = h(86) = h(52) = 7

Aalyss of chag We make the assumpto of smple uform hashg: Each key k K of keys s equally lkely to be hashed to ay slot of table T, depedet of where other keys are hashed. Let be the umber of keys the table, ad let m be the umber of slots. Defe the load factor of T to be α = /m = average umber of keys per slot. 8

Search cost Expected tme to search for a record wth a gve key = Θ( + α). apply hash fucto ad access slot search the lst Expected search tme = Θ() f α = O(), or equvaletly, f = O(m). 9

Choosg a hash fucto The assumpto of smple uform hashg s hard to guaratee, but several commo techques ted to work well practce as log as ther defceces ca be avoded. Desderata: A good hash fucto should dstrbute the keys uformly to the slots of the table. Regularty the key dstrbuto should ot affect ths uformty.

Dvso method Assume all keys are tegers, ad defe h(k) = k mod m. Defcecy: Do t pck a m that has a small dvsor d. A prepoderace of keys that are cogruet modulo d ca adversely affect uformty.

Dvso method Extreme defcecy: If m = 2 r, the the hash does t eve deped o all the bts of k: If k = 2 ad r = 6, the $!!! #!! h(k) = 2. h(k)! " 2

Dvso method h(k) = k mod m. Pck m to be a prme ot too close to a power of 2 or ad ot otherwse used prometly the computg evromet. Aoyace: Sometmes, makg the table sze a prme s coveet. But, ths method s popular, although the ext method we ll see s usually superor. 3

Dot-product method Radomzed strategy: Let m be prme. Decompose key k to r + dgts, each wth value the set {,,, m }. That s, let k = <k, k,, k r >, where k < m. Pck a = <a, a,, a r > where each a s chose radomly from {,,, m }. Defe h a ( k) = r = a k mod m Excellet practce, but expesve to compute. 4

A weakess of hashg as we saw t Problem: For ay hash fucto h, a set of keys exsts that ca cause the average access tme of a hash table to skyrocket. A adversary ca pck all keys from {k U : h(k) = } for some slot. 5

A weakess of hashg as we saw t IDEA: Choose the hash fucto at radom, depedetly of the keys. Eve f a adversary ca see your code, he or she caot fd a bad set of keys, sce he or she does t kow exactly whch hash fucto wll be chose. 6

Uversal hashg Defto. Let U be a uverse of keys, ad let H be a fte collecto of hash fuctos, each mappg U to {,,, m }. We say H s uversal f for all x, y U, where x y, we have {h H: h(x) = h(y)} = H /m. That s, the chace of a collso betwee x ad y s /m f we choose h radomly from H. {h: h(x) = h(y)} H m H 7

Uversalty s good Theorem. Let h be a hash fucto chose (uformly) at radom from a uversal set H of hash fuctos. Suppose h s used to hash arbtrary keys to the m slots of a table T. The, for a gve key x, we have E[#collsos wth x] < /m 8

Proof of the theorem Proof. Let C x be the radom varable deotg the total umber of collsos of keys T wth x, ad let c xy = f h(x) = h(y), otherwse. Note: E[c xy ] = /m ad C x = c xy y T {x} 9

Proof (cot.) E[ C ] = E y T { x c xy x} Take expectato of both sdes. = E[ y T { x} c xy ] Learty of expectato. = / y T { x} m m E[c xy ] = /m = Algebra. 2

Costructg a set of uversal hash fuctos Let m be prme. Decompose key k to r + dgts, each wth value the set {,,, m }. That s, let k = k, k,, k r, where k < m. Radomzed strategy: Pck a = a, a,, a r where each a s chose radomly from {,,, m }. Defe h a ( k) = r = a k mod m How bg s H = {h a }? H = m r+ Dot product, modulo m 2

Uversalty of dot-product hash fuctos Theorem. The set H = {h a } s uversal. Proof. Suppose that x = x, x,, x r ad y = y, y,, y r are dstct keys. Thus, they dffer at least oe dgt posto, wlog posto. For how may h a H do x ad y collde? h a ( x) = h a ( y) r = a x r = a y (mod m) 22

23 Proof (cot.) Equvaletly, we have or whch mples that ) (mod ) ( m y x a r = ) (mod ) ( ) ( m y x a y x a r + = ) (mod ) ( ) ( m y x a y x a r =

Fact from umber theory Theorem. Let m be prme. For ay z Z m such that z, there exsts a uque z Z m such that Example: m = 7. z z (mod m). z 2 3 4 5 6 z 4 5 2 3 6 24

25 Back to the proof We have ad sce x y, a verse (x y ) must exst, whch mples that Thus, for ay choces of a, a 2,, a r, exactly oe choce of a causes x ad y to collde. ) (mod ) ( ) ( m y x a y x a r = ) (mod ) ( ) ( m y x y x a a r =

Proof completed Q. How may h a s cause x ad y to collde? A. There are m choces for each of a, a 2,, a r, but oce these are chose, exactly oe choce for a causes x ad y to collde, amely r a a ( x y ) ( x y) (mod m) = Thus, the umber of h a s that cause x ad y to collde s m r = m r = H /m. 26

Perfect hashg Gve a set of keys, costruct a statc hash table of sze m = O() such that SEARCH takes Θ() tme the worst case. IDEA: Twolevel scheme wth uversal hashg at both levels. No collsos at level 2! 2 3 4 5 6 T 4 3 9 86 S 4 26 S $!!!!! #!!!!!! " 4 27 h 3 (4) = h 3 (27) = $!!!!!!!!!!!! #!!!!!!!!!!!!! " 4 37 22 m a 2 3 4 5 6 727 8 S 6

Collsos at level 2 Theorem. Let H be a class of uversal hash fuctos for a table of sze m = 2. The, f we use a radom h H to hash keys to the table, the expected umber of collsos s at most /2. 28

Collsos at level 2 Proof. By the defto of uversalty, the probablty that 2 gve keys the table collde uder h s /m = / 2. Sce there are pars of keys that ca possbly collde, the expected umber of collsos s 2 2 ( ) 2 = < 2 2 2 29

No Collsos at level 2 Corollary. The probablty of o collsos s at least /2. Proof. Markov s equalty says that for ay oegatve radom varable X, we have Pr{X t} E[X]/t. Applyg ths equalty wth t =, we fd that the probablty of or more collsos s at most /2. Thus, just by testg radom hash fuctos H, we ll quckly fd oe that works. 3

Aalyss of storage Theorem. If we store keys a hash table of sze m = usg a hash fucto h radomly chose from a uversal class of hash fuctos, the m E j= 2 j < 2 where j s the umber of keys hashg to slot j. 3

32 Proof of the Theorem The summato s just the total umbers of collsos. Use fact + = = = 2 2 2 m j j j m j j E E + = 2 2 2 a a a + = = 2 2 ] [ m j j E E + = = = 2 2 m j j m j j E E Learty of expectato + = = 2 2 m j j E s ot a radom varable = 2 m j j

33 Proof of the Theorem (cot.) By the propertes of uversal hashg, the expected value of the summato s at most: sce m =, thus 2 2 ) ( 2 = = m m 2 2 2 + = E m j j 2 2 < =

Aalyss of storage (cot.) Corollary. If we store keys a hash table of sze m = usg a hash fucto h radomly chose from uversal class of hash fuctos ad we set the sze of each secodary hash table to m j = j 2 for j =,,,m-, the the expected amout of storage requred for all secodary hash tables a perfect hashg scheme s less tha 2. 34

Resolvg collsos by ope addressg No storage s used outsde of the hash table tself. Iserto systematcally probes the table utl a empty slot s foud. The hash fucto depeds o both the key ad probe umber: h : U {,,, m } {,,, m }. The probe sequece <h(k,), h(k,),, h(k,m )> should be a permutato of {,,, m }. The table may fll up, ad deleto s dffcult (but ot mpossble). 35

Example of ope addressg Isert key k = 496:. Probe h(496, ) T 586 33 24 collso 48 m- 36

Example of ope addressg Isert key k = 496: T. Probe h(496, ). Probe h(496, ) 586 collso 33 24 48 m- 37

Example of ope addressg Isert key k = 496: T. Probe h(496, ). Probe h(496, ) 2. Probe h(496, 2) 586 33 24 496 serto 48 m- 38

Example of ope addressg Isert key k = 496: T. Probe h(496, ). Probe h(496, ) 2. Probe h(496, 2) 586 33 Search uses the same probe 48 sequece, termatg successfully f t fds the m- key ad usuccessfully f t ecouters a empty slot. 24 496 39

Probg strateges Lear probg: Gve a ordary hash fucto hʹ (k), lear probg uses the hash fucto h(k,) = (hʹ (k) + ) mod m. Ths method, though smple, suffers from prmary clusterg, where log rus of occuped slots buld up, creasg the average search tme. Moreover, the log rus of occuped slots ted to get loger. 4

Probg strateges Double hashg Gve two ordary hash fuctos h (k) ad h 2 (k), double hashg uses the hash fucto h(k,) = (h (k) + h 2 (k)) mod m. Ths method geerally produces excellet results, but h 2 (k) must be relatvely prme to m. Oe way s to make m a power of 2 ad desg h 2 (k) to produce oly odd umbers. 4

Aalyss of ope addressg We make the assumpto of uform hashg: Each key s equally lkely to have ay oe of the m! permutatos as ts probe sequece. Theorem. Gve a ope-addressed hash table wth load factor α = /m <, the expected umber of probes a usuccessful search s at most /( α). 42

Proof of the theorem Proof. At least oe probe s always ecessary. Wth probablty /m, the frst probe hts a occuped slot, ad a secod probe s ecessary. Wth probablty ( )/(m ), the secod probe hts a occuped slot, ad a thrd probe s ecessary. Wth probablty ( 2)/(m 2), the thrd probe hts a occuped slot, etc. m m Observe that < = α for =, 2,,. 43

44 Proof (cotued) Therefore, the expected umber of probes s α α α α α α α α α = = + + + + + + + + + + + + + = ))) ) ( ( ( ( 2 2 3 2 m m m m!!!!!

Implcatos of the theorem If α s costat, the accessg a ope addressed hash table takes costat tme. If the table s half full, the the expected umber of probes s /(.5) = 2. If the table s 9% full, the the expected umber of probes s /(.9) =. 45

Amortzed aalyss Next Week 46

Multplcato method Assume that all keys are tegers, m = 2 r, ad our computer has w-bt words. Defe h(k) = (A k mod 2 w ) rsh (w r), where rsh s the bt-wse rght-shft operator ad A s a odd teger the rage 2 w < A < 2 w. Do t pck A too close to 2 w. Multplcato modulo 2 w s fast. The rsh operator s fast. 47

Multplcato method example h(k) = (A k mod 2 w ) rsh (w r) Suppose that m = 8 = 2 3 ad that our computer has w = 7-bt words: = A = k $!#!" h(k) Modular wheel 48