Hashing. Alexandra Stefan

Similar documents
Introduction to Algorithms

Introduction to Algorithms

Lecture 3 January 31, 2017

Some Consequences. Example of Extended Euclidean Algorithm. The Fundamental Theorem of Arithmetic, II. Characterizing the GCD and LCM

CHAPTER 17 Amortized Analysis

Errors for Linear Systems

Problem Set 9 Solutions

Lecture 4: Universal Hash Functions/Streaming Cont d

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Appendix B: Resampling Algorithms

= z 20 z n. (k 20) + 4 z k = 4

Example: (13320, 22140) =? Solution #1: The divisors of are 1, 2, 3, 4, 5, 6, 9, 10, 12, 15, 18, 20, 27, 30, 36, 41,

Structure and Drive Paul A. Jensen Copyright July 20, 2003

One-sided finite-difference approximations suitable for use with Richardson extrapolation

THE SUMMATION NOTATION Ʃ

Chapter Newton s Method

Complete subgraphs in multipartite graphs

Lecture 8 Modal Analysis

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Discussion 11 Summary 11/20/2018

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

EEL 6266 Power System Operation and Control. Chapter 3 Economic Dispatch Using Dynamic Programming

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

Negative Binomial Regression

Min Cut, Fast Cut, Polynomial Identities

MA 323 Geometric Modelling Course Notes: Day 13 Bezier Curves & Bernstein Polynomials

Lecture Space-Bounded Derandomization

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for P Charts. Dr. Wayne A. Taylor

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

Lecture 12: Discrete Laplacian

HMMT February 2016 February 20, 2016

Graphs and Trees: cycles detection and stream segmentation. Lorenzo Cioni Dipartimento di Informatica Largo Pontecorvo 3 Pisa

Split alignment. Martin C. Frith April 13, 2012

04 - Treaps. Dr. Alexander Souza

A 2D Bounded Linear Program (H,c) 2D Linear Programming

PHYS 705: Classical Mechanics. Calculus of Variations II

Lecture 4: November 17, Part 1 Single Buffer Management

Linear Approximation with Regularization and Moving Least Squares

Design and Analysis of Algorithms

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Lecture 3. Ax x i a i. i i

Tornado and Luby Transform Codes. Ashish Khisti Presentation October 22, 2003

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Economics 101. Lecture 4 - Equilibrium and Efficiency

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Section 8.3 Polar Form of Complex Numbers

General theory of fuzzy connectedness segmentations: reconciliation of two tracks of FC theory

Quadratic speedup for unstructured search - Grover s Al-

Math 261 Exercise sheet 2

Week 5: Neural Networks

Economics 130. Lecture 4 Simple Linear Regression Continued

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Graphical Analysis of a BJT Amplifier

Ryan (2009)- regulating a concentrated industry (cement) Firms play Cournot in the stage. Make lumpy investment decisions

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Nice plotting of proteins II

Provable Security Signatures

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Week 2. This week, we covered operations on sets and cardinality.

Copyright 2017 by Taylor Enterprises, Inc., All Rights Reserved. Adjusted Control Limits for U Charts. Dr. Wayne A. Taylor

Calculation of time complexity (3%)

VQ widely used in coding speech, image, and video

Algebraic properties of polynomial iterates

Mean Field / Variational Approximations

November 5, 2002 SE 180: Earthquake Engineering SE 180. Final Project

Note 10. Modeling and Simulation of Dynamic Systems

Analysis of Discrete Time Queues (Section 4.6)

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Prof. Dr. I. Nasser Phys 630, T Aug-15 One_dimensional_Ising_Model

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

EN40: Dynamics and Vibrations. Homework 4: Work, Energy and Linear Momentum Due Friday March 1 st

Unit 5: Quadratic Equations & Functions

Learning Theory: Lecture Notes

IV. Performance Optimization

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Attacks on RSA The Rabin Cryptosystem Semantic Security of RSA Cryptology, Tuesday, February 27th, 2007 Nils Andersen. Complexity Theoretic Reduction

Math 426: Probability MWF 1pm, Gasson 310 Homework 4 Selected Solutions

Maximizing the number of nonnegative subsets

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Lecture Nov

Fundamental Algorithms

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Assortment Optimization under MNL

Report on Image warping

Problem Solving in Math (Math 43900) Fall 2013

Population element: 1 2 N. 1.1 Sampling with Replacement: Hansen-Hurwitz Estimator(HH)

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Formal solvers of the RT equation

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Data Structures and Algorithm. Xiaoqing Zheng

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

Notes on Frequency Estimation in Data Streams

Transcription:

Hashng Alexandra Stefan 1

Hash tables Tables Drect access table (or key-ndex table): key => ndex Hash table: key => hash value => ndex Man components Hash functon Collson resoluton Dfferent keys mapped to the same ndex Dynamc hashng reallocate the table as needed If an Insert operaton brngs the load factor to ½, double the table. If a Delete operaton brngs the load factor to 1/8, half the table. Propertes: Good tme-space trade-off Good for: Search, nsert, delete O(1) Not good for: Select, sort not supported, must use a dfferent method Readng: chapter 11, CLRS (chapter 1, Sedgewck has more complexty analyss)

Example Let M = 10, h(k) = k%10 Insert keys: 6 -> 6 15 -> 5 0 -> 0 37 -> 7 3 -> 3 5 -> 5 collson 35 -> 9 -> Collson resoluton: - Separate channg - Open addressng - Lnear probng - Quadratc probng ndex k 0 0 1 3 3 5 15 6 6 7 37 - Double hashng 3 8 9

Hash functons M table sze. h hash functon that maps a key to an ndex We want random-lke behavor: any key can be mapped to any ndex wth equal probablty. Typcal functons: h(k,m) = floor( ((k-s)/(t-s))* M ) Here s k<t. Smple, good f keys are random, not so good otherwse. h(k,m) = k % M Best M s a prme number. (Avod M that has a power of factor, wll generate more collsons). Choose M a prme number that s closest to the desred table sze. If M = p, t uses only the lower order p bts => bad, deally use all bts. h(k,m) = floor(m*(k*a mod 1)), 0<A<1 (n CLRS) Good A = 0.618033 (the golden rato) Useful when M s not prme (can pck M to be a power of ) Alternatve: h(k,m) = (16161 * (unsgned)k ) % M (from Sedgewck)

Collson Resoluton: Separate Channg α = N/M (N tems n the table, M table sze) load factor Separate channg Each table entry ponts to a lst of all tems whose keys were mapped to that ndex. Requres extra space for lnks n lsts Lsts wll be short. On average, sze α. Preferred when the table must support deletons. Operatons: Chan_Insert(T,x) - O(1) nsert x n lst T[h(x.key)] at begnnng. No search for duplcates Chan_Delete(T,x) O(1) delete x from lst T[h(x.key)] Chan_Search(T, k) Θ(1+ α) (both successful and unsuccessful) search n lst T[h(k)] for an tem x wth x.key == k. 5

Separate Channg Example: nsert 5 Let M = 10, h(k) = k%10 Insert keys: 6 -> 6 15 -> 5 0 -> 0 37 -> 7 3 -> 3 5 -> 5 collson 35 9 -> ndex 0 1 3 5 6 7 8 9 k 0 3 5 15 6 37 Insertng at the begnnng of the lst s advantageous n cases where the more recent data s more lkely to be accessed agan (e.g. new students wll have to go see several offces at UTA and so they wll be lookedup more frequently than contnung students. α = 6

Collson Resoluton: Open Addressng α = N/M (N tems n the table, M table sze) load factor Open addressng: Use empty cells n the table to store colldng elements. M > N α rato of used cells from the table (<1). Probng examnng slots n the table. Number of probes = number of slots examned. Lnear probng: h(k,,m) = (h 1 (k) + )% M, If the slot where the key s hashed s taken, use the next avalable slot. Long chans Quadratc probng: h(k,,m) = (h 1 (k) + c 1 + c )% M, If two keys hash to the same value, they follow the same set of probes. Double hashng: h(k,,m) = (h 1 (k) + * h (k)) % M, h (k) should not evaluate to 0. (E.g. use: h (k) = 1 + k%(m-1) ) Use a second hash value as the jump sze (as opposed to sze 1 n lnear probng). Want: h (k) relatvely prme wth M. M prme and h (k) = 1 + k%(m-1) M= p and h (k) = odd (M and h (k) wll be relatvely prme snce all the dvsors of M are powers of, thus even). See fgure 1.10, page 596 (Sedgewck) for clusterng produced by lnear probng and double hashng. 7

Open Addressng: quadratc Worksheet M = 10, h 1 (k) = k%10. Table already contans keys: 6, 15, 0, 37, 3 Next want to nsert 5: h 1 (5) = 5 (collson: 5 wth 15) Lnear probng Index Lnear Quadratc Double hashng h (k) = 1+(k%7) Double hashng h (k) = 1+(k%9) - h(k,,m) = (h 1 (k) + )% M 0 0 0 0 0 (try slots: 5,6,7,8) Quadratc probng example: 1 - h(k,,m) = (h 1 (k) + + )% M (try slots: 5, 8) - Insertng 35(not shown n table): (probe) 0 h 1 (k) + + %10 3 3 3 3 3 5 15 15 15 15 (try slots: 5, 8, 3,0) 1 6 6 6 6 6 7 37 37 37 37 3 8 9 8 Where wll 9 be nserted now (after 35)?

Open Addressng: quadratc Answers M = 10, h 1 (k) = k%10. Table already contans keys: 6, 15, 0, 37, 3 Next want to nsert 5: h 1 (5) = 5 (collson: 5 wth 15) Lnear probng - h(k,,m) = (h 1 (k) + )% M (try slots: 5,6,7,8) Quadratc probng example: - h(k,,m) = (h 1 (k) + + )% M (try slots: 5, 8) - Insertng 35(not shown n table): (try slots: 5, 8, 3,0) (probe) h 1 (k) + + %10 0 5+0=5 5 1 5+3=8 8 5+8=13 3 3 5+*3+3 = 5+15=0 0 Index Lnear Quadratc Double hashng h (k) = 1+(k%7) Double hashng h (k) = 1+(k%9) 0 0 0 0 0 1 5 3 3 3 3 3 5 15 15 15 15 6 6 6 6 6 7 37 37 37 37 8 5 5 9 9 Where wll 9 be nserted now (after 35)?

Open Addressng : double hashng - Worksheet M = 10, h 1 (k) = k%10. Table already contans keys: 6, 15, 0, 37, 3 Try to nsert 5: h 1 (5) = 5 (collson: 5 wth 15) Double hashng example - h(k,,m) = (h 1 (k) + * h (k)) % M Choce of h matters: - h a (k) = 1+(k%7): try slots: 5, 9, - h (5) = 1+ (5%7) = 1+ = 5 => h(k,,m) = (5 + *5)%M => slots: 5,0,5,0, Cannot nsert 5. - h b (k) = 1+(k%9): - h (5) = 1 + (5%9) = 1 + 7 = 8 => h(k,,m) = (5 + *8)%M => slots: 5,3,1,7,5, (probe) h 1 (k) + + %10 0 5+0=5 5 1 5+3=8 8 5+8=13 3 3 5+*3+3 = 5+15=0 0 Index Lnear Quadratc Double hashng h (k) = 1+(k%7) Double hashng h (k) = 1+(k%9) 0 0 0 0 0 (pro be) 0 1 Inde x (h 1 (k) + *h a (k) )%M (5+*5)%10 (pro be) 0 1 Index (h 1 (k) + *h b (k) )%M (5+*8)%10 1 5 3 3 3 3 3 5 15 15 15 15 6 6 6 6 6 3 3 7 37 37 37 37 8 5 5 5 9 Where wll 9 be nserted now? 10

Open Addressng : double hashng - Answers M = 10, h 1 (k) = k%10. Table already contans keys: 6, 15, 0, 37, 3 Try to nsert 5: h 1 (5) = 5 (collson: 5 wth 15) Double hashng example - h(k,,m) = (h 1 (k) + * h (k)) % M Choce of h matters: - h a (k) = 1+(k%7): try slots: 5, 9, - h (5) = 1+ (5%7) = 1+ = 5 => h(k,,m) = (5 + *5)%M => slots: 5,0,5,0, Cannot nsert 5. - h b (k) = 1+(k%9): - h (5) = 1 + (5%9) = 1 + 7 = 8 => h(k,,m) = (5 + *8)%M => slots: 5,3,1,7,5, (probe) h 1 (k) + + %10 0 5+0=5 5 1 5+3=8 8 5+8=13 3 3 5+*3+3 = 5+15=0 0 Index Lnear Quadratc Double hashng h (k) = 1+(k%7) Double hashng h (k) = 1+(k%9) 0 0 0 0 0 (pro be) Inde x (h 1 (k) + *h a (k) )%M (5+*5)%10 0 5 (5+0)%10=5 1 0 (5+5)%10=0 5 (5+*5)%10 =15%10= 5 Cycles back to 5 => Cannot nsert 5 3 0 (5+3*5)%10=0%10= 0 (pro be) Index (h 1 (k) + *h b (k) )%M (5+*8)%10 0 5 (5+0)%10=5 1 3 (5+8)%10=3 1 (5+*8)%10 =1%10= 1 3 9 (5+3*8)%10=9%10= 9 7 (5+*8)%10=37%10= 7 5 5 (5+5*8)%10=5%10= 5 Cycles back to 5 1 5 3 3 3 3 3 5 15 15 15 15 6 6 6 6 6 7 37 37 37 37 8 5 5 9 Where wll 9 be nserted now? 11

Open Addressng: Quadratc vs double hashng M = 10, h 1 (k) = k%10. Table already contans keys: 6, 15, 0, 37, 3 Try to nsert 5: h 1 (5) = 5 (collson: 5 wth 15) Quadratc hashng wth: + h(k)= (h 1 (k) + + )%M where h 1 (k) = k%m (probe) 0 1 3 Index h(k)=(h 1 (k) + + )%M =(5 + + )%10 (k=5) Double hashng: h(k)=(h 1 (k)+*h (k) )%M where: h 1 (k) = k%m h (k) = 1+(k%(M-1)) = 1+(k%9) h (5) = 1+(5%9) = 1+7 = 8 (pro be) 0 1 3 5 Index h(k)=(h 1 (k)+*h (k) )%M = (5+*8)%10 (for k=5) Index Lnear Quadratc Double hashng h (k) = 1+(k%7) Double hashng h (k) = 1+(k%9) 0 0 0 0 0 1 5 3 3 3 3 3 5 15 15 15 15 6 6 6 6 6 7 37 37 37 37 Choce of h matters. See h (k) = 1+(k%7) h (5) = 1+ = 5 => h(5) cycles: 5,0,5,0 => Could not nsert 5. 8 5 5 9 Where wll 9 be nserted now? 1

Open Addressng: Quadratc vs double hashng M = 10, h 1 (k) = k%10. Table already contans keys: 6, 15, 0, 37, 3 Try to nsert 5: h 1 (5) = 5 (collson: 5 wth 15) Quadratc hashng wth: + h(k)= (h 1 (k) + + )%M where h 1 (k) = k%m (probe) Index 0 5 5+0=5 1 8 5+3=8 3 5+8=13 h(k)=(h 1 (k) + + )%M =(5 + + )%10 (k=5) 3 0 5+*3+3 = 5+15=0 Double hashng: h(k)=(h 1 (k)+*h (k) )%M where: h 1 (k) = k%m h (k) = 1+(k%(M-1)) = 1+(k%9) h (5) = 1+(5%9) = 1+7 = 8 (pro be) Index 0 5 (5+0)%10=5 1 3 (5+8)%10=3 h(k)=(h 1 (k)+*h (k) )%M = (5+*8)%10 (for k=5) 1 (5+*8)%10 =31%10= 1 3 9 (5+3*8)%10=9%10= 9 7 (5+*8)%10=37%10= 7 5 5 (5+5*8)%10=5%10= 5 Cycles back to 5 Choce of h matters. See h (k) = 1+(k%7) h (5) = 1+ = 5 => h(5) cycles: 5,0,5,0 => Could not nsert 5. Index Lnear Quadratc Double hashng h (k) = 1+(k%7) Double hashng h (k) = 1+(k%9) 0 0 0 0 0 1 5 3 3 3 3 3 5 15 15 15 15 6 6 6 6 6 7 37 37 37 37 8 5 5 9 Where wll 9 be nserted now? 13

Searchng: Search and Deleton n Open Addressng Report as not found when land on an EMPTY cell Deleton: Mark the cell as DELETED, not as an EMPTY cell Otherwse you wll break the chan and not be able to fnd elements followng n that chan. E.g., wth lnear probng, and hash functon h(k,,10) = (k + ) %10, nsert 15,5,35,5, search for 5, then delete 5 and search for 5 or 35. 1

Open Addressng: clusterng Lnear probng prmary clusterng: the longer the chan, the hgher the probablty that t wll ncrease. Gven a chan of sze T n a table of sze M, what s the probablty that ths chan wll ncrease after a new nserton? Quadratc probng 15

Expected Tme Complexty for Hash Operatons (under perfect condtons) Operaton \Methods Separate channg Open Addressng Successful search Θ(1+α) (1/α)ln(1/(1-α)) Unsuccessful search Θ(1+α) 1/(1-α) Insert Delete Θ(1) When: nsert at begnnng and no search for duplcates Θ(1) Assumes: doubly-lnked lst and node wth tem to be deleted s gven. 1/(1-α) Perfect condtons: smple unform hashng unform hashng The tme complexty does not depend only on α, (but also on the deleted cells). In such cases separate channg may be preferred to open addressng as ts behavor has better guarantees. Reference Theorem 11.1 and 11. Theorem 11.6 and 11.8 and corollary 11.7 n CLRS α = N/M s the load factor 16

Perfect Hashng Smlar to separate channg, but use another hash table nstead of a lnked lst. Can be done for statc keys (once the keys are stored n the table, the set of keys never changes). Corollary 11.11:Suppose that we store N keys n a hash table usng perfect hashng. Then the expected storage used for the secondary hash tables s less than N. 17

Hashng Strngs Gven strngs wth 7-bt encoded characters Vew the strng as a number n base 18 where each letter s a dgt. Convert t to base 10 Apply mod M Example: now : n = 110, o = 111, w = 119: h( now,m) = (110*18 + 111*18 1 + 119) % M See Sedgewck page 576 (fgure 1.3) for the mportance of M for collsons M = 31 s better than M = 6 When M = 6, t s a dvsor of 18 => (110*18 + 111*18 1 + 119) % M = 119 % M => only the last letter s used. => f lowercase Englsh letters: a,,z=> 97,,1 => (%6) => 6 ndexes, [33,...,58], used out of [0,63] avalable. Addtonal collsons wll happen due to last letter dstrbuton (how many words end n t?). 18

Hashng Long Strngs For long strngs, the prevous method wll result n number overflow. E.g.: 6-bt ntegers 11 character long word: c*18 10 = c*( 7 ) 10 = c* 70 Vew the word as a number n base 18 (each letter s a dgt ). Soluton: partal calculatons: Replace: c 10 *18 10 + c 9 *18 9 + c 1 *18 1 + c 0 wth: ((c 10 * 18 + c 9 ) * 18 + c 8 )*18+.+ c 1 )*18 + c 0 ) Compute t teratvely and apply mod M at each teraton: // (Sedgewck) nt hash(char *v, nt M) { nt h = 0, a = 18; for (; *v!= '\0'; v++) h = (a*h + *v) % M; return h; } // mprovement: use 17, not 18: (Sedgewck) nt hash(char *v, nt M) { nt h = 0, a = 17; for (; *v!= '\0'; v++) h = (a*h + *v) % M; return h; 19 }