A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

Similar documents
So far we have implemented the search for a key by carefully choosing split-elements.

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Data Structures and Algorithm. Xiaoqing Zheng

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Lecture: Analysis of Algorithms (CS )

Hashing, Hash Functions. Lecture 7

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Searching, mainly via Hash tables

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Introduction to Hashtables

Hash tables. Hash tables

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Hash tables. Hash tables

Hash tables. Hash tables

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

1 Hashing. 1.1 Perfect Hashing

Introduction to Hash Tables

Fundamental Algorithms

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Array-based Hashtables

1 Maintaining a Dictionary

Hashing. Data organization in main memory or disk

Lecture Lecture 3 Tuesday Sep 09, 2014

Module 1: Analyzing the Efficiency of Algorithms

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hashing Data Structures. Ananda Gunawardena

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Lecture 8 HASHING!!!!!

Integer Sorting on the word-ram

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

7 Dictionary. Harald Räcke 125

7 Dictionary. Ernst Mayr, Harald Räcke 126

Algorithms for Data Science

Premaster Course Algorithms 1 Chapter 3: Elementary Data Structures

Each internal node v with d(v) children stores d 1 keys. k i 1 < key in i-th sub-tree k i, where we use k 0 = and k d =.

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Linear Selection and Linear Sorting

Data Structures and Algorithm. Xiaoqing Zheng

6.1 Occupancy Problem

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

CS483 Design and Analysis of Algorithms

CSE548, AMS542: Analysis of Algorithms, Fall 2017 Date: Oct 26. Homework #2. ( Due: Nov 8 )

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Analysis of Algorithms I: Perfect Hashing

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

8 Priority Queues. 8 Priority Queues. Prim s Minimum Spanning Tree Algorithm. Dijkstra s Shortest Path Algorithm

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

Hashing. Why Hashing? Applications of Hashing

Problem Set 4 Solutions

N/4 + N/2 + N = 2N 2.

? 11.5 Perfect hashing. Exercises

7 Dictionary. EADS c Ernst Mayr, Harald Räcke 109

s 1 if xπy and f(x) = f(y)

Tutorial 4. Dynamic Set: Amortized Analysis

b + O(n d ) where a 1, b > 1, then O(n d log n) if a = b d d ) if a < b d O(n log b a ) if a > b d

Dictionary: an abstract data type

Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology Spring 2010 Professors Piotr Indyk and David Karger Quiz 1

Sorting algorithms. Sorting algorithms

Data Structures 1 NTIN066

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Central Algorithmic Techniques. Iterative Algorithms

CS Data Structures and Algorithm Analysis

Intermediate Math Circles February 14, 2018 Contest Prep: Number Theory

Fast Sorting and Selection. A Lower Bound for Worst Case

Review Of Topics. Review: Induction

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Amortized analysis. Amortized analysis

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

14.1 Finding frequent elements in stream

CSE 502 Class 11 Part 2

Hashing. Martin Babka. January 12, 2011

Today: Amortized Analysis

ADVANCED CALCULUS - MTH433 LECTURE 4 - FINITE AND INFINITE SETS

1 ListElement l e = f i r s t ; / / s t a r t i n g p o i n t 2 while ( l e. next!= n u l l ) 3 { l e = l e. next ; / / next step 4 } Removal

Lecture 5: Loop Invariants and Insertion-sort

Skip List. CS 561, Lecture 11. Skip List. Skip List

Exam 1. March 12th, CS525 - Midterm Exam Solutions

11. Hash Tables. m is not too large. Many applications require a dynamic set that supports only the directory operations INSERT, SEARCH and DELETE.

Part III. Data Structures. Abstract Data Type. Dynamic Set Operations. Dynamic Set Operations

Design and Analysis of Algorithms March 12, 2015 Massachusetts Institute of Technology Profs. Erik Demaine, Srini Devadas, and Nancy Lynch Quiz 1

Advanced Data Structures

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 27

Randomized Load Balancing:The Power of 2 Choices

Advanced Data Structures

Data Structures and Algorithms

Solution suggestions for examination of Logic, Algorithms and Data Structures,

Outline. 1 Introduction. 3 Quicksort. 4 Analysis. 5 References. Idea. 1 Choose an element x and reorder the array as follows:

Transcription:

A Lecture on Hashing Aram-Alexandre Pooladian, Alexander Iannantuono March 22, 217 This is the scribing of a lecture given by Luc Devroye on the 17th of March 217 for Honours Algorithms and Data Structures (COMP 252). The subject was hashing and their uses. Hashing Definition 1. A hash function, h( ), is a map from the space of keys, K, to another space, M = {, 1,, m 1} (we note that the cardinality of M, M, is m). Hence, we write h : K M, and note that the mapping usually appears to be random and uniformly distributed. Example 2. Consider a hash function that takes the last two digits of a phone number. Hence, this would map a phone number, say 514.844.1395 95. Using this hash function, we have that m = 1, and thus M = {, 1,, 99}. There are different methods of using hash functions in data structures that permit dictionary operations. We looked into the following three methods during this lecture: 1. Direct Addressing 2. Hashing with Chaining 3. Open Addressing Direct Addressing For a key k, we have h(k) = k. The table that the keys are mapped to is defined as T = T[], T[1],, T[m 1]. This table is represented in the computer as an array, and is typically implemented as an exogenous data structure and has pointers that go to the necessary data (see figure 1). For a data item x, its key is denoted by key[x]. Operations - Simple The following dictionary operations work under the assumption that all keys are unique. They all require O(1) time. x T.. 1 2 k m 2 m 1 Figure 1: Example of a direct hash table

a lecture on hashing 2 INSERT(x, T) 1 T[h(key[x])] = x DELETE(x, T) 1 T[h(key[x])] = NIL SEARCH(k, T) 1 if T[k] == NIL 2 return Not Found 3 else 4 return T[k] Avoiding Initialization A possible issue with implementing the hash table in this specific way occurs when K << M. Then we would need to initialize a table that would be very large arguably larger than how much memory available at our disposal. However, suppose this was possible. We would also be wasting storage space because we would have slots in our table that never get used! Lastly, this could would take O(m) since we initialize every slot to contain NIL. We need to do better. A possible solution would be to simply identify end points (our smallest and largest values in M) and consider everything between these points to be our table. However, this could result in our table T having garbage values. It takes Θ(m) time to delete the garbage. To avoid initialization, we need a second table, T such that T = T, and a stack S. As we shall see below, with this method, we are able to identify garbage by the clever use of forward and backward pointers. The stack S has positions indexed by 1 to Last[S]. T [k] points to a place in S where k is a stored key, and S[i] = k as a check. We rewrite the basic dictionary operations: Operations - Avoiding Initialization INSERT(x, T) 1 Last[S] + = 1 // A new element is added in the stack 2 k = key[x] // Note: the keys are still unique 3 T[k] = x // Pointer to x 4 T [k] = Last[S] 5 S[Last[S]] = k // Acts as a back-pointer

a lecture on hashing 3 DELETE(x, T) 1 k = key[x] 2 T[k] = NIL // Step not required 3 p = T [k] // Want to delete p 4 q = Last[S] 5 Last[S] = 1 // Decrement the size of the stack 6 T [q] = p 7 S[p] = q k x T k T S p q SEARCH(k, T) 1 if T[k] Last[S] and S[T [k]] == k 2 return t[k] 3 else 4 return Not Found Hashing with Chaining As before, we have our table T of size m but now instead of containing the pointers to our data, it contains pointers to linked lists that contain the data, i.e., T[k] is a pointer to the head of the linked list. All elements x in that list have keys that hash to k that is, h(key[x]) = k. This is to say that the linked lists are somewhat nested inside the array. We will see that this is a convenient way to avoid collisions. That being said, before we continue, we must first formally define a collision. Definition 3. A collision is said to occur when for a given hash function h(k), k 1, k 2 K such that k 1 = k 2 and h(k 1 ) = h(k 2 ) 1. Figure 2: Process of avoiding initialization. The dotted line illustrates the DELETE operation. 1 In other words, h(k) is not injective. Last[S] Operations INSERT(x, T) 1 k = h(key[x]) 2 LinkedListI NSERT(x, T[k]) DELETE(x, T) 1 k = h(key[x]) 2 LinkedListDELETE(x, T[k]) T 1 2 3 SEARCH(k, T) 1 k = h(key[x]) 2 if k is there 3 return ListSEARCH(k, T[k ]) m 1 Figure 3: Process of Hashing with chaining. Dotted lines indicate pointers to head of linked list.

a lecture on hashing 4 Expected Time for Unsuccessful Search Let T U be the expected time for an unsuccessful search. Assume that each key gets mapped uniformly at random to one of the m chains. Let E S be the expected list size which is n/m, n being the number of items and m being T. This ratio is known as the load factor and is denoted by α. We have T U = 1 + E S = 1 + n m = 1 + α, where the +1 is due to overhead. We also have that the total space, T S, used is n cells + m headers or, ( T S = n + m = n 1 + m ) ( = n 1 + 1 ). n α We then notice a trade-off between T U and T S, as illustrated in the figure. It is recommended that α 1 in order to have optimal speed when performing operations. Figure 4: The dotted green circle outlines the desired sweet-spot for the load factor Discussion of Result If we keep α fixed, then the expected time of an INSERT, SEARCH or DELETE operation is O(1), regardless of how large n is. This makes hashing a formidable competitor for binary search trees. Dynamic Hashing In practice, if we consider the load factor as a function of time, i.e., as dictionary operations are performed on the hash table, we would like it to remain within reasonable bounds. To keep it within bounds, we check if the current load factor is within our bounds before performing the operations INSERT or DELETE and if it is beyond the bound, we do what is known as rehashing a process in which the hash table size is changed to keep the load factor near a target value α and then hashing every element in our table to place it into the new table. Consider the following example: Example 4. We set our upper and lower bounds to 2 1 and 2 respectively. Thus, α ( 2 1, 2). If α goes below or reaches 1 2, we make a new table T of size m/2 and iterate through all of our elements in the table T. For every element, we insert them into T. We can then relabel T to be T. On the other hand, if α goes above or reaches 2, we rehash to a table T of size 2m. This assures that α remains close to 1. Rehashing takes O(n) time but its amortized cost is O(1). Load Factor [α] 3 2.5 2 1.5 1.5 1 2 3 4 5 6 7 8 9 1 Time [t] Figure 5: Load factor as a function of time with dynamic hashing.

a lecture on hashing 5 Radix Sort Radix sort can be regarded as a version of hashing with chaining. We demonstrate how radix sort works for integers, given in a decimal base. As an example, take 319 329 77 729 321 327 322. We first create ten linked lists (so M = 1) and perform what is called bucketing using the following hash function h(x) = x mod 1, where x is simply the integer we want to sort. We place them in their respective buckets as follows: 1 2 3 4 5 6 7 8 9 321 322 77 319 327 329 729 Then we concatenate the linked lists from left to right 321 322 77 327 319 329 729, and repeat the process with the following function Their buckets are h(x) = (x 1) mod 1. 1 2 3 4 5 6 7 8 9 77 319 321 322 327 329 729 Doing a third and final round would place them all in order. We note that this procedure takes time n # of rounds, so in this small example, the time and space are indeed O(n). Assume now that we wish to sort n numbers from the space {1, 2,, n 17 } in O(n) time and space (in RAM). The number of rounds would be log 1 n 17 = 17 log 1 n, leading to time complexity Θ(n log 1 n), which is not good! A solution is to change the base from 1 to something else like n. Then any number in the range of [, n 17 ) can be written as follows x = x + x 1 n 1 + x 2 n 2 + + x 16 n 16,

a lecture on hashing 6 with the coefficients being: x = x mod n, x 1 = x x n mod n, x 2 = x x x 1 n n 2 mod n, and so on. Writing the integers in this fashion and using radix sort gives us hash tables of size n and time complexity O(n), since only 17 rounds are required. Open Addressing This method does not require any linked lists but it does need that the number of elements n be less than or equal to T = m. The idea behind open addressing is to store a key in the first available slot. We start by checking if T[k] is free to be filled and if not, we generate some new index every time we find a full slot until we find an empty one. An interesting difference in open addressing, in comparison to the previous methods we saw, is that a key s location in the table is affected by the order in which the keys were inserted. To consider examples of hashing functions, we must first define the concept of a probe sequence. Definition 5. A probe sequence is a sequence of indices of the slots checked in attempt to insert a key into the table T. We use the notation h(k, ), h(k, 1),, h(k, m 1) for the probe sequence, where it is understood that this must form a permutation of, 1,, m 1. INSERT(x, T) 1 k = key[x] 2 j = // Counts the number of attempts 3 while j < m and T[h(k, j)] = NIL 4 j+ = 1 5 if j = m // Used all possible options 6 return Table is Full 7 else 8 T[h(k, j)] = x T h(k, 2) SEARCH(k, T) 1 j = 2 while j < m and T[h(k, j)] = NIL 3 if key[t[h(k, j)]] = k 4 return T[h(k, j)] 5 j = j + 1 6 return Not Found h(k, ) h(k, 1) m 1 Figure 6: Probe sequence, illustrated. The hatched lines indicate filled slots.

a lecture on hashing 7 Speed Comparison with Chaining Method Recall that we have a table of size m with n m items stored. Assume that each item is inserted using an independent, uniform random permutation for h(k, ),, h(k, m 1) (this is unrealistic but nevertheless insightful). Let us look at the operation INSERT in order to compare with the chaining method. We have P[Finding an empty space in 1st attempt] = (m n)/m = 1 α, and thus, P[Finding an occupied space in 1st attempt] = 1 (1 α) = α. Clearly, P[Finding an empty space in 2nd attempt] is P[O 2 ] = ( )( ) ( ) 2 n n 1 n = α 2, m m 1 m and, in general, the P[Finding an empty space in kth attempt] P[O k ] = ( )( ) ( ) ( ) k n n 1 n k n = α k. m m 1 m k m Thus we can say that the expected time to insert, E[T INSERT ], is E[T INSERT ] = 1 P[O 1 ] + 2 P[O 2 ] + 3 P[O 3 ] + = P[O 1 ] + P[O 2 ] + P[O 3 ] + }{{} =1 +P[O 2 ] + P[O 3 ] + }{{} α 1 α j = 1 1 α. j= +P[O 3 ] + }{{} α 2 This is worse than chaining (since 1 + α 1/(1 α)), but can get close when α is small. It is recommended to keep α small preferably less than.5. If we maintain that, then, just like chaining, we have a formidable competition for binary search trees in dictionary applications. E(TINSERT) 1 8 6 4 2 Chaining Open Addressing.5 1 1.5 2 α Figure 7: The Open Addressing asymptotically approach infinity as α 1 whereas chaining does not have this behavior

a lecture on hashing 8 Examples of Hashing Functions Linear Probing Linear Probing is a method of probing whose probe sequence is generated by the equations below: h(k, j) = h(k) + j (mod m) or h(k, j) = h(k) + c j (mod m) where gcd(c, m) = 1. Remark 6. Linear probing, while easy to implement, has a flaw in it known as primary clustering. Clustering begins to reveal itself as the slots in the table T fill up, thus negatively affecting the search and insert times. Random Probing Random Probing is another method of probing whose probe sequence is generated by the equations below: h(k, j) = h(k) + d j (mod m) where d j is a sequence with j [, m 1] that forms a permutation of {, 1,, m 1} that appears to be, or is close to a, truly random uniform permutation. An example is the linear congruential sequence: d = and d i+1 = (a d i + 1) mod m, where a is an integer. It is well known that d,, d m 1 is a permutation of M if and only if a 1 is a multiple of every prime that divides m, where 4 is considered a prime. If m is prime, then all values of a are good. If m = 1, then a must be 1, 4, 41, 61 or 81. References [1] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction to Algorithms. 29. Cambridge, MA. [2] Kaylee Kutschera, Pavel Kondratyev, and Ralph Sarkis. A Lecture on Cartesian Trees. 217. Montreal, QC. http://luc.devroye.org/kutschera-kondratyev-sarkis- CartesianTreesLectureNotes-McGillUniversity-217.pdf