Introduction to Hash Tables

Similar documents
Lecture: Analysis of Algorithms (CS )

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

1 Maintaining a Dictionary

Hashing, Hash Functions. Lecture 7

Hash tables. Hash tables

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Hash tables. Hash tables

Hash tables. Hash tables

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Data Structures and Algorithm. Xiaoqing Zheng

Advanced Implementations of Tables: Balanced Search Trees and Hashing

? 11.5 Perfect hashing. Exercises

Searching, mainly via Hash tables

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

Fundamental Algorithms

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Analysis of Algorithms I: Perfect Hashing

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

Hashing Data Structures. Ananda Gunawardena

Algorithms for Data Science

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Problem Set 4 Solutions

Application: Bucket Sort

Hashing. Martin Babka. January 12, 2011

Hashing. Why Hashing? Applications of Hashing

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

So far we have implemented the search for a key by carefully choosing split-elements.

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Randomized Load Balancing:The Power of 2 Choices

Introduction to Hashtables

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Secret Sharing CPT, Version 3

Basics of hashing: k-independence and applications

Cryptographic Hash Functions

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

compare to comparison and pointer based sorting, binary trees

6.854 Advanced Algorithms

CPSC 467: Cryptography and Computer Security

b = 10 a, is the logarithm of b to the base 10. Changing the base to e we obtain natural logarithms, so a = ln b means that b = e a.

Data Structures and Algorithm. Xiaoqing Zheng

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

The set of integers will be denoted by Z = {, -3, -2, -1, 0, 1, 2, 3, 4, }

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

6.1 Occupancy Problem

Module 1: Analyzing the Efficiency of Algorithms

4.5 Applications of Congruences

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Avoiding collisions Cryptographic hash functions. Table of contents

CPSC 467: Cryptography and Computer Security

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle

Cache-Oblivious Hashing

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Notes for Lecture 9. 1 Combining Encryption and Authentication

Where do pseudo-random generators come from?

1 Cryptographic hash functions

CPSC 467: Cryptography and Computer Security

CSE 502 Class 11 Part 2

Lecture 10 - MAC s continued, hash & MAC

Lecture Lecture 3 Tuesday Sep 09, 2014

Lecture 8 HASHING!!!!!

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Lecture 7: More Arithmetic and Fun With Primes

Mid-term Exam Answers and Final Exam Study Guide CIS 675 Summer 2010

Bloom Filters, general theory and variants

Maximum sum contiguous subsequence Longest common subsequence Matrix chain multiplication All pair shortest path Kna. Dynamic Programming

CS483 Design and Analysis of Algorithms

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

1 Cryptographic hash functions

14.1 Finding frequent elements in stream

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Hash Tables (Cont'd) Carlos Moreno uwaterloo.ca EIT

Cryptography and Security Final Exam

NOTE: You have 2 hours, please plan your time. Problems are not ordered by difficulty.

Algorithms Test 1. Question 1. (10 points) for (i = 1; i <= n; i++) { j = 1; while (j < n) {

Computational Complexity. IE 496 Lecture 6. Dr. Ted Ralphs

Homework 10 Solution

1 Difference between grad and undergrad algorithms

2. This exam consists of 15 questions. The rst nine questions are multiple choice Q10 requires two

Lattice Cryptography

ENEE 459-C Computer Security. Message authentication (continue from previous lecture)

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Algorithms CMSC Homework set #1 due January 14, 2015

Data Structure. Mohsen Arab. January 13, Yazd University. Mohsen Arab (Yazd University ) Data Structure January 13, / 86

Hashing. Data organization in main memory or disk

3.1 Asymptotic notation

Algorithms and Their Complexity

Some notes on streaming algorithms continued

Solutions to Practice Final

Transcription:

Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In its most simplest form, each cell is either empty or contains a reference to a stored element, or a reference to a list of stroed elements. The size n of a hash table equals the number of cells in the array. To insert an element x into a hash table of size n, one requires a hash function h that takes as input x and returns a (usually large) integer h(x). Then the value h(x) mod n provides an index of a table cell to where x may be placed. If there is no element currently stored in this cell, then x may be placed there. Otherwise, x may have to be re-located, depending on whether or not the table allows for multiple elements to be stored at a single cell. In practice, h evaluates only a portion of x, called its key, that uniquely identfies x. For example, an employee record may consist of multiple kilobytes of data, including a string s of ten digits that serves as the employee s unique identifier. Then h(s) may be used to insert the record into a hash table. Example 1. Given a hash table of size 10, let the set of elements to be hashed be the set of 5-digit strings. Define h by h(x) equals the integer value of the string. For example, h(00020) = 20, while h(00000) = 0. Use h to hash the elements, 54371, 51323, 16170, 64199, 44344, 96787, 19898, 00002, 55435 to cells of the table. Example 1 Solution. 1

Properties of a good hash function h Efficient Computing h(x) should require a constant number of steps per bit of x. This is essential since every table operation involving x will require a computation of h(x). Uniform h should ideally map elements randomly and uniformly over the entire range of integers. This allows each element to have an equally likely chance of being inserted into any of the table cells. Suppose element e is represented by an array of m bytes b 0, b 1,..., b m 1. Then a common method for defining a hash function h(e) is to consider the key polynomial. p(x) = b 0 x m 1 + b 1 x m 2 + + b m 1. Then we may define h(e) p(a), where a is an appropriately chosen positive integer that helps h attain the uniformity property. For example, a = 37 has been shown to give adequate results. Example 2. Provide the key polynomial in the case that e is the string Hello!. We now show that an element s key polynomial can be computed in O(m) steps using Horner s algorithm. Given polynomial p(x) = b 0 x m 1 + b 1 x m 2 + + b m 1, Horner s algorithm begins by first computing p 0 (x) = c 0. Now assume p k (x) has been computed for some k 0, then p k+1 (x) is computed via the equation p K+1 (x) = xp k (x) + c k+1. 2

Example 3. Given the polynomial p(x) = 2x 3 3x 2 + 5x 7, show the sequence of polynomials that are evaluated leading up the evaluation of p(x) using Horner s algorithm. Show the algorithm can be used to compute p( 2). Horner s Algorithm Input coefficient array b[0 : m 1]. Input value x. Initialize sum: sum 0. For each i from 0 to m 1 sum sum x + b[i]. Return sum. 3

Collision Handling Given a hash table T of size n, a collision occurs in T whenever an element x is to be inserted into T, and h(x) mod n yields the index of a table cell that already contains an inserted element. The two most common methods for resolving a collision are separate chaining and open-address probing. Separate chaining Separate chaining handles collisions by allowing for multiple elements to be inserted within the same table cell. This is accomplished by allowing the cell to reference a list of all elements inserted in the cell. Separate chaining represents a very simple and convenient way of handling collisions, and can be efficient provided the lists do not become too long. The load factor of a hash table is defined as λ = m n is the table size. where m is the number of hashed items, and n Theorem 1. Let λ = m be the load factor of a hash table. Under the assumption that separate n chaining is used to resolve collisions, the average length of a chain is λ. Proof. Letting L i, i = 1, 2,..., n denote the length of chain i, we see that the average chain length is 1 n (L 1 + L 2 + + L n ) = m n = λ, and the result is proved. From Theorem 1, and the facts that 1. successfully finding data in a list will on the average take L 2 list steps, where L is the length of the 2. unsuccessfully finding data in a list will take L steps on average it follows that the complexity of hashing is directly dependent upon λ, and so λ should be made a small constant, such as λ = 1. 4

Example 4. Assuming the data and hash function from Example 1, hash the elements into a table of size 10 using separate chaining. Assume additional elements 12345, 23423, 17654, 12343. The following is stated without proof. A proof can be found in Mitzenmacher and Upfal s Probability and Computing. Theorem 2. If n elements are uniformly hashed into a hash table of size n using separate chaining, then with probability approaching 1 as n approaches infinity, the longest chain of the table will be Θ(ln n/ ln(ln n))). It turns out one can do much better, if, instead of hashing to only one bin, one hashes to,say d 2 bins, and chooses the bin of least size to place the element. Theorem 3. Suppose that n elements are uniformly hashed into a table of size n using separate chaining in the following manner. For each element d 2 index locations are determined via a d hash functions. Moreover, the element is placed into the list having the shortest length out of the corresponding d lists (with ties being broken randomly). Then after all n elements have been hashed, with probability 1 o(1/n), the longest list will be at most ln(ln n)/ ln d + O(1). 5

Open-address probing collision resolution The method of open-address probing allows for only one inserted element per cell, and thus must find a new location for an element x for which I = h(x) mod n results in an index for an already-occupied cell. In response to this the values (I + f(i)) mod n, i = 1, 2,... are computed until one of them results in an unoccupied cell index. The following are some possibilities for probing function f. Linear Probing f(i) = ai + b, is a linear function Quadratic Probing f(i) = ai 2 is a quadratic function Random Probing The next cell to be probed is randomly and uniformly selected using a pseudorandom number generator Double Hashing f(i) = ih 2 (x) for some second hash function h 2 Example 5. Hash 1.34,1.45,2.56,5.12,5.34,4.34 using hash function h(x) = x and i) linear probing with f(i) = i; ii) quadratic probing with f(i) = i 2 for collision resolution. Assume a table size of 13. 6

Example 6. Assuming open addressing with a random probe function, determine the expected number of probes that are needed to hash m elements into a table of size n. 7

Theorem 4. If p is prime, than the first p perfect squares (mod p) are distinct. As a corollary, if 2 a hash-table size is prime, and the load factor is less than 1, then every collision can be sucessfully 2 re-addressed via quadratic probing. Proof. Let m 1 and m 2 be integers from {0, 1,..., p 2 1}, and suppose m2 1 m 2 2 mod p. Then by definition of mod, we have p (m 2 2 m 2 1), which implies p (m 2 m 1 )(m 2 + m 1 ), which implies either p (m 2 m 1 ) or p (m 2 + m 1 ). In the first case, we must then have m 1 = m 2. The latter case is impossible since 0 < m 2 + m 1 < p, and so p cannot possibly divide into this sum. Therefore, if m 1 is distinct from m 2, then their squares must also be distinct, modulo p. 8

Universal Hashing Even for a good hash function h, it is possible for an adversary to choose a set of elements that cause h to perform badly by causing several collisions. Universal hashing solves this problem by instead providing an entire family of hash functions, where, for each element to be hashed, one of the hash functions is randomly selected to perform the hashing. It is assumed that the adversary does not know which hash function from the family will be selected. We call a collection of hash functions universal if for each pair of distinct elements h and k, the number of hash functions for which h(k) = h(l) is at most H n /n, where n is the range of each hash function, and H n is the set of hash functions. In other words, the chance of a collision between k and l is at most 1. It is not n hard to prove the existence of universal families of hash functions, but this goes beyond the scope of this lecture. Moreover, it can be proven that, for universal hashing family H n and any sequence of m inserts, finds, and removals using a table of size n, it takes expected time Θ(m) to handle these operations. Perfect Hashing Perfect hashing is the notion of designing a hash function and table so that find operations do not incur any collisions when retrieving an element. Perfect hashing can be achieved when the data to be stored in the table remains fixed. Perfect hashing schemes can be achieved using universal hashing along with a dual-table approach. The first table contains references to other tables, each one with its own hash function. The hash functions are chosen carefully so that the second hash yields zero collisions. Again, the details go beyond the scope of this lecture. Secure Hash Algorithm (SHA) A secure hash algorithm is one that used for mapping a sequence of characters (such as a password) into a pseudorandom word whose length ranges anywhere from 128-512 bits, depending on the algorithm. SHAs undergo rigorous testing by the National Institute of Standards and Technology. One desirable property of a SHA is that it produces uncorrelated output words for two input words, even if the input words may differ by only a single character. This makes it very difficult for a hacker who might possess the hashed word w (via a security breach) and is searching for a preimage that hashes to w. 9

Exercises. 1. Given elements 4371, 1323, 6173, 4199, 4344, 9679, 1989, hash function h(x) = x, and table size 10, hash the elements using i) separate chaining, ii) linear probing with f(i) = i, and iii) quadratic probing with f(i) = i 2. 2. Given hash function h(x) = x, a hash table of size 13, use quadratic probing with f(i) = i 2 to hash the numbers 2.12, 2.31, 6.21, 2.99, 2.56, 11.94, 11.00. Draw the resulting hash table. 3. Suppose we are using a random hash function that hashes m elements into a table of size n. What is the expected number of collisions? In other words, what is the expected size of the set {(x i, x j ) i j and h(x i ) = h(x j )}? Hint: define appropriate indicator random variables, and use the fact that the expectancy of a sum equals the sum of the expectancies. 4. When using separate chaining to handle collisions within a hash table T, suppose we insist on representing each chain as a sorted array, and that a linear-time merge procedure is performed whenever a newly hashed element is inserted into the chain. If T has load factor λ, then provide big-o expressions for the average time it will take to i) successfully find an element in T, ii) unsuccessfully find an element in T, and iii) insert an element into T. Explain. 5. Suppose that we are storing a set of m keys into a hash table of size n. Show that if the keys are drawn from a universe having size that exceeds mn, then there is a subset of m keys that all hash to the same slot; so that the worst-case searching time for separate-chaining hashing is Θ(m). 6. Suppose we want to determine if the string s = s 1 s 2 s k is a substring of the much larger string a 1 a 2 a n. One approach is to compute h(s) with some hash function h. Then, for each i = 1, 2,..., n k + 1, compute h(a i a i+1 a i+k 1 ). If this value is identical to h(s), then compare this string with s, and return true if the strings are identical. Otherwise, proceed to the next substring. Argue that the running time of this algorithm is O(k(n k)). Prove that, if we assume the hash function is h(x 0 x l ) = l i=0 x l i37 i, then the running time can be reduced to O(n). Hint: argue that h(a i+1 a i+2 a i+k ) can be computed in Θ(1) steps assuming h(a i a i+1 a i+k 1 ) has been computed. 7. Given two lists of integers of sizes m and n respectively, describe an algorithm that runs in time O(m + n) that computes the intersection of the lists. Argue that your algorithm is correct, and has the desired running time. 10

Hints and Answers. 1. i) Separate chaining: cell 1: 4371, cell 3: 1323, 6173, cell 4: 4344, cell 9: 4199, 9679, 1989. Linear probing: 9679 4371 1989 1323 6173 4344 4199 0 1 2 3 4 5 6 7 8 9 Quadratic Probing: 9679 4371 1323 6173 4344 1989 4199 0 1 2 3 4 5 6 7 8 9 2. Given hash function h(x) = x, a hash table of size 13, use quadratic probing with f(i) = i 2 to hash the numbers 2.12, 2.31, 6.21, 2.99, 2.56, 11.94, 11.00. Draw the resulting hash table. 2.12 2.31 2.56 6.21 11.00 2.99 11.94 0 1 2 3 4 5 6 7 8 9 10 11 12 3. Let I ij be an indicator random variable that is set to 1 if x i and x j hash to the same cell, where 1 i < j < m. Then E[I ij ] equals the probability that x i and x j hash to the same cell, which equals 1/n (why?). Then since there are m(m 1)/2 different indicator variables, it follows that the expected size of {(x i, x j ) i j and h(x i ) = h(x j )} equals m(m 1)/2n. 4. Successful and unsuccsessful searches: O(log L), where L is the chain size, since binary search can be used. Insertions: O(log L) by using a binary heap at each cell to dynamically sort the incoming elements. 5. Use the pigeon-hole principle. 6. Letting b i = a i+1 a i+2 a i+k 1 a i+k then show h(b i+1 ) = (h(b i ) a i+1 37 k 1 ) 37 + a i+k+1. In other words, the hash for b i+1 can be obtained from the hash of b i in a constant number of operations. 7. Hash the first list into a table T. This takes O(m) steps. Then for each a in the second list hash a to check if a T. Add a to the output list if this is the case. This requires an additional O(m) steps for a total of O(m + n) steps. 11