Fundamental Algorithms

Similar documents
Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Data Structures and Algorithm. Xiaoqing Zheng

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Hashing, Hash Functions. Lecture 7

Lecture: Analysis of Algorithms (CS )

Advanced Implementations of Tables: Balanced Search Trees and Hashing

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

Hashing. Data organization in main memory or disk

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Hash tables. Hash tables

Hash tables. Hash tables

Hash tables. Hash tables

Searching, mainly via Hash tables

Introduction to Hash Tables

? 11.5 Perfect hashing. Exercises

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Introduction to Hashtables

CSE 502 Class 11 Part 2

So far we have implemented the search for a key by carefully choosing split-elements.

Analysis of Algorithms I: Perfect Hashing

Hashing. Why Hashing? Applications of Hashing

Data Structures and Algorithm. Xiaoqing Zheng

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Hashing Data Structures. Ananda Gunawardena

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

n = p 1 p 2 p r = q 1 q 2 q m, then r = m (i.e. the number of primes in any prime decomposition

1 Maintaining a Dictionary

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Module 1: Analyzing the Efficiency of Algorithms

CS483 Design and Analysis of Algorithms

Hashing. Martin Babka. January 12, 2011

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Lecture 8: Number theory

Lecture 7: Amortized analysis

Array-based Hashtables

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Algorithms (II) Yu Yu. Shanghai Jiaotong University

compare to comparison and pointer based sorting, binary trees

Algorithms for Data Science

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Lecture Lecture 3 Tuesday Sep 09, 2014

Proof by Contradiction

1 Hashing. 1.1 Perfect Hashing

Lecture Notes for Chapter 17: Amortized Analysis

Tutorial 4. Dynamic Set: Amortized Analysis

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

Integer factorization, part 1: the Q sieve. part 2: detecting smoothness. D. J. Bernstein

Lecture 12: Hash Tables [Sp 15]

Finding Divisors, Number of Divisors and Summation of Divisors. Jane Alam Jan

The set of integers will be denoted by Z = {, -3, -2, -1, 0, 1, 2, 3, 4, }

Hashing. Alexandra Stefan

6.854 Advanced Algorithms

Problem Set 4 Solutions

Lecture 8 HASHING!!!!!

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Lecture 7: More Arithmetic and Fun With Primes

N/4 + N/2 + N = 2N 2.

Notes. Number Theory: Applications. Notes. Number Theory: Applications. Notes. Hash Functions I

Optimisation and Operations Research

Big O 2/14/13. Administrative. Does it terminate? David Kauchak cs302 Spring 2013

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d

Graduate Analysis of Algorithms Dr. Haim Levkowitz

Math 131 notes. Jason Riedy. 6 October, Linear Diophantine equations : Likely delayed 6

Fundamental Algorithms

4.5 Applications of Congruences

6.1 Occupancy Problem

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

This is a recursive algorithm. The procedure is guaranteed to terminate, since the second argument decreases each time.

Adders, subtractors comparators, multipliers and other ALU elements

Number Theory. Modular Arithmetic

Integer factorization, part 1: the Q sieve. D. J. Bernstein

1. Introduction. Mathematisches Institut, Seminars, (Y. Tschinkel, ed.), p Universität Göttingen,

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing

Lecture 4: Divide and Conquer: van Emde Boas Trees

Integers and Division

Distributed Systems Part II Solution to Exercise Sheet 10

Turing Machine Variants

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

Linear Congruences. The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence:

Lecture 2: A Las Vegas Algorithm for finding the closest pair of points in the plane

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

Topic Contents. Factoring Methods. Unit 3: Factoring Methods. Finding the square root of a number

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

Commutative Rings and Fields

Transcription:

Chapter 5: Hash Tables, Winter 2018/19 1 Fundamental Algorithms Chapter 5: Hash Tables Jan Křetínský Winter 2018/19

Chapter 5: Hash Tables, Winter 2018/19 2 Generalised Search Problem Definition (Search Problem) Input: a sequence or set A of n elements A, and an x A. Output: Index i {1,..., n} with x = A[i], or NIL, if x A. complexity depends on data structure complexity of operations to set up data structure? (insert/delete)

Chapter 5: Hash Tables, Winter 2018/19 2 Generalised Search Problem Definition (Search Problem) Input: a sequence or set A of n elements A, and an x A. Output: Index i {1,..., n} with x = A[i], or NIL, if x A. complexity depends on data structure complexity of operations to set up data structure? (insert/delete) Definition (Generalised Search Problem) Store a set of objects consisting of a key and additional data: Object := ( key : Integer,. record : Data ) ; search/insert/delete objects in this set

Chapter 5: Hash Tables, Winter 2018/19 3 Direct-Address Tables Definition (table as data structure) similar to array: access element via index usually contains elements only for some of the indices

Chapter 5: Hash Tables, Winter 2018/19 3 Direct-Address Tables Definition (table as data structure) similar to array: access element via index usually contains elements only for some of the indices Direct-Address Table: assume: limited number of values for the keys: U = {0, 1,..., m 1} allocate table of size m use keys directly as index

Chapter 5: Hash Tables, Winter 2018/19 4 Direct-Address Tables (2) D i r A d d r I n s e r t ( T : Table, x : Object ) { T [ x. key ] := x ; }

Chapter 5: Hash Tables, Winter 2018/19 4 Direct-Address Tables (2) D i r A d d r I n s e r t ( T : Table, x : Object ) { T [ x. key ] := x ; } DirAddrDelete ( T : Table, x : Object ){ T [ x. key ] := NIL ; }

Chapter 5: Hash Tables, Winter 2018/19 4 Direct-Address Tables (2) D i r A d d r I n s e r t ( T : Table, x : Object ) { T [ x. key ] := x ; } DirAddrDelete ( T : Table, x : Object ){ T [ x. key ] := NIL ; } DirAddrSearch ( T : Table, key : Integer ){ return T [ key ] ; }

Chapter 5: Hash Tables, Winter 2018/19 5 Direct-Address Tables (3) Advantage: very fast: search/delete/insert is Θ(1)

Chapter 5: Hash Tables, Winter 2018/19 5 Direct-Address Tables (3) Advantage: very fast: search/delete/insert is Θ(1) Disadvantages: m has to be small, or otherwise, the table has to be very large! if only few elements are stored, lots of table elements are unused (waste of memory) all keys need to be distinct (they should be, anyway)

Chapter 5: Hash Tables, Winter 2018/19 6 Hash Tables Idea: compute index from key Wanted: function h that maps a given key to an index, has a relatively small range of values, and can be computed efficiently,

Chapter 5: Hash Tables, Winter 2018/19 6 Hash Tables Idea: compute index from key Wanted: function h that maps a given key to an index, has a relatively small range of values, and can be computed efficiently, Definition (hash function, hash table) Such a function h is called a hash function. The respective table is called a hash table.

Chapter 5: Hash Tables, Winter 2018/19 7 Hash Tables Insert, Delete, Search HashInsert ( T : Table, x : Object ) { T [ h ( x. key ) ] := x ; }

Chapter 5: Hash Tables, Winter 2018/19 7 Hash Tables Insert, Delete, Search HashInsert ( T : Table, x : Object ) { T [ h ( x. key ) ] := x ; } HashDelete ( T : Table, x : Object ) { T [ h ( x. key ) ] : = NIL ; }

Chapter 5: Hash Tables, Winter 2018/19 7 Hash Tables Insert, Delete, Search HashInsert ( T : Table, x : Object ) { T [ h ( x. key ) ] := x ; } HashDelete ( T : Table, x : Object ) { T [ h ( x. key ) ] : = NIL ; } HashSearch ( T : Table, x : Object ) { return T [ h ( x. key ) ] ; }

Chapter 5: Hash Tables, Winter 2018/19 8 So Far: Naive Hashing Advantages: still very fast: search/delete/insert is Θ(1), if h is Θ(1) size of the table can be chosen freely, provided there is an appropriate hash function h

Chapter 5: Hash Tables, Winter 2018/19 8 So Far: Naive Hashing Advantages: still very fast: search/delete/insert is Θ(1), if h is Θ(1) size of the table can be chosen freely, provided there is an appropriate hash function h Disadvantages: values of h have to be distinct for all keys however: impossible to find a hash function that produces distinct values for any set of stored data

Chapter 5: Hash Tables, Winter 2018/19 8 So Far: Naive Hashing Advantages: still very fast: search/delete/insert is Θ(1), if h is Θ(1) size of the table can be chosen freely, provided there is an appropriate hash function h Disadvantages: values of h have to be distinct for all keys however: impossible to find a hash function that produces distinct values for any set of stored data ToDo: deal with collisions: objects with different keys that share a common hash value have to be stored in the same table element

Chapter 5: Hash Tables, Winter 2018/19 9 Resolve Collisions by Chaining Idea: use a table of containers containers can hold an arbitrarily large amount of data using (linked) lists as containers: chaining

Chapter 5: Hash Tables, Winter 2018/19 9 Resolve Collisions by Chaining Idea: use a table of containers containers can hold an arbitrarily large amount of data using (linked) lists as containers: chaining ChainHashInsert ( T : Table, x : Object ) { i n s e r t x i n t o T [ h ( x. key ) ] ; }

Chapter 5: Hash Tables, Winter 2018/19 9 Resolve Collisions by Chaining Idea: use a table of containers containers can hold an arbitrarily large amount of data using (linked) lists as containers: chaining ChainHashInsert ( T : Table, x : Object ) { i n s e r t x i n t o T [ h ( x. key ) ] ; } ChainHashDelete ( T : Table, x : Object ) { d e lete x from T [ h ( x. key ) ] ; }

Chapter 5: Hash Tables, Winter 2018/19 10 Resolve Collisions by Chaining ChainHashSearch ( T : Table, x : Object ) { return ListSearch ( x, T [ h ( x. key ) ] ) ;! r e s u l t : reference to x or NIL, i f x not found ; }

Chapter 5: Hash Tables, Winter 2018/19 10 Resolve Collisions by Chaining ChainHashSearch ( T : Table, x : Object ) { return ListSearch ( x, T [ h ( x. key ) ] ) ;! r e s u l t : reference to x or NIL, i f x not found ; } Advantages: hash function no longer has to return distinct values still very fast, if the lists are short

Chapter 5: Hash Tables, Winter 2018/19 10 Resolve Collisions by Chaining ChainHashSearch ( T : Table, x : Object ) { return ListSearch ( x, T [ h ( x. key ) ] ) ;! r e s u l t : reference to x or NIL, i f x not found ; } Advantages: hash function no longer has to return distinct values still very fast, if the lists are short Disadvantages: delete/search is Θ(k), if k elements are in the accessed list worst case: all elements stored in one single list (very unlikely).

Chapter 5: Hash Tables, Winter 2018/19 11 Chaining Average Search Complexity Assumptions: hash table has m slots (table of m lists) contains n elements load factor: α = n m h(k) can be computed in O(1) for all k all values of h are equally likely to occur

Chapter 5: Hash Tables, Winter 2018/19 11 Chaining Average Search Complexity Assumptions: hash table has m slots (table of m lists) contains n elements load factor: α = n m h(k) can be computed in O(1) for all k all values of h are equally likely to occur Search complexity: on average, the list corresponding to the requested key will have α elements unsuccessful search: compare the requested key with all objects in the list, i.e. O(α) operations successful search: requested key last in the list; also O(α) operations

Chapter 5: Hash Tables, Winter 2018/19 11 Chaining Average Search Complexity Assumptions: hash table has m slots (table of m lists) contains n elements load factor: α = n m h(k) can be computed in O(1) for all k all values of h are equally likely to occur Search complexity: on average, the list corresponding to the requested key will have α elements unsuccessful search: compare the requested key with all objects in the list, i.e. O(α) operations successful search: requested key last in the list; also O(α) operations Expected: Average complexity: O(α) operations

Chapter 5: Hash Tables, Winter 2018/19 12 Hash Functions A good hash function should: satisfy the assumption of even distribution: each key is equally likely to be hashed to any of the slots: k : h(k)=j be easy to compute (P(key = k)) = 1 m for all j = 0,..., m 1 be non-smooth : keys that are close together should not produce hash values that are close together (to avoid clustering)

Chapter 5: Hash Tables, Winter 2018/19 12 Hash Functions A good hash function should: satisfy the assumption of even distribution: each key is equally likely to be hashed to any of the slots: k : h(k)=j be easy to compute (P(key = k)) = 1 m for all j = 0,..., m 1 be non-smooth : keys that are close together should not produce hash values that are close together (to avoid clustering) Simplest choice: h = k mod m (m a prime number) easy to compute; even distribution if keys evenly distributed however: not non-smooth

Chapter 5: Hash Tables, Winter 2018/19 13 The Multiplication Method for Integer Keys Two-step method 1. multiply k by constant 0 < γ < 1, and extract fractional part of kγ 2. multiply by m, and use integer part as hash value: h(k) := m(γk mod 1) = m(γk γk )

Chapter 5: Hash Tables, Winter 2018/19 13 The Multiplication Method for Integer Keys Two-step method 1. multiply k by constant 0 < γ < 1, and extract fractional part of kγ 2. multiply by m, and use integer part as hash value: h(k) := m(γk mod 1) = m(γk γk ) Remarks: value of m uncritical; e.g. m = 2 p value of γ needs to be chosen well in practice: use fix-point arithmetics non-integer keys: use encoding to integers (ASCII, byte encoding,... )

Chapter 5: Hash Tables, Winter 2018/19 14 Open Addressing Definition no containers: table contains objects each slot of the hash table either contains an object or NIL to resolve collisions, more than one position is allowed for a specific key

Chapter 5: Hash Tables, Winter 2018/19 14 Open Addressing Definition no containers: table contains objects each slot of the hash table either contains an object or NIL to resolve collisions, more than one position is allowed for a specific key Hash function: generates sequence of hash table indices: General approach: h : U {0,..., m 1} {0,..., m 1} store object in the first empty slot specified by the probe sequence empty slot in the hash table guaranteed, if the probe sequence h(k, 0), h(k, 1),..., h(k, m 1) is a permutation of 0, 1,..., m 1

Chapter 5: Hash Tables, Winter 2018/19 15 Open Addressing Algorithms OpenHashInsert ( T : Table, x : Object ) : Integer { f o r i from 0 to m 1 do { j := h ( x. key, i ) ; i f T [ j ]= NIL then { T [ j ] := x ; return j ; } } cast e r r o r hash t a b l e overflow }

Chapter 5: Hash Tables, Winter 2018/19 15 Open Addressing Algorithms OpenHashInsert ( T : Table, x : Object ) : Integer { f o r i from 0 to m 1 do { j := h ( x. key, i ) ; i f T [ j ]= NIL then { T [ j ] := x ; return j ; } } cast e r r o r hash t a b l e overflow } OpenHashSearch ( T : Table, k : Integer ) : Object { i := 0; while T [ h ( k, i ) ] <> NIL and i < m { i f k = T [ h ( k, i ) ]. key then return T [ h ( k, i ) ] ; i := i +1; } return NIL ; }

Chapter 5: Hash Tables, Winter 2018/19 16 Open Addressing Linear Probing Hash function: h(k, i) := (h 0 (k) + i) mod m first slot to be checked is T[h 0 (k)] second probe slot is T[h 0 (k) + 1], then T[h 0 (k) + 2], etc. wrap around to T[0] after T[m 1] has been checked

Chapter 5: Hash Tables, Winter 2018/19 16 Open Addressing Linear Probing Hash function: h(k, i) := (h 0 (k) + i) mod m first slot to be checked is T[h 0 (k)] second probe slot is T[h 0 (k) + 1], then T[h 0 (k) + 2], etc. wrap around to T[0] after T[m 1] has been checked Main problem: clustering continuous sequences of occupied slots ( clusters ) cause lots of checks during searching and inserting clusters tend to grow, because all objects that are hashed to a slot inside the cluster will increase it slight (but minor) improvement: h(k, i) := (h 0 (k) + ci) mod m

Chapter 5: Hash Tables, Winter 2018/19 16 Open Addressing Linear Probing Hash function: h(k, i) := (h 0 (k) + i) mod m first slot to be checked is T[h 0 (k)] second probe slot is T[h 0 (k) + 1], then T[h 0 (k) + 2], etc. wrap around to T[0] after T[m 1] has been checked Main problem: clustering continuous sequences of occupied slots ( clusters ) cause lots of checks during searching and inserting clusters tend to grow, because all objects that are hashed to a slot inside the cluster will increase it slight (but minor) improvement: h(k, i) := (h 0 (k) + ci) mod m Main advantage: simple and fast easy to implement cache efficient!

Chapter 5: Hash Tables, Winter 2018/19 17 Open Addressing Quadratic Probing Hash function: h(k, i) := (h 0 (k) + c 1 i + c 2 i 2 ) mod m how to chose constants c 1 and c 2? objects with identical h 0 (k) still have the same sequence of hash values ( secondary clustering )

Chapter 5: Hash Tables, Winter 2018/19 17 Open Addressing Quadratic Probing Hash function: h(k, i) := (h 0 (k) + c 1 i + c 2 i 2 ) mod m how to chose constants c 1 and c 2? objects with identical h 0 (k) still have the same sequence of hash values ( secondary clustering ) Idea: double hashing h(k, i) := (h 0 (k) + i h 1 (k)) mod m if h 0 is identical for two keys, h 1 will generate different probe sequences

Chapter 5: Hash Tables, Winter 2018/19 18 Open Addressing Double Hashing h(k, i) := (h 0 (k) + i h 1 (k)) mod m How to choose h 0 and h 1 :

Chapter 5: Hash Tables, Winter 2018/19 18 Open Addressing Double Hashing h(k, i) := (h 0 (k) + i h 1 (k)) mod m How to choose h 0 and h 1 : range of h 0 : U {0,..., m 1} (cover entire table) h 1 (k) must never be 0 (no probe sequence generated) h 1 (k) should be prime to m for all k probe sequence will try all slots if d is the greatest common divisor of h 1 (k) and m, only 1 d hash slots will be probed of the

Chapter 5: Hash Tables, Winter 2018/19 18 Open Addressing Double Hashing h(k, i) := (h 0 (k) + i h 1 (k)) mod m How to choose h 0 and h 1 : range of h 0 : U {0,..., m 1} (cover entire table) h 1 (k) must never be 0 (no probe sequence generated) h 1 (k) should be prime to m for all k probe sequence will try all slots if d is the greatest common divisor of h 1 (k) and m, only 1 d hash slots will be probed of the Possible choices: m = 2 M and let h 1 generate odd numbers, only m a prime number, and h 1 : U {1,..., m 1 } with m 1 < m

Chapter 5: Hash Tables, Winter 2018/19 19 Open Addressing Deletion Problem remaining: how to delete?

Chapter 5: Hash Tables, Winter 2018/19 19 Open Addressing Deletion Problem remaining: how to delete? search entry, remove it does not work: insert 3, 7, 8 having same hash-value, then delete 7 how to find 8?

Chapter 5: Hash Tables, Winter 2018/19 19 Open Addressing Deletion Problem remaining: how to delete? search entry, remove it does not work: insert 3, 7, 8 having same hash-value, then delete 7 how to find 8? do not delete, just mark as deleted

Chapter 5: Hash Tables, Winter 2018/19 19 Open Addressing Deletion Problem remaining: how to delete? search entry, remove it does not work: insert 3, 7, 8 having same hash-value, then delete 7 how to find 8? do not delete, just mark as deleted Next problem: searching stops if first empty entry found after many deletions: lots of unnecessary comparisons!

Chapter 5: Hash Tables, Winter 2018/19 20 Open Addressing Deletion (2) Deletion general problem for open hashing only solution : new construction of table after some deletions hash tables therefore commonly don t support deletion

Chapter 5: Hash Tables, Winter 2018/19 20 Open Addressing Deletion (2) Deletion general problem for open hashing only solution : new construction of table after some deletions hash tables therefore commonly don t support deletion Inserting inserting efficient, but too many inserts not enough space if ratio α too big, new construction of table with larger size

Chapter 5: Hash Tables, Winter 2018/19 20 Open Addressing Deletion (2) Deletion general problem for open hashing only solution : new construction of table after some deletions hash tables therefore commonly don t support deletion Inserting inserting efficient, but too many inserts not enough space if ratio α too big, new construction of table with larger size Still... searching faster than O(log n) possible