Introduction to Hashtables

Similar documents
Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Hash tables. Hash tables

Hash tables. Hash tables

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Hashing. Data organization in main memory or disk

Hash tables. Hash tables

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Data Structures and Algorithm. Xiaoqing Zheng

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Lecture: Analysis of Algorithms (CS )

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

Array-based Hashtables

Hashing, Hash Functions. Lecture 7

Hashing. Martin Babka. January 12, 2011

So far we have implemented the search for a key by carefully choosing split-elements.

Fundamental Algorithms

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Hashing Data Structures. Ananda Gunawardena

CSE 502 Class 11 Part 2

? 11.5 Perfect hashing. Exercises

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Hashing. Why Hashing? Applications of Hashing

Introduction to Hash Tables

1 Hashing. 1.1 Perfect Hashing

6.1 Occupancy Problem

1 Maintaining a Dictionary

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Searching, mainly via Hash tables

Algorithms for Data Science

Data Structures and Algorithm. Xiaoqing Zheng

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Cache-Oblivious Hashing

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

Cuckoo Hashing and Cuckoo Filters

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

CS483 Design and Analysis of Algorithms

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

CS 580: Algorithm Design and Analysis

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

Advanced Data Structures

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

2. This exam consists of 15 questions. The rst nine questions are multiple choice Q10 requires two

Data Structures and Algorithms Winter Semester

Module 1: Analyzing the Efficiency of Algorithms

4.5 Applications of Congruences

1 ListElement l e = f i r s t ; / / s t a r t i n g p o i n t 2 while ( l e. next!= n u l l ) 3 { l e = l e. next ; / / next step 4 } Removal

The set of integers will be denoted by Z = {, -3, -2, -1, 0, 1, 2, 3, 4, }

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

1 2 3 style total. Circle the correct answer; no explanation is required. Each problem in this section counts 5 points.

Explicit and Efficient Hash Functions Suffice for Cuckoo Hashing with a Stash

Basics of hashing: k-independence and applications

Analysis of Algorithms I: Perfect Hashing

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Application: Bucket Sort

compare to comparison and pointer based sorting, binary trees

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

CSE 548: (Design and) Analysis of Algorithms

Amortized analysis. Amortized analysis

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Exam 1. March 12th, CS525 - Midterm Exam Solutions

2. Exact String Matching

Dictionary: an abstract data type

Premaster Course Algorithms 1 Chapter 3: Elementary Data Structures

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Solution suggestions for examination of Logic, Algorithms and Data Structures,

Binary Search Trees. Lecture 29 Section Robb T. Koether. Hampden-Sydney College. Fri, Apr 8, 2016

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

CPSC 320 Sample Final Examination December 2013

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Solutions. Prelim 2[Solutions]

s 1 if xπy and f(x) = f(y)

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM

B669 Sublinear Algorithms for Big Data

GF(2 m ) arithmetic: summary

The Problem with ASICBOOST

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

Tutorial 4. Dynamic Set: Amortized Analysis

MA/CSSE 473 Day 05. Factors and Primes Recursive division algorithm

Module 9: Tries and String Matching

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

The Market-Basket Model. Association Rules. Example. Support. Applications --- (1) Applications --- (2)

Introduction to Randomized Algorithms III

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Rainbow Tables ENEE 457/CMSC 498E

CSE 202 Dynamic Programming II

Transcription:

Introduction to HashTables Boise State University March 5th 2015

Hash Tables: What Problem Do They Solve What Problem Do They Solve? Why not use arrays for everything? 1 Arrays can be very wasteful: Example Social Security Numbers 999-99-9999 1 billion entries only 319 million people in US as of 2014. Array would be 68 % empty. 2 Non-numeric objects: Store strings in the array. LLAPLN19312015 Need to map string to a numeric index. This map is very similar to a hash function.

Hash Tables: The Basics Hash Table A[] Array to hold keys,values. insert(a, key, value); Save key and value in A. O(1) int finda, key); Find the key. O(1) delete(a,key); Delete Key/Value. O(1) int hash(key); Compute Hash Value. O(1)

Hash Function Two Basic Hash Approaches division h(k) = k mod m. Use the remainder of k/m. m comes from the size of what object? multiplication h(k) = m(ka ka ) where 0 < A < 1. This equation will range from zero to m.

Hash Function What makes a good hash function: Uniform Hashing Keys are equally like to go to any hash value. Fully Utilize Can generate a hash value for any table entry. Fast Hash function computes in O(1) Minimizes Collisions For a set of keys, keeps collisions to a minimum.

Hash Tables Collisions Since the hash function is applied to unbounded keys there are going to be keys that generate the same hash value. These are called Collisions. *It is not the size of the hash table that causes the collision but the nature of the hash function.

Hash Tables Solving Collisions: There are two approaches: 1 Separate Chaining: store the keys in a linked list anchored at the hash value. 2 Open Addressing: rehash repeatedly to find an empty space in the hash table.

Separate Chaining Separate Chaining Hash = k mod 7 Keys = [7, 0, 56, 2,51,49,5,54, 12, 26, 9, 19, 16] H=0 49 56 0 7 H=1 H=2 16 9 51 2 H=3 H=4 H=5 H=6 19 26 12 54 5

Separate Chaining Separate Chaining: set method O(1) 1 i n t i n s e r t (A, key, v a l u e ) 2 h i = hash ( key ) ; // g e t the hash v a l u e. 3 l i s t p = A[ h i ] ; // g e t l i s t l i n k e d l i s t 4 l i s t p. i n s e r t ( key, v a l u e ) ; // i n s e r t at f r o n t. 5 r e t u r n 1 ; // a l w a y s room with s e p a r a t e c h a i n i n g. Separate Chaining: get method O(n) 1 i n t f i n d (A, key ) 2 h i = hash ( key ) ; // g e t the hash v a l u e. 3 l i s t p = A[ h i ] ; // g e t l i s t l i n k e d l i s t 4 do { i f ( l l i s t p. key == key ) r e t u r n l i s t p. v a l u e ; l i s t p = l i s t p. n e x t ; } w h i l e ( l i s t ) ; 5 r e t u r n ERROR; // key not p r e s e n t Separate Chaining: delete method O(n) 1 d e l e t e (A, key ) 2 h i = hash ( key ) ; // g e t the hash v a l u e. 3 l i s t p = A[ h i ] ; // g e t l i s t l i n k e d l i s t 4 l i s t p. d e l e t e ( key ) ; // assume b u i l t i n method

Hash Tables Open Addressing If their is a collision we rehash the key to generate a new key. How many times do we rehash until we give up? Hash functions with rehashing linear We use the hash function h(k) = (h (k) + i) mod m, specific case h(k) = (k mod 7 + i) mod m. quadratic We use the hash function h(k) = (h (k) + c 1 i + c 2 i 2 ) mod m h(k) = (k mod 7 + c 1 i + c 2 i 2 ) mod 7, first rehash matches linear when c 1 = c 2 = 1/2. double We use the hash function h(k) = (h 1 (k) + i h 2 (k)) mod m, h(k) = (k mod 7 + i (k mod 2)) mod 7 General Linear and Quadratic are specializations of double hashing, linear h 2 (k) = 1, quadratic i h 2 (k) = c 1 i + c 2 i 2.

Open Addressing Open Addressing: insert method O(m) 1 i n t i n s e r t (A, key, v a l u e ) 2 i = 0 ; h i = key mod A. l e n g t h ; // g e t the hash v a l u e 3 w h i l e ( i < A. l e n g t h && A[ h i ]!= EMPTY) 4 { 5 i = i + 1 ; 6 h i = ( key mod A. l e n g t h + i ) mod A. l e n g t h ) ; 7 } 8 i f ( i >= A. l e n g t h ) r e t u r n FALSE ; // Table i s f u l l. 9 A[ h i ] = v a l u e ; // assumes f u l l y u t i l i z e d. 10 r e t u r n TRUE;

Open Addressing 1 i n t f i n d (A, key ) 2 i = 0 ; h i = hash ( key ) ; // g e t the hash v a l u e. 3 w h i l e ( i < A. l e n g t h && A[ h i ]!= key && A[ h i ]!= EMPTY) { i = i + 1 ; h i = ( key mod A. l e n g t h + i ) mod A. l e n g t h ) ; 4 i f ( i >= A. l e n g t h ) r e t u r n FALSE ; // no key 5 i f (A[ h i ] == EMPTY) r e t u r n FALSE ; // no key 6 r e t u r n A[ h i ] ; // found the key 1 d e l e t e (A, key ) 2 i = 0 ; 3 h i = hash ( key ) ; // g e t the hash v a l u e. 4 w h i l e ( i < A. l e n g t h && A[ h i ]!= EMPTY && A[ h i ]!= key ) { i = i + 1 ; h i = ( key mod A. l e n g t h + i ) mod A. l e n g t h ) ; 5 i f ( i >= A. l e n g t h ) r e t u r n ; // no such key t a b l e f u l l 6 i f (A[ h i ] == EMPTY ) r e t u r n ; // no such key 7 A[ h i ] = EMPTY; // what i s wrong with t h i s?

Hash Tables Key Issues in Hash Table Performance Uniform Hashing For any random key the probability of hashing to any specific location in the table is equal to 1/ A. Or alternatively, the probability for table locations are the same. This is known as uniform hashing. Fully Utilize For Open Addressing, a rehashing scheme is said to fully utilize the table if given all entries in the table are full, for some finite number of steps the rehashing scheme will have inspected every location in the table. Load Factor The Load Factor for a hash table is equal to n/m where n is the number of keys in the table and m is the size of the table.

Hash Tables What makes a good Uniform Hashing function? Let s look at the mod function. For example k mod 7. For any series of numbers, say 0 through 21, it will cycle through the numbers 0 through 6 exactly 3 times. What is the probability of a random key falling at given index i: Given a key k what is the probability that it k mod 7 = i for a fixed i? How many numbers in 0 to 21 mod to i? 3. What is the probability that k is one of those numbers? 3/21 = 1/7. So the probability that k fall at any specific index i is 1/7. So the mod function provides uniform hashing.

Full Utilization Fully Utilize a table: definition: - example problem showing failure. Hash Functions: Linear: h(k) = (k mod 8 + i) mod 8 Quad: h(k) = (k mod 8 + c 1 i + c 2 i) mod 8, c 1 = c 2 = 1/2 Double: h(k) = (k mod 8 + i (k mod 7)) mod 8

Load Factor Load Factor is the ratio of the number of elements to the size of the table: Load Factor = Bounds for Load Factor α = n m Separate Chaining In separate chaining the Load Factor is unbounded because the table can hold more list elements than indices in the table. Open Addressing In Open Addressing the Load Factor is bounded to the range 0 <= α <= 1. Good Load Factors What are good values for load factors for each method?

Load Factor

Load Factor

Expected Probes Expected Probes for an Unsuccessful Search using linear probing. HT 8 9 15 17 18 COUNT 3 2 1 2 1 1 3 2 1 1 Expected Probes = A i=1 p(i) Sum = 3 + 2 + 1 + 2 + 1 + 1 + 3 + 2 + 1 + 1 = 17, A = 10 A Expected Probes = 17/10 = 1.7. Load Factor = 5/10 = 0.5. Challenge Problem: How to calculate probes in a successful search?

Expected Number of Probes Expected Number of Probes for Open Addressing A probe is any inspection of the table. Minimum probe for insert, find, or delete is 1 probe. For Unsuccessful search expected number of probes is at most where α = n/m < 1. 1 (1 α) 1 Inserting an element takes at most on average. (1 α) where α = n/m < 1 The number of probes in a successful search is at most where α = n/m < 1. 1 α ln 1 (1 α)

Cuckoo Hashing A type of Open Addressing. Cuckoo Hashing Uses two (or more) hash functions. If the first hash fails, the second is used. If the second hash fails, the old key/value in the Table is replaced with the new one and the old value is rehashed using alternating rehashing, swapping old values as they occur. This technique leads to very space efficient tables and very rapid table lookups. What fundamental computer science principal is exhibited here?