Hashing Data Structures. Ananda Gunawardena

Similar documents
Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Introduction to Hash Tables

Hashing. Why Hashing? Applications of Hashing

Lecture: Analysis of Algorithms (CS )

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

Algorithms for Data Science

Introduction to Hashtables

Fundamental Algorithms

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

Hashing, Hash Functions. Lecture 7

1 Maintaining a Dictionary

Searching, mainly via Hash tables

Hashing. Data organization in main memory or disk

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Array-based Hashtables

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Data Structures and Algorithm. Xiaoqing Zheng

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hash tables. Hash tables

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hash tables. Hash tables

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

Hash tables. Hash tables

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

6.1 Occupancy Problem

Algorithms lecture notes 1. Hashing, and Universal Hash functions

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

? 11.5 Perfect hashing. Exercises

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

1 Hashing. 1.1 Perfect Hashing

compare to comparison and pointer based sorting, binary trees

So far we have implemented the search for a key by carefully choosing split-elements.

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

CS483 Design and Analysis of Algorithms

Analysis of Algorithms I: Perfect Hashing

Hashing. Martin Babka. January 12, 2011

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

1 Difference between grad and undergrad algorithms

Lecture 8 HASHING!!!!!

Application: Bucket Sort

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

CS 580: Algorithm Design and Analysis

Module 1: Analyzing the Efficiency of Algorithms

cse 311: foundations of computing Fall 2015 Lecture 12: Primes, GCD, applications

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Lecture 4 Thursday Sep 11, 2014

N/4 + N/2 + N = 2N 2.

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Basics of hashing: k-independence and applications

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

Cuckoo Hashing and Cuckoo Filters

Finding Succinct. Ordered Minimal Perfect. Hash Functions. Steven S. Seiden 3 Daniel S. Hirschberg 3. September 22, Abstract

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

MA/CSSE 473 Day 05. Factors and Primes Recursive division algorithm

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

E23: Hotel Management System Wen Yunlu Hu Xing Chen Ke Tang Haoyuan Module: EEE 101

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

Randomized Load Balancing:The Power of 2 Choices

13. RANDOMIZED ALGORITHMS

13. RANDOMIZED ALGORITHMS

Cryptographic Hash Functions

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Overview. More counting Permutations Permutations of repeated distinct objects Combinations

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Finding similar items

Problem. Maintain the above set so as to support certain

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

past balancing schemes require maintenance of balance info at all times, are aggresive use work of searches to pay for work of rebalancing

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

CSE 548: (Design and) Analysis of Algorithms

Quiz 1 Solutions. (a) f 1 (n) = 8 n, f 2 (n) = , f 3 (n) = ( 3) lg n. f 2 (n), f 1 (n), f 3 (n) Solution: (b)

Introduction. An Introduction to Algorithms and Data Structures

CSE 502 Class 11 Part 2

Lecture 3 Sept. 4, 2014

Data Structures in Java

Announcements. CSE332: Data Abstractions Lecture 2: Math Review; Algorithm Analysis. Today. Mathematical induction. Dan Grossman Spring 2010

Advanced Data Structures

Bloom Filters, general theory and variants

cse 311: foundations of computing Spring 2015 Lecture 12: Primes, GCD, applications

Problem Set 4 Solutions

Chapter 13. Randomized Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

CS 580: Algorithm Design and Analysis

Lecture 4: Stacks and Queues

B669 Sublinear Algorithms for Big Data

PRIMES Math Problem Set

Solution suggestions for examination of Logic, Algorithms and Data Structures,

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Transcription:

Hashing 15-121 Data Structures Ananda Gunawardena

Hashing

Why do we need hashing? Many applications deal with lots of data Search engines and web pages There are myriad look ups. The look ups are time critical. Typical data structures like arrays and lists, may not be sufficient to handle efficient lookups In general: When look-ups need to occur in near constant time. O(1)

Why do we need hashing? Consider the internet(2002 data): By the Internet Software Consortium survey at http://www.isc.org/ in 2001 there are 125,888,197 internet hosts, and the number is growing by 20% every six month! Using the best possible binary search it takes on average 27 iterations to find an entry. By an survey by NUA at http://www.nua.ie/ there are 513.41 million users world wide.

Why do we need hashing? We need something that can do better than a binary search, O(log N). We want, O(1). Solution: Hashing In fact hashing is used in: Web searches Spell checkers Databases Compilers passwords Many others

jezoar Building an index using HashMaps DOCID OCCUR POS 1 POS 2... WORD NDOCS PTR jezebel 20 jezer 3 jezerit 1 jeziah 1 jeziel 1 jezliah 1 jezoar 1 jezrahliah 1 jezreel 39 34 6 1 118 2087 3922 3981 5002 44 3 215 2291 3010 56 4 5 22 134 992 566 3 203 245 287 67 1 132... More on this in Graphs

The concept Suppose we need to find a better way to maintain a table (Example: a Dictionary) that is easy to insert and search in O(1).

Big Idea in Hashing Let S={a 1,a 2, a m } be a set of objects that we need to map into a table of size N. Find a function such that H:S [1 n] Ideally we d like to have a 1-1 map But it is not easy to find one Also function must be easy to compute It is a good idea to pick a prime as the table size to have a better distribution of values Assume a i is a 16-bit integer. Of course there is a trivial map H(a i )=a i But this may not be practical. Why?

Finding a hash Function Assume that N = 5 and the values we need to insert are: cab, bea, bad etc. Let a=0, b=1, c=2, etc Define H such that H[data] = ( characters) Mod N H[cab] = (2+0+1) Mod 5 = 3 H[bea] = (1+4+0) Mod 5 = 0 H[bad] = (1+0+3) Mod 5 = 4

Collisions What if the values we need to insert are abc, cba, bca etc They all map to the same location based on our map H (obviously H is not a good hash map) This is called Collision When collisions occur, we need to handle them Collisions can be reduced with a selection of a good hash function

Choosing a Hash Function A good hash function must Be easy to compute Avoid collisions How do we find a good hash function? A bad hash function Let S be a string and H(S) = Σ S i where S i is the i th character of S Why is this bad?

Choosing a Hash Function? Question Think of hashing 10000, 5-letter words into a table of size 10000 using the map H defined as follows. H(a 0 a 1 a 2 a 3 a 4 ) = Σ a i (i=0,1.4) If we use H, what would be the key distribution like?

Choosing a Hash Function Suppose we need to hash a set of strings S ={S i } to a table of size N H(S i ) = (Σ S i [j].d j ) mod N, where S i [j] is the j th character of string S i How expensive is to compute this function? cost with direct calculation Is it always possible to do direct calculation? Is there a cheaper way to calculate this? Hint: use Horners Rule.

Code public static int hash(string key, int n){ int value = 0; for (int i=0; i<key.length(); i++) value = (value*128+ key.charat(i))%n; return value; } What does this function return if guna is hashed into a table of size 101? What is the complexity of code in terms of string length? What are some of the problems with this function?

Collisions Hash functions can be many-to-1 They can map different search keys to the same hash key. hash1(`a`) == 9 == hash1(`w`) Must compare the search key with the record found If the match fails, there is a collision

Collision Resolving strategies Separate chaining Open addressing Linear Probing Quadratic Probing Double Probing Etc.

Separate Chaining Collisions can be resolved by creating a list of keys that map to the same value

Separate Chaining Use an array of linked lists LinkedList[ ] Table; Table = new LinkedList(N), where N is the table size Define Load Factor of Table as λ = number of keys/size of the table (λ can be more than 1) Still need a good hash function to distribute keys evenly For search and updates

Linear Probing The idea: Table remains a simple array of size N On insert(x), compute f(x) mod N, if the cell is full, find another by sequentially searching for the next available slot Go to f(x)+1, f(x)+2 etc.. On find(x), compute f(x) mod N, if the cell doesn t match, look elsewhere. Linear probing function can be given by h(x, i) = (f(x) + i) mod N (i=1,2,.)

Figure 20.4 Linear probing hash table after each insertion Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss 2002 Addison Wesley

Linear Probing Example Consider H(key) = key Mod 6 (assume N=6) H(11)=5, H(10)=4, H(17)=5, H(16)=4,H(23)=5 Draw the Hash table 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5

Clustering Problem Clustering is a significant problem in linear probing. Why? Illustration of primary clustering in linear probing (b) versus no clustering (a) and the less significant secondary clustering in quadratic probing (c). Long lines represent occupied cells, and the load factor is 0.7. Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss 2002 Addison Wesley

Linear Probing How about deleting items from Hash table? Item in a hash table connects to others in the table(eg: BST). Deleting items will affect finding the others Lazy Delete Just mark the items as inactive rather than removing it.

Lazy Delete Naïve removal can leave gaps! Insert f Remove e Find f 0 a 0 a 0 a 0 a 2 b 3 c 3 e 5 d 2 b 3 c 3 e 5 d 3 f 2 b 3 c 5 d 3 f 2 b 3 c 5 d 3 f 8 j 8 u 10 g 8 s 8 j 8 u 10 g 8 s 8 j 8 u 10 g 8 s 8 j 8 u 10 g 8 s 3 f means search key f and hash key 3

Lazy Delete Clever removal Insert f Remove e 0 a 0 a 0 a Find f 0 a 2 b 3 c 3 e 5 d 2 b 3 c 3 e 5 d 3 f 2 b 3 c gone 5 d 3 f 2 b 3 c gone 5 d 3 f 8 j 8 u 10 g 8 s 8 j 8 u 10 g 8 s 8 j 8 u 10 g 8 s 8 j 8 u 10 g 8 s 3 f means search key f and hash key 3

Load Factor (open addressing) definition: The load factor λ of a probing hash table is the fraction of the table that is full. The load factor ranges from 0 (empty) to 1 (completely full). It is better to keep the load factor under 0.7 Double the table size and rehash if load factor gets high Cost of Hash function f(x) must be minimized When collisions occur, linear probing can always find an empty cell But clustering can be a problem

Quadratic Probing

Quadratic probing Another open addressing method Resolve collisions by examining certain cells (1,4,9, ) away from the original probe point Collision policy: Define h 0 (k), h 1 (k), h 2 (k), h 3 (k), Caveat: where h i (k) = (hash(k) + i 2 ) mod size May not find a vacant cell! Table must be less than half full (λ < ½) (Linear probing always finds a cell.)

Quadratic probing Another issue Suppose the table size is 16. Probe offsets that will be tried: 1 mod 16 = 1 4 mod 16 = 4 9 mod 16 = 9 16 mod 16 = 0 25 mod 16 = 9 only four different values! 36 mod 16 = 4 49 mod 16 = 1 64 mod 16 = 0 81 mod 16 = 1

Figure 20.6 A quadratic probing hash table after each insertion (note that the table size was poorly chosen because it is not a prime number). Data Structures & Problem Solving using JAVA/2E Mark Allen Weiss 2002 Addison Wesley