Basics of hashing: k-independence and applications

Similar documents
Lecture Lecture 3 Tuesday Sep 09, 2014

Tabulation-Based Hashing

Lecture 4 Thursday Sep 11, 2014

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Hashing. Martin Babka. January 12, 2011

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Introduction to Hash Tables

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

Hash tables. Hash tables

arxiv: v3 [cs.ds] 11 May 2017

CS 580: Algorithm Design and Analysis

compare to comparison and pointer based sorting, binary trees

6.1 Occupancy Problem

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Application: Bucket Sort

Lecture 4: Hashing and Streaming Algorithms

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Hash tables. Hash tables

Lecture: Analysis of Algorithms (CS )

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Hash tables. Hash tables

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Cache-Oblivious Hashing

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Chapter 13. Randomized Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

CS 580: Algorithm Design and Analysis

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

On Randomness in Hash Functions. Martin Dietzfelbinger Technische Universität Ilmenau

1 Maintaining a Dictionary

CS369N: Beyond Worst-Case Analysis Lecture #6: Pseudorandom Data and Universal Hashing

Outline. CS38 Introduction to Algorithms. Max-3-SAT approximation algorithm 6/3/2014. randomness in algorithms. Lecture 19 June 3, 2014

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

Lecture 10 - MAC s continued, hash & MAC

Lecture 8 HASHING!!!!!

Lecture 23: Alternation vs. Counting

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Copyright 2000, Kevin Wayne 1

Analysis of Algorithms I: Perfect Hashing

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Lecture 4 February 2nd, 2017

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

Introduction to Hashtables

13. RANDOMIZED ALGORITHMS

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Balls and Bins: Smaller Hash Families and Faster Evaluation

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

13. RANDOMIZED ALGORITHMS

6.854 Advanced Algorithms

A Tight Lower Bound for Dynamic Membership in the External Memory Model

CSE 548: (Design and) Analysis of Algorithms

Notes. Number Theory: Applications. Notes. Number Theory: Applications. Notes. Hash Functions I

Balls and Bins: Smaller Hash Families and Faster Evaluation

Hashing. Why Hashing? Applications of Hashing

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

1. Is the set {f a,b (x) = ax + b a Q and b Q} of all linear functions with rational coefficients countable or uncountable?

Randomized Load Balancing:The Power of 2 Choices

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Hashing Data Structures. Ananda Gunawardena

ALG 4.1 Randomized Pattern Matching:

2 How many distinct elements are in a stream?

variance of independent variables: sum of variances So chebyshev predicts won t stray beyond stdev.

CSE 502 Class 11 Part 2

So far we have implemented the search for a key by carefully choosing split-elements.

1 Estimating Frequency Moments in Streams

1 Hashing. 1.1 Perfect Hashing

Hashing, Hash Functions. Lecture 7

b = 10 a, is the logarithm of b to the base 10. Changing the base to e we obtain natural logarithms, so a = ln b means that b = e a.

Expectation of geometric distribution

CS483 Design and Analysis of Algorithms

Learning and Fourier Analysis

Number Theory: Applications. Number Theory Applications. Hash Functions II. Hash Functions III. Pseudorandom Numbers

Searching, mainly via Hash tables

Tight Bounds for Sliding Bloom Filters

14.1 Finding frequent elements in stream

Continuing discussion of CRC s, especially looking at two-bit errors

The space complexity of approximating the frequency moments

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Copyright 2000, Kevin Wayne 1

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

The set of integers will be denoted by Z = {, -3, -2, -1, 0, 1, 2, 3, 4, }

Algorithms lecture notes 1. Hashing, and Universal Hash functions

NUMBER THEORY. Anwitaman DATTA SCSE, NTU Singapore CX4024. CRYPTOGRAPHY & NETWORK SECURITY 2018, Anwitaman DATTA

The Power of Simple Tabulation Hashing

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

Some notes on streaming algorithms continued

Finite Fields. SOLUTIONS Network Coding - Prof. Frank H.P. Fitzek

20.1 2SAT. CS125 Lecture 20 Fall 2016

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

Lecture 4: Two-point Sampling, Coupon Collector s problem

Leftovers from Lecture 3

B490 Mining the Big Data

CPSC 467: Cryptography and Computer Security

Fully Homomorphic Encryption

CPSC 467: Cryptography and Computer Security

Inaccessible Entropy and its Applications. 1 Review: Psedorandom Generators from One-Way Functions

How long does it take to catch a wild kangaroo?

Transcription:

Basics of hashing: k-independence and applications Rasmus Pagh Supported by: 1

Agenda Load balancing using hashing! - Analysis using bounded independence! Implementation of small independence! Case studies:! - Approximate membership! - Hashing with linear probing! Exercise: Space-efficient linear probing 2

Prerequisites I assume you are familiar with the notions of:! a hash table! modular arithmetic [and perhaps finite fields]! expected value of a random variable You can read about these things in e.g. CLRS or! http://www.daimi.au.dk/~bromille/notes/un.pdf 3

Load balancing by hashing Goal: Distribute an unknown, possibly dynamic, set S of items approximately evenly to a set of buckets. 4

Load balancing by hashing Goal: Distribute an unknown, possibly dynamic, set S of items approximately evenly to a set of buckets. Examples: Hash tables, SSDs, distributed keyvalue stores, distributed computation, network routing, parallel algorithms, 4

Load balancing by hashing Goal: Distribute an unknown, possibly dynamic, set S of items approximately evenly to a set of buckets. Examples: Hash tables, SSDs, distributed keyvalue stores, distributed computation, network routing, parallel algorithms, Main tool: Random choice of assignment. 4

n items into n buckets Assume for now: Items are placed uniformly and independently in buckets.! What is the probability that k items end up in one particular bucket?! Use union bound to get an upper bound: n k n k < (n k /k!)n k < 1/k! 5

n items into n buckets Assume for now: Items are placed uniformly and independently in buckets.! What is the probability that k items end up in one particular bucket?! Use union bound to get an upper bound: n Conclusion: Probability of having some bucket with k items is at most n/k!! largest bucket has size O(log n/log log n) whp. k n k < (n k /k!)n k < 1/k! 5

n items into r buckets Use better bound on binomial coefficients:! n! k < (en/k) k Upper bound, k items in particular bucket: X r k < (en/k) k r k =(en/kr) k K S, K =k 6

n items into r buckets Use better bound on binomial coefficients:!! Conclusion: If k > 2en/r > 2 log r n the probability of k items in any k < (en/k) k single bucket is < 1/r. Upper bound, k items in particular bucket: X r k < (en/k) k r k =(en/kr) k K S, K =k 6

k-independence Observation: Proofs only used probabilities of events involving k items. Consequence: It suffices that the hash function used behaves fully randomly when considering sets of k hash values. 7

k-independence Observation: Proofs only used probabilities of Definition: A random hash events involving k items. function h is k-independent if for all choices of distinct x 1,,x k Consequence: It suffices that the hash function the values h(x 1 ),,h(x k ) are used behaves fully randomly when independent. considering sets of k hash values. 7

k-independence Observation: Proofs only used probabilities of Definition: A random hash events involving k items. function h is k-independent if for all choices of distinct x 1,,x k Consequence: It suffices that the hash function the values h(x 1 ),,h(x k ) are used behaves fully randomly when independent. considering sets of k hash values. How do you implement k-independent hashing? 7

Polynomial hashing Random polynomial degree k-1 hash function (assuming key x from field F): p(x) = k 1 X i=0 a i x i 8

Polynomial hashing Random polynomial degree k-1 hash function (assuming key x from field F): k-independent! Why? p(x) = k 1 X i=0 a i x i 8

Polynomial hashing Random polynomial degree k-1 hash function (assuming key x from field F): k-independent! Why? p(x) = k 1 X i=0 a i x i Map to smaller range in any balanced way 8

Polynomial hashing Random polynomial degree k-1 hash function (assuming key x from field F): k-independent! Why? p(x) = k 1 X i=0 a i x i Map to smaller range in any balanced way Divide-and-conquer Horner s rule: p(x) =xp odd (x 2 )+p even (x 2 ) Reduces data dependencies! 8

Implementing field operations work by Tobias Christiani For GF(2 64 ): Use new CLMUL instruction with sparse irreducible polynomial.! - Time for k-independence ca. 3k ns! For GF(p), p=2 61-1 (Mersenne prime): Use double 64-bit registers and special code for modulo.! - Time for k-independence ca. k ns 9

Implementing field operations work by Tobias Christiani For GF(2 64 ): Use new CLMUL instruction with sparse irreducible polynomial.! - Time for k-independence ca. 3k ns! For GF(p), p=2 61-1 (Mersenne prime): Use double x%p = (x>>61)+(x&p) 64-bit registers and special code for modulo.! - Time for k-independence ca. k ns 9

Implementing field operations work by Tobias Christiani For GF(2 64 ): Use new CLMUL instruction with sparse irreducible polynomial.! - Time for k-independence ca. 3k ns! For GF(p), p=2 61-1 (Mersenne prime): Use double x%p = (x>>61)+(x&p) 64-bit registers and special code for modulo.! - Time for k-independence ca. k ns Fastest known for keys of 61 bits up to more than 100-independence 9

Tomorrow: Double tabulation Mikkel Thorup on Danish TV 10

2-independence Degree 1 polynomial: h(x) = (ax+b mod p) mod r Property: 2-independent If a 0: collision probability 1/r 11

2-independence Degree 1 polynomial: h(x) = (ax+b mod p) mod r Property: 2-independent If a 0: collision probability 1/r For set S of n elements, x S: Pr[h(x) h(s)] n/r. 11

2-independence Degree 1 polynomial: h(x) = (ax+b mod p) mod r Property: 2-independent If a 0: collision probability 1/r For set S of n elements, x S: Pr[h(x) h(s)] n/r. Probability of no collisions is 1 - n 2 /r. 11

2-independence Degree 1 polynomial: h(x) = (ax+b mod p) mod r Property: 2-independent If a 0: collision probability 1/r For set S of n elements, x S: Pr[h(x) h(s)] n/r. Probability of no collisions is 1 - n 2 /r. Can map to (say) 128-bit signature with extremely small risk of collision 11

Storing a set of signatures From last slide: For set S of n elements, x S: Pr[h(x) h(s)] n/r.! Suppose r=2n and we store h(s) as a bitmap. 0110010110111000110111101010101001100010010111010 12

Storing a set of signatures From last slide: For set S of n elements, x S: Pr[h(x) h(s)] n/r.! Suppose r=2n and we store h(s) as a bitmap. 0110010110111000110111101010101001100010010111010 Allows us to determine if x S with false positive error probability 1/2. 12

Storing a set of signatures From last slide: For set S of n elements, x S: Pr[h(x) h(s)] n/r.! Suppose r=2n and we store h(s) as a bitmap. 0110010110111000110111101010101001100010010111010 Space 2 bits/item Allows us to determine if x S with false positive error probability 1/2. 12

Linear probing A simple method for placing a set of items into a hash table.! No pointers, just keys and vacant space.! One of the first hash tables invented, still practically important. 13

Hashing with linear probing

Hashing with linear probing

Hashing with linear probing

Hashing with linear probing

Hashing with linear probing

389 km/h 20 km/h

Race car vs golf car Linear probing uses a sequential scan and is thus cache-friendly.! Order of magnitude speed difference between sequential and random access! 16

History of linear probing First described in 1954.! Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is fully random.! Over 30 papers using this assumption.! Since 2007: We know simple, efficient hash functions that make linear probing provably work! 17

History of linear probing First described in 1954.! Analyzed in 1962 by D. Knuth, aged 24. Assumes hash function h is fully random.! Over 30 papers using this assumption.! Since 2007: We know simple, efficient hash functions that make linear probing provably work! 17

Modern proof Idea: Link the number of steps used to insert an item x to the size of intervals around h(x) being full of hash values. Notation: L I = {x 2 S h(x) 2 I} 18

Modern proof Idea: Link the number of steps used to insert an item x to the size of intervals around h(x) being full of hash values. Notation: L I = {x 2 S h(x) 2 I} 18

Modern proof Idea: Link the number of steps used to insert an item x to the size of intervals around h(x) being full of hash values. Notation: L I = {x 2 S h(x) 2 I} 18

Lemma. If insertion of a key x requires k probes, then there exists an interval I of length at least k such that h(x) 2 I and L I I. h(x) x I

Lemma. If insertion of a key x requires k probes, then there exists an interval I of length at least k such that h(x) 2 I and L I I. h(x) x Insertion time is at most the number of I full intervals around h(x)

How many full intervals? Assume that r = 2n, so we expect L I = I /2. By Chernoff bounds: Pr[L I > 2E[L I ]] < (e/4) E[L I] Chernoff bounds are found in books on randomized algorithms or e.g. www.cs.uiuc.edu/~jeffe/ teaching/algorithms/ notes/11-chernoff.pdf 20

How many full intervals? Assume that r = 2n, so we expect L I = I /2. By Chernoff bounds: Pr[L I > 2E[L I ]] < (e/4) E[L I] Chernoff bounds are found in books on randomized algorithms or e.g. www.cs.uiuc.edu/~jeffe/ teaching/algorithms/ notes/11-chernoff.pdf Expected number of full intervals around h(x): < nx (e/4) k/2 k = O(1) k=1 Assumes that values h(x) are independent! 20

With 7-independence Fix a particular interval I containing h(x). Want to analyze prob. that L I has I items.! Define: X Obs: x2s `(I) =Pr[h(y) 2 I] apple I /2 Y x = 1 `(I), if h(x) 2 I `(I), otherwise Y x = L I E[L I ]=L I n`(i). 21

6th moment tail bound Pr[ X x2s Y x > I /2] = Pr[( X x2s Y x ) 6 > ( I /2) 6 ] < E[( X x2s Y x ) 6 ]/( I /2) 6 < 512/ I 3 The first inequality is Markov s.! The 2nd inequality requires that variables Y x are 6-independent (and a calculation). 22

Concluding the argument Expected number of full intervals around h(x) is bounded by: nx k=1 X I3h(x), I =k Pr[L I I ] apple nx k=1 k(512/k 3 ) 1X < 512 1/k 2 = O(1) k=1 23

Concluding the argument Expected number of full intervals around h(x) is bounded by: nx X nx Pr[L I I ] apple k(512/k 3 ) k=1 I3h(x), I =k k=1 Insertion time is at most the number of full intervals around h(x) 1X < 512 1/k 2 = O(1) k=1 23

Concluding the argument Expected number of full intervals around h(x) is bounded by: nx X nx Pr[L I I ] apple k(512/k 3 ) k=1 I3h(x), I =k k=1 Insertion time is at most the number of full intervals around h(x) 23 1X < 512 1/k 2 = O(1) k=1 Tighter analysis: 5-independence works 4-independence does not

Some references Patrascu and Thorup: On the k-independence Required by Linear Probing and Minwise Independence. http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf (particularly section 1.1)! Pagh, Pagh, and Ruzic: Linear probing with 5-wise independence http://www.itu.dk/people/pagh/papers/linear-sigest.pdf! Thorup: String Hashing for Linear Probing https://www.siam.org/proceedings/soda/2009/soda09_072_thorupm.pdf 24

Epilogue: Deterministic hashing Java string hashing (signed 32-bit arithmetic): h(a 1 a 2 a n ) = a n + 31 h(a 1 a 2 a n-1 ) Collisions: - h(aa) = h(bb) = 2112 (equivalent substrings) - h(aaaa) = h(aabb) = h(bbaa) = h(bbbb) = 2095104-25

Epilogue: Deterministic hashing Java string hashing (signed 32-bit arithmetic): h(a 1 a 2 a n ) = a n + 31 h(a 1 a 2 a n-1 ) Collisions: - h(aa) = h(bb) = 2112 (equivalent substrings) - h(aaaa) = h(aabb) = h(bbaa) = h(bbbb) = 2095104 - Recent heuristic hash functions, with focus on evaluation time: MurmurHash, CityHash, SipHash. 25

(Some) people are starting to care! Crosby & Wallach: Denial of Service via Algorithmic Complexity Attacks. Usenix Security 03.! - Follow-ups: Chaos Communication Congress 11, 12. 26

(Some) people are starting to care! Crosby & Wallach: Denial of Service via Algorithmic Complexity Attacks. Usenix Security 03.! - Follow-ups: Chaos Communication Congress 11, 12. Java, C++, C# libraries still use deterministic hashing.! - Java falls back to BST for long hash chains! NEW: Ruby 1.9, Python 3.3, [Perl 5.18] now use random hashing [if deterministic hashing fails]. 26

27