Testing a Hash Function using Probability

Similar documents
LECTURE 15: SIMPLE LINEAR REGRESSION I

4.5 Applications of Congruences

Topic 4 Randomized algorithms

PHYSICS 15a, Fall 2006 SPEED OF SOUND LAB Due: Tuesday, November 14

Sums of Squares (FNS 195-S) Fall 2014

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Intensity of Light and Heat. The second reason that scientists prefer the word intensity is Well, see for yourself.

Solving with Absolute Value

Dynamic Programming: Matrix chain multiplication (CLRS 15.2)

2 Systems of Linear Equations

2.2 Graphs of Functions

Uncertainty: A Reading Guide and Self-Paced Tutorial

Part I Electrostatics. 1: Charge and Coulomb s Law July 6, 2008

Introduction to Algebra: The First Week

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

8. TRANSFORMING TOOL #1 (the Addition Property of Equality)

Introduction to Energy Study Guide (also use your notes!!!!)

EXPERIMENTAL UNCERTAINTY

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

CS 124 Math Review Section January 29, 2018

Chapter 5. Piece of Wisdom #2: A statistician drowned crossing a stream with an average depth of 6 inches. (Anonymous)

Quick Sort Notes , Spring 2010

Position and Displacement

Atomic Theory. Introducing the Atomic Theory:

Properties of Sequences

Introduction to discrete probability. The rules Sample space (finite except for one example)

- a value calculated or derived from the data.

Projectile Motion: Vectors

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Descriptive Statistics (And a little bit on rounding and significant digits)

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

1.1 Units and unit conversions

ASTRO 114 Lecture Okay. What we re going to discuss today are what we call radiation laws. We ve

In grade school one draws factor trees. For example, here is a tree for the number 36,000:

Chapter 26: Comparing Counts (Chi Square)

Let V be a vector space, and let X be a subset. We say X is a Basis if it is both linearly independent and a generating set.

Conservation of Momentum: Marble Collisions Student Version

On Stationary state, also called steady state. Lifetimes and spatial scales of variability

CSE 211. Pushdown Automata. CSE 211 (Theory of Computation) Atif Hasan Rahman

You don t have to look too deeply to see how chemistry affects your life.

15 Skepticism of quantum computing

30. TRANSFORMING TOOL #1 (the Addition Property of Equality)

Solution to Proof Questions from September 1st

Algebra, Part I. x m x = n xm i x n = x m n = 1

Section 5.4. Ken Ueda

U.S. Air Force Chad Gibson collects data during a hurricane.

Predicting AGI: What can we say when we know so little?

Park School Mathematics Curriculum Book 9, Lesson 2: Introduction to Logarithms

Section 4.6 Negative Exponents

Stat 20 Midterm 1 Review

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Take the measurement of a person's height as an example. Assuming that her height has been determined to be 5' 8", how accurate is our result?

1.1 The Language of Mathematics Expressions versus Sentences

Last few slides from last time

Voting Systems. High School Circle II. June 4, 2017

Statistics 1L03 - Midterm #2 Review

27. THESE SENTENCES CERTAINLY LOOK DIFFERENT

1 Maintaining a Dictionary

Probability and the Second Law of Thermodynamics

Chapter 1 Review of Equations and Inequalities

Note that we are looking at the true mean, μ, not y. The problem for us is that we need to find the endpoints of our interval (a, b).

Orbitals How atoms really look like

Biostatistics: Correlations

SECTION 3.1 SIMPLE LINEAR REGRESSION

Half Life Introduction

continued Before you use the slides you might find the following websites useful for information on the satellite and also space in general: The

27. THESE SENTENCES CERTAINLY LOOK DIFFERENT

Nondeterministic finite automata

Lecture 15: Exploding and Vanishing Gradients

Factoring. there exists some 1 i < j l such that x i x j (mod p). (1) p gcd(x i x j, n).

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

CHARLES DARWIN 1 COMPLEXITY & XX THRESHOLDS 820L. 5 billion years ago 1 billion years ago 1million years ago billion.

CONSTRUCTION OF sequence of rational approximations to sets of rational approximating sequences, all with the same tail behaviour Definition 1.

Logarithms and Exponentials

Physical Properties of Matter & What is in Mr. Skoldberg s Car?

Dynamic Programming (CLRS )

What causes the tides in the ocean?

Fitting a Straight Line to Data

BRIDGE CIRCUITS EXPERIMENT 5: DC AND AC BRIDGE CIRCUITS 10/2/13

S.R.S Varadhan by Professor Tom Louis Lindstrøm

Physics 2020 Laboratory Manual

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Unit 5: Energy (Part 2)

What is Crater Number Density?

Chapter 19 Sir Migo Mendoza

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Chapter 1. Foundations of GMAT Math. Arithmetic

Trinity Web Site Design. Today we will learn about the technology of numbering systems.

HiSPARC Detector - Detector Station

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

DRIVE vs. BALANCE August 10, By Michael Erlewine

Module 8 Probability

Hypothesis Testing with Z and T

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

3 Using Newton s Laws

Capacitors. Chapter How capacitors work Inside a capacitor

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

SOLUTIONS Workshop 2: Reading Graphs, Motion in 1-D

Introductory Quantum Chemistry Prof. K. L. Sebastian Department of Inorganic and Physical Chemistry Indian Institute of Science, Bangalore

Transcription:

Testing a Hash Function using Probability Suppose you have a huge square turnip field with 1000 turnips growing in it. They are all perfectly evenly spaced in a regular pattern. Suppose also that the Germans fly over your field and drop 10 bombs totally at random, all falling on your turnip field. Each bomb is so powerful that it will completely destroy the one turnip that it lands closest to. How many turnips would you have left? Sounds easy: start with 1000 and 10 are destroyed, so 990 are left. Except that there is a possibility that two bombs will land on the same turnip, so only nine will be destroyed. Not very likely, but certainly not impossible. You could even find three bombs landing on the same turnip, or two landing on one and another two landing on another one. It is even possible that all 10 bombs will land on the same turnip. Each turnip has a 1:100 chance, or a 0.01 probability, of being hit each time a bomb is dropped. So for each turnip the probability of being hit all ten times is 0.00000000000000000001. Being left with only 990 turnips is the worst possible case. The best possible case is 999, and anything in between is also possible. Exactly what are the probabilities? That is what the Poisson Distribution is all about. There are a number of opportunities (1000) for an event that is individually rare (probability 0.01), but over the whole world of opportunities in inevitable, and is in fact going to happen 10 times. The key thing is the average number of events per opportunity. In our case this is the average number of bombs per turnip, 10/1000. This average is given the symbol λ, a lower case lambda or Greek L. λ = 0.01 If you want to know the probability of any particular turnip being hit by N bombs, the poisson distribution tells us that p(n) = e -λ λ N N!

p(n) = e -λ λ N N! In our example: N is limited to the range 0 to 10. λ = 0.01 e -0.01 = 0.9900498 (e 2.71828183) remember that 0!=1 and anything to the power of 0 is 1. p(0) = 0.99 p(1) = 0.0099 p(2) = 0.000049 p(3) = 0.00000017 p(4) = 0.00000000041 p(5) = 0.00000000000083 p(6) = 0.0000000000000014 p(7) = 0.0000000000000000019 p(8) = 0.0000000000000000000025 p(9) = 0.0000000000000000000000027 p(10) = 0.0000000000000000000000000027 For each turnip, there is a 0.99 chance of not being hit at all. With 1000 turnips, that means we really do expect to see 990 surviving. But only 9.9 of them get hit exactly once. Over the course of 10 raids, we would probably see one case of a turnip being hit more than once. On average, if we sat through 371,000,000,000,000,000,000,000 raids, we could expect to see a case of a single turnip being hit by ten bombs only once. All in all, this isn t looking very useful, but now look at another example... In a class of 29 students, what are the chances that two will share a birthday? With 365 days to spread 29 students over, it looks like only about an 8% chance of a coincidence (29/365). But the correct analysis is that average number of students per birthday is roughly 29/365 which is 0.079452. Each day of the year can expect just 0.079452 birthdays to be on it. λ = 0.079452 p(0) = e -λ = 0.9236 (92.4% of days have no birthday on them) p(1) = λp(0) = 0.0733 (7.3% of days have one birthday on them) p(2) = λp(1)/2 = 0.00292 (0.29% of days have two birthdays on them) But 0.29% of days is 0.29% of 365, which is 1.0585. Meaning that for any random group of 29 people, on average, there will be just over 1 shared birthday.

So what? German bombs, people, and strings are all the same kind of thing. Turnips, days of the year, and hash table positions are all the same kind of thing. From the turnip s point of view, being blown up by a bomb is an unlikely event, it probably isn t going to happen. From the bomb s point of view, landing on a turnip is an absolute certainty. From the day-of-the-year s point of view, someone in a small group of people having their birthday on it is unlikely. From the person-in-the-group s point of view, having their birthday land on some day of the year is a certainty. From the point of view of one of the thousands of positions in a hash table, any particular string landing on it is quite unlikely. From a string s point of view, finding a place in a hash table is a certainty: every string has a hash value. It all works the same way. If we have a hash table whose array contains 10,000 pointers and we eventually store 5,000 strings in it, what would we expect to happen? If the hash function is working properly, we will get a random distribution of strings in the array, just like the distribution of people on days-of-the-year. In this case, λ = 5000/10000 = 0.5 p(0) = e -λ = 0.6065 p(1) = e -λ λ = 0.3033 p(2) = e -λ λ 2 /2 = 0.0758 p(3) = e -λ λ 3 /6 = 0.0126 p(4) = e -λ λ 4 /24 = 0.0016 p(5) = e -λ λ 5 /120 = 0.0002 p(6) = e -λ λ 6 /720 = 0.0000 Interpretation: p(2) is 0.0758. Every one of the 10,000 positions in the hash table has a 0.0758 probability of containing two strings. Therefore we should expect 758 of the hash table s linked lists to have a length of 2. Similarly, we should expect 6065 entries to be empty, 3033 linked lists should contain only one string, Only 2 entries in the whole table should have 5 strings in them. Notice how the numbers add up to 1.0000? We would expect to have no linked lists at all with a length greater than 5.

Of course, these are just the most likely figures, we can t expect nature to duplicate them exactly. But any properly working hash function should deliver that shape of distribution whenever λ is 0.5, i.e. whenever the hash table appears to be at half capacity. For λ = 0.5. Number of strings in the table = 0.5 times array size. 0 1000 2000 3000 4000 5000 6000 7000 To test a hash function: 1. Make your hash table quite large. 2. Read a large number of random strings into it (perhaps the text of a book) 3. Calculate λ = number of strings / size of table 4. Make your program count how many linked lists are empty, how many have one string in them, how many have two, and so on and so on. 5. Calculate the probabilistically expected numbers to part 4, but this time using the poisson formula. 6. Display the two sets of numbers, something like this: number of empty lists: expected = 6065 actual = 6110 number with length 1: expected = 3033 actual = 2980 number with length 2: expected = 758 actual = 789... etc You ll soon notice if the numbers are significantly different. Side note: When you would think a hash table is full, the number of strings in it is the same as the size of its array, λ = 1, these are the probabilities... linked list length 0 probability 0.3679 1 0.3679 2 0.1839 3 0.0613 4 0.0153 5 0.0031 6 0.0005 0.9999 total, so only 0.0001 left Even under such conditions, there should be no long lists, and a hash table remains a very fast-to-search storage system.

Shape when λ is small, much less than 1 Shape when λ = 1 Shape when λ is large, much more than 1