Testing a Hash Function using Probability Suppose you have a huge square turnip field with 1000 turnips growing in it. They are all perfectly evenly spaced in a regular pattern. Suppose also that the Germans fly over your field and drop 10 bombs totally at random, all falling on your turnip field. Each bomb is so powerful that it will completely destroy the one turnip that it lands closest to. How many turnips would you have left? Sounds easy: start with 1000 and 10 are destroyed, so 990 are left. Except that there is a possibility that two bombs will land on the same turnip, so only nine will be destroyed. Not very likely, but certainly not impossible. You could even find three bombs landing on the same turnip, or two landing on one and another two landing on another one. It is even possible that all 10 bombs will land on the same turnip. Each turnip has a 1:100 chance, or a 0.01 probability, of being hit each time a bomb is dropped. So for each turnip the probability of being hit all ten times is 0.00000000000000000001. Being left with only 990 turnips is the worst possible case. The best possible case is 999, and anything in between is also possible. Exactly what are the probabilities? That is what the Poisson Distribution is all about. There are a number of opportunities (1000) for an event that is individually rare (probability 0.01), but over the whole world of opportunities in inevitable, and is in fact going to happen 10 times. The key thing is the average number of events per opportunity. In our case this is the average number of bombs per turnip, 10/1000. This average is given the symbol λ, a lower case lambda or Greek L. λ = 0.01 If you want to know the probability of any particular turnip being hit by N bombs, the poisson distribution tells us that p(n) = e -λ λ N N!
p(n) = e -λ λ N N! In our example: N is limited to the range 0 to 10. λ = 0.01 e -0.01 = 0.9900498 (e 2.71828183) remember that 0!=1 and anything to the power of 0 is 1. p(0) = 0.99 p(1) = 0.0099 p(2) = 0.000049 p(3) = 0.00000017 p(4) = 0.00000000041 p(5) = 0.00000000000083 p(6) = 0.0000000000000014 p(7) = 0.0000000000000000019 p(8) = 0.0000000000000000000025 p(9) = 0.0000000000000000000000027 p(10) = 0.0000000000000000000000000027 For each turnip, there is a 0.99 chance of not being hit at all. With 1000 turnips, that means we really do expect to see 990 surviving. But only 9.9 of them get hit exactly once. Over the course of 10 raids, we would probably see one case of a turnip being hit more than once. On average, if we sat through 371,000,000,000,000,000,000,000 raids, we could expect to see a case of a single turnip being hit by ten bombs only once. All in all, this isn t looking very useful, but now look at another example... In a class of 29 students, what are the chances that two will share a birthday? With 365 days to spread 29 students over, it looks like only about an 8% chance of a coincidence (29/365). But the correct analysis is that average number of students per birthday is roughly 29/365 which is 0.079452. Each day of the year can expect just 0.079452 birthdays to be on it. λ = 0.079452 p(0) = e -λ = 0.9236 (92.4% of days have no birthday on them) p(1) = λp(0) = 0.0733 (7.3% of days have one birthday on them) p(2) = λp(1)/2 = 0.00292 (0.29% of days have two birthdays on them) But 0.29% of days is 0.29% of 365, which is 1.0585. Meaning that for any random group of 29 people, on average, there will be just over 1 shared birthday.
So what? German bombs, people, and strings are all the same kind of thing. Turnips, days of the year, and hash table positions are all the same kind of thing. From the turnip s point of view, being blown up by a bomb is an unlikely event, it probably isn t going to happen. From the bomb s point of view, landing on a turnip is an absolute certainty. From the day-of-the-year s point of view, someone in a small group of people having their birthday on it is unlikely. From the person-in-the-group s point of view, having their birthday land on some day of the year is a certainty. From the point of view of one of the thousands of positions in a hash table, any particular string landing on it is quite unlikely. From a string s point of view, finding a place in a hash table is a certainty: every string has a hash value. It all works the same way. If we have a hash table whose array contains 10,000 pointers and we eventually store 5,000 strings in it, what would we expect to happen? If the hash function is working properly, we will get a random distribution of strings in the array, just like the distribution of people on days-of-the-year. In this case, λ = 5000/10000 = 0.5 p(0) = e -λ = 0.6065 p(1) = e -λ λ = 0.3033 p(2) = e -λ λ 2 /2 = 0.0758 p(3) = e -λ λ 3 /6 = 0.0126 p(4) = e -λ λ 4 /24 = 0.0016 p(5) = e -λ λ 5 /120 = 0.0002 p(6) = e -λ λ 6 /720 = 0.0000 Interpretation: p(2) is 0.0758. Every one of the 10,000 positions in the hash table has a 0.0758 probability of containing two strings. Therefore we should expect 758 of the hash table s linked lists to have a length of 2. Similarly, we should expect 6065 entries to be empty, 3033 linked lists should contain only one string, Only 2 entries in the whole table should have 5 strings in them. Notice how the numbers add up to 1.0000? We would expect to have no linked lists at all with a length greater than 5.
Of course, these are just the most likely figures, we can t expect nature to duplicate them exactly. But any properly working hash function should deliver that shape of distribution whenever λ is 0.5, i.e. whenever the hash table appears to be at half capacity. For λ = 0.5. Number of strings in the table = 0.5 times array size. 0 1000 2000 3000 4000 5000 6000 7000 To test a hash function: 1. Make your hash table quite large. 2. Read a large number of random strings into it (perhaps the text of a book) 3. Calculate λ = number of strings / size of table 4. Make your program count how many linked lists are empty, how many have one string in them, how many have two, and so on and so on. 5. Calculate the probabilistically expected numbers to part 4, but this time using the poisson formula. 6. Display the two sets of numbers, something like this: number of empty lists: expected = 6065 actual = 6110 number with length 1: expected = 3033 actual = 2980 number with length 2: expected = 758 actual = 789... etc You ll soon notice if the numbers are significantly different. Side note: When you would think a hash table is full, the number of strings in it is the same as the size of its array, λ = 1, these are the probabilities... linked list length 0 probability 0.3679 1 0.3679 2 0.1839 3 0.0613 4 0.0153 5 0.0031 6 0.0005 0.9999 total, so only 0.0001 left Even under such conditions, there should be no long lists, and a hash table remains a very fast-to-search storage system.
Shape when λ is small, much less than 1 Shape when λ = 1 Shape when λ is large, much more than 1