An Exponential Approximation

Interact

The goal of this section is to understand how the chance of at least one collision behaves as a function of the number of individuals n, when there are N hash values and N is large compared to n.

We know that chance is

P(at least one collision) = 1  n1i=0NiN

While this gives an exact formula for the chance, it doesn’t give us a sense of how the function grows. Let’s see if we can develop an approximation that has a simpler form and is therefore easier to study.

The main steps in the approximation will be used repeatedly in this course, so we will set them out in some detail here.

Step 1. Only Approximate Terms that Need Approximation

While this might seem obvious, it’s worth noting as it can save a lot of unnecessary fiddling. We are trying to approximate

1  n1i=0NiN

so all we need to approximate is

n1i=0NiN

We can subtract the approximation from 1 at the end.

In other words, we’ll approximate P(no collision) instead.

Step 2. Use log to Convert Products to Sums

Our formula is a product, but we are much better at working with sums. The log function helps us turn the product into a sum:

log(P(no collision)) = n1i=0log(NiN)

Once we have an approximation to log(P(no collision)), we can use exponentiation to convert it to an approximation for what we want, which is P(no collision).

Step 3. Use Properties of log

This is usually the step where the main calculation happens. Remember that log(1+x)x for small x, where the symbol here means that the ratio of the two sides goes to 1 as x goes to 0. The approximation might not be great for larger x but let’s try it out anyway.

log(P(no collision)) = n1i=0log(NiN)= n1i=0log(1iN) n1i=0(iN)= 1Nn1i=0i=1N(n1)n2

by the formula for the sum of the first n1 positive integers.

Step 4. Invert as Needed to Complete the Approximation

The hard work has been done, and now we just have to clean things up. Step 3 gave us

log(P(no collision))  1N(n1)n2

and so by exponentiation on both sides we get

P(no collision)  e(n1)n/2N  en2/2N

Finally, P(at least one collision)  1e(n1)n/2N  1en2/2N

Now you can see why the P(at least one collision) rises sharply as a function of the number of people. Remember that N is fixed and n varies between 1 and N. As n increases, (n1)n increases fast, essentially like n2. So n2/2N decreases fast, which makes en2/2N drop sharply; and that makes 1en2/2N shoot up.

It’s worth noting that there is only one approxmation in the entire calculation: it’s in the line in the middle of Step 3, where we use log(1+x)x for small x. We will encounter this approximation several times in the course.

How Good is the Approximation?

To see how the exponential approximation compares with the exact probabilities, let’s work in the context of birthdays; you can change N in the code if you prefer a different setting.

To see the entire sequence of steps, we will redo our exact calculations and augment them with a column of approximations. We’ll use the somewhat more careful approximation of the two above.

N = 365 

def p_no_match(n):
    individuals_array = np.arange(n)
    return np.prod((N - individuals_array)/N)

trials = np.arange(1, N+1, 1)
results = Table().with_column('Trials', trials)
different = results.apply(p_no_match, 'Trials')

results = results.with_column(
    'P(at least one match)', 1 - different,
    'Exponential Approximation', 1 - np.e**( -(trials - 1)*trials/(2*N) )
)

results
</tbody> </tbody> </tbody> </tbody> </tbody> </tbody> </tbody> </tbody> </tbody>
Trials P(at least one match) Exponential Approximation
1 0 0
2 0.00273973 0.00273598
3 0.00820417 0.00818549
4 0.0163559 0.016304
5 0.0271356 0.0270254
6 0.0404625 0.0402629
7 0.0562357 0.0559104
8 0.0743353 0.0738438
9 0.0946238 0.0939222
10 0.116948 0.115991

... (355 rows omitted)

The first 10 approximations look pretty good. Let’s take a look at some more.

results.scatter('Trials')
plt.xlim(0, N/3)
plt.ylim(0, 1);

png

On the scale of this graph, the blue dots (the exact values) are almost indistinguishable from the gold (our exponential approximation). You can run the code again with the less careful approximation that replaces (n1)n by n2 and see that the approximation is still excellent.

What we learn from the second form of the approximation is that the chance that there is at least one collision among the n assigned values is essentially 1ecn2 where c is a positive constant.

We will encounter the function 1ecx2 again when we study the Rayleigh distribution later in the course.