14.4. SciPy and Normal Curves#

14.4.1. Plotting Normal Curves#

The prob140 function Plot_norm takes three arguments and displays the corresponding normal curve. The arguments are:

  • the interval over which to draw the curve, as a list or array with the two endpoints

  • the mean

  • the SD

Plot_norm([-4, 4], 0, 1)
../../_images/f4c0085e033c9cb9ba6d7a7f2798d8a2491c39c837644227565dc0e006acd2bf.png

You can shade all the area to the left of a point \(x\), by providing the point \(x\) as the right_end of the interval \((-\infty, x]\).

Plot_norm([-4, 4], 0, 1, right_end=1.5)
../../_images/70f2f428dfabafa301874159a7aa8330208843a005833b6b6856cf9348ffa6af.png

All the area to the right of a point:

Plot_norm([-4, 4], 0, 1, left_end=1.5)
../../_images/865183bb623267ecafe097de26fdf833d55fc76d12098ee02321988dfd9b2d2b.png

The area between two points:

Plot_norm([-4, 4], 0, 1, right_end=-1, left_end=1.5)
../../_images/a3229016222a7f9137d1e9997abb0fe5052afb6e1430dbfb36ccf4fe2946da1b.png

14.4.2. \(\Phi\) and \(\Phi^{-1}\)#

All the areas displayed above can be expressed in terms of the standard normal cdf \(\Phi\).

Recall that the standard normal cdf \(\Phi\) is the function defined by

\[ \Phi(x) = \int_{-\infty}^x \phi(z)dz ~, ~~~~ -\infty < x < \infty \]

where \(\phi\) is the standard normal curve.

For each \(x\), the value of \(\Phi(x)\) is an area under the standard normal curve. The function \(\Phi\) takes a real number \(x\) as its argument and returns a proportion \(p\) which is all the area to the left of \(x\) under the standard normal curve.

../../_images/a286eb229b85973a4d54e1b22c8eab8f10a159d5ee4e0199c4652b0ec4fbb480.png

It will also be helpful to go the other way, and identify the \(x\) such that \(\Phi(x)\) is a specified value \(p\). In other words, we will need \(\Phi^{-1}\), the inverse of \(\Phi\), which is determined by

\[ \Phi(z) ~ = ~ p ~~ \iff ~~ \Phi^{-1}(p) = z \]

For each \(p\) in the interval \((0, 1)\), the value of \(\Phi^{-1}(p)\) is a point on the horizontal axis of the graph of the standard normal curve.

../../_images/11b044d6fde2f73a51cdcd0048b49dfccf2e4d45c7cd4cde3569e485b109f383.png

14.4.3. \(\Phi\) and \(\Phi^{-1}\) in SciPy#

As we noted in the previous section, there is no closed form formula for \(\Phi\). So there also isn’t one for \(\Phi^{-1}\). But most computational systems provide excellent numerical approximations.

In SciPy the approximations are in the familiar stats module. For the standard normal cdf, use stats.norm.cdf just as you used stats.binom.cdf and so on. By default, stats.norm.cdf is based on the standard normal curve.

The area to the left of \(1\) under the standard normal curve:

stats.norm.cdf(1)
0.84134474606854293

The area between \(-1\) and \(1\) under the standard normal curve can be found by using the cdf and subtraction in a familiar way:

stats.norm.cdf(1) - stats.norm.cdf(-1)
0.68268949213708585

In both examples above, we started with a point or points on the horizontal axis and used the cdf \(\Phi\) to find a related area. We can also go backwards, by specfiying an area and using \(\Phi^{-1}\) to find a related point on the horizontal axis.

For example, if you want \(x\) such that \(\Phi(x) = 0.9\), you can use the percent point function stats.norm.ppf. The name comes from the expression “90% point” of the distribution, or equivalently, the 90th percentile.

stats.norm.ppf(0.9)
1.2815515655446004
../../_images/e329d3e56f55975e1389354c6f52a26769c283277693fc8e4de57cd0a9fa7ecc.png

By the definition of an inverse, we should have \(\Phi(\Phi^{-1}(0.9)) = 0.9\). Let’s check that.

stats.norm.cdf(stats.norm.ppf(0.9))
0.89999999999999991

14.4.4. Example#

Suppose the weights of a sample of 100 people are i.i.d. with a mean of 150 pounds and an SD of 20 pounds. Then the total weight of the sampled people is roughly normal with mean \(100 \times 150 = 15,000\) pounds and SD \(\sqrt{100} \times 20 = 200\) pounds.

Who cares about the total weight of a random group of people? Ask those who construct stadiums, elevators, and airplanes.

# Approximate distribution of total weight

n = 100
mu = 150
sigma = 20

mean = n*mu
sd = (n**0.5)*sigma

plot_interval = make_array(mean-4*sd, mean+4*sd)

Plot_norm(plot_interval, mean, sd)
../../_images/194a16896f31b7db854aebe7ad3b7dc8816c586f04c506a9e2f415a02ab746d3.png

The chance that the total weight of the sampled people is less than 15,100 pounds is approximately the gold area below. The CLT allows us to use the normal curve as an approximation to the unknown exact distribution of the total weight.

Plot_norm(plot_interval, mean, sd, right_end=15100)
../../_images/7124008026522125d0cfd34e20f67dc54393a108c5f3ad34439b68865f90fdbd.png

The function stats.norm.cdf takes the mean and SD as optional arguments. Remember that the names mean and sd were assigned in an earlier cell. Also remember that the answer below is not exact but an approximation based on the CLT.

stats.norm.cdf(15100, mean, sd)
0.69146246127401312

To find the approximate 90th percentile of the distribution of weights, you can use stats.norm.ppf with the mean and SD as arguments.

stats.norm.ppf(0.9, mean, sd)
15256.310313108919

The conclusion is \(P(S \le 15256) \approx 0.9\) where \(S\) denotes the total weight.

14.4.5. Using Standard Units#

While it convenient to be able to enter the mean and SD as arguments to stats.norm.cdf and stats.norm.ppf, the fundamental curve is the standard normal curve. All the others are obtained by linear transformations.

Therefore all the calculations above can be done in terms of the standard normal cdf by standardizing, and therefore all normal approximations can (and will) be written in terms of the standard normal cdf \(\Phi\). We don’t need to use a different cdf for each mean and SD.

For example, we can redo the two calculations above as follows.

To find the approximate chance that the total weight is less than 15100 pounds, first standardize 15100 and then use the standard normal cdf:

\[ P(S < 15100) ~ \approx ~ \Phi \big{(} \frac{15100 - 15000}{200} \big{)} \]

The calculation gives the same answer as before.

z = (15100 - mean)/sd

stats.norm.cdf(z) 
0.69146246127401312

To find 90th percentile of the approximate distribution of the \(S\), first find the 90th percentile of the standard normal curve. This value is the 90th percentile of any normal curve, measured in standard units.

z = stats.norm.ppf(0.9)
z
1.2815515655446004

Now convert the standard units back to pounds. The 90th percentile of the distribution of \(S\) is approximately \(\Phi^{-1}(0.9)\cdot200 + 15000\). The numerical answer is the same as before.

x = z*sd + mean
x
15256.310313108919