{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"# HIDDEN\n",
"import warnings\n",
"warnings.filterwarnings('ignore')\n",
"from datascience import *\n",
"from prob140 import *\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('fivethirtyeight')\n",
"%matplotlib inline\n",
"from scipy import stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Chi-Squared Distributions ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The gamma family has two important branches. The first consists of gamma $(r, \\lambda)$ distributions with integer shape parameter $r$, as you saw in the previous section.\n",
"\n",
"The other important branch consists of gamma $(r, \\lambda)$ distributions that have *half-integer* shape parameter $r$, that is, when $r = n/2$ for a positive integer $n$. Notice that this branch contains the one above: every integer $r$ is also half of the integer $n = 2r$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chi-Squared $(1)$ ###\n",
"\n",
"We have already seen the fundamental member of the branch. Let $Z$ be a standard normal random variable and let $V = Z^2$. By the change of variable formula for densities, we found the density of $V$ to be\n",
"\n",
"$$\n",
"f_V(v) ~ = ~ \\frac{1}{\\sqrt{2\\pi}} v^{-\\frac{1}{2}} e^{-\\frac{1}{2} v}, ~~~~ v > 0\n",
"$$\n",
"\n",
"That's the gamma $(1/2, 1/2)$ density. It is also called the *chi-squared density with 1 degree of freedom,* which we will abbreviate to chi-squared (1)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-input",
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" VIDEO"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# VIDEO: Chi-Squared Distributions\n",
"from IPython.display import YouTubeVideo\n",
"\n",
"YouTubeVideo('TXM-yzqYmwg')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From Chi-Squared $(1)$ to Chi-Squared $(n)$ ###\n",
"\n",
"When we were establishing the properties of the standard normal density, we discovered that if $Z_1$ and $Z_2$ are independent standard normal then $Z_1^2 + Z_2^2$ has the exponential $(1/2)$ distribution. We saw this by comparing two different settings in which the Rayleigh distribution arises. But that wasn't a particularly illuminating reason for why $Z_1^2 + Z_2^2$ should be exponential. \n",
"\n",
"But now we know that the sum of independent gamma variables with the same rate is also gamma; the shape parameter adds up and the rate remains the same. Therefore $Z_1^2 + Z_2^2$ is a gamma $(1, 1/2)$ variable. That's the same distribution as exponential $(1/2)$, as you showed in exercises. This explains why the sum of squares of two i.i.d. standard normal variables has the exponential $(1/2)$ distribution.\n",
"\n",
"If $Z_1, Z_2, Z_3$ are i.i.d. standard normal variables, then:\n",
"\n",
"- $Z_1^2$ has the gamma $(1/2, 1/2)$ distribution\n",
"- $Z_1^2 + Z_2^2$ has the gamma $(1/2 + 1/2, 1/2)$ distribution\n",
"- $Z_1^2 + Z_2^2 + Z_3^2$ has the gamma $(1/2 + 1/2 + 1/2, 1/2)$ distribution\n",
"\n",
"Now let $Z_1, Z_2, \\ldots, Z_n$ be i.i.d. standard normal variables. Then $Z_1^2, Z_2^2, \\ldots, Z_n^2$ are i.i.d. chi-squared $(1)$ variables. That is, each of them has the gamma $(1/2, 1/2)$ distribution. \n",
"\n",
"By induction, $Z_1^2 + Z_2^2 + \\cdots + Z_n^2$ has the gamma $(n/2, 1/2)$ distribution. This is called the *chi-squared distribution with $n$ degrees of freedom,* which we will abbreviate to chi-squared $(n)$.\n",
"\n",
"In data science, these distributions often arise when we work with the *sum of squares of normal errors*. This is usually part of a *mean squared error* calculation."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Chi-Squared Distribution with $n$ Degrees of Freedom ###\n",
"For a positive integer $n$, the random variable $X$ has the *chi-squared distribution with $n$ degrees of freedom* if the distribution of $X$ is gamma $(n/2, 1/2)$. That is, $X$ has density\n",
"\n",
"$$\n",
"f_X(x) ~ = ~ \\frac{\\frac{1}{2}^{\\frac{n}{2}}}{\\Gamma(\\frac{n}{2})} x^{\\frac{n}{2} - 1} e^{-\\frac{1}{2}x}, ~~~~ x > 0\n",
"$$\n",
"\n",
"Here are the graphs of the chi-squared densities for degrees of freedom 2 through 5."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"remove_input"
]
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"# NO CODE\n",
"x = np.arange(0, 14, 0.01)\n",
"y2 = stats.chi2.pdf(x, 2)\n",
"y3 = stats.chi2.pdf(x, 3)\n",
"y4 = stats.chi2.pdf(x, 4)\n",
"y5 = stats.chi2.pdf(x, 5)\n",
"plt.plot(x, y2, lw=2, label='2 df')\n",
"plt.plot(x, y3, lw=2, label='3 df')\n",
"plt.plot(x, y4, lw=2, label='4 df')\n",
"plt.plot(x, y5, lw=2, label='5 df')\n",
"plt.legend()\n",
"plt.xlabel('$v$')\n",
"plt.title('Chi-Squared $(n)$ Densities for $n = 2, 3, 4, 5$');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The chi-squared (2) distribution is exponential because it is the gamma $(1, 1/2)$ distribution. This distribution has three names:\n",
"\n",
"- chi-squared (2)\n",
"- gamma (1, 1/2)\n",
"- exponential (1/2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mean and Variance ###\n",
"You know that if $T$ has the gamma $(r, \\lambda)$ density then \n",
"\n",
"$$\n",
"E(T) ~ = ~ \\frac{r}{\\lambda} ~~~~~~~~~~~~ SD(T) = \\frac{\\sqrt{r}}{\\lambda}\n",
"$$\n",
"\n",
"If $X$ has the chi-squared $(n)$ distribution then $X$ is gamma $(n/2, 1/2)$. So\n",
"\n",
"$$\n",
"E(X) ~ = ~ \\frac{n/2}{1/2} ~ = ~ n\n",
"$$\n",
"\n",
"Thus **the expectation of a chi-squared random variable is its degrees of freedom**.\n",
"\n",
"The SD is\n",
"$$\n",
"SD(X) ~ = ~ \\frac{\\sqrt{n/2}}{1/2} ~ = ~ \\sqrt{2n}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Estimating the Normal Variance ###\n",
"Suppose $X_1, X_2, \\ldots, X_n$ are i.i.d. normal $(\\mu, \\sigma^2)$ variables, and that you are in a setting in which you know $\\mu$ and are trying to estimate $\\sigma^2$. \n",
"\n",
"Let $Z_i$ be $X_i$ in standard units, so that $Z_i = (X_i - \\mu)/\\sigma$. Define the random variable $T$ as follows:\n",
"\n",
"$$\n",
"T ~ = ~ \\sum_{i=1}^n Z_i^2 ~ = ~ \\frac{1}{\\sigma^2}\\sum_{i=1}^n (X_i - \\mu)^2\n",
"$$\n",
"\n",
"Then $T$ has the chi-squared $(n)$ distribution and $E(T) = n$. Now define $W$ by\n",
"\n",
"$$\n",
"W ~ = ~ \\frac{\\sigma^2}{n} T ~ = ~ \\frac{1}{n} \\sum_{i=1}^n (X_i - \\mu)^2\n",
"$$\n",
"\n",
"Then $W$ can be computed based on the sample since $\\mu$ is known. And since $W$ is a linear tranformation of $T$ it is easy to see that $E(W) = \\sigma^2$. \n",
"\n",
"So we have constructed an unbiased estimate of $\\sigma^2$. It is the mean squared deviation from the known population mean.\n",
"\n",
"But typically, $\\mu$ is not known. In that case you need a different estimate of $\\sigma^2$ since you can't compute $W$ as defined above. You showed in exercises that\n",
"\n",
"$$\n",
"S^2 ~ = ~ \\frac{1}{n-1}\\sum_{i=1}^n (X_i - \\bar{X})^2\n",
"$$\n",
"\n",
"is an unbiased estimate of $\\sigma^2$ regardless of the distribution of the $X_i$'s. When the $X_i$'s are normal, as is the case here, it turns out that $S^2$ is a linear transformation of a chi-squared $(n-1)$ random variable. We will show that later in the course."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### \"Degrees of Freedom\" ###\n",
"The example above helps explain the strange term \"degrees of freedom\" for the parameter of the chi-squared distribution. \n",
"- When $\\mu$ is known, you have $n$ independent centered normals $(X_i - \\mu)$ that you can use to estimate $\\sigma^2$. That is, you have $n$ degrees of freedom in constructing your estimate.\n",
"- When $\\mu$ is not known, you are using all $n$ of $X_1 - \\bar{X}, X_2 - \\bar{X}, \\ldots, X_n - \\bar{X}$ in your estimate, but they are not independent. They are the deviations of the list $X_1, X_2, \\ldots , X_n$ from their average $\\bar{X}$, and hence their sum is 0. If you know $n-1$ of them, the final one is determined. So you only have $n-1$ degrees of freedom."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}