{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"# HIDDEN\n",
"from datascience import *\n",
"from prob140 import *\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"plt.style.use('fivethirtyeight')\n",
"%matplotlib inline\n",
"from scipy import stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The Regression Equation ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The equation of the regression line for predicting $Y$ based on $X$ can be written in several equivalent ways. The regression equation, and the error in the regression estimate, are best understood in standard units. All the other representations follow by straightforward algebra.\n",
"\n",
"Let $X$ and $Y$ be bivariate normal with parameters $(\\mu_X, \\mu_Y, \\sigma_X^2, \\sigma_Y^2, \\rho)$. Then, as we have seen, the best predictor $E(Y \\mid X)$ is a linear function of $X$ and hence the formula for $E(Y \\mid X)$ is also the equation of the regression line."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### In Standard Units ###\n",
"Let $X_{su}$ be $X$ in standard units and $Y_{su}$ be $Y$ in standard units. The regression equation is\n",
"\n",
"$$\n",
"E(Y_{su} \\mid X_{su}) ~ = ~ \\rho X_{su}\n",
"$$\n",
"\n",
"and the amount of error in the prediction is measured by\n",
"\n",
"$$\n",
"SD(Y_{su} \\mid X_{su}) ~ = ~ \\sqrt{1 - \\rho^2}\n",
"$$\n",
"\n",
"The conditional SD is in the same units as the prediction. The conditional variance is\n",
"\n",
"$$\n",
"Var(Y_{su} \\mid X_{su}) ~ = ~ 1 - \\rho^2\n",
"$$\n",
"\n",
"We know more than just the conditional expectation and conditional variance. We know that the conditional distribution of $Y_{su}$ given $X_{su}$ is normal. This allows us to find conditional probabilities given $X_{su}$, by the usual normal curve methods. For example, \n",
"\n",
"$$\n",
"P(Y_{su} < y_{su} \\mid X_{su} = x_{su}) ~ = ~ \\Phi \\big{(} \\frac{y_{su} - \\rho x_{su}}{\\sqrt{1-\\rho^2}} \\big{)}\n",
"$$\n",
"\n",
"In one of Galton's famous data sets, the distribution of the heights of father-son pairs was roughly bivariate normal with a correlation of 0.5. Of the fathers whose heights were two SDs above average, about what percent had sons whose heights were more than 2 SDs above average?\n",
"\n",
"By the regression effect, you know this answer has to be less than 50%. If $Y_{su}$ denotes the height of a randomly picked son in standard units, and $X_{su}$ the height of his father in standard units, then the proportion is approximately\n",
"\n",
"$$\n",
"P(Y_{su} > 2 \\mid X_{su} = 2) ~ = ~ 1 - \\Phi \\big{(} \\frac{2 - 0.5\\times2}{\\sqrt{1-0.5^2}} \\big{)}\n",
"$$\n",
"\n",
"which is approximately 12.4%."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.12410653949496186"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"1 - stats.norm.cdf(2, 0.5*2, np.sqrt(1-0.5**2))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-input",
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"