{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove_cell"
]
},
"outputs": [],
"source": [
"# HIDDEN\n",
"from datascience import *\n",
"from prob140 import *\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"plt.style.use('fivethirtyeight')\n",
"%matplotlib inline\n",
"from scipy import stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Multiple Regression ##"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regression provides one way of predicting a numerical variable, called a *response*, based on other variables called *predictor variables*. The multiple regression model says in essence that\n",
"\n",
"$$\n",
"\\text{response} ~ = ~ \\text{linear combination of predictor variables} + \\text{ random noise }\n",
"$$\n",
"\n",
"You can think of the first term on the right hand side as a *signal*. The problem is that we don't get to observe the signal. The observed response is the sum of the signal and the noise. The data scientist's task is to use the observations to extract the signal as accurately as possible.\n",
"\n",
"It is worth looking more closely at exactly what is linear in linear regression, now that we are allowing more than one predictor variable. For example, notice that you can fit a quadratic function of $x$ by using the two predictor variables $x_1 = x$ and $x_2 = x^2$. Then the signal\n",
"\n",
"$$\n",
"\\beta_0 + \\beta_1x_1 + \\beta_2x_2 ~ = ~ \\beta_0 + \\beta_1x + \\beta_2x^2\n",
"$$ \n",
"\n",
"is a quadratic function of $x$. But it is linear in the coefficients, and it is a linear combination of the two predictor variables $x_1$ and $x_2$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The Model ###\n",
"\n",
"As in all of statistical inference, properties of estimates depend on the assumptions under which they are calculated. The *multiple regression model* is a commonly used set of assumptions that describes a particular kind of linear relation between a numerical response variable and a set of predictor variables. You should use it only if you believe that it makes sense for your data.\n",
"\n",
"The model assumes that there are $n$ individuals, on each of whom you have measured the response and the predictor variables. For $1 \\le i \\le n$, the relation between the variables is assumed to be\n",
"\n",
"$$\n",
"Y_i = \\beta_0 + \\beta_1x_{i,1} + \\beta_2x_{i,2} + \\cdots + \\beta_{p-1}x_{i,p-1} + \\epsilon_i\n",
"$$\n",
"\n",
"in the notation described below.\n",
"\n",
"- $x_{i,1}, x_{i,2}, \\ldots, x_{i,p-1}$ are the observed constant values of $p-1$ predictor variables for individual $i$. They are not random variables. If you prefer to think of the predictor variables as random, this model assumes that you have conditioned on them.\n",
"\n",
"- The intercept $\\beta_0$ and slopes $\\beta_1, \\beta_2, \\ldots, \\beta_{p-1}$ are unobservable constants and are parameters of the model. There are $p$ of them, hence the notation $p$ for \"parameters\".\n",
"\n",
"- $\\epsilon_i$ is an unobservable random error that has the normal $(0, \\sigma^2)$ distribution for some unobservable $\\sigma^2$, and $\\epsilon_1, \\epsilon_2, \\ldots, \\epsilon_n$ are i.i.d.\n",
"\n",
"- $Y_i$ is the observable response of individual $i$. It is random because $\\epsilon_i$ is one of its components.\n",
"\n",
"We will assume that $n > p$, that is, we will assume we have more individuals than parameters. Indeed in this course it is fine for you to think of $n$ as much larger than $p$. \n",
"\n",
"Two special cases are already familiar.\n",
"\n",
"**$p = 1$: Prediction by a Constant**\n",
"\n",
"When $p = 1$ there is just one parameter: the intercept. There are no predictor variables at all. The model says that for each individual $i$, the response is $Y_i = \\beta_0 + \\epsilon_i$. This is a case of trying to estimate the response by a constant. \n",
"\n",
"**$p = 2$: Simple Linear Regression**\n",
"\n",
"The two parameters are the intercept and a slope. The model says that for each individual $i$, the response is $Y_i = \\beta_0 + \\beta_1x_{i,1} + \\epsilon_i$. That is, the response is the value on a hidden straight line, plus some normal noise. This is the simple regression model you used in Data 8."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Signal and Noise: Matrix Representation ###\n",
"For any $p$, the model can be written compactly as\n",
"\n",
"$$\n",
"\\mathbf{Y} ~ = ~ \\mathbf{X}\\boldsymbol{\\beta} + \\boldsymbol{\\epsilon}\n",
"$$\n",
"\n",
"in the matrix notation described below.\n",
"\n",
"- The *design matrix* $\\mathbf{X}$ is an $n \\times p$ matrix of real numbers, not random variables. Column 0 of $\\mathbf{X}$ is a vector of 1's and Column $j$ for $1 \\le j \\le p-1$ consists of the $n$ observations on the $j$th predictor variable. For each $i$ in the range 1 through $n$, Row $i$ contains the values of all the predictor variables for individual $i$.\n",
"\n",
"- The *parameter vector* $\\boldsymbol{\\beta} = [\\beta_0 ~~ \\beta_1 ~~ \\ldots ~~ \\beta_{p-1}]^T$ is a $p \\times 1$ vector of the coefficients.\n",
"\n",
"- The *error vector* $\\boldsymbol{\\epsilon}$ is an $n \\times 1$ multivariate normal $(\\mathbf{0}, \\sigma^2\\mathbf{I}_n)$ random vector. Its mean vector is an $n \\times 1$ vector of 0's and $\\mathbf{I}_n$ is the $n \\times n$ identity matrix.\n",
"\n",
"- The *response vector* $\\mathbf{Y}$ is a random vector that is the sum of the linear signal $\\mathbf{X}\\boldsymbol{\\beta}$ and the normal noise $\\boldsymbol{\\epsilon}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Ordinary Least Squares ###\n",
"Based on the observations of the predictor variables and the response, the goal is to find the best estimates of the intercept and slopes in the model.\n",
"\n",
"These estimates can then be used to predict the response of a new individuals, assuming that the model holds for the new individual as well.\n",
"\n",
"We must select a criterion by which we will decide whether one estimate is better than another. To develop one such criterion, start by noting that any linear function of the predictor variables can be written as $\\mathbf{X}\\boldsymbol{\\gamma}$ where $\\boldsymbol{\\gamma}$ is some $p \\times 1$ vector of coefficients. Think of $\\mathbf{X}\\boldsymbol{\\gamma}$ as an estimate of $\\mathbf{Y}$. Then the error in the estimate is $\\mathbf{Y} - \\mathbf{X}\\boldsymbol{\\gamma}$.\n",
"\n",
"The goal of *ordinary least squares* (OLS) is to find the vector $\\boldsymbol{\\gamma}$ that minimises the mean squared error\n",
"\n",
"$$\n",
"MSE(\\boldsymbol{\\gamma}) ~ = ~ \\frac{1}{n} \\sum_{i=1}^n (Y_i - (\\mathbf{X}\\boldsymbol{\\gamma})_i)^2\n",
"$$\n",
"\n",
"This is the same as the $\\boldsymbol{\\gamma}$ that minimizes the sum of squared errors\n",
"\n",
"$$\n",
"SSE(\\boldsymbol{\\gamma}) ~ = ~ \\sum_{i=1}^n (Y_i - (\\mathbf{X}\\boldsymbol{\\gamma})_i)^2\n",
"$$\n",
"\n",
"Again for compactness it will help to use matrix notation. For an $n \\times 1$ vector $\\mathbf{w}$, \n",
"\n",
"$$\n",
"\\sum_{i=1}^n w_i^2 ~ = ~ \\mathbf{w}^T\\mathbf{w} ~ = ~ \\mathbf{w} \\cdot \\mathbf{w} ~ = ~ \\| \\mathbf{w} \\|^2\n",
"$$\n",
"\n",
"which is sometimes called the *squared norm* of $\\mathbf{w}$.\n",
"\n",
"In this notation, the goal of OLS is to find the $p \\times 1$ vector $\\hat{\\boldsymbol{\\beta}}$ that minimizes $\\| \\mathbf{Y} - \\mathbf{X}\\boldsymbol{\\gamma}\\|^2$ over all vectors $\\boldsymbol{\\gamma}$.\n",
"\n",
"Typically you will also have to estimate the unknown error variance $\\sigma^2$. But we will not cover that in this class except in the case $p = 1$."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-input",
"hide-output"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"