Best Predictor
Best Predictor¶
We end with a solution to a problem of prediction based on more than one predictor.
Let $p$ (for predictors) be a positive integer. Suppose the random vector
$$ \begin{bmatrix} Y \\ X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix} $$is multivariate normal, and suppose we are trying to predict $Y$ based on the $p$ predictor variables
$$ \mat{X} ~ = ~ \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix} $$You will recognize the following as extensions of results you know well from predicting $Y$ based on a single $X$.
- The conditional distribution of $Y$ given $\mat{X} = \mat{x}$ is normal.
- The conditional expectation of $Y$ given $\mat{X}$ is the least squares predictor of $Y$ given $\mat{X}$.
- The conditional expectation of $Y$ given $\mat{X}$ is linear, and hence the best overall predictor (by the criterion of least squares) is the same as the best linear predictor.
To complete the specification of the normal conditional distribution of $Y$ given $\mat{X}$, we have to identify the conditional mean and the conditional variance. We will not derive these quantities, but we will demonstrate how their formulas are clear analogs of the familiar calculations when there is only one predictor variable.
To keep our variables organized and our notation compact, we will partition the random vector and its mean vector.
$$ \begin{bmatrix} Y \\ X_1 \\ X_2 \\ \vdots \\ X_p \end{bmatrix} ~ = ~ \begin{bmatrix} Y \\ \mat{X} \end{bmatrix} ~~~~~~~~~~~~~~~ \begin{bmatrix} \mu_Y \\ \mu_{X_1} \\ \mu_{X_2} \\ \vdots \\ \mu_{X_p} \end{bmatrix} ~ = ~ \begin{bmatrix} \mu_Y \\ \mat{\mu_X} \end{bmatrix} $$We can partition the covariance matrix as well, according to the demarcating lines shown below.
- The top left (1, 1) element is the variance of $Y$.
- The bottom right hand $p \times p$ matrix $\bsymb{\Sigma_{X,X}}$ is a covariance matrix in its own right. It is the covariance matrix of the $p$ predictor variables.
- The top right hand row vector $\bsymb{\Sigma_{Y,X}}$ and bottom left hand column vector $\bsymb{\Sigma_{X,Y}}$ are transposes of each other. Each contains the covariance of the response variable $Y$ with each of the predictor variables.
Conditional Expectation¶
We will start by recalling how to predict $Y$ based on one predictor variable $X$. As in previous sections, let $a^*$ be the slope of the regression line. Our first step is to write $a^*$ in terms of elements of elements of the covariance matrix of $X$ and $Y$.
$$ a^* ~ = ~ \rho \frac{\sigma_Y}{\sigma_X} ~ = ~ \frac{\sigma_{X,Y}}{\sigma_X \sigma_Y} \cdot \frac{\sigma_Y}{\sigma_X} ~ = ~ \sigma_{Y,X}({\sigma_X^2})^{-1} $$The equation of the regression line is
\begin{align*} E(Y \mid X = x) ~ &= ~ a^*x + (\mu_Y - a^*\mu_X) \\ &= ~ a^*(x - \mu_X) + \mu_Y \\ &= ~ \sigma_{Y,X}({\sigma_X^2})^{-1}(x - \mu_X) + \mu_Y \end{align*}To extend this result to the case where $\mat{X}$ has $p$ components, replace each of the quantities in the formula by its matrix counterpart.
Regression of $Y$ on $\mat{X}$¶
$$ E(Y \mid \mat{X} = \mat{x}) ~ = ~ \bsymb{\Sigma_{Y,X}}\bsymb{\Sigma_{X,X}}^{-1} \big{(} \mat{x} - \mat{\mu_X} \big{)} + \mu_Y $$The result should be quite believable though we have proved it only in the case of one predictor variable. There are several ways to prove it in the general case. One way is to use the division rule to find the conditional density of $Y$ given $X$. A more insightful way is to write multivariate normal variables in terms of independent normal variables and then make use of the independence just as we did when we had one predictor.
Whether or not you go through the proof in the future, keep in mind that order matters when you are multiplying matrices: you can only multiply matrices if their shapes allow it. In the formula for $E(Y \mid \mat{X} = \mat{x})$, the mean $\mu_Y$ is a scalar. So $\bsymb{\Sigma_{Y,X}}\bsymb{\Sigma_{X,X}}^{-1} \big{(} \mat{x} - \mat{\mu_X} \big{)}$ had better be a scalar as well. It is, because:
- $\bsymb{\Sigma_{Y,X}}$ is a $1 \times p$ row vector
- $\bsymb{\Sigma_{X,X}}^{-1}$ is a $p \times p$ matrix
- $\big{(} \mat{x} - \mat{\mu_X} \big{)}$ is a $p \times 1$ column vector
Conditional Variance¶
Return to predicting $Y$ based on just one variable $X$, and this time write $\rho^2$ in terms of elements of the covariance matrix of $X$ and $Y$.
$$ \rho^2 ~ = ~ \frac{\sigma_{X,Y}^2}{\sigma_X^2 \sigma_Y^2} ~ = ~ \sigma_{Y,X}({\sigma_X^2})^{-1}\sigma_{X,Y}({\sigma_Y^2})^{-1} $$Thus for every $x$, $$ Var(Y \mid X = x) ~ = ~ (1 - \rho^2)\sigma_Y^2 ~ = ~ \sigma_Y^2 - \rho^2\sigma_Y^2 ~ = ~ \sigma_Y^2 - \sigma_{Y,X}({\sigma_X^2})^{-1}\sigma_{X,Y} $$
As before, replace each of the quantities above by its matrix counterpart to extend the result to the case of $p$ predictors.
Mean Square Error of Regression¶
$$ Var(Y \mid \mat{X} = \mat{x}) ~ = ~ \sigma_Y^2 ~ - ~ \bsymb{\Sigma_{Y,X}}(\bsymb{\Sigma_{X,X}})^{-1}\bsymb{\Sigma_{X,Y}} $$