Best Linear Predictor
Let Y and the p×1 vector X be jointly distributed, and suppose you are trying to predict Y based on a linear function of X. For the predictor
ˆYc,d =cTX+dthe mean squared error of prediction is
MSE(ˆYc,d) = E((Y−ˆYc,d)2)In this section we will identify the linear predictor that minimizes the mean squared error. We will also find the variance of the error made by this best predictor.
A Linear Predictor
In the case of simple regression, we found the best linear predictor by using calculus to minimize the mean squared error over all slopes and intercepts. We could do the multivariable version of that calculation here. But because of the work we did in the case of one predictor, we will take a different approach.
We will guess the answer based on the answer in the case of simple regression, and then establish that our guess is correct.
In the case of simple regression, we wrote the regression equation in the form
ˆY = σY,X(σ2X)−1(X−μX)+μYNow define
ˆYb = ΣY,XΣ−1X(X−μX)+μY = bT(X−μX)+μYwhere
b = Σ−1XΣX,Yis the p×1 vector of the coefficients of the linear function.
Clearly ˆYb is a linear predictor of Y based on X. We will show that it is the least squares linear predictor. The steps will follow those that we used to show that conditional expectation is the least squares predictor among all predictors.
Projection
Notice that E(ˆYb) = μY. The predictor has the same mean as the variable being predicted.
Define the error in the prediction to be
W = Y−ˆYbThen
E(W) = 0We will now show that W is uncorrelated with all linear combinations of elements of X.
Cov(W,aTX) = Cov(Y−ˆYb,aTX)= Cov(Y,aTX)−Cov(ˆYb,aTX)= Cov(Y,aTX)−Cov(bTX,aTX)= aTΣX,Y−aTΣXb= aTΣX,Y−aTΣXΣ−1XΣX,Y= 0Because E(W)=0, we also have E(WaTX)=Cov(W,aTX)=0 for all a.
Least Squares
To show that ˆYb minimizes the mean squared error, start with an exercise: show that the best linear predictor must have the same mean as the variable being predicted. That is, show that the best linear predictor must have mean μY.
Once you have done that, you can restrict the search for the best linear predictor to all unbiased linear predictors. Define the generic one of these by
ˆYh = hT(X−μX)+μYwhere h is some p×1 vector of coefficients. Then
MSE(ˆYh) = E((Y−ˆYh)2)= E(((Y−ˆYb)+(ˆYb−ˆYh))2)= E((Y−ˆYb)2)+E((ˆYb−ˆYh)2)+2E((Y−ˆYb)(ˆYb−ˆYh))= MSE(ˆYb)+E((ˆYb−ˆYh)2)+2E(W(b−h)T(X−μX))= MSE(ˆYb)+E((ˆYb−ˆYh)2)≥ MSE(ˆYb)Regression Equation and Predicted Values
The least squares linear predictor is given by
ˆY = bT(X−μX)+μY = ΣY,XΣ−1X(X−μX)+μYThis is the same as ˆYb. We are just dropping the subscript for convenience, now that we have established that it is the best linear predictor.
As we have seen above, the predictor is unbiased:
E(ˆY) = E(Y)The variance of the predicted values is
Var(ˆY) = bTΣXb= ΣY,XΣ−1XΣXΣ−1XΣX,Y= ΣY,XΣ−1XΣX,YError Variance
The error in the prediction is W=Y−ˆY. Because ˆY is a linear function of X, we have
0 = Cov(W,ˆY) = Cov(Y−ˆY,ˆY) = Cov(Y,ˆY)−Var(ˆY)Therefore
Cov(Y,ˆY) = Var(ˆY)The variance of the error is
Var(W) = Cov(Y−ˆY,Y−ˆY)= Var(Y)−2Cov(Y,ˆY)+Var(ˆY)= Var(Y)−Var(ˆY)= σ2Y−ΣY,XΣ−1XΣX,YIn the case of simple regression under the bivariate normal model, we saw that the error variance was
σ2Y−σY,X(σ2X)−1σX,YThis is a special case of the more general formula that we have established here. The bivariate normal assumption isn’t needed.
As in the case of simple regression, we have made no assumption about the joint distribution of Y and X other than to say that ΣX is positive definite. Regardless, there is a unique best linear predictor of Y based on X.