Best Linear Predictor

Interact

Let Y and the p×1 vector X be jointly distributed, and suppose you are trying to predict Y based on a linear function of X. For the predictor

ˆYc,d =cTX+d

the mean squared error of prediction is

MSE(ˆYc,d) = E((YˆYc,d)2)

In this section we will identify the linear predictor that minimizes the mean squared error. We will also find the variance of the error made by this best predictor.

A Linear Predictor

In the case of simple regression, we found the best linear predictor by using calculus to minimize the mean squared error over all slopes and intercepts. We could do the multivariable version of that calculation here. But because of the work we did in the case of one predictor, we will take a different approach.

We will guess the answer based on the answer in the case of simple regression, and then establish that our guess is correct.

In the case of simple regression, we wrote the regression equation in the form

ˆY = σY,X(σ2X)1(XμX)+μY

Now define

ˆYb = ΣY,XΣ1X(XμX)+μY = bT(XμX)+μY

where

b = Σ1XΣX,Y

is the p×1 vector of the coefficients of the linear function.

Clearly ˆYb is a linear predictor of Y based on X. We will show that it is the least squares linear predictor. The steps will follow those that we used to show that conditional expectation is the least squares predictor among all predictors.

Projection

Notice that E(ˆYb) = μY. The predictor has the same mean as the variable being predicted.

Define the error in the prediction to be

W = YˆYb

Then

E(W) = 0

We will now show that W is uncorrelated with all linear combinations of elements of X.

Cov(W,aTX) = Cov(YˆYb,aTX)= Cov(Y,aTX)Cov(ˆYb,aTX)= Cov(Y,aTX)Cov(bTX,aTX)= aTΣX,YaTΣXb= aTΣX,YaTΣXΣ1XΣX,Y= 0

Because E(W)=0, we also have E(WaTX)=Cov(W,aTX)=0 for all a.

Least Squares

To show that ˆYb minimizes the mean squared error, start with an exercise: show that the best linear predictor must have the same mean as the variable being predicted. That is, show that the best linear predictor must have mean μY.

Once you have done that, you can restrict the search for the best linear predictor to all unbiased linear predictors. Define the generic one of these by

ˆYh = hT(XμX)+μY

where h is some p×1 vector of coefficients. Then

MSE(ˆYh) = E((YˆYh)2)= E(((YˆYb)+(ˆYbˆYh))2)= E((YˆYb)2)+E((ˆYbˆYh)2)+2E((YˆYb)(ˆYbˆYh))= MSE(ˆYb)+E((ˆYbˆYh)2)+2E(W(bh)T(XμX))= MSE(ˆYb)+E((ˆYbˆYh)2) MSE(ˆYb)

Regression Equation and Predicted Values

The least squares linear predictor is given by

ˆY = bT(XμX)+μY = ΣY,XΣ1X(XμX)+μY

This is the same as ˆYb. We are just dropping the subscript for convenience, now that we have established that it is the best linear predictor.

As we have seen above, the predictor is unbiased:

E(ˆY) = E(Y)

The variance of the predicted values is

Var(ˆY) = bTΣXb= ΣY,XΣ1XΣXΣ1XΣX,Y= ΣY,XΣ1XΣX,Y

Error Variance

The error in the prediction is W=YˆY. Because ˆY is a linear function of X, we have

0 = Cov(W,ˆY) = Cov(YˆY,ˆY) = Cov(Y,ˆY)Var(ˆY)

Therefore

Cov(Y,ˆY) = Var(ˆY)

The variance of the error is

Var(W) = Cov(YˆY,YˆY)= Var(Y)2Cov(Y,ˆY)+Var(ˆY)= Var(Y)Var(ˆY)= σ2YΣY,XΣ1XΣX,Y

In the case of simple regression under the bivariate normal model, we saw that the error variance was

σ2YσY,X(σ2X)1σX,Y

This is a special case of the more general formula that we have established here. The bivariate normal assumption isn’t needed.

As in the case of simple regression, we have made no assumption about the joint distribution of Y and X other than to say that ΣX is positive definite. Regardless, there is a unique best linear predictor of Y based on X.