8.4. Additivity#

Calculating expectation by plugging into the definition works in simple cases, but often it can be cumbersome or lack insight. The most powerful result for calculating expectation turns out not to be the definition. It looks rather innocuous:

8.4.1. Additivity of Expectation#

Let X and Y be two random variables defined on the same probability space. Then

E(X+Y)=E(X)+E(Y)

Before we look more closely at this result, note that we are assuming that all the expectations exist; we will do this throughout in this course.

And now note that there are no assumptions about the relation between X and Y. They could be dependent or independent. Regardless, the expectation of the sum is the sum of the expectations. This makes the result powerful.

See More

Additivity follows easily from the definition of X+Y and the definition of expectation on the domain space. First note that the random variable X+Y is the function defined by

(X+Y)(ω)=X(ω)+Y(ω)    for all ωΩ

Thus a “value of X+Y weighted by the probability” can be written as

(X+Y)(ω)P(ω)=X(ω)P(ω)+Y(ω)P(ω)

Sum the two sides over all ωΩ to prove additivty of expecation.

Quick Check

Let X and Y be random variables on the same space, with E(X)=5 and E(Y)=3.

(a) Find E(XY).

(b) Find E(2X8Y+7).

By induction, additivity extends to any finite number of random variables. If X1,X2,,Xn are random variables defined on the same probability space, then

E(X1+X2++Xn)=E(X1)+E(X2)++E(Xn)

regardless of the dependence structure of X1,X2,,Xn.

If you are trying to find an expectation, then the way to use additivity is to write your random variable as a sum of simpler variables whose expectations you know or can calculate easily.

8.4.2. E(X2) for a Poisson Variable X#

Let X have the Poisson μ distribution. In earlier sections we showed that E(X)=μ and E(X(X1))=μ2.

Now X2=X(X1)+X. The random variables X(X1) and X are both functions of X, so they are not independent of each other. But additivity of expectation doesn’t require independence, so we can use it to see that

E(X2) = E(X(X1))+E(X) = μ2+μ

We will use this fact later when we study the variability of X.

It is worth noting that it is not easy to calculate E(X2) directly, since

E(X2) = k=0k2eμμkk!

is not an easy sum to simplify.

8.4.3. Sample Sum#

Let X1,X2,,Xn be a sample drawn at random from a numerical population that has mean μ, and let the sample sum be

Sn=X1+X2++Xn

Then, regardless of whether the sample was drawn with or without replacement, each Xi has the same distribution as the population. This is clearly true if the sampling is with replacement, and it is true by symmetry if the sampling is without replacement as we saw in an earlier chapter.

So, regardless of whether the sample is drawn with or without replacement, E(Xi)=μ for each i, and hence

E(Sn)=E(X1)+E(X2)++E(Xn)=nμ

We can use this to estimate a population mean based on a sample mean.

8.4.4. Unbiased Estimator#

Suppose a random variable X is being used to estimate a fixed numerical parameter θ. Then X is called an estimator of θ.

The bias of X is the difference E(X)θ. The bias measures the amount by which the estimator exceeds the parameter, on average. The bias can be negative if the estimator tends to underestimate the parameter.

If the bias of an estimator is 0 then the estimator is called unbiased. So X is an unbiased estimator of θ if E(X)=θ.

If an estimator is unbiased, and you use it to generate estimates repeatedly and independently, then in the long run the average of all the estimates is equal to the parameter being estimated. On average, the unbiased estimator is neither higher nor lower than the parameter. That’s usually considered a good quality in an estimator.

In practical terms, if a data scientist wants to estimate an unknown parameter based on a random sample X1,X2,,Xn, the data scientist has to come up with a statistic to use as the estimator.

Recall from Data 8 that a statistic is a number computed from the sample. In other words, a statistic is a numerical function of X1,X2,,Xn.

Constructing an unbiased estimator of a parameter θ therefore amounts to finding a statistic T=g(X1,X2,,Xn) for a function g such that E(T)=θ.

8.4.5. Unbiased Estimators of a Population Mean#

As in the sample sum example above, let Sn be the sum of a sample X1,X2,,Xn drawn at random from a population that has mean μ. The standard statistical notation for the average of X1,X2,,Xn is X¯n. So

X¯n=Snn

Then, regardless of whether the draws were made with replacement or without,

E(X¯n)=E(Sn)n    (linear function rule)=nμn         (E(Sn)=nμ)=μ

Thus the sample mean is an unbiased estimator of the population mean.

It is worth noting that X1 is also an unbiased estimator of μ, since E(X1)=μ. So is Xj for any j, also (X1+X9)/2, or any linear combination of the sample if the coefficients add up to 1.

But it seems clear that using the sample mean as the estimator is better than using just one sampled element, even though both are unbiased. This is true, and is related to how variable the estimators are. We will address this later in the course.

Quick Check

Let X1,X2,X3 be i.i.d. Poisson (μ) random variables, and suppose the value of μ is unknown. Is 0.4X1+0.2X2+0.4X3 an unbiased estimator of μ?

See More

8.4.6. First Unbiased Estimator of a Maximum Possible Value#

Suppose we have a sample X1,X2,,Xn drawn at random from 1,2,,N for some fixed N, and we are trying to estimate N.

How can we use the sample to construct an unbiased estimator of N? By definition, such an estimator must be a function of the sample and its expectation must be N.

In other words, we have to construct a statistic that has expectation N.

Each Xi has the uniform distribution on 1,2,,N. This is true for sampling with replacement as well as for simple random sampling, by symmetry.

The expectation of each of the uniform variables is (N+1)/2, as we have seen earlier. So if X¯n is the sample mean, then

E(X¯n)=N+12

Clearly, X¯n is not an unbiased estimator of N. That’s not surprising because N is the maximum possible value of each observation and X¯n should be somewhere in the middle of all the possible values.

But because E(X¯n) is a linear function of N, we can figure out how to create an unbiased estimator of N.

Remember that our job is to create a function of the sample X1,X2,,Xn in such a way that the expectation of that function is N.

Start by inverting the linear function, that is, by isolating N in the equation above.

2E(X¯n)1=N

This tells us what we have to do to the sample X1,X2,,Xn to get an unbiased estimator of N.

We should just use the statistic T1=2X¯n1 as the estimator. It is unbiased because E(T1)=N by the calculation above.

Quick Check

In the setting above, what is the bias of 2X¯n as an estimator of N? Does it tend to overestimate on average, or underestimate?

8.4.7. Second Unbiased Estimator of the Maximum Possible Value#

The calculation above stems from a problem the Allied forces faced in World War II. Germany had a seemingly never-ending fleet of Panzer tanks, and the Allies needed to estimate how many they had. They decided to base their estimates on the serial numbers of the tanks that they saw.

Here is a picture of one from Wikipedia.

Panzer Tank

Notice the serial number on the top left. When tanks were disabled or destroyed, it was discovered that their parts had serial numbers too. The ones from the gear boxes proved very useful.

The idea was to model the observed serial numbers as random draws from 1,2,,N and then estimate N. This is of course a very simplified model of reality. But estimates based on even such simple probabilistic models proved to be quite a bit more accurate than those based on the intelligence gathered by the Allies. For example, in August 1942, intelligence estimates were that Germany was producing 1,550 tanks per month. The prediction based on the probability model was 327 per month. After the war, German records showed that the actual production rate was 342 per month.

The model was that the draws were made at random without replacement from the integers 1 through N.

In the example above, we constructed the random variable T to be an unbiased estimator of N under this model.

The Allied statisticians instead started with M, the sample maximum:

M = max{X1,X2,,Xn}

The sample maximum M is a biased estimator of N, because we know that its value is always less than or equal to N. Its average value therefore will be somewhat less than N.

To correct for this, the Allied statisticians imagined a row of N spots for the serial numbers 1 through N, with marks at the spots corresponding to the observed serial numbers. The visualization below shows an outcome in the case N=20 and n=3.

gaps

  • There are N=20 spots in all.

  • From these, we take a simple random sample of size n=3. Those are the gold spots.

  • The remaining Nn=17 spots are colored blue.

The n=3 sampled spots create n+1=4 blue “gaps” between sampled values: one before the leftmost gold spot, two between successive gold spots, and one after the rightmost gold spot that is at position M.

A key observation is that because of the symmetry of simple random sampling, the lengths of all four gaps have the same distribution.

But of course we don’t get to see all the gaps. In the sample, we can see all but the last gap, as in the figure below. The red question mark reminds you that the gap to the right of M is invisible to us.

mystery gap

If we could see the gap to the right of M, we would see N. But we can’t. So we can try to do the next best thing, which is to augment M by the estimated size of that gap.

Since we can see all of the spots and their colors up to and including M, we can see n out of the n+1 gaps. The lengths of the gaps all have the same distribution by symmetry, so we can estimate the length of a single gap by the average length of all the gaps that we can see.

We can see M spots, of which n are the sampled values. So the total length of all n visible gaps is Mn. Therefore

estimated length of one gap = Mnn

So the Allied statisticians decided to improve upon M by using the augmented maximum as their estimator:

T2 = M+Mnn

By algebra, this estimator can be rewritten as

T2 = Mn+1n  1

Is T2 an unbiased estimator of N? To answer this, we have to find its expectation. Since T2 is a linear function of M, we’ll find the expectation of M first.

Here once again is the visualization of what’s going on.

gaps

Let G be the length of the last gap. Then M=NG.

There are n+1 gaps, made up of the Nn unsampled values. Since they all have the same expected length,

E(G) = Nnn+1

So

E(M) = NNnn+1 = (N+1)nn+1

Recall that the Allied statisticians’ estimate of N is

T2 = Mn+1n1

Now

E(T2) = E(M)n+1n1 = (N+1)nn+1n+1n1 = N

Thus the augmented maximum T2 is an unbiased estimator of N.

Quick Check

A gardener in Berkeley has 23 blue flower pots in a row. She picks a simple random sample of 5 of them and colors the selected pots gold. What is the expected number of blue flower pots at the end of the row?

8.4.8. Which Estimator to Use?#

The Allied statisticians thus had two unbiased estimators of N from which to choose. They went with T2 instead of T1 because T2 has less variability.

We will quantify this later in the course. For now, here is a simulation of distributions of the two estimators in the case N=300 and n=30. The simulation is based on 5000 repetitions of drawing a simple random sample of size 30 from the integers 1 through 300.

../../_images/bd8f9a14c231e07c8ad423ef5b4cfbc913fe1ebf24d69e5bfab4a91c4bbfa447.png

You can see why T2 is a better estimator than T1.

  • Both are unbiased. So both the empirical histograms are balanced at around 300, the true value of N.

  • The emipirical distribution of T2 is clustered much closer to the true value 300 than the empirical distribution of T1.

For a recap, take another look at the accuracy table of the Allied statisticians’ estimator T2. Not bad for an estimator based on a model that assumes nothing more complicated than simple random sampling!