Conditional Distributions
Conditional Distributions¶
To understand the relation between two variables you must examine the conditional behavior of each of them given the value of the other. Towards this goal, we will start with some theory and then illustrate it by an example.
Recall the division rule $$ P(B \mid A) = \frac{P(AB)}{P(A)} $$ and let's see what it looks like in the context of random variables.
Let $X$ and $Y$ be two random variables defined on the same space. If $x$ is a possible value of $X$, and $y$ and possible value of $Y$, then $$ P(Y = y \mid X = x) = \frac{P(X = x, Y = y)}{P(X = x)} $$
Therefore, for a fixed value $x^*$ of $X$, the conditional distribution of $Y$, given $X = x^*$ is the collection of probabilities $$ P(Y = y \mid X = x^*) = \frac{P(X = x^*, Y = y)}{P(X = x^*)} $$ where $y$ ranges over all the values of $Y$. Keep in mind that $y$ represents values of the variable here. The value $x^*$ is the particular value of $X$ that was observed; it is a constant.
The Probabilities in a Conditional Distribution Sum to 1¶
In a distribution, the probabilities have to sum to 1. To see that this is true for the conditional distribution defined above, start by using the fundamental rule: find $P(X = x^*)$ by partitioning the event $\{ X = x^* \}$ according to the values of $Y$:
$$ P(X = x^*) = \sum_{\text{all }y} P(X = x^*, Y = y) $$Now let's sum the probabilities in the conditional distribution of $Y$ given $X = x^*$, and see if the sum is 1.
\begin{align*} \sum_{\text{all }y} P(Y = y \mid X = x^*) &= \sum_{\text{all }y} \frac{P(X = x^*, Y = y)}{P(X = x^*)} \\ \\ &= \frac{1}{P(X = x^*)} \sum_{\text{all }y} P(X = x^*, Y = y) \\ &= \frac{1}{P(X = x^*)} \cdot P(X = x^*) \\ &= 1 \end{align*}Example: A Randomly Picked Coin¶
Suppose I have one fair coin and three biased coins. Suppose that each of the biased coins lands heads with chance 0.9. I pick one of the coins at random, toss it twice and tell you the number of heads.
The goal of this example is to see how the information about the number of heads affects your opinion about which coin was picked.
Let $R$ be the probability with which the random coin lands heads. The possible values of $R$ are 0.5 and 0.9, and the probability distribution of $R$ is given by the following table.
Table().values([0.5, 0.9]).probability([0.25, 0.75])
Let $H$ be the number of heads in the two tosses. Then the joint distribution of $R$ and $H$ consists of terms of the form $$ P(R = r, H = h) ~~ \text{where } r \in \{0.5, 0.9\} \text{ and } h \in \{ 0, 1, 2 \} $$
There are six such terms. Let's work out two of them directly, and then we will use the prob140
library to calculate the entire joint distribution. By the multiplication rule,
We have used that fact that $\{H = 1\}$ happens if either there is a head followed by a tail or a tail followed by a head.
To get the prob140
methods values
and probability_function
to do the rest of the work, we will start by listing the possible values.
- $R$ has possible values 0.5 and 0.9
- $H$ has possible values 0, 1, and 2
The syntax below creates possible values of the pair of variables and puts it in a table. Now that there are two sets of possible values, you precede each set by the name of the corresponding random variable.
joint_tbl = Table().values('R', [0.5, 0.9], 'H', [0, 1, 2])
joint_tbl
That's the outcome space. It consists of the six possible pairs. Now for the probabilities in the joint distribution, let's define a function that computes them.
def joint_probs(r, h):
"""Return P(R = r, H = h)"""
# The distribution of the number of heads in two tosses
# of a coin that lands heads with chance r
heads_2_tosses = make_array((1-r)**2, 2*r*(1-r), r**2)
if r == 0.5:
return 0.25*heads_2_tosses.item(h)
elif r == 0.9:
return 0.75*heads_2_tosses.item(h)
We now use the probability_function
on joint_tbl
to create all the joint probabilities:
joint_tbl = joint_tbl.probability_function(joint_probs)
joint_tbl
The final step is to convert this to a conventional joint distribution table.
joint_dist = joint_tbl.toJoint()
joint_dist
The values of $P(R = 0.9, H = 2)$ and $P(R = 0.5, H = 1)$ agree with those that we calculated directly.
Let's check that the marginal of $R$ agrees with the assumption that the coin is picked at random from one fair coin and three biased coins.
joint_dist.marginal('R')
Now suppose I tell you that I picked the fair coin. Then the $R=0.9$ column is irrelevant. We can think of our original outcome space being reduced to the three cells in the R=0.5
column, and all probabilities being renormalized relative to the total probability $P(R = 0.5) = 0.25$ in that column.
Similarly, if I told you I had picked one of the biased coins, you would restrict your calculations to the R=0.9
column.
The conditional_dist
method does all of this for you. For the conditional distributions of $H$ given values of $R$, use 'H'
as the first argument and 'R'
as the second
joint_dist.conditional_dist('H', 'R')
To check that you understand what has been calculated, try calculating a few of these values directly. For example, $$ P(H = 2 \mid R = 0.9) = \frac{0.6075}{0.75} = 0.81 $$ is the chance that you get two heads given that you tossed a biased coin. This makes sense, as the biased coins land heads with chance 0.9 and therefore have chance $0.9^2 = 0.81$ of producing two heads.
You can see that the first column of this table contains the distribution of the number of heads in two tosses of a fair coin. The second column contains the distribution of the number of heads in two tosses of a coin that lands heads with chance 0.9. And the last column contains the distribution of the number of heads in two tosses of a coin picked at random from our four coins.
The three distributions are different. Knowing which coin was picked changes the distribution of the number of heads.
Our original problem was the other way around: given the value of $H$, what can we say about $R$?
joint_dist.conditional_dist('R', 'H')
Read this table from the bottom row upwards. Remember that the coin was randomly picked.
- If I tell you nothing about the number of heads, all you can tell me about the coin is that there is 25% chance that it is fair and 75% chance that it is a biased coin.
- If I tell you that I got 0 heads, your opinion about the coin changes dramatically in favor of the fair coin. The biased coins have a very small chance of producing no heads, so even though one of them was very likely to have been picked, the data tilt the balance in favor of the fair coin.
- If I tell you that I got 1 head, you're in a bit of a quandary. The biased coins have a modest chance (18%) of producing 1 head compared to the 50% chance that the fair coin produces 1 head. But the coin picked had a 75% chance of being biased compared to a 25% chance of being fair. The size of these two effects makes you quite uncertain about which kind of coin to lean towards. You have only a slight lean towards 'biased'.
- If I tell you that I got 2 heads, your opinion shifts dramatically towards the biased coins. Not only are they very likely to produce two heads, they are also very likely to have been picked.
Updating Your Opinion Based on Data¶
This is a simple example of something that comes up often in machine learning.
- You start out with a prior opinion about an unknown quantity. In our case the prior distribution was that the fair coin would be picked with chance 25%.
- For every value of the unknown quantity, the data have a likelihood. For each of our four coins, we know the likelihood of getting any specified number of heads given that we tossed that coin.
- After you see the data, your opinion about the unknown quantity might change, sometimes quite drastically. The change depends on the prior as well as the likelihood.