11.3. Metropolis Algorithm#
The goal of Markov Chain Monte Carlo (MCMC) is to generate random samples from complicated high dimensional distributions about which we have incomplete information. For example, it might be that we don’t know the normalizing constant of the distribution, as we saw in the code breaking example of the previous section.
See More
Imagine a large state space
Now suppose we are interested in a particular probability distribution on this space. We are going to call this probability distribution
In the code-breaking setting, we are interested in the mode of
Since can’t compute
That is what MCMC does. The procedure relies on a few observations.
Suppose we can create a Markov Chain
that has our specified distribution as its stationary distribution. Then for large the distribution of will be approximately .Suppose we can construct such a chain. By the definition of a mode,
is most likely to be at or near the mode of , when is large. So by running the chain for a long time we will be able to identify the mode of .Creating a chain involves creating a transition matrix. To create a transition matrix that results in
as the stationary distribution, the easiest way is to try to ensure that the detailed balance equations are solved.The detailed balance equations are equivalent to
The right hand side only involves the transition probabilities of the chain that we want to create. The left hand side only involves ratios of the terms in
11.3.1. Metropolis Algorithm#
Exactly who proposed the first algorithm to create such a Markov Chain is the subject of some debate. A general version was proposed by Hastings. Here we will describe an earlier version attributed to Metropolis and co-authors in 1953.
The goal is to create a transition matrix
The algorithm starts with any symmetric, irreducible transition matrix
The algorithm then introduces additional randomization to create a new chain that is irreducible and aperiodic and has
Here are the rules that determine the transitions of the new chain.
Suppose the chain is at
at time , that is, suppose . Pick a state according to the proposal probability . This is the candidate state to which your chain might move.Define the acceptance ratio
If
, set .If
, toss a coin that lands heads with chance .If the coin lands heads, set
.If the coin lands tails, set
.
Repeat all the steps, with
as the starting value.
Thus the new chain either moves to the state picked according to
The new chain is irreducible because the proposal chain is irreducible. It is aperiodic because it can stay in place. So it has a steady state distribution.
The algorithm says that this steady state distribution is the same as the distribution
See More
11.3.2. How to Think About the Algorithm#
Before we prove that the algorithm works, let’s examine what it is doing in the context of decoders.
First notice that we are requiring
Fix any starting decoder and call it
To decide whether or not the chain should move to
The algorithm does this by comparing the acceptance ratio
If
, the likelihood of is at least as large that of , so you accept the proposal and move to .If
, the proposed decoder has less likelihood than the current , so it is tempting to stay at . But this risks the chain getting stuck at a local maximum. The algorithm provides a chance to avoid this, by tossing a biased coin. If the coin lands heads, the chain moves to even though has a lower likelihood than the current decoder . The idea is that from this new position there might be paths to decoders that have the highest likelihoods of all.
11.3.3. The Algorithm Works#
We will now show that the detailed balance equations are solved by the desired limit distribution
Take any two distinct states
11.3.3.1. Case 1: #
Then
Therefore
11.3.3.2. Case 2: #
Then
Now
Therefore
which is the same as
11.3.3.3. Case 2: #
Reverse the roles of
That’s it! A simple and brilliant idea that provides a solution to a difficult problem. In lab, you will see it in action when you implement the algorithm to decode text.