MCMC methods are frequently used in machine learning to approximate expectations of a complicated probability distributions. This guide introduces Metropolis-Hastings MCMC, shows a demo implementation, and provides comparisons with other methods.
All of the algorithms in this class consist of two basic steps:
This concept was first formalized by Nicholas Metropolis et al. in 1952 for the specific case of the Boltzmann distribution. The Metropolis method works on symmetric distributions and was later generalized for arbitrary distributions regardless of symmetry by W.K. Hastings in 1970. The resulting algorithm is referred to as Metropolis-Hastings Markov chain Monte Carlo.
There are a variety of MCMC methods, each with its own merits. Gibbs sampling is a particularly popular method because easier to implement than Metropolis-Hastings, may run faster, and requires no tuning. However, it requires that the conditional distribution of each variable be known. Since this is not always the case, the Metropolis-Hastings algorithm is often used instead.
This joint distribution is formed by taking the product of two functions. $f(\theta)$: The prior distribution representing the underlying probability distribution from which the data is believed to arise. The second quantity is the likelihood function for the data, $L(\theta|D)$ which incorporates the effect of the data on the model. Since $$L(\theta | D) = P(D|\theta)$$ We write the likelihood function as $P(D|\theta)$.
Using the prior and the likelihood we can form a full probability model as $$P(D, \theta) = P(D|\theta) P(\theta)$$
It is possible to obtain the posterior distribution by applying Bayes Rule to $P(D|\theta)$ $$P(\theta|D) = \frac{P(\theta)P(D|\theta)}{P(D)}$$
Where $P(D) = \int P(D, \theta)d\theta = \int P(D|\theta) P(\theta) d\theta$. So we obtain $$P(\theta|D) = \frac{P(\theta)P(D|\theta)}{\int P(\theta) P(D|\theta)d \theta}$$ This describes the probability that the parameters of interest assume particular values given the observed data. Now the goal becomes finding the expectation of a function of $\theta$, $E(f(\theta))$.
Using the posterior distribution, we can write this as $$E(f(\theta) | D) = \frac{\int f(\theta)P(\theta)P(D|\theta) d \theta}{\int P(\theta) P(D|\theta)d \theta}$$
However, the exact computation of this expectation is often impossible for real world data. Among other methods, Markov Chain Monte Carlo offers a way of numerically approximating this quantity.
The denominator of this formula is the normalization constant which is required to ensure that $E(f(x))$ is a probability density function. Recall that a probability density function has the property that the the probability of the entire sample space is $1$. Often this value is not actually known and is too difficult to compute. Often we can only calculate this expectation up to a constant of proportionality. $$E(f(x)) \propto \int f(x) \pi (x) dx$$ The amazing thing is that with Markov Chain Monte Carlo it is not necessary to compute the normalizing factor for the computation of the expectation.
Let $X$ denote a vector of $k$ random variables with distribution $\pi(.)$. If we can accurately sample from $\pi(.)$ then we can use Monte Carlo integration to approximate the expectation as follows: $$E(f(x)) \approx \frac{1}{n}\sum_{t=1}^n f(X_t)$$ This formula relies on the law of large numbers to state that the maximum likelihood estimator of the sample mean of $f(X)$ can be approximated by the arithmetic mean of a sufficiently large number of samples of $f(X)$.
That is, if we can accurately draw $\lbrace X_1, X_2, \dots X_n \rbrace$ then we could apply Monte Carlo integration to find the desired expectation.
Since $\pi(.)$ may be very complicated, drawing these samples may be non-trivial. To address this problem we turn to Markov Chains.
If we can build a Markov Chain having $\pi(.)$ as its stationary distribution then we can use it to draw the necessary samples to perform Monte Carlo integration. Hence the name, Markov Chain Monte Carlo.
The chain must be initialized with some initial $X_0$. We must run this chain for a sufficiently long time such that this initial value is "forgotten" and eventually the chain will begin to draw dependent samples from its stationary distribution, $\pi(.)$. This process is referred to as burn-in. The values obtained after the burn-in may then be used to perform the approximation.
For a Markov Chain of length $n$ with burn-in $m$, we can write the ergodic average [1] of $f$ as $$\bar{f} = \frac{1}{n-m} \sum_{t=m+1}^n f(X_t) $$
Which will converge to our desired expectation, $\hat{\mu}$ by the ergodic theorem [1].
The values of $m$ and $n$ must both be chosen to be sufficiently large by analyzing the output of the Markov Chain.
Notice that computing the expectation using Monte Carlo integration does not require the difficult computation of the normalization factor.
Now we must develop a method of constructing a Markov Chain with $\pi(.)$ as its stationary distribution. One method of doing this is to use the Metropolis-Hastings Algorithm.
It draws a sample from the proposal and accepts it with probability $\alpha$. If the sample is rejected the chain remains unchanged at state $t+1$. The proposal distribution should resemble $\pi(.)$ but may depend on the current sample $X_t$. Choosing a good proposal distribution will be covered later, but a multivariate normal distribution usually provides a good approximation.
The algorithm draws a point $Y$ from the proposal distribution given the current state of the chain. $Y$ is accepted as the next point in the chain with probability $\alpha$: $$\alpha (X_t,Y) = \min \Big(1, \frac{\pi(Y)q(X_t|Y)}{\pi(X_t)q(Y|X_t)}\Big )$$
If it it is rejected then current point becomes the next point, $X_{t+1} = X_t$. This process is repeated for $t = 1 \dots n$. Note that normalizing constant need not be known because it drops out of the ratio $\pi(Y)/\pi(X)$ [3].
However, picking a good proposal distribution is essential to obtaining reliable results with a chain of practical running time. A poor choice of a proposal distribution may result in slow convergence and slow mixing. The mixing of the chain, refers to how the chain moves about the support set of $\pi(.)$. Ideally, the chain will possess rapid mixing, drawing samples rapidly moving around the support set. Under this condition, the chain is more likely to continuously draw samples throughout the support in the correct proportions.
On the other hand, chains with slow mixing will draw many samples from one region of the support and then slowly move to another region. Thus, chains with slow mixing must be run for much longer in order to counteract the bias that is introduced by focusing on select portions of the support.
Another important consideration is the computational efficiency associated with the proposal distribution. To prevent bottle necks in computation, a distribution should be chosen that can be quickly sampled and evaluated.
Large sample theory states that the posterior distribution of the parameters approaches a multivariate normal distribution [4]. This is particularly convenient because it is easy to quickly sample from such a distribution. When using a single or multivariate normal proposal, the means and covariance matrix should be tuned to obtain good mixing for the problem at hand. It is common to use either the means directly from $X_t$ or a function thereof as the means in the proposal distribution for $X_{t+1}$. The purpose of this is to allow the proposal distribution to draw its next sample near the last accepted sample with some specified variance.
Proposal distributions with larger variance will make larger jumps around the support at the cost of each sample having a higher probability of being rejected. So proposals with especially large variance are prone to perform $X_{t+1} = X_t$ very frequently.
On the other hand, proposals with small variances are more likely to result in the chain accepting the proposed point more frequently. Likewise, too small of variance will again result in slow mixing. Even though the proposed samples will be accepted frequently, the chain will take too small of steps from the last accepted sample to quickly move around the support.
However, in practice it is often unnecessary to perform this analysis. Instead, the values of the chain can be plotted against time. This allows for the analysis of both the rate of convergence and the mixing of the chain.
These plots allow us to guess a minimum burn-in with reasonable certainty and also determine how well the chain mixes for the given proposal distribution.
The first figure with proposal distribution $N(X, 0.1)$ mixes too slowly to be considered convergent with 1000 iterations.
The second figure with proposal distribution $N(X, 0.5$ mixes considerably better. We would take its burn-in to be about $200$ iterations.
The third figure with proposal distribution $N(X, 10)$ also mixes well. We would take its burn-in to be about $170$ iterations.
The last figure with proposal $N(X, 50)$ does not mix well at all. The variance is too high so most of the proposed samples are rejected. It almost always maintains its current state and hence stays in particular regions of the support set.
A simple way of determining stopping time is to run the simulation several times with different starting values for the same number of iterations. If the estimates differ by a statistically significant amount then the running time should be increased. If even very long running times result in varying approximations then the proposal distribution may warrant revision.
In general, the longer the chain is run, the better the estimation will be. However, there is certainly a point of diminishing returns. Once the chain has converged to the stationary distribution, future samples should be drawn in the correct proportions. It can be seen that simply increasing the running time does not necessarily mean that the estimate will improve. In fact, by looking at the Monte Carlo Standard Error (to be introduced shortly), we can see that adding samples can decrease the confidence in the estimate if the samples demonstrate high variance from $\hat{\mu}$.
Picking a starting value near the mode of the chain may seem like a good method of reducing the required burn-in. According to Gilks, et al., even if the chain appears to converge immediately, it is still necessary to run the burn-in long enough to let the chain forget its starting value. However, since these are Markov chains, the next value in the chain is only dependent on the prior value in the chain. So if the chain is started within the support set of the stationary distribution, it may only take a few steps for the chain to forget its initial value. \vspace{0.2in}
Computationally, it makes sense to run one chain for every processor available unless there are other tasks waiting for the processors. When using multiple chains it is important to be careful of how the multiple estimates are handled.For example it may seem like a good idea to take the arithmetic mean of the estimates to get a better estimator This may, in principle, yield a numerically better approximation of the expectation but makes error reporting more complicated. Since the variance of each chain my differ significantly, the MCSE of one chain may not be representative of the others. Instead of attempting to combine the results of multiple chains together one should consider accepting the chain that has the lowest Monte Carlo Standard Error.
where $\sigma^2 = \text{var}(f(X_i))$. This is the actual population variance. Since we don't know the population variance $\sigma^2$ we need to estimate this quantity. The Monte Carlo Standard Error is a consistent estimate of the standard deviation $\sigma$ such as $$\hat{\sigma} = \sqrt{\frac{1}{n}\sum_{i=1}^n \Big(g(X_i) - \hat{\mu} \Big)^2 }$$
With this estimator we can use Slutsky's Theorem to attain $$Z = \frac{\hat{\mu}_n - \mu}{\frac{\hat{\sigma}_n}{\sqrt{n}}} \longrightarrow N(0, 1), \quad n \to \infty$$
Applying basic transformations we see $$\hat{\mu} \sim N\Big(\mu, \frac{\hat{\sigma}^2}{n}\Big), \quad n \to \infty$$
The MCSE tells us how much we can expect our $\hat{\mu}$ to deviate from the true population $\mu$. Note that as more samples are added, the MCSE does not necessarily decrease. In fact, it only decreases if the samples are close to the estimated mean $\hat{\mu}$. Consider the case of $\pi(.)$ having many peaks and the chain has attained a good estimate $\hat{\mu}$ after $t$ iterations. Then if at $t+1$ iterations the chain begins to explore a peak far from $\hat{\mu}$ the MCSE will increase as more samples from this peak are added.
By reporting the MCSE, $\hat{\sigma}$, and the number of samples used, $n$, enough information is provided to create confidence intervals for the expectation $\hat{\mu}$ at any $\alpha$ value. Taking $\alpha = 0.05$ gives $1-\alpha = 0.95$ or $95$ percent confidence interval.
Let $Z$ represent the standardized distribution of our variable. To construct a $95$ percent confidence interval for $\hat{\mu}$ we find $z$ such that $$P(-z \leq Z \leq z) = 0.95$$ By the definition of CDF: $$\Phi(z) = P(Z \leq z) = 1-\frac{\alpha}{2} = 0.975$$ $$z = \Phi^{-1}(\Phi(z)) = \Phi^{-1}(0.975) = 1.96$$ Then we construct the confidence interval as $$P(-1.96 \leq Z \leq 1.96) = 0.95$$ $$P(\hat{\mu} - 1.96 \frac{\hat{\sigma}}{\sqrt{n}} \leq \mu \leq \hat{\mu} + 1.96 \frac{\hat{\sigma}}{\sqrt{n}}) = 0.95$$
Determining when to stop the chain programmatically will yield much more precise results. This can be achieved by
The MCSE should be chosen depending on the confidence level desired. Smaller MCSE results in tighter confidence intervals for a given $\alpha$ level.
This method has surely been used before, but we have not found reference to it in the the literature reviewed thus far.
/*** Pseudocode ***/
// Construct the Markov Chain
X0 = ... #initialize X0
chain = [X0] #initialize chain
for i in range(1, max_iterations):
Y = random sample from q(.|chain[i-1])
U = random sample from Uniform(0,1)
if U <= alpha(chain[i-1], Y):
chain[i] = Y
else:
chain[i] = chain[i-1]
// Monte-Carlo Integration
total = 0
for i in range(burnin, max_iterations):
total = total + chain[i]
total = total/(max_iterations-burnin)
The software implementation provided as a supplement and is intended to provide a demonstration of Markov Chain Monte Carlo in a single dimension. It is written in C and must invoked from within the open source statistical software package R. The implementation relies upon several R functions to efficiently sample from the proposal distribution and evaluate $\pi(.)$ at a given $x$. To make calling the source code easier we provide an R script that wraps the call to C in a function, calls the function with specified arguments, and plots a histogram of the chain after excluding the burn-in.
This Metropolis-Hastings sampler draws samples according to the pseudo-code in the preceding section. It outputs the approximated expectation and displays a histogram of chain.
To demonstrate the sampler we have run this to approximate a standard normal and a Chi-square distribution with $6$ degrees of freedom.
plots/HistNormal.pdf}}
Approximation of standard normal with $15,000$ samples.
plots/HistChi6.pdf
Approximation of Chi-square with 6 degrees of freedom with $30,000$ samples.
R CMD SHLIB MCMC.cNext, start R in the same directory
RNow in R, to run the R script enter
source("MCMC.r")
chainSize = 1000 burnin = 200
The first function, pi, returns the value of the proposal distribution when evaluated at some point $X$. We have set this as a normal distribution with
return dnorm(X, 0, 1, FALSE);
where X is the point, 0 is the mean, 1 is the variance, and the log option is set to false.
The second function, q, returns a random sample from the proposal distribution. We have set this as a single-variate normal with
return rnorm(Y,10);where Y is the mean and 10 is the variance.
The multi-dimensional version is nearly identical. Instead of $X$ being a point at which the posterior is evaluated, it is regarded as a vector. For example, the posterior distribution may depend on $n$ parameters, so we would have $X = x_1, x_2, \dots, x_n$. Then it would be evaluated as $\pi (X) = \pi(x_1, x_2, \dots x_n)$. The goal is still to approximate $E(f(X))$ but we must do this by varying each $x_i \in X$.
Obviously, as the number of dimensions are increased so too will be the necessary running time. This is especially true if a random sample of $n$ dimensions is drawn and then accepted or rejected as a whole. To counteract this we introduce single-component Metropolis-Hastings.
One iteration of single-component Metropolis-Hastings takes $h$ update steps to complete. A total of $h$ candidates must be drawn for $X$ and the distribution from which they are drawn may depend on the new values of the already updated components and the prior values of the non-updated components. That is, the proposal distribution for $Y_{.i}$ may depend on $X_{t.-1} = {X_{t+1.1},\dots X_{t+1.i-1}, X_{t.i}, \dots , X_{t.h}}$.
The $i^{th}$ proposal is then accepted with probability $$ \alpha(X_{.-i}, X_{.i}, Y_{.i}) = \text{min}\Big(1, \frac{\pi(Y_{.i}|X_{.-i})q_i(X_{.i}|Y_{.i},X_{.-i})} {\pi(X_{.i}|Y_{.-i})q_i(Y_{.i}|X_{.i},X_{.-i})}\Big) $$
Here is a brief overview of some other algorithms that are worth considering.
If we have data $D$, parameters $\theta_1 , \theta_2$ and can sample from $$p(\theta_1 | \theta_2, D) \text{ and } p(\theta_2 | \theta_1, D)$$ then we can construct a Markov chain $${(\theta_1^{(1)},\theta_2^{(1)}),(\theta_1^{(2)},\theta_2^{(2)}), \dots, (\theta_1^{(n)},\theta_2^{(n)})}$$ by sampling $$\theta_1^{(k)} \sim p(\theta_1|\theta_2^{(k-1)}, D) \qquad \text{ and } \qquad \theta_2^{(k)} \sim p(\theta_2|\theta_1^{(k)}, D)$$ for $k = 1, \dots, n$. The stationary distribution of the Markov chain will then be the desired $p(\theta_1, \theta_2 | D)$. Modern-day Superman to some yet-to-be-found Lois Lane, part-time voider of warranties, and full-time engineer with a penchant This can be generalized to any dimension by sampling each of the parameters as $$\theta_i^{(k)} \sim p(\theta_i | \theta_1^{(k)}, \dots, \theta_{i-1}^{(k)}, \theta_{i+1}^{(k-1)}, \dots, \theta_n^{(k-1)}, D)\text{.}$$
Some optimizations exist to avoid random walks. Hybrid Monte Carlo or Hamiltonian Monte Carlo tries to avoid random walks by imposing momentum on the chain using Hamiltonian dynamics [7]. The chain then moves across the sample space more rapidly, decreasing the correlation between the samples and converging to the stationary distribution more rapidly.
The HMC method allows for global moves while keeping the acceptance probability high [12]. Global jumps are not usually made in the Metropolis-Hastings algorithm because the acceptance probability would be prohibitively low. This problem is handled with single-component Metropolis Hastings (discussed earlier) which only updates one component of $\theta$ at a time. As a result the update step of HMC is much faster since the update is done at once instead of iteratively updating each individual component.
In many practical situations the expectation in the E step is an intractable integral. One way to solve this problem is to use MCMC to sample from $p(x_h | x_v, \theta^{(old)})$ by drawing $N$ dependent samples. According to An Introduction to MCMC for Machine Learning by Andrieu, et al. we can then replace $Q(\theta)$ with $$\hat{Q}(\theta) = \frac{1}{N}\sum_{j=1}^{N} \text{log}(p(x_h^{(j)}, x_v|\theta))$$
In summary, MCMC-EM works by sampling from the unobservable data and using this data to approximate the log-likelihood of the parameters of interest. The log-likelihood is then maximized and the process starts over again using the new and improved parameters.
Expanding these methods into multiple dimensions is generally achieved by treating the multidimensional integral as multiple one dimensional integrals using Fubini's Theorem. The problem with this method is that the computational cost increases exponentially with dimension. If 1,000 points are required in 1-D then typically 1,000,000 are required in 2-D and 1,000,000,000 in 3-D [10].
The curse of dimensionality is partially overcome by the sparse grid scheme. The method is based on 1-D quadratures but combines the results in a more sophisticated manner using Smolyak's quadrature rule [11]. This method works well in many situations and has the advantage of giving precise error estimation for a normalized posterior. Sparse grids do not rely on the concepts of burn-in or chain length but do require precise choice of a basis functions. A wrong basis function will result in a wrong result. Similarly, in MCMC a poor choice in proposal will hinder convergence but not yield a wrong answer.
Among the MCMC methods, the general purpose Metropolis-Hastings sampler is a very diverse algorithm that can be applied to almost all situations. The exception to this is when one needs to handle a mixture of distributions having different dimensions. In this case the even more general and more complicated Metropolis-Hastings-Green method should be used. If the conditional distributions for each variable are known, Gibbs sampling can provide just as good of a chain and requires no tuning.
Like all algorithms, Metropolis-Hastings MCMC is not a silver bullet. Fast convergence requires a well thought-out prior and rapid mixing often requires significant tuning. High-dimensional problems require massive number of samples which can be computationally expensive. This is especially true when a large number of multi-dimensional integrals need to be handled. In this situation it may be advisable to implement the optimizations mentioned in the Avoiding Random Walks section.
Despite these problems, MCMC works surprisingly well. By taking advantage of MCSE, estimates can be made at almost any level of desired precision at the cost of additional computer time. This is what makes an MCMC estimate worth using. It is no surprise that since Metropolis introduced the method in 1952 it has gained popularity in every field that requires the computation of high-dimensional problems.
Regardless of the application, most people using MCMC are interested in computing the expectation of a probability distribution, usually a Bayesian posterior. In physics this may be the expected location of an electron or in bioinformatics it may be the probability that a single nucleotide polymorphism (SNP) is a causal for a given disease model.
In every case, picking an adequate proposal is one of the most important challenges. A multivariate normal is often a good choice, but the covariance matrix needs to be tuned for the data at hand. Without strong prior knowledge of the distribution from which the data arise, this requires monitoring how well the chain is mixing for each parameter and making adjustments based on this information. Fortunately, this is not very difficult. Unfortunately, this requires a human in the loop. These modifications can be made programmatically at the cost of a more complicated software implementation.
Ensuring that a chain has good mixing and determining a proper burn-in value are important considerations. The mixing of the chain is generally determined by variance of the proposal, also known as "step-size". When using a multi-variate proposal, the step-size for each parameter can be set independently. Mixing and burn-in can be analyzed visually as demonstrated the figures in this guide. However, the parameters must be analyzed one at a time unless 3-D graphing techniques are used.
There are various MCMC implementations available for download on the web. The BUGS (Bayesian inference Using Gibbs Sampling) Project provides a comprehensive software package for Gibbs Sampling. Charles Geyer at the University of Minnesota has released an R package for doing "Simple, but general" random-walk Metropolis-Hastings. It is available here.
[1] D. J. S. Walter R. Gilks, Markov Chain Monte Carlo in Practice. First CRC Press, 1 ed., 1998.
[2] C. J. Geyer, “Practical markov chain monte carlo,” Statistical Science, vol. 7, no. 4, 1992.
[3] “xbeta: Metropolis-hastings algorithm.” http://xbeta.org/wiki/show/ Metropolis-Hastings+algorithm.
[4] “The mcmc procedure: Tuning the proposal distribution.” http://support.sas.com/documentation/cdl/en/statug/59654/HTML/ default/statug mcmc sect022.htm, 2008.
[5] C. Geyer, “Introduction to markov chain monte carlo slides.”.
[6] C. Andrieu, “An introduction to mcmc for machine learning,” 2003.
[7] Wikipedia, “Markov chain monte carlo — wikipedia, the free encyclopedia,” 2009. [Online; accessed 10-May-2009].
[8] D. Propp, James; Wilson, “Coupling from the past: a user’s guide,” Microsurveys in discrete probability, p. 181192, 1998. MR1630414.
[9] C. J. Geyer, “The metropolis-hastings-green algorithm.” http://www.stat. umn.edu/geyer/f05/8931/bmhg.pdf, December 2003.
[10] D. H. Bailey, “High-precision, high-dimension integration,” tech. rep.
[11] Wikipedia, “Numerical integration — wikipedia, the free encyclopedia.” http://en.wikipedia.org/w/index.php?title=Numerical integration, 2009. [Online; accessed 14-May-2009].
[12] Heermann, Dieter W. Slides on Hybrid Monte Carlo. http://wwwcp.tphys.uni-heidelberg.de/comp-phys/Lecture-Theory-8-Hybrid-Monte-Carlo.pdf