MCMC HW4

1. Hamiltonian Monte Carlo for Gaussian mixture model
Consider the gaussian mixture model given by

\[f(x) = \frac{1}{3}\mathcal N\biggl( \begin{pmatrix} -4 \\ -6 \end{pmatrix}, \begin{pmatrix} 1 & 0.9 \\ 0.9 & 1 \end{pmatrix} \biggr) + \frac{1}{3}\mathcal N\biggl(\begin{pmatrix} 0 \\ 0\end{pmatrix}, \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \biggr) + \frac{1}{3}\mathcal N\biggl( \begin{pmatrix} 6\\8 \end{pmatrix}, \begin{pmatrix} 1 & -0.9 \\ -0.9 & 1 \end{pmatrix}\biggr).\]

(a) Explain the update step of Hamiltonian Monte Carlo (HMC) and compute the log-posterior and first gradient of the log-posterior for the Gaussian mixture model.

1) update step

We update \(x\) and \(\phi\) under the leapfrog steps. Before getting into the leapfrog, generate \(\phi\) from normal distribution \(\mathcal N_2(0,M)\). During the leapfrog steps, the half step of \(\phi\) is updated using the gradient of log posterior (‘phi_half’ in the code). Using the ‘phi_half’ and the inverse of M, the \(x\) is updated, and we updated the rest half of \(\phi\) using the gradient of log posterior. This process is repeated for \(L\) times.

2) gradient of log-posterior

The log posterior is simply \[\log f(x) = \log\biggl( \sum_{k=1}^Kw_k\mathcal N(x; \mu_k,\Sigma_k) \biggr)\] where \(K=3\), \(w_k = 1/3\) for all \(k\), and \(\mu_k,\Sigma_k\) are corresponding mean vector and covariance matrix; i.e. \(\mu_1 = (-4,-6)\) and \(\Sigma_1 = \begin{bmatrix}1&0.9\\0.9&1\end{bmatrix}\).

The gradient of \(\log f(x)\) can be written as \[\begin{aligned}\nabla_x\log f(x) &= \biggl[\frac{\partial}{\partial x_1}\log f(x) ,\frac{\partial}{\partial x_2}\log f(x) \biggr]^T\\ &= \biggl[\frac{1}{f(x)}\frac{\partial}{\partial x_1}f(x) , \frac{1}{f(x)}\frac{\partial}{\partial x_2}f(x) \biggr]^T \\ &= \frac{\nabla_x f(x)}{f(x)}. \end{aligned}\] Let the \(k\)th Gaussian pdf be \(g_k\) for simplicity, then \[\nabla_x f(x) = \sum_{k=1}^Kw_k\nabla_x g_k(x).\] Since \[\begin{aligned} \frac{\partial}{\partial x}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) = 2\Sigma_k^{-1}x-2\Sigma_k^{-1}\mu_k, \end{aligned}\] and \[g_k(x) = \frac{1}{(2\pi)^{2/2}|\Sigma_k|^{1/2}}\exp\biggl\{ -\frac{1}{2}(x - \mu_k)^T\Sigma_k^{-1}(x - \mu_k) \biggr\},\] we have \[\begin{aligned} \nabla_x g_k(x) = g_k(x)\cdot\bigl(-\Sigma_k^{-1}(x - \mu_k) \bigr). \end{aligned}\] To sum up, \[\begin{aligned} \nabla_x \log f(x) = -\frac{1}{f(x)}\sum_{k=1}^Kw_kg_k(x)\Sigma_k^{-1}(x-\mu_k) \end{aligned}\]

(b) Write code for HMC using the leapfrog process and generate samples from the gaussian mixture model.

1) initial values

The following code shows how to set the mean vectors and covariance matrices for the Gaussian mixture model. The list of mean(mu) and covariance(Sigma) is used in calculation of log-posterior. The inverse of the covariance matrices are used in the gradient of the log-posterior.

rm(list = ls())
set.seed(2023311161)
#install.packages("numDeriv")
#install.packages("RcppDist")
library(Rcpp)

## Warning: 패키지 'Rcpp'는 R 버전 4.3.3에서 작성되었습니다

library(RcppArmadillo)
library(RcppDist)

## Warning: 패키지 'RcppDist'는 R 버전 4.3.3에서 작성되었습니다

library(numDeriv)
library(ggplot2)

## Warning: 패키지 'ggplot2'는 R 버전 4.3.3에서 작성되었습니다

library(gridExtra)

## Warning: 패키지 'gridExtra'는 R 버전 4.3.2에서 작성되었습니다

# Define parameters for the Gaussian mixture components
mu1 <- c(-4, -6)
mu2 <- c(0, 0)
mu3 <- c(6, 8)
mu <- list(mu1 = mu1, mu2 = mu2, mu3 = mu3)


Sigma1 <- matrix(c(1, 0.9, 0.9, 1), nrow = 2)
Sigma2 <- diag(2)
Sigma3 <- matrix(c(1, -0.9, -0.9, 1), nrow = 2)
Sigma <- list(Sigma1 = Sigma1, Sigma2 = Sigma2, Sigma3 = Sigma3)


Sigma1_inv <- solve(Sigma1)
Sigma2_inv <- solve(Sigma2)
Sigma3_inv <- solve(Sigma3)

The matrix M is used as a covariance matrix for momentum vector \(\phi \sim \mathcal N_2(0, M)\). The total iteration is 25000 with burn-in 5000. The initial value is (-1,-1), which is around \(1\)st and \(2\)nd modes of the mixture, the step size (ep, \(\epsilon\)) is 1/2, and the number of leapfrog step is L = 50.

M <- matrix(c(3,-2,-2,4), nrow = 2)
M_inv <- solve(M)


n_iter <- 25000;burnin <- 5000
x <- matrix(c(-1,-1), nrow = 1) #1 by 2 matrix, not vector
ep <- 0.5;L <- 50

2) RcppDist

Before introducing the log-posterior and its gradient function, we introduce some Rcpp functions that work same as those in ‘mtvnorm’ package with faster speed. When calculating log-posterior distribution and generating momentun \(\phi\), ‘dmvnorm’ and ‘rmvnorm’ in ‘mtvnorm’ package are needed. However, these functions require matrix calculations, which lead to slow generation in basic R. So, we will use RcppDist package (based on RcppArmadillo) which also contains ‘dmvnorm’ and ‘rmvnorm’ but runs in Rcpp so as to improve the calculation time. These functions are defined as ‘dmvnorm_rcpp’ and ‘rmvnorm_rcpp’ respectively.

// [[Rcpp::depends(RcppArmadillo, RcppDist)]]
#include <RcppArmadillo.h>
#include <RcppDist.h>
using namespace Rcpp;

// [[Rcpp::export]]
arma::vec dmvnorm_rcpp(const arma::mat& x, const arma::vec& mean, const arma::mat& sigma, bool logd = false) {

  return dmvnorm(x, mean, sigma, logd);
}

// [[Rcpp::export]]
arma::mat rmvnorm_rcpp(int n, const arma::vec& mean, const arma::mat& sigma) {
  return rmvnorm(n, mean, sigma);
}

These Rcpp file is stored in “MC_HW4_kohanjun.cpp” file.

sourceCpp("MC_HW4_kohanjun.cpp")

3) log-posterior using log-sum-exp-trick

The following shows how to obtain log-posterior of the given \(f(x)\). However, directly applying \(\log\) to the \(f(x)\) value might return unstable values such as ‘NA’. That is, when \[\log f(x) = \log\biggl( \sum_{k=1}^Kw_k\mathcal N(x; \mu_k,\Sigma_k) \biggr)\] and the \(f(x)\) has extremely small value, the direct application of logarithm leads to unstable value in R. So, we use the so-called log-sum-exp trick as follows. \[\begin{aligned} \log f(x) &= \log\biggl( \sum_{k=1}^Kw_k\mathcal N(x; \mu_k,\Sigma_k) \biggr) \\ &= \log\biggl(\sum_{k=1}^K \exp\Bigl\{ \log\Bigl( w_k\mathcal N(x;\mu_k,\Sigma_k) \Bigr)\Bigr\}\biggr) \\ &= \log\biggr(\sum_{k=1}^K\exp\Bigl\{\underbrace{\log w_k + \log\mathcal N(x;\mu_k,\Sigma_k)}_{:=\alpha_k}\Bigr\}\biggr)\quad \alpha := \max_{1\le k\le K}\alpha_k \\ &= \log\biggl(\exp(\alpha)\sum_{k=1}^K\exp\{\alpha_k-\alpha\} \biggr) \\ &=\alpha + \log\biggl( \sum_{k=1}^K\exp\{\alpha_k-\alpha\} \biggr) \\ &= \alpha + \log\biggr(\sum_{k=1}^K\exp\Bigl\{\log w_k + \log\mathcal N(x;\mu_k,\Sigma_k)-\alpha \Bigr\}\biggr). \end{aligned}\] Based on this, the ‘log_comp_densities’ stores the log normal density value of each \(k\)th component, ‘ws’ has identical weight \(1/K\),

log_posterior <- function(x) {
  K <- 3
  log_comp_densities <- numeric(K)
  ws <- rep(1/K,K)
  
  for (k in 1:K) {
    log_comp_densities[k] <- dmvnorm_rcpp(x,mean = mu[[k]], sigma = Sigma[[k]], logd = TRUE)
  }
  
  log_weights <- log(ws)
  
  # Log-sum-exp trick
  alpha <- max(log_weights + log_comp_densities)
  log_density <- alpha + log(sum(exp(log_weights + log_comp_densities - alpha)))
  
  return(log_density)
}

The following example code shows that the ‘log_posterior’ function returns same pdf value as the true \(f(x)\).

exp(log_posterior(x))

## [1] 0.01951661

as.numeric((1/3)*
             (dmvnorm_rcpp(x, mean=mu1, sigma=Sigma1)+
                dmvnorm_rcpp(x, mean=mu2, sigma=Sigma2) + 
                dmvnorm_rcpp(x, mean=mu3, sigma=Sigma3)))

## [1] 0.01951661

4) gradient

The way to obtain the gradient of \(\log f(x)\) is suggested in 2) of (a).

gradient_log_density <- function(x){
  p1 <- -sum(dmvnorm_rcpp(x, mean = mu1, sigma = Sigma1))*Sigma1_inv%*%(t(x) - mu1)
  p2 <- -sum(dmvnorm_rcpp(x, mean = mu2, sigma = Sigma2))*Sigma2_inv%*%(t(x) - mu2)
  p3 <- -sum(dmvnorm_rcpp(x, mean = mu3, sigma = Sigma3))*Sigma3_inv%*%(t(x) - mu3)
  
  return( t( (1/3)*(p1+p2+p3)/exp(log_posterior(x)) ) )
}
gradient_log_density(x)

##      [,1]      [,2]
## [1,]    1 0.9999992

5) HMC algorithm

The following ‘HMC’ function shows how to implement the HMC algorithm. The ‘x_p’ and ‘phi_p’ store current values of \(x\) and \(\phi\) respectively. The leapfrog step mentioned in (a) is conducted for \(L\) times. After calculating MH acceptance ratio (log_acc_ratio), we determine whether to accepct the generated \(x\) (x_p). If it is accepted, add \(1\) to the ‘acc’ and store the ‘x_p’ into the ‘x’ and ‘x_samples’ matrix. At the end, it returns total acceptance ratio of the HMC algorithm.

HMC <- function(x){
  acc <- 0
  x_samples <- matrix(0, nrow = n_iter, ncol = 2)
  
  for(iter in 1:n_iter){
    x_p <- x # 1 by 2 matrix
    phi <- rmvnorm_rcpp(1, mean = c(0,0), sigma = M)
    phi_p <- phi # momentum
    
    for(l in 1:L){ # leapfrog
      phi_half <- phi_p + 1/2*ep*gradient_log_density(x_p)
      x_p <- x_p + t(ep*M_inv%*%t(phi_half)) 
      phi_p <- phi_half + 1/2*ep*gradient_log_density(x_p)
    }
    
    log_acc_ratio <- log_posterior(x_p) + 
      sum( dmvnorm_rcpp(phi_p, mean = c(0,0), sigma = M, logd = TRUE ) ) -
      log_posterior(x) - 
      sum( dmvnorm_rcpp(phi, mean = c(0,0), sigma = M , logd = TRUE) )
    
    if(log(runif(1)) < log_acc_ratio){
      x <- x_p
      acc <- acc+1
    }
    
    x_samples[iter, ] <- x
  }
  print(acc/n_iter)
  return(x_samples)
}

6) result-1

We first show the result of the input x = (5,5). The result matrix ‘x_sample_12’ is converted into dataframe to use it in ggplot later. It is shown that the accpetance ratio is expremely high.

x_samples_12 <- HMC(x)

## [1] 0.9826

xs_data_12 <- as.data.frame(x_samples_12[(burnin+1):n_iter,])
colnames(xs_data_12) <- c('x1','x2')

The traceplots of \(x_1\) and \(x_2\) show that both lie in the first and second Gaussian distributions.

par(mfrow = c(1,2))
ts.plot(x_samples_12[(burnin+1):n_iter,1],
        main = "Traceplot of x1 using HMC", ylab = 'x1')
ts.plot(x_samples_12[(burnin+1):n_iter,2],
        main = "Traceplot of x2 using HMC", ylab = 'x2')

The following code is for contour plot of \(f(x)\).

x1_seq <- seq(-10, 10, length.out = 100)
x2_seq <- seq(-10, 15, length.out = 100)
grid <- expand.grid(x = x1_seq, y = x2_seq)

gm_density <- function(x, y) {
  density <- 0
  for (i in 1:3) {
    density <- density + (1/3)*dmvnorm_rcpp(matrix(c(x, y),nrow=1), 
                                            mean = mu[[i]], 
                                            sigma = Sigma[[i]])
  }
  return(density)
}
grid$density <- mapply(gm_density, grid$x, grid$y)

This ggplot shows both the contour plot (red and blue) and scatter plot of generated ‘x_samples_12’ with gray color. This also shows that the generated samples are stuck in the \(1\)st and \(2\)nd Gaussians.

ggplot() +
  geom_point(data = xs_data_12, aes(x = x1, y = x2), color = "black", size = 0.3, alpha=0.1) +
  geom_contour(data = grid, aes(x = x, y = y, z = density, color = after_stat(level))) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Contour Plot of f(x) with Scatter Points using HMC", x = "x1", y = "x2") +
  theme_minimal()

The following marginal histograms show the same conclusions.

h1 <- ggplot()+
  geom_histogram(data = xs_data_12, 
                 aes(x = x1, y =  after_stat(density)), 
                 binwidth = 0.5, fill = 'skyblue', color = 'black')+
  labs(title = "Histogram of x1 using HMC", x = 'x1', y = 'freq')

h2 <- ggplot()+
  geom_histogram(data = xs_data_12, 
                 aes(x = x2, y =  after_stat(density)), 
                 binwidth = 0.5, fill = 'skyblue', color = 'black')+
  labs(title = "Histogram of x2 using HMC", x = 'x2', y = 'freq')

grid.arrange(h1, h2, nrow = 1)

7) result -2

Now, we run the ‘HMC’ function with different input x= (6,6), which is closer to the \(3\)rd Gaussian. This result is stored in ‘x_samples_3’ and ‘xs_data_3’ respectively. The acceptance ratio is also high.

set.seed(2023311161)
x <- matrix(c(6,6), nrow = 1)

x_samples_3 <- HMC(x)

## [1] 0.87804

xs_data_3 <- as.data.frame(x_samples_3[(burnin+1):n_iter,])
colnames(xs_data_3) <- c('x1','x2')

The traceplots show that the samples lie in the \(3\)rd Gaussian distribution.

par(mfrow = c(1,2))
ts.plot(x_samples_3[(burnin+1):n_iter,1],
        main = "Traceplot of x1 using HMC", ylab = 'x1')
ts.plot(x_samples_3[(burnin+1):n_iter,2],
        main = "Traceplot of x2 using HMC", ylab = 'x2')

The scatter plot overlaid with same contour plot also shows the generated samples are in the \(3\)rd Gaussian mode.

ggplot() +
  geom_point(data = xs_data_3, aes(x = x1, y = x2), color = "black", size = 0.3, alpha=0.1) +
  geom_contour(data = grid, aes(x = x, y = y, z = density, color = after_stat(level))) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Contour Plot of f(x) with Scatter Points using HMC", x = "x1", y = "x2") +
  theme_minimal()

The marginal histograms suggest the same conclusion.

h1 <- ggplot()+
  geom_histogram(data = xs_data_3, 
                 aes(x = x1, y =  after_stat(density)), 
                 binwidth = 0.5, fill = 'skyblue', color = 'black')+
  labs(title = "Histogram of x1 using HMC", x = 'x1', y = 'freq')

h2 <- ggplot()+
  geom_histogram(data = xs_data_3, 
                 aes(x = x2, y =  after_stat(density)), 
                 binwidth = 0.5, fill = 'skyblue', color = 'black')+
  labs(title = "Histogram of x2 using HMC", x = 'x2', y = 'freq')

grid.arrange(h1, h2, nrow = 1)

8) conclusion

The difference between 6) and 7) is significant, although there were only one difference in the initial values of \(x\). This is because the HMC algorithm faces challenges in traversing between well-seperated modes.

HMC uses gradient information of the log-density function to inform its dynamics. While this is beneficial for efficient exploration within a mode, it can hinder movement between modes because the gradients near the modes do not provide information about the location of other distant modes. That is, the low density value between distinct modes serves as a energy barrier that impedes movement.

(c) Use the Metropolis-Adjusted Langevin Algorithm (MALA) and the Metropolis-Hastings (MH) algorithm to generate samples from the Gaussian mixture distribution and compare the results from HMC.

1) MALA

The algorithm of MALA is as follows:

proppose new \(x^*\) by setting \[x^* = x^{(t)} + \frac{\sigma^2}{2}\nabla_x\log f(x^{(t)}) + \sigma\epsilon\] where \(\epsilon\sim\mathcal N_2(0, I_2)\).
for a proposal density \(g\), calculate the MH ratio \[R_{MALA} = \frac{f(x^*)g(x^{(t)}|g^*)}{f(x^{(t)})g(x^*|x^{(t)})}\]
set \(x^{(t+1)} = x^*\) with probability \(\min\{1, R_{MALA}\}\)

i. x = (-1,-1)

Now, my algorithm sets the \(\sigma\) to 1/2, and the first initial value is (-1,-1) again.

set.seed(2023311161)
sigma <- 0.5
x <- matrix(c(-1,-1), nrow = 1) #1 by 2 matrix, not vector

The proposal density is set to a biraviate normal \[g(x^*|x) = \mathcal N_2\Bigl(x^*; x+\frac{\sigma^2}{2}\nabla_x\log f(x), \sigma^2I_2 \Bigr)\] with mean \(x+\frac{\sigma^2}{2}\nabla_x\log f(x)\) and covariance \(\sigma^2 I_2\). To calculate the mean readily, we define ‘prop_m’ function as follows.

prop_m <- function(x){
  return( x + sigma^2/2*gradient_log_density(x) )
}

The following ‘MALA’ function is MALA algorithm. The ‘eps’ is the generated bivariate standard normal, and ‘x_temp’, corresponding to \(x^*\) in the \(1\)st step of the algorithm, is stored. We determine whether the ‘x_temp’ is set to the next \(x\) based on the value of MH ratio ‘log_acc_ratio’. At the end of the algorithm, it prints the total acceptance ratio.

MALA <- function(x){
  acc <- 0
  x_samples_lang <- matrix(0, nrow = n_iter, ncol = 2)
  
  for(iter in 1:n_iter){
    
    eps <- rmvnorm_rcpp(1, mean = c(0,0), sigma = diag(2))
    x_temp <- prop_m(x) + sigma*eps
    
    log_acc_ratio <- log_posterior(x_temp) + 
      sum(dmvnorm_rcpp(x, mean = prop_m(x_temp), sigma = sigma^2*diag(2), logd = TRUE)) - 
      log_posterior(x) - 
      sum(dmvnorm_rcpp(x_temp, mean = prop_m(x), sigma = sigma^2*diag(2), logd = TRUE))
    
    if(log(runif(1)) < log_acc_ratio){
      x <- x_temp
      acc <- acc+1
    }
    x_samples_lang[iter,] <- x
  }
  print(acc/n_iter)
  return(x_samples_lang)
}

The result of MALA algorithm using x = (5,5) is stored in ‘x_samples_lang_12’ and ‘xs_data_lang_12’ respectively.

x_samples_lang_12 <- MALA(x)

## [1] 0.86652

xs_data_lang_12 <- as.data.frame(x_samples_lang_12[(burnin+1):n_iter,])
colnames(xs_data_lang_12) <- c('x1','x2')

The traceplots show that the generated samples lie in the \(1\)st and \(2\)nd modes of the Gaussian mixture.

par(mfrow = c(1,2))
ts.plot(x_samples_lang_12[(burnin+1):n_iter,1],
        main = "Traceplot of x1 using MALA", ylab = "x1")
ts.plot(x_samples_lang_12[(burnin+1):n_iter,2],
        main = "Traceplot of x2 using MALA", ylab = "x2")

The contour plot and scatter plot shows the same result.

ggplot() +
  geom_point(data = xs_data_lang_12, aes(x = x1, y = x2), color = "black", size = 0.3, alpha=0.1) +
  geom_contour(data = grid, aes(x = x, y = y, z = density, color = after_stat(level))) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Contour Plot of f(x) with Scatter Points using MALA", x = "x1", y = "x2") +
  theme_minimal()

The marginal histograms show that the samples are in the \(1\)st and \(2\)nd modes.

h1 <- ggplot()+
  geom_histogram(data = xs_data_lang_12, aes(x = x1, y = after_stat(density)),
                 binwidth = 0.5,fill = 'lightgreen', color = 'black')+
  labs(title = "Histogram of x1 using MALA", x = 'x1', y = 'freq')

h2 <- ggplot()+
  geom_histogram(data = xs_data_lang_12, aes(x=x2, y = after_stat(density)),
                 binwidth = 0.5, fill = 'lightgreen', color = 'black') +
  labs(title = "Histogram of x2 using MALA", x = 'x2', y = 'freq')

grid.arrange(h1, h2, nrow = 1)

ii. x = (6,6) Now, we repeat this MALA algorithm using different inivial value of x = (6,6).

set.seed(2023311161)
x <- matrix(c(6,6), nrow = 1) #1 by 2 matrix, not vector

x_samples_lang_3 <- MALA(x)

## [1] 0.70196

xs_data_lang_3 <- as.data.frame(x_samples_lang_3[(burnin+1):n_iter,])
colnames(xs_data_lang_3) <- c('x1','x2')

The traceplots shows the generated samples are in the \(3\)rd Gaussian

par(mfrow = c(1,2))
ts.plot(x_samples_lang_3[(burnin+1):n_iter,1],
        main = "Traceplot of x1 using MALA", ylab = "x1")
ts.plot(x_samples_lang_3[(burnin+1):n_iter,2],
        main = "Traceplot of x2 using MALA", ylab = "x2")

The scatter plot and marginal histograms also show the same conclusions.

ggplot() +
  geom_point(data = xs_data_lang_3, aes(x = x1, y = x2), color = "black", size = 0.3, alpha=0.1) +
  geom_contour(data = grid, aes(x = x, y = y, z = density, color = after_stat(level))) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(title = "Contour Plot of f(x) with Scatter Points using MALA", x = "x1", y = "x2") +
  theme_minimal()

h1 <- ggplot()+
  geom_histogram(data = xs_data_lang_3, aes(x = x1, y = after_stat(density)),
                 binwidth = 0.5,fill = 'lightgreen', color = 'black')+
  labs(title = "Histogram of x1 using MALA", x = 'x1', y = 'freq')

h2 <- ggplot()+
  geom_histogram(data = xs_data_lang_3, aes(x=x2, y = after_stat(density)),
                 binwidth = 0.5, fill = 'lightgreen', color = 'black') +
  labs(title = "Histogram of x2 using MALA", x = 'x2', y = 'freq')

grid.arrange(h1, h2, nrow = 1)

par(mfrow = c(1,1))

The result of MALA is similar to that of HMC. This is because the MALA algorithm also makes use of gradient information of logarithm of multimodal distribution, which makes the transition from one mode to other extremely difficult.

2) basic MH algorithm

Let the proposal density be \(p(x^*|x) := \mathcal N(x^*; x, 5^2I_2)\). Then,

sample \(x^*\) from \(p(\cdot|x^{(t)})\)
obtain the MH ratio \[R_{MH} = \frac{f(x^*)p(x^{(t)}|x^*)}{f(x^{(t)})p(x^*|x^{(t)})}.\]
set \(x^{(t+1)} = x^*\) with probability \(\min\{1, R_{MH}\}\).

This algorithm is written in the following code (MH_algo).

MH_algo <- function(x){
  acc <- 0
  x_samples <- data.frame(x1 = numeric(n_iter),
                          x2 = numeric(n_iter))

  for(iter in 1:n_iter){
    x_p <- rmvnorm_rcpp(1, mean = x, sigma = sd^2*diag(2))
    log_MHR <- log_posterior(x_p) + 
      dmvnorm_rcpp(x, mean = x_p, sigma = sd^2*diag(2), logd = TRUE) -
      log_posterior(x) -
      dmvnorm_rcpp(x_p, mean = x, sigma = sd^2*diag(2), logd = TRUE)
    
    if(log(runif(1)) < log_MHR){
      x <- x_p
      acc <- acc+1
    }
    
    x_samples[iter,] <- x
    
  }
  print(acc/n_iter)
  return(x_samples)
}

i. x=(-1,-1)

To compare with above algirithms, we set same initial values; x=(-1,-1) at first and x=(6,6) for the next code. The result is in x_MH_12 dataframe.

set.seed(2023311161)
x <- matrix(c(-1,-1),nrow = 1)
sd <- 5


x_MH <- MH_algo(x)

## [1] 0.06336

x_MH_12 <- x_MH[(burnin+1):n_iter,]

The traceplots show that the generated samples lie in all 3 modes more uniformly than former algorithms.

par(mfrow = c(1,2))
ts.plot(x_MH_12[,1],
        main = "trace plot of x1 using MH", ylab = 'x1')
ts.plot(x_MH_12[,2],
        main = "trace plot of x1 using MH", ylab = 'x2')

The result of the scatter plot is more explicit. It is seen that all the samples are around all 3 modes. Each of the marginal histograms shows 3 Gaussian histograms.

ggplot()+
  geom_point(data = x_MH_12, aes(x = x1, y = x2), color = 'black', size = 0.3, alpha = 0.1) +
  geom_contour(data = grid, aes(x=x, y=y, z=density, color = after_stat(level))) +
  scale_color_gradient(low = 'blue', high = 'red')+
  labs(title = "Contour Plot of f(x) with Scatter Points using MH", x='x1', y='x2') +
  theme_minimal()

h1 <- ggplot()+
  geom_histogram(data = x_MH_12, aes(x=x1, y = after_stat(density)),
                 binwidth = 0.5, fill = 'orange', color = 'black') +
  labs(title = "Histogram of x1 using MH", x = 'x1', y = 'freq')

h2 <- ggplot()+
  geom_histogram(data = x_MH_12, aes(x=x2, y=after_stat(density) ),
                 binwidth = 0.5, fill = 'orange', color='black') +
  labs(title = "Histogram of x2 using MH", x = 'x2', y = 'freq')

grid.arrange(h1,h2, nrow =1)

ii. x = (6,6)

Now, let’s see the result of initial value x=(6,6). The traceplots shows the samples all lie in the 3 modes as well.

set.seed(2023311161)
x <- matrix(c(6,6), nrow = 1)
x_MH <- MH_algo(x)

## [1] 0.05912

x_MH_3 <- x_MH[(burnin+1):n_iter,]


par(mfrow = c(1,2))
ts.plot(x_MH_3[,1],
        main = "trace plot of x1 using MH", ylab = 'x1')
ts.plot(x_MH_3[,2],
        main = "trace plot of x1 using MH", ylab = 'x2')

The scatter plot and histograms also show that generated samples are in the region of all 3 modes.

ggplot()+
  geom_point(data = x_MH_3, aes(x = x1, y = x2), color = 'black', size = 0.3, alpha = 0.1) +
  geom_contour(data = grid, aes(x=x, y=y, z=density, color = after_stat(level))) +
  scale_color_gradient(low = 'blue', high = 'red')+
  labs(title = "Contour Plot of f(x) with Scatter Points using MH", x='x1', y='x2') +
  theme_minimal()

h1 <- ggplot()+
  geom_histogram(data = x_MH_3, aes(x=x1, y = after_stat(density)),
                 binwidth = 0.5, fill = 'orange', color = 'black') +
  labs(title = "Histogram of x1 using MH", x = 'x1', y = 'freq')

h2 <- ggplot()+
  geom_histogram(data = x_MH_3, aes(x=x2, y=after_stat(density) ),
                 binwidth = 0.5, fill = 'orange', color='black') +
  labs(title = "Histogram of x2 using MH", x = 'x2', y = 'freq')

grid.arrange(h1,h2, nrow =1)

par(mfrow = c(1,1))

Although the acceptance ratio of MH algorithm is relatively low, the generated samples reflect the original distribution \(f(x)\) much better than HMC and MALA. This is because the MH do not require gradient of some distribution.

2. Approximate bayesian computation MCMC code for susceptible-infected-removed (SIR) model.

Model Definition:

• \(N\) : Population size.

• \(y\) : Total number of infectives out of the initial susceptible population of size \(N\).

• \(S(t), I(t), R(t)\) : susceptibles, infectives and recovered individuals at time \(t\), respectively.

• \(S(t) + I(t) + R(t) = N\) for all \(t\). Initially \((S(0), I(0),R(0)) = (N − 1, 1, 0)\)

• \(\beta,\gamma\): Infection rate and recover rate, respectively. \[\beta \sim \text{Exp}(1),\quad \gamma = 1\] • Dynamics of the process \(\{(S(t), I(t)) : t ≥ 0\}\) are as follows:

If chain currently at \((S(t), I(t))\), it can jump to \[\begin{cases}(S(t) − 1, I(t) + 1) &\text{(infection) at rate } \beta S(t)I(t)/N \\ (S(t), I(t) − 1) & \text{(removal) at rate } \gamma I(t)\end{cases}\]
Thus time spent in \(S(t), I(t)\) is \[T_{S,I} \sim \text{Exp}\Bigl(\frac{\beta S(t)I(t)}{N} + \gamma I(t) \Bigr)\]
When chain leaves \((S(t), I(t))\), then it can jump to \[\begin{cases} (S(t) − 1, I(t) + 1) & \text{with prob } [\beta S(t)]/[\beta S(t) + N\gamma] \\ (S(t), I(t) − 1) &\text{with prob } [N\gamma]/[\beta S(t) + N\gamma] \end{cases}\]

(a) Implement ABC-MCMC to estimate Infection rate \((\beta)\) for the SIR model (hint. \(N = 100, y = 70\) and \(S(y) = |y' − y|\) where \(y'\) is simulated value).

To make it clear, \(y\) represents the total number of individuals who have ever become infected over the course of the outbreak, while \(I(t)\) denotes the number of currently infectious individuals at time \(t\).

1) SIR model

The main point of the ‘SIR_model’ is that the probability of infection is \(\frac{\beta S(t)}{\beta S(t) + N\gamma}\), while that of recovery is \(\frac{N\gamma}{\beta S(t) + N\gamma}\). Until generating final \(y^*\), the \(\beta\), as well as \(\gamma\), is fixed. When infection occurs, 1 is deducted from \(S\) and 1 is added to \(I\). When recovery occurs, the \(I\) is reduced to \(I-1\) and the \(R\) increases by 1. The following ‘SIR_model’ function reflects all these points.

rm(list = ls())


SIR_model <- function(beta, gamma = 1, N = 100, initial_I = 1, initial_R = 0) {
  S <- N - initial_I - initial_R
  I <- initial_I
  R <- initial_R
  y_star <- I # Cumulative number of infections
  
  while (I > 0) {
    infection_rate <- beta * S * I / N
    recovery_rate <- gamma * I
    total_rate <- infection_rate + recovery_rate
    

    if (runif(1) < infection_rate / total_rate) {
      # Infection
      S <- S - 1
      I <- I + 1
      y_star <- y_star + 1
    } else {
      # Recovery
      I <- I - 1
      R <- R + 1
    }
  }
  
  return(y_star)
}

2) ABC MCMC

The algorithm of ABC MCMC for this question is as follows:

Using current \(\beta\), propose new one \(\beta^* \sim q(\cdot|\beta)\)
generate \(y^*\) from SIR model using \(\beta^*\)
If \(|y - y^*|>\epsilon\), stay at current \(\beta\) and go back to 1
If not, obtain the ratio \[R_{ABC} = \frac{p(\beta^*)q(\beta | \beta^*)}{p(\beta)q(\beta^*|\beta)}\] which do not contain the information of likelihood and uses only the prior \(p\) and proposal \(q\) Then, accept the proposed \(\beta^*\) with probability \(\min\{1, R_{ABC}\}\).

In the ‘abc_mcmc’ function, the proposal density is \(\text{LogNormal}\) : \(\beta^* \sim \text{LogNormal}(\beta^*; \beta, \sigma^2)\). The prior is already set to \(\text{Exp}(1)\).

abc_mcmc <- function(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd) {
  # Initial values
  beta_samples <- numeric(n_iter)
  acc <- 0
  
  for (i in 1:n_iter) {
    beta_star <- rlnorm(1, meanlog = beta, sdlog = sd)
    y_star <- SIR_model(beta_star, gamma, N, initial_I = 1, initial_R = 0)
    
    if (abs(y_star - y_obs) <= ep) {
      log_acc_ratio <- dexp(beta_star, rate = beta_param, log = TRUE) + 
        dlnorm(beta, meanlog = beta_star, sdlog = sd, log = TRUE) -
        dexp(beta, rate = beta_param, log = TRUE) -
        dlnorm(beta_star, meanlog = beta, sdlog = sd, log = TRUE)
      
      if (log(runif(1)) < log_acc_ratio) {
        beta_samples[i] <- beta_star
        acc <- acc + 1
        beta <- beta_star
      } else {
        beta_samples[i] <- beta
      }
    } else {
      # Reject and stay at the previous beta
      beta_samples[i] <- beta
    }
  }
  
  result <- list(beta_samples = beta_samples, acc_rate = acc/n_iter)
  return(result)
}

3) other initial settings

Now, we set the initial values as follows. ‘y_obs’ means the observed total infectives, ‘ep’ corresonds to \(\epsilon\), ‘beta_param’ is the rate parameter of the exponential distribution \(\text{Exp}(1)\), and ‘sd’ is the \(\sigma\) of the log-normal distribution.

set.seed(2023311161)
N <- 100             
y_obs <- 70           
gamma <- 1            
ep <- 10
n_iter <- 50000
beta <- 1 # initial beta
beta_param <- 1
sd <- 3
burnin <- 10000

4) result

The posterior of \(\beta\) are in the first element of the ‘result_a’ list, and the second element shows the acceptance ratio, which is about 1%.

result_a <- abc_mcmc(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd)
beta_samples_a <- result_a[[1]]
result_a[[2]] # acc rate

## [1] 0.01214

The posterior mean and median are about 1.7 ~ 1.8, and its traceplot shows the posterior oscillates around 1.7. Its histogram shows same result.

mean(beta_samples_a[burnin:n_iter]);median(beta_samples_a[burnin:n_iter])

## [1] 1.803883

## [1] 1.746795

ts.plot(beta_samples_a[burnin:n_iter], 
        main = bquote("Traceplot of " ~ beta),
        ylab = bquote(~ beta))

hist(beta_samples_a[burnin:n_iter] ,freq = FALSE,
     main = bquote("Histogram of " ~ beta),
     xlab = bquote(~ beta))

(b) Change model setting - Use another distribution for \(\beta\), Change observed \(y\) and \(N\), Change recovery rate \(\gamma\) - and discuss the results.

1) \(\beta \sim \text{LogNormal}(2,1^2)\)

Instead of exponential distribution, the \(\beta\) is generated from log-normal with parameter (2,1). Like the exponential, the log-normal also generates positive numbers, so log-normal can be one of the distribution where the \(\beta\) can be generated. The ‘log_acc_ratio’ in the ‘abc_mcmc_b’ function uses the lognormal prior instead of exponential distribution.

abc_mcmc_b <- function(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd) {
  # Initial values
  beta_samples <- numeric(n_iter)
  acc <- 0
  
  for (i in 1:n_iter) {
    beta_star <- rlnorm(1, meanlog = beta, sdlog = sd)
    y_star <- SIR_model(beta_star, gamma, N, initial_I = 1, initial_R = 0)
    
    if (abs(y_star - y_obs) <= ep) {
      log_acc_ratio <- dlnorm(beta_star, meanlog = beta_param, sdlog = 1, log = TRUE) + 
        dlnorm(beta, meanlog = beta_star, sdlog = sd, log = TRUE) -
        dlnorm(beta, meanlog = beta_param, sdlog = 1, log = TRUE) -
        dlnorm(beta_star, meanlog = beta, sdlog = sd, log = TRUE)
      
      if (log(runif(1)) < log_acc_ratio) {
        beta_samples[i] <- beta_star
        acc <- acc + 1
        beta <- beta_star
      } else {
        beta_samples[i] <- beta
      }
    } else {
      beta_samples[i] <- beta
    }
  }
  
  result <- list(beta_samples = beta_samples, acc_rate = acc/n_iter)
  return(result)
}

All of the initial values are same, except for ‘beta_param’ which is the log-mean of the lognormal.

set.seed(2023311161)
N <- 100             
y_obs <- 70           
gamma <- 1            
ep <- 10 
n_iter <- 50000
beta <- 1
beta_param <- 2
sd <- 3
burnin <- 10000

The result is in the first element of the list ‘result_b1’, and it returns about 1% acceptance ratio.

result_b1 <- abc_mcmc_b(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd)
beta_samples_b1 <- result_b1[[1]]
result_b1[[2]] # acc rate

## [1] 0.01316

The posterior mean and median is about 1.95, and its trace plot and histogram shows similar results.

mean(beta_samples_b1[burnin:n_iter]);median(beta_samples_b1[burnin:n_iter])

## [1] 1.968705

## [1] 1.950626

ts.plot(beta_samples_b1[burnin:n_iter],
        main = bquote("Traceplot of " ~ beta ~ "with its rate parameter 1/2"),
        ylab = bquote(~ beta))

hist(beta_samples_b1[burnin:n_iter] ,freq = FALSE, 
     main = bquote("Histogram of " ~ beta ~ "with its rate parameter 1/2"),
     xlab = bquote(~ beta))

In summary, after using log-normal as a prior of \(\beta\), the posterior mean and median, as well as the acceptance ratio, is slightly higher than using exponential prior.

2) \(y = 90\)

From this section, we use the ‘abc_mcmc’ function again, which uses the exponential prior for \(\beta\). Now, the total number of observed infectives are 90.

set.seed(2023311161)
N <- 100             
y_obs <- 90           
gamma <- 1            
ep <- 10
n_iter <- 50000
beta <- 1
beta_param <- 1
sd <- 3
burnin <- 10000

The ABC MCMC algorithm shows the acceptance ratio is about 3.9%, much higher than the case of \(y=70\).

result_b2 <- abc_mcmc(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd)
beta_samples_b2 <- result_b2[[1]]
result_b2[[2]] # acc rate

## [1] 0.03926

The posterior mean and median is about 2.6 ~ 3.0, much higher than the case \(y=70\). The trace plot oscillates around 3, and the histogram has highest frequency around 3 as well.

mean(beta_samples_b2[burnin:n_iter]);median(beta_samples_b2[burnin:n_iter])

## [1] 2.973725

## [1] 2.681776

ts.plot(beta_samples_b2[burnin:n_iter], 
        main = bquote("Traceplot of " ~ beta ~ "when y = 90"), ylab = bquote(~ beta))

hist(beta_samples_b2[burnin:n_iter] ,freq = FALSE,
     main = "Histogram of " ~ beta ~ "when y = 90", xlab = bquote(~beta))

3) \(N = 200\)

Now, the total population size is increased by 100.

set.seed(2023311161)
N <- 200            
y_obs <- 70           
gamma <- 1            
ep <- 10
n_iter <- 50000
beta <- 1
beta_param <- 1
sd <- 3
burnin <- 10000

The acceptance ratio is extremely decreased to about 0.25%

result_b3 <- abc_mcmc(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd)
beta_samples_b3 <- result_b3[[1]]
result_b3[[2]]

## [1] 0.00258

The posterior mean and median is around 1.24. The trace plot shows the posterior is rejected for many times.

mean(beta_samples_b3[burnin:n_iter]);median(beta_samples_b3[burnin:n_iter])

## [1] 1.240804

## [1] 1.238706

ts.plot(beta_samples_b3[burnin:n_iter],
        main = bquote("Trace plot of " ~ beta ~ "when N = 200"), ylab = bquote(~beta))

hist(beta_samples_b3[burnin:n_iter] ,freq = FALSE,
     main = bquote("Histogram of " ~ beta ~ "when N = 200"), xlab = bquote(~beta))

4) \(\gamma = 1/2\)

Now, the recovery rate is decreased to 0.5.

set.seed(2023311161)
N <- 100            
y_obs <- 70           
gamma <- 1/2            
ep <- 10
n_iter <- 50000
beta <- 1
beta_param <- 1
sd <- 3
burnin <- 10000

The acceptance ratio is around 1.4%.

result_b4 <- abc_mcmc(y_obs, N, gamma, ep, n_iter, beta, beta_param, sd)
beta_samples_b4 <- result_b4[[1]]
result_b4[[2]]

## [1] 0.01486

The posterior mean and median are around 0.9 ~ 0.93. The traceplot shows the posterior oscillates around 0.9, and the histogram has the highest frequency around 0.9 as well.

mean(beta_samples_b4[burnin:n_iter]);median(beta_samples_b4[burnin:n_iter])

## [1] 0.9247872

## [1] 0.9073891

ts.plot(beta_samples_b4[burnin:n_iter], 
        main = bquote("Trace plot of "~ beta ~ "when" ~ gamma ~ "= 1/2"), ylab = bquote(~beta))

hist(beta_samples_b4[burnin:n_iter] ,freq = FALSE,
     main = bquote("Histogram of " ~ beta ~ "when" ~ gamma ~ " = 1/2"),xlab = bquote(~beta))

Monte Carlo Methods HW4

1. Hamiltonian Monte Carlo for Gaussian mixture model Consider the gaussian mixture model given by