Q1. Consider a linear mixed model: For \(y_{i,j} \in\mathbb R, X_{i,j} \in \mathbb R^{1\times p}, \beta \in \mathbb R^p, \tau^2 \in (0,\infty)\) and \(\sigma^2 \in (0,\infty)\), \[y_{i,j} = X_{i,j}\beta + \gamma_i + \epsilon_{i,j},\quad \gamma_i \sim N(0, \tau^2), \quad\epsilon_{i,j} \sim N(0, \sigma^2),\] \[\quad j = 1, \dots,n_i,\quad i = 1, \dots,m.\]

(a) Write down the model in matrix notation as \[y = X\beta + Z\gamma + \epsilon, \gamma \sim N_m(0_m, \tau^2I_m), \epsilon \sim N_N(0_N, \sigma^2I_N),\] by specifying matrices \(X \in \mathbb R^{N\times p}\) and \(Z \in \mathbb R^{N\times m}\), where \(N = \sum_{i=1}^mn_i\).

\(X_{i,j}\) is a row vector of \(X\) in \(j\)th line of group \(i\), and \(\gamma_i\) is the \(i\)th element of \(\gamma \sim N_m(0,\tau^2 I_m)\). Then, the design matrix \(Z\) is a block diagonal matrix, whose \(i\)th block diagonal is \(1_{n_i} \in \mathbb R^{n_i}\) and the rest of which is zero matrix. That is, \[\begin{aligned} Z = \begin{bmatrix}1_{n_1} & & \\ & 1_{n_2} & \\ & & \ddots & \\ & & & 1_{n_m} \end{bmatrix}. \end{aligned}\]

(b) Derive the marginal distribution of \(y\) integrating \(\gamma\) out of the model. Specify the mean and covariance of the marginal distribution.

For starters, we have \((y|\gamma) \sim N_N(X\beta + Z\gamma, \sigma^2 I_N)\) and \(\gamma \sim N_m(0, \tau^2 I_m)\). Using these facts, \[\mathbb E[y] = \mathbb E_{\gamma}[\mathbb E_y[y|\gamma]] = \mathbb E_{\gamma}[X\beta+Z\gamma] = X\beta,\] and \[\begin{aligned}\text{Var}[y] &= \mathbb E_{\gamma}[\text{Var}[y|\gamma]] + \text{Var}_{\gamma}\mathbb E[y|\gamma] \\ &= \mathbb E_{\gamma}[\sigma^2 I_N] + \text{Var}_{\gamma}[X\beta+Z\gamma] \\ &= \sigma^2 I_N + \tau^2ZZ^T. \end{aligned}\] So, our purpose is to show that \(y \sim N_N(X\beta, \sigma^2 I_N + \tau^2ZZ^T)\).

\[\begin{aligned} p(y,\gamma) &= p(y|\gamma) p(\gamma) \\ &\propto \exp\Bigl\{-\frac{1}{2}(\underbrace{y-X\beta}_{:=\mu}-Z\gamma)^T(\sigma^2 I_N)^{-1}(y-X\beta-Z\gamma)\Bigr\}\exp\Bigl\{-\frac{1}{2}\gamma^T(\tau^2 I_m)^{-1}\gamma\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2\sigma^2}(\mu-Z\gamma)^T(\mu-Z\gamma)\Bigr\}\exp\Bigr\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2\sigma^2}(\mu-Z\gamma)^T(\mu-Z\gamma) - \frac{1}{2\tau^2}\gamma^2\gamma\Bigr\} \\ &=\exp\Bigr\{-\frac{1}{2\sigma^2}\gamma^T(Z^TZ)\gamma -\frac{1}{2\tau^2}\gamma^T\gamma + \frac{1}{2\sigma^2}2\mu^TZ\gamma - \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2}\gamma^T\underbrace{\Bigl(\frac{1}{\sigma^2}Z^TZ + \frac{1}{\tau^2}I_m\Bigr)}_{:=\Sigma^{-1}}\gamma + \frac{1}{2\sigma^2}2\mu^TZ\gamma - \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2}\gamma^T\Sigma^{-1}\gamma + 2\gamma^T\Sigma^{-1}\Sigma Z^T\mu\frac{1}{(2\sigma^2)} - \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}\gamma^T\Sigma^{-1}\gamma +\frac{1}{2}2\gamma^T\Sigma^{-1}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr) - \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr) + \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)- \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &=\underbrace{\exp\Bigl\{-\frac{1}{2}\Bigl(\gamma - \frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}\Bigl(\gamma - \frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)\Bigr\}}_{\propto \text{pdf of } N(\Sigma Z^T\mu/\sigma^2,\Sigma)} \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)\Bigr\}. \end{aligned}\]

Since \(p(y)\) is obtained by integrating \(p(y,\gamma)\) with respect to \(\gamma\), we have

\[\begin{aligned} p(y) &\propto \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2\sigma^4}\mu^TZ\Sigma Z^T\mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2\sigma^4}\mu^TZ\Bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\Bigr)^{-1} Z^T\mu\Bigr\}. \end{aligned}\]

Recall that our goal is to induce \(p(y) \propto \exp\{-\frac{1}{2}\mu^T(\sigma^2I_N + \tau^2ZZ^T)^{-1}\mu\}\). Now, we may use Woodbury Matrix Identity so that \[\begin{aligned} \Bigl(\sigma^2I_N + \tau^2ZZ^T\Bigr)^{-1} &= \frac{1}{\sigma^2}I_N + \frac{1}{\sigma^2}I_NZ\Bigl(\frac{1}{\tau^2}I_m + Z^T\frac{1}{\sigma^2}I_NZ\Bigr)^{-1}Z^T\frac{1}{\sigma^2}I_N \\ &= \frac{1}{\sigma^2}I_N + \frac{1}{\sigma^4}Z\Bigl(\frac{1}{\tau^2}I_m + \frac{1}{\sigma^2}Z^TZ\Bigr)^{-1}Z^T. \end{aligned}\]

Thus, we conclude that \[\begin{aligned} p(y) &\propto \exp\Bigl\{-\frac{1}{2}\mu^T\Bigl[\frac{1}{\sigma^2}I_N + \frac{1}{\sigma^4}\mu^TZ\Bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\Bigr)^{-1} Z^T\Bigr]\mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}\mu^T\Bigl(\sigma^2I_N + \tau^2ZZ^T\Bigr)^{-1} \mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}(y-X\beta)^T\Bigl(\sigma^2I_N + \tau^2ZZ^T\Bigr)^{-1} (y-X\beta)\Bigr\}, \end{aligned}\] which means \[\begin{aligned} y \sim N_N\Bigl(X\beta, \sigma^2I_N + \tau^2ZZ^T\Bigr). \end{aligned}\]

(c) Assume that \(X\) is of full column rank. With an improper uniform prior \(p(\beta, \sigma, \tau) \propto 1\), construct a Gibbs sampler for \((\beta, \gamma, \sigma^2, \tau^2)\) by deriving the following full conditional distribution:

- \(p(\beta|\gamma, \sigma^2, \tau^2, y)\)
- \(p(\gamma|\beta, \sigma^2, \tau^2, y)\)
- \(p(\sigma^2|\beta, \gamma, \tau^2, y)\)
- \(p(\tau^2|\beta, \gamma,\sigma^2, y)\)

The posterior of all the 4 parameters are as follows:

\[\begin{aligned} p(\beta, \gamma, \sigma^2, \tau^2|y) &\propto p(y|\beta, \gamma, \sigma^2, \tau^2)p(\beta, \gamma, \sigma^2, \tau^2) \\ &= p(y|\beta, \gamma, \sigma^2, \tau^2)p(\beta, \sigma^2, \tau^2)p(\gamma|\beta,\sigma^2,\tau^2) \\ &= p(y|\beta, \gamma, \sigma^2, \tau^2)p(\beta, \sigma^2, \tau^2)p(\gamma|\tau^2). \end{aligned}\]

Going back to the question (a), we have \[\begin{aligned}y = X\beta + Z\gamma + \epsilon,\end{aligned}\] where \[\begin{aligned}\epsilon \sim N_N(0,\sigma^2I_N).\end{aligned}\]

When \(\beta, \gamma,\) and \(\sigma^2\) are fixed, the \(y\) follows a normal distribution with mean \(X\beta + Z\gamma\) and covariance \(\sigma^2I_N\): \[\begin{aligned}(y|\beta,\gamma,\sigma^2) \sim N_N(X\beta + Z\gamma,\sigma^2I_N).\end{aligned}\]

There is a relationship between \(p(\beta, \sigma, \tau)\) and \(p(\beta, \sigma^2, \tau^2)\). That is, \[\begin{aligned} p(\beta, \sigma^2, \tau^2) &= p(\beta, \sigma, \tau)|J|\end{aligned}\] where \[\begin{aligned} J &= \begin{bmatrix} \frac{\partial \sigma}{\partial \sigma^2} & \frac{\partial \sigma}{\partial \tau^2} \\ \frac{\partial \tau}{\partial \sigma^2} & \frac{\partial \tau}{\partial \tau^2} \end{bmatrix} = \begin{bmatrix} \frac{1}{2\sigma} & 0 \\ 0 & \frac{1}{2\tau} \end{bmatrix} . \end{aligned}\] So, \[\begin{aligned} p(\beta, \sigma^2, \tau^2) = p(\beta, \sigma, \tau)\frac{1}{4\sigma\tau} \propto \frac{1}{\sigma\tau} . \end{aligned}\]

The \(p(\gamma|\tau)\) is already given: \((\gamma|\tau) \sim N_m(0, \tau^2 I_m)\), so the posterior can be written as

\[\begin{aligned}p(\beta, \gamma, \sigma^2, \tau^2|y) \propto N_N(X\beta + Z\gamma,\sigma^2I_N)\cdot\frac{1}{\sigma\tau}\cdot N_m(0, \tau^2 I_m).\end{aligned}\]

\(\text{1) }\beta\)

\[\begin{aligned} p(\beta|\gamma, \sigma^2, \tau^2, y) &\propto N_N(X\beta + Z\gamma,\sigma^2I_N) \\ &\propto \exp\Bigl\{-\frac{1}{2}(y-X\beta-Z\gamma)^T\frac{1}{\sigma^2}(y-X\beta-Z\gamma)\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2\sigma^2}\Bigl(\|y-Z\gamma\|^2-2\beta^TX^T(y-Z\gamma) + \beta^TX^TX\beta\Bigr)\Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2}\Bigl(\beta^T\underbrace{\bigl(X^TX/\sigma^2\bigr)}_{:=\Sigma_{\beta}^{-1}}\beta - 2\beta^TX^T(y-Z\gamma)/\sigma^2\Bigr) \Bigr\}\\ &= \exp\Bigl\{-\frac{1}{2}\Bigl(\beta^T\Sigma_{\beta}^{-1}\beta - 2\beta^T\Sigma_{\beta}^{-1}\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2 \Bigr) \Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2}\Bigl(\beta-\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2 \Bigr)^T\Sigma_{\beta}^{-1}\Bigl(\beta-\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2 \Bigr)\Bigr\} \end{aligned}\]

So, \[\begin{aligned} (\beta|\gamma, \sigma^2, \tau^2, y) \sim N_p\bigl(\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2, \Sigma_{\beta}\bigr),\end{aligned}\] where \[\begin{aligned}\Sigma_{\beta}=(X^TX/\sigma^2)^{-1}.\end{aligned}\]

\(\text{2) } \gamma\)

\[\begin{aligned} p(\gamma|\beta,\sigma^2,\tau^2,y) &\propto N_N(X\beta + Z\gamma,\sigma^2I_N)\cdot N_m(0, \tau^2 I_m) \\ &\propto \exp\Bigl\{-\frac{1}{2}(y-X\beta-Z\gamma)^T\frac{1}{\sigma^2}(y-X\beta-Z\gamma)\Bigr\} \exp\Bigl\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= \exp\Bigl\{ -\frac{1}{2\sigma^2}\Bigl(\gamma^TZ^TZ\gamma - 2\gamma^TZ^T(y-X\beta) + \|y-X\beta\|^2\Bigr) -\frac{1}{2\tau^2}\gamma^TI_m\gamma\Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2}\Bigl(\gamma^T\underbrace{\bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\bigr)}_{:= \Sigma_{\gamma}^{-1}}\gamma -2\gamma^TZ^T(y-X\beta/\sigma^2) \Bigr) \Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}\Bigl( \gamma^T\Sigma_{\gamma}^{-1}\gamma -2\gamma^T\Sigma_{\gamma}^{-1}\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2 \Bigr) \Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2} \Bigl( (\gamma-\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2)^T\Sigma_{\gamma}^{-1}(\gamma-\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2) \Bigr) \Bigr\}. \end{aligned}\]

So, \[\begin{aligned}(\gamma| \beta, \sigma^2, \tau^2, y) \sim N_m\bigl(\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2, \Sigma_{\gamma}\bigr),\end{aligned}\] where \[\begin{aligned} \Sigma_{\gamma}=\Bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\Bigr)^{-1}.\end{aligned}\]

\(\text{3) } \sigma^2\)

\[\begin{aligned} p(\sigma^2|\beta,\gamma,\tau^2,y) &\propto N_N(X\beta + Z\gamma,\sigma^2I_N)\cdot\frac{1}{\sigma} \\ &\propto \frac{1}{|\sigma^2 I_N|^{1/2}}\exp\Bigl\{-\frac{1}{2\sigma^2}\|y-X\beta-Z\gamma\|^2\Bigr\}\frac{1}{\sigma} \\ &= \frac{1}{(\sigma^2)^{N/2}}\frac{1}{(\sigma^2)^{1/2}}\exp\Bigl\{-\frac{1}{2\sigma^2}\|y-X\beta-Z\gamma\|^2\Bigr\} \\ &= (\sigma^2)^{-\frac{N+1}{2}}\exp\Bigl\{-\frac{1}{2\sigma^2}\|y-X\beta-Z\gamma\|^2\Bigr\} \\ &= (\sigma^2)^{-(\frac{N-1}{2} + 1)}\exp\Bigl\{-\frac{(N-1)\cdot\|y-X\beta-Z\gamma\|^2/(N-1)}{2\sigma^2}\Bigr\}. \end{aligned}\] So, \[\begin{aligned}(\sigma^2|\beta,\gamma,\tau^2,y) \sim \text{Inv-}\chi^2\Bigl(N-1, \frac{\|y-X\beta-Z\gamma\|^2}{N-1} \Bigr).\end{aligned}\]

\(\text{4) } \tau^2\)

\[\begin{aligned} p(\tau^2|\beta,\gamma,\sigma^2,y) &\propto \frac{1}{\tau}\cdot N_m(0, \tau^2 I_m) \\ &\propto \frac{1}{\tau}\frac{1}{|\tau^2I_m|^{1/2}}\exp\Bigl\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= \frac{1}{(\tau^2)^{-\frac{m+1}{2}}}\exp\Bigl\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= (\tau^2)^{-(\frac{m-1}{2}+1)}\exp\Bigl\{-\frac{(m-1)\gamma^T\gamma/(m-1)}{2\tau^2}\Bigr\} . \end{aligned}\] So, \[\begin{aligned}(\tau^2|\beta,\gamma,\sigma^2,y) \sim \text{Inv-}\chi^2\Bigl(m-1, \frac{\gamma^T\gamma}{m-1} \Bigr).\end{aligned}\]

(d) For \(m = 10\), \(n_i = 20, ^\forall i\), \(\beta = (1, 2, 3)^T\), \(\sigma^2 = 0.5\), and \(\tau^2 = 0.5\), generate each element of \(X\) from the standard normal distribution and \(y\) from the above distributional law. Using R or Python, program and run your Gibbs sampler in (c) to obtain the posterior distribution. Draw the marginal posterior distributions for \(\beta\), \(\tau^2\), and \(\sigma^2\) (histograms) and the joint posterior distribution for \((\tau^2, \sigma^2)\) (a scatter plot or a contour plot).

For starters, the packages “MASS” and “extraDistr” should be installed to use multivariate normal and inverse-chisquare distribution.

set.seed(2023311161)

library(MASS)
#install.packages("extraDistr")
library(extraDistr)
## Warning: 패키지 'extraDistr'는 R 버전 4.3.3에서 작성되었습니다

Following code shows how to generate 10000 samples using Gibbs sampler. The block diagonal matrix \(Z\) can be represented by R basic function ‘kronecker’ which perform Kronecker product of two matrices; \(A\) is an \(r \times s\) matrix and \(B\) is a \(p \times q\) matrix, then the Kronecker product of \(A\) and \(B\) is \[\begin{aligned}A\otimes B = \begin{bmatrix}a_{1,1}B & \dots & a_{1,s}B \\ \vdots & \ddots & \vdots \\ a_{r,1}B &\dots & a_{r,s}B\end{bmatrix},\end{aligned}\] where \(a_{i,j} = [A]_{i,j}\). \(A := I_m\) and \(B := 1_{20}\) leads to \[\begin{aligned} I_m\otimes 1_{20} = \begin{bmatrix}1_{20} & \dots & 0_{20} \\ \vdots & \ddots & \vdots \\ 0_{20} &\dots & 1_{20}\end{bmatrix} = Z.\end{aligned}\] The rest of the code is straightforward.

q1_gibbs <- function(iter_num){
  
  # set parameters
  m <- 10;ni <- 20;N <- m * ni;p <- 3
  
  # generate X and y
  beta_0 <- c(1, 2, 3)
  sigma2_0 <- 0.5
  tau2_0 <- 0.5
  
  X <- matrix(rnorm(N * p), ncol = p)
  Z <- kronecker(diag(1, m), rep(1, ni))
  gamma_0 <- rnorm(m, 0, sqrt(tau2_0))
  y <- X %*% beta_0 + Z %*% gamma_0 + rnorm(N, 0, sqrt(sigma2_0))
  
  
  # Initial values for Gibbs sampler and variables for generated values
  beta <- rep(0, p);gamma <- rep(0, m)
  sigma2 <- 1;tau2 <- 1
  
  beta_gibbs <- matrix(0, nrow = iter_num, ncol = p)
  sigma2_gibbs <- rep(0,iter_num)
  tau2_gibbs <- rep(0,iter_num)
  
  # generate Gibbs sampling
  for (i in 1:iter_num) {
    
    # beta
    Cov_beta <- solve(t(X)%*%X/sigma2)
    mu_beta <- Cov_beta%*%(t(X)%*%(y - Z%*%gamma)/sigma2)
    beta <- mvrnorm(1, mu_beta, Cov_beta)
    
    # gamma
    Cov_gamma <- solve(diag(1/tau2, m) + t(Z)%*%Z/sigma2)
    mu_gamma <- Cov_gamma%*%(t(Z)%*%(y - X%*%beta)/sigma2)
    gamma = mvrnorm(1, mu_gamma, Cov_gamma)
    
    # sigma2
    scale_sigma2 <- sum((y - X%*%beta - Z%*%gamma)^2)/(N-1)
    sigma2 <- rinvchisq(1, N-1, sqrt( scale_sigma2 ) )
    
    # tau2
    tau2 <- rinvchisq(1, m-1, sqrt( t(gamma)%*%gamma)/(m-1) )
    
    # Store samples
    beta_gibbs[i, ] <- beta
    sigma2_gibbs[i] <- sigma2
    tau2_gibbs[i] <- tau2
  }
  
  result <- as.data.frame(cbind(beta_gibbs, sigma2_gibbs, tau2_gibbs))
  colnames(result) <- c('beta1', 'beta2','beta3','sigma^2','tau^2')
  
  return(result)
  
}

q1_result <- q1_gibbs(10000)

The marginal posterior distribution of \(\beta\) is as follows.

# marginal posterior of beta
par(mfrow = c(1,3))

hist(q1_result$beta1, 
     main = expression( "Posterior distribution of" ~ beta[1] ),
     xlab = expression(~beta[1]) )

hist(q1_result$beta2,
     main = expression( "Posterior distribution of" ~ beta[2] ),
     xlab = expression(~beta[2]) )

hist(q1_result$beta3,
     main = expression( "Posterior distribution of" ~ beta[3] ),
     xlab = expression(~beta[3]) )

The marginal posterior distributions of \(\tau^2, \sigma^2\) and their joint posterior are as follows.

# marginal posterior of tau^2, sigma^2, and their joint posterior
par(mfrow = c(1,2))

hist(q1_result$`tau^2`, 
     main = expression( "Posterior distribution of" ~ tau^2 ),
     xlab = expression(~tau^2) )

hist(q1_result$`sigma^2`,
     main = expression( "Posterior distribution of" ~ sigma^2),
     xlab = expression(~sigma^2) )

par(mfrow = c(1,1))

plot(q1_result$`tau^2`, q1_result$`sigma^2`, 
     main= expression( "Joint Posterior of" ~ tau^2 ~"and"~sigma^2),
     xlab = expression(~tau^2), ylab = expression(~sigma^2))

Q2. For a response variable \(y \in \mathbb R^n\), design matrix \(X \in \mathbb R^{n\times p}\), and regression coefficients \(\beta \in \mathbb R^p\), consider a generalized linear model with likelihood of the form \[p(y|X, \beta, \phi) = \exp\Bigl(\sum_{i=1}^nL(y_i|\eta_i,\phi)\Bigr),\]where \(L\) is the log-likelihood function for the individual observations, \(\eta_i=X_i\beta\) is the linear predictor for individual \(i\) with row vector \(X_i\), and \(\phi\) is a dispersion parameter.

(a) For the maximum likelihood estimator \(\hat\beta \in \mathbb R^p\) of \(\beta\), show that the likelihood can be approximated by the normal distribution with mean \(\hat\beta\) and covariance matrix \(V\), where \[V = \Bigl[X^T \text{diag}\{−L''(y_i|\hat\eta_i, \phi)\}X\Bigr]^{−1},\qquad L''(y_i|\hat\eta_i, \phi) =\frac{d^2}{d\eta_i^2}L(y_i|\eta_i,\phi)\Big|_{\eta_i=\hat\eta_i}\] with \(\hat\eta_i = X_i\hat\beta\).

Let \(\mathcal L(\eta) := \sum_{i=1}^nL(y_i|\eta_i,\phi)\) for simplicity, where \(\eta = (\eta_1,\dots, \eta_p)\). We may use Taylor expansion of \(\mathcal L\) at \(\eta = \hat\eta\) as follows: \[\mathcal L(\eta) \approx \mathcal L(\hat\eta) + (\eta-\hat\eta)^T\nabla\mathcal L(\hat\eta) + \frac{1}{2}(\eta-\hat\eta)^T\mathbf H_{\mathcal L(\hat\eta)}(\eta-\hat\eta),\] where \(\mathbf H_{\mathcal L(\hat\eta)} := \mathbf H_{\mathcal L(\eta)}|_{\eta=\hat\eta}\). So,

\[\begin{aligned} p(y|X.\beta,\phi) &\approx \exp\bigl\{ \mathcal L(\hat\eta) + (\eta-\hat\eta)^T\nabla\mathcal L(\hat\eta) + \frac{1}{2}(\eta-\hat\eta)^T\mathbf H_{\mathcal L(\hat\eta)}(\eta-\hat\eta) \bigl\} \\ &\propto \exp\bigl\{ -\frac{1}{2}(\eta-\hat\eta)^T\bigl(-\mathbf H_{\mathcal L(\hat\eta)}\bigr)(\eta-\hat\eta) \bigr\} \\ &=\exp\bigl\{ -\frac{1}{2}(\beta-\hat\beta)^TX^T\bigl(-\mathbf H_{\mathcal L(\hat\eta)}\bigr)X(\beta-\hat\beta) \bigr\}, \end{aligned}\] and this shows \((y|X.\beta,\phi) \sim N_p\Bigl(\hat\beta, \bigl(X^T\bigl(-\mathbf H_{\mathcal L(\hat\eta)}\bigr)X\bigr)^{-1}\Bigr)\). By the definition of the Hessian matrix, \[\begin{aligned} \bigl[\mathbf H_{\mathcal L(\eta)} \bigr]_{i,j} = \frac{\partial^2\mathcal L}{\partial\eta_i\partial\eta_j} = \begin{cases}\frac{\partial^2L}{\partial \eta_i^2},& i=j \\ \\ 0,& i\neq j.\end{cases} \end{aligned}\] So, the hessian matrix becomes a diagonal matrix whose \(i\)th element is \(\frac{\partial^2L}{\partial \eta_i^2}\); \[\mathbf H_{\mathcal L(\eta)} := \text{diag}\{L''(y_i|\hat\eta_i,\phi)\}.\] Thus, the likelihood is approximated by multivariate (\(p\)-dimensional) normal distribution with mean vector \(\hat\beta\) and covariance matrix \(V := \bigl(X^T(\text{diag}\{-L''(y_i|\hat\eta_i,\phi)\})X\bigr)^{-1}\).

(b) For the logistic regression model expressed as \[p(y|\beta) = \prod_{i=1}^n\binom{k_i}{y_i}\Bigl(\frac{e^{\eta_i}}{1+e^{\eta_i}}\Bigr)^{y_i}\Bigl(\frac{1}{1+e^{\eta_i}}\Bigr)^{k_i-y_i},\] find the expression for \(V\) as a function of \(\hat\beta\).

\[\begin{aligned} \mathcal L(\eta) = \sum_{i=1}^n\Bigl\{\log\binom{k_i}{y_i} + y_i\log\frac{e^{\eta_i}}{1+e^{\eta_i}} +(k_i-y_i)\log\frac{1}{1+e^{\eta_i}}\Bigr\} = \sum_{i=1}^nL(y_i|\eta_i). \end{aligned}\]

The first derivative of \(L\) is as follows: \[\begin{aligned} \frac{\partial L}{\partial \eta_i} &= y_i\cdot\frac{1}{(\frac{e^{\eta_i}}{1+e^{\eta_i}})}\cdot\frac{ e^{\eta_i}(1+e^{\eta_i}) - e^{\eta_i}e^{\eta_i} }{(1+e^{\eta_i})^2} + (k_i-y_i)\cdot \frac{1}{(\frac{1}{1+e^{\eta_i}})}\cdot\frac{-e^{\eta_i}}{(1+e^{\eta_i})^2} \\ &= y_i\cdot\frac{1}{1+e^{\eta_i}} + (k_i-y_i)\cdot \frac{-e^{\eta_i}}{1+e^{\eta_i}} \\ &= y_i + (-k_i)\frac{e^{\eta_i}+1-1}{1+e^{\eta_i}}\\ &= (y_i-k_i) + \frac{k_i}{1+e^{\eta_i}}, \end{aligned}\] and the second derivative goes as follows: \[\begin{aligned} \frac{\partial^2 L}{\partial \eta_i^2} &= k_i\frac{-e^{\eta_i}}{(1+e^{\eta_i})^2}. \end{aligned}\] Thus, \[\begin{aligned} \mathbf H_{\mathcal L(\hat\eta)} &= \text{diag}\Bigl\{\frac{-k_ie^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2}\Bigr\} \qquad \text{and} \\ V &=\Bigl(X^T\text{diag}\Bigl\{\frac{k_ie^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2}\Bigr\}X \Bigr)^{-1}. \end{aligned}\]

(c) Let \(n = 500\) and \(\beta = (1, 2)^T\). Generate each element of \(X \in \mathbb R^{500\times2}\) from the standard normal distribution and then generate \(y\in\{0,1\}^{500}\) using the logistic regression model (This means that \(k_i = 1\) for all \(i\)). Find the maximum likelihood estimator \(\hat\beta\) using the glm function in R or using the one equivalent in Python (statsmodels or scikit-learn). With the improper prior \(p(\beta) \propto 1\), obtain the posterior distribution of \(\beta\) using the following three methods:

* Laplace approximation
* Independent Metropolis-Hastings
* Random walk Metropolis-Hastings.

Draw the marginal posteriors of \(\beta\) obtained by the above three methods on a single panel. Compare the posterior distributions. Furthermore, compare the performance between the independent MH and the random walk MH.

We may obtain design matrix \(X\) of standard Gaussian and corresponding response \(y\) using logistic regression as follows:

rm(list=ls())
set.seed(2023311161)

X <- matrix(rnorm(1000), nrow=500)
b <- c(1,2)
y <- rbinom(500,1, plogis(X%*%b))

The MLE of the \(\beta\) is obtained using ‘glm’ function in R as follows. The ‘bhat’ variable shows the MLE.

model <- glm(y ~ 0 + X[,1] + X[,2], family = binomial) # 0 : no intercept
bhat <- as.numeric(coefficients(model)) 
bhat
## [1] 1.010806 2.056443

The following shows how to obtain the posterior under uniform prior \(p(\beta) \propto 1\).

1) Laplace approximation


The above results of (a) and (b) can be applied so that \[p(\beta|y) \propto p(y|\beta) \approx N_p\Bigl(\hat\beta, \bigl(X^T\text{diag}\Bigl\{\frac{e^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2} \Bigr\}X\bigr)^{-1} \Bigr).\]

The MLE is already obtained right above, so it remains to get the value of the matrix \(V = \bigl(X^T\text{diag}\Bigl\{\frac{e^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2} \Bigr\}X\bigr)^{-1}\) in R. A vector of diagonal elements is saved in a variable ‘L’ , and correspondingly posterior \(\beta\) can be generated using ‘mvrnorm’ function.

set.seed(2023311161)
library(MASS)
g_num = 10000

L <- as.numeric(exp(X%*%bhat)/(1+exp(X%*%bhat))^2)
V <- solve(t(X)%*%diag(L)%*%X)

b_post_lap <- as.data.frame(mvrnorm(n=g_num, mu = bhat, Sigma = V))
colnames(b_post_lap) <- c('beta1','beta2')

For reference, the scatter plot of generated \(\beta_1,\beta_2\) are as follows:

plot(b_post_lap$beta1, b_post_lap$beta2, main = expression("Posterior Distribution of " * beta * " using Laplace Approximation"), xlab = expression("" * beta * "1"), ylab = expression("" * beta * "2"), col = "blue", pch = 16)

The histogram of \(\beta_1\) and \(\beta_2\) are as follows:

par(mfrow = c(1, 2))

hist(b_post_lap$beta1,
     main=expression("Histogram of "*beta*"1 (Laplace)"),
     xlab = expression("" * beta * "1"),
     col = "skyblue",
     freq = FALSE)

hist(b_post_lap$beta2,
     main=expression("Histogram of "*beta*"2 (Laplace)"),
     xlab = expression("" * beta * "2"),
     col = "darkblue",
     freq = FALSE)

2) Independent Metropolis-Hastings


The proposal density at time \(t\) is independent of the previous state; \(J_t(\beta^*)\), which is proportional to \(N(\hat\beta, V)\) under the uniform prior. The procedure of Independent MH is as follows: 1) denote initial value, say, \(\beta_0 := (2,1)^T\). 2) generate temporary \(\beta^* \sim N(\hat\beta, V) \propto \exp\Bigl\{-\frac{1}{2}(\beta^*-\hat\beta)^TV^{-1}(\beta^*-\hat\beta)\Bigr\}\) and \(u\sim \text{Unif}(0,1)\). 3) If \(u < \min\Bigl\{ 1, \frac{p(\beta^*|y)\exp\{-\frac{1}{2}(\beta_{t-1}-\hat\beta)^TV^{-1}(\beta_{t-1}-\hat\beta)\}}{p(\beta_{t-1}|y)\exp\{-\frac{1}{2}(\beta^*-\hat\beta)^TV^{-1}(\beta^*-\hat\beta)\}} \Bigr\}\), let \(\beta_t := \beta^*\). If not, let \(\beta_t := \beta_{t-1}\).

When \(k_i=1\), the likelihood function becomes \[p(\beta|y) = \prod_{i=1}^n\Bigl(\frac{e^{X_i\beta}}{1+e^{X_i\beta}}\Bigr)^{y_i}\Bigl(\frac{1}{1+e^{X_i\beta}}\Bigr)^{1-y_i},\qquad y_i=0,1.\]

This function can be represented simply by defining some function in R (‘post_beta’) as follows:

set.seed(2023311161)

post_beta <- function(b,X,y){
  prob_vec <- (exp(X%*%b)/(1+exp(X%*%b)))^y*(1/(1+exp(X%*%b)))^(1-y)
  return(prod(prob_vec))
}

Using this function and the above MH algorithm, the ‘indep_MH’ function can genrate \(\beta\) under independent Metropolis Hastings.

indep_MH <- function(g_num, bhat,X, V){
  betas <- matrix(c(2,1), nrow=1)
  
  for(i in 1:g_num){
    b_temp <- mvrnorm(1,mu = bhat, Sigma = V)
    b_prev <- as.numeric(t(betas[i,])) 
    u <- runif(1)
    
    acc_prob <- min(1,( post_beta(b_temp,X,y)*exp(-1/2*t(b_prev-bhat)%*%solve(V)%*%(b_prev-bhat)))/( post_beta(b_prev,X,y)*exp(1/2*t(b_temp-bhat)%*%solve(V)%*%(b_temp-bhat))))
    
    if(u<acc_prob){
      betas <- rbind(betas, b_temp)
    }
    else{betas <- rbind(betas,b_prev)}
    
  }
  
  betas <- betas[-1,]
  rownames(betas) <- 1:g_num
  return(betas)
}

We may obtain the posterior using this function, and for reference, the scatter plot is as follows:

b_post_ind <- indep_MH(g_num, bhat, X, V)

par(mfrow = c(1, 1))
plot(b_post_ind[,1], b_post_ind[,2], main = expression("Posterior of " * beta * " using Independent M-H"), xlab = expression("" * beta * "1"), ylab = expression("" * beta * "2"), col = "green", pch = 16)

The histogram (marginal posterior) of each \(\beta\) is given as follows:

par(mfrow = c(1, 2))

hist(b_post_ind[,1],
     main=expression("Histogram of "*beta*"1 (Independent MH)"),
     xlab = expression("" * beta * "1"),
     col = "seagreen1",
     freq = FALSE)

hist(b_post_ind[,2],
     main=expression("Histogram of "*beta*"2 (Independent MH)"),
     xlab = expression("" * beta * "2"),
     col = "darkgreen",
     freq = FALSE)

3) Random walk Metropolis-Hastings


The proposal density at time \(t\) is dependent on the previous state; \(J_t(\beta^*|\beta_{t-1}) \propto N(\beta_{t-1}, (2.38^2/d)V)\) under the uniform prior. The procedure of Random walk MH is as follows: 1) denote initial value, say, \(\beta_0 := (2,1)^T\). 2) generate temporary \(\beta^* \sim N(\beta_{t-1},V)\) and \(u\sim \text{Unif}(0,1)\). 3) If \(u < \min\Bigl\{ 1, \frac{p(\beta^*|y)}{p(\beta_{t-1}|y)} \Bigr\}\), let \(\beta_t := \beta^*\). If not, let \(\beta_t := \beta_{t-1}\).

The above 3rd procedure is plausible because of the symmetry. Under this facts and using the ‘post_beta’ function, we can define ‘randomwalk_MH’ function in R to generate posterior \(\beta\).

set.seed(2023311161)

randomwalk_MH <- function(g_num, X, V){
  betas <- matrix(c(2,1), nrow=1)
  
  for(i in 1:g_num){
    b_prev <- as.numeric(t(betas[i,])) 
    b_temp <- mvrnorm(1,mu = b_prev, Sigma = 2.38^2/2*V)
    
    u <- runif(1)
    
    acc_prob <- min( 1, post_beta(b_temp,X,y)/post_beta(b_prev,X,y) )
    
    if(u<acc_prob){
      betas <- rbind(betas, b_temp)
    }
    else{betas <- rbind(betas,b_prev)}
    
  }
  
  betas <- betas[-1,]
  rownames(betas) <- 1:g_num
  return(betas)
}

The scatter plot of \(\beta_1,\beta_2\) is as follows:

b_post_rw <- randomwalk_MH(g_num, X, V)
par(mfrow = c(1, 1))
plot(b_post_rw[,1], b_post_rw[,2], main = expression("Posterior of " * beta * " using Random Walk M-H"), xlab = expression("" * beta * "1"), ylab = expression("" * beta * "2"), col = "red", pch = 16)

The histogram (marginal posterior) of each \(\beta\) is given as follows:

par(mfrow = c(1, 2))

hist(b_post_rw[,1],
     main=expression("Histogram of "*beta*"1 (Random walk MH)"),
     xlab = expression("" * beta * "1"),
     col = "pink",
     freq = FALSE)

hist(b_post_rw[,2],
     main=expression("Histogram of "*beta*"2 (Random walk MH)"),
     xlab = expression("" * beta * "2"),
     col = "brown",
     freq = FALSE)

4) comparison of the performance


Based on the above results, the Independent MH are more likely to converge to the mean of the posterior fast, while the Random walk MH tries to search for some outlier values and converge to the mean slowly. In a nutshell, Independent MH are estimated to be more efficient than Random walk MH as the dimension of the parameter \(\beta\) becomes larger.

END