Bayesian Statistics HW3
Q1. Consider a linear mixed model: For \(y_{i,j} \in\mathbb R, X_{i,j} \in \mathbb R^{1\times p}, \beta \in \mathbb R^p, \tau^2 \in (0,\infty)\) and \(\sigma^2 \in (0,\infty)\), \[y_{i,j} = X_{i,j}\beta + \gamma_i + \epsilon_{i,j},\quad \gamma_i \sim N(0, \tau^2), \quad\epsilon_{i,j} \sim N(0, \sigma^2),\] \[\quad j = 1, \dots,n_i,\quad i = 1, \dots,m.\]
(a) Write down the model in matrix notation as \[y = X\beta + Z\gamma + \epsilon, \gamma \sim N_m(0_m, \tau^2I_m), \epsilon \sim N_N(0_N, \sigma^2I_N),\] by specifying matrices \(X \in \mathbb R^{N\times p}\) and \(Z \in \mathbb R^{N\times m}\), where \(N = \sum_{i=1}^mn_i\).
\(X_{i,j}\) is a row vector of \(X\) in \(j\)th line of group \(i\), and \(\gamma_i\) is the \(i\)th element of \(\gamma \sim N_m(0,\tau^2 I_m)\). Then, the design matrix \(Z\) is a block diagonal matrix, whose \(i\)th block diagonal is \(1_{n_i} \in \mathbb R^{n_i}\) and the rest of which is zero matrix. That is, \[\begin{aligned} Z = \begin{bmatrix}1_{n_1} & & \\ & 1_{n_2} & \\ & & \ddots & \\ & & & 1_{n_m} \end{bmatrix}. \end{aligned}\]
(b) Derive the marginal distribution of \(y\) integrating \(\gamma\) out of the model. Specify the mean and covariance of the marginal distribution.
For starters, we have \((y|\gamma) \sim N_N(X\beta + Z\gamma, \sigma^2 I_N)\) and \(\gamma \sim N_m(0, \tau^2 I_m)\). Using these facts, \[\mathbb E[y] = \mathbb E_{\gamma}[\mathbb E_y[y|\gamma]] = \mathbb E_{\gamma}[X\beta+Z\gamma] = X\beta,\] and \[\begin{aligned}\text{Var}[y] &= \mathbb E_{\gamma}[\text{Var}[y|\gamma]] + \text{Var}_{\gamma}\mathbb E[y|\gamma] \\ &= \mathbb E_{\gamma}[\sigma^2 I_N] + \text{Var}_{\gamma}[X\beta+Z\gamma] \\ &= \sigma^2 I_N + \tau^2ZZ^T. \end{aligned}\] So, our purpose is to show that \(y \sim N_N(X\beta, \sigma^2 I_N + \tau^2ZZ^T)\).
\[\begin{aligned} p(y,\gamma) &= p(y|\gamma) p(\gamma) \\ &\propto \exp\Bigl\{-\frac{1}{2}(\underbrace{y-X\beta}_{:=\mu}-Z\gamma)^T(\sigma^2 I_N)^{-1}(y-X\beta-Z\gamma)\Bigr\}\exp\Bigl\{-\frac{1}{2}\gamma^T(\tau^2 I_m)^{-1}\gamma\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2\sigma^2}(\mu-Z\gamma)^T(\mu-Z\gamma)\Bigr\}\exp\Bigr\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2\sigma^2}(\mu-Z\gamma)^T(\mu-Z\gamma) - \frac{1}{2\tau^2}\gamma^2\gamma\Bigr\} \\ &=\exp\Bigr\{-\frac{1}{2\sigma^2}\gamma^T(Z^TZ)\gamma -\frac{1}{2\tau^2}\gamma^T\gamma + \frac{1}{2\sigma^2}2\mu^TZ\gamma - \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2}\gamma^T\underbrace{\Bigl(\frac{1}{\sigma^2}Z^TZ + \frac{1}{\tau^2}I_m\Bigr)}_{:=\Sigma^{-1}}\gamma + \frac{1}{2\sigma^2}2\mu^TZ\gamma - \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &= \exp\Bigr\{-\frac{1}{2}\gamma^T\Sigma^{-1}\gamma + 2\gamma^T\Sigma^{-1}\Sigma Z^T\mu\frac{1}{(2\sigma^2)} - \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}\gamma^T\Sigma^{-1}\gamma +\frac{1}{2}2\gamma^T\Sigma^{-1}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr) - \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr) + \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)- \frac{1}{2\sigma^2}\mu^T\mu\Bigr\} \\ &=\underbrace{\exp\Bigl\{-\frac{1}{2}\Bigl(\gamma - \frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}\Bigl(\gamma - \frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)\Bigr\}}_{\propto \text{pdf of } N(\Sigma Z^T\mu/\sigma^2,\Sigma)} \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)\Bigr\}. \end{aligned}\]
Since \(p(y)\) is obtained by integrating \(p(y,\gamma)\) with respect to \(\gamma\), we have
\[\begin{aligned} p(y) &\propto \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2}\Bigl(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)^T\Sigma^{-1}(\frac{\Sigma Z^T\mu}{\sigma^2}\Bigr)\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2\sigma^4}\mu^TZ\Sigma Z^T\mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2\sigma^2}\mu^T\mu + \frac{1}{2\sigma^4}\mu^TZ\Bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\Bigr)^{-1} Z^T\mu\Bigr\}. \end{aligned}\]
Recall that our goal is to induce \(p(y) \propto \exp\{-\frac{1}{2}\mu^T(\sigma^2I_N + \tau^2ZZ^T)^{-1}\mu\}\). Now, we may use Woodbury Matrix Identity so that \[\begin{aligned} \Bigl(\sigma^2I_N + \tau^2ZZ^T\Bigr)^{-1} &= \frac{1}{\sigma^2}I_N + \frac{1}{\sigma^2}I_NZ\Bigl(\frac{1}{\tau^2}I_m + Z^T\frac{1}{\sigma^2}I_NZ\Bigr)^{-1}Z^T\frac{1}{\sigma^2}I_N \\ &= \frac{1}{\sigma^2}I_N + \frac{1}{\sigma^4}Z\Bigl(\frac{1}{\tau^2}I_m + \frac{1}{\sigma^2}Z^TZ\Bigr)^{-1}Z^T. \end{aligned}\]
Thus, we conclude that \[\begin{aligned} p(y) &\propto \exp\Bigl\{-\frac{1}{2}\mu^T\Bigl[\frac{1}{\sigma^2}I_N + \frac{1}{\sigma^4}\mu^TZ\Bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\Bigr)^{-1} Z^T\Bigr]\mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}\mu^T\Bigl(\sigma^2I_N + \tau^2ZZ^T\Bigr)^{-1} \mu\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}(y-X\beta)^T\Bigl(\sigma^2I_N + \tau^2ZZ^T\Bigr)^{-1} (y-X\beta)\Bigr\}, \end{aligned}\] which means \[\begin{aligned} y \sim N_N\Bigl(X\beta, \sigma^2I_N + \tau^2ZZ^T\Bigr). \end{aligned}\]
(c) Assume that \(X\) is of full
column rank. With an improper uniform prior \(p(\beta, \sigma, \tau) \propto 1\),
construct a Gibbs sampler for \((\beta,
\gamma, \sigma^2, \tau^2)\) by deriving the following full
conditional distribution:
- \(p(\beta|\gamma, \sigma^2, \tau^2, y)\)
- \(p(\gamma|\beta, \sigma^2, \tau^2,
y)\)
- \(p(\sigma^2|\beta, \gamma,
\tau^2, y)\)
- \(p(\tau^2|\beta,
\gamma,\sigma^2, y)\)
The posterior of all the 4 parameters are as follows:
\[\begin{aligned} p(\beta, \gamma, \sigma^2, \tau^2|y) &\propto p(y|\beta, \gamma, \sigma^2, \tau^2)p(\beta, \gamma, \sigma^2, \tau^2) \\ &= p(y|\beta, \gamma, \sigma^2, \tau^2)p(\beta, \sigma^2, \tau^2)p(\gamma|\beta,\sigma^2,\tau^2) \\ &= p(y|\beta, \gamma, \sigma^2, \tau^2)p(\beta, \sigma^2, \tau^2)p(\gamma|\tau^2). \end{aligned}\]
Going back to the question (a), we have \[\begin{aligned}y = X\beta + Z\gamma + \epsilon,\end{aligned}\] where \[\begin{aligned}\epsilon \sim N_N(0,\sigma^2I_N).\end{aligned}\]
When \(\beta, \gamma,\) and \(\sigma^2\) are fixed, the \(y\) follows a normal distribution with mean \(X\beta + Z\gamma\) and covariance \(\sigma^2I_N\): \[\begin{aligned}(y|\beta,\gamma,\sigma^2) \sim N_N(X\beta + Z\gamma,\sigma^2I_N).\end{aligned}\]
There is a relationship between \(p(\beta, \sigma, \tau)\) and \(p(\beta, \sigma^2, \tau^2)\). That is, \[\begin{aligned} p(\beta, \sigma^2, \tau^2) &= p(\beta, \sigma, \tau)|J|\end{aligned}\] where \[\begin{aligned} J &= \begin{bmatrix} \frac{\partial \sigma}{\partial \sigma^2} & \frac{\partial \sigma}{\partial \tau^2} \\ \frac{\partial \tau}{\partial \sigma^2} & \frac{\partial \tau}{\partial \tau^2} \end{bmatrix} = \begin{bmatrix} \frac{1}{2\sigma} & 0 \\ 0 & \frac{1}{2\tau} \end{bmatrix} . \end{aligned}\] So, \[\begin{aligned} p(\beta, \sigma^2, \tau^2) = p(\beta, \sigma, \tau)\frac{1}{4\sigma\tau} \propto \frac{1}{\sigma\tau} . \end{aligned}\]
The \(p(\gamma|\tau)\) is already given: \((\gamma|\tau) \sim N_m(0, \tau^2 I_m)\), so the posterior can be written as
\[\begin{aligned}p(\beta, \gamma, \sigma^2, \tau^2|y) \propto N_N(X\beta + Z\gamma,\sigma^2I_N)\cdot\frac{1}{\sigma\tau}\cdot N_m(0, \tau^2 I_m).\end{aligned}\]
\(\text{1) }\beta\)
\[\begin{aligned} p(\beta|\gamma, \sigma^2, \tau^2, y) &\propto N_N(X\beta + Z\gamma,\sigma^2I_N) \\ &\propto \exp\Bigl\{-\frac{1}{2}(y-X\beta-Z\gamma)^T\frac{1}{\sigma^2}(y-X\beta-Z\gamma)\Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2\sigma^2}\Bigl(\|y-Z\gamma\|^2-2\beta^TX^T(y-Z\gamma) + \beta^TX^TX\beta\Bigr)\Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2}\Bigl(\beta^T\underbrace{\bigl(X^TX/\sigma^2\bigr)}_{:=\Sigma_{\beta}^{-1}}\beta - 2\beta^TX^T(y-Z\gamma)/\sigma^2\Bigr) \Bigr\}\\ &= \exp\Bigl\{-\frac{1}{2}\Bigl(\beta^T\Sigma_{\beta}^{-1}\beta - 2\beta^T\Sigma_{\beta}^{-1}\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2 \Bigr) \Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2}\Bigl(\beta-\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2 \Bigr)^T\Sigma_{\beta}^{-1}\Bigl(\beta-\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2 \Bigr)\Bigr\} \end{aligned}\]
So, \[\begin{aligned} (\beta|\gamma, \sigma^2, \tau^2, y) \sim N_p\bigl(\Sigma_{\beta}X^T(y-Z\gamma)/\sigma^2, \Sigma_{\beta}\bigr),\end{aligned}\] where \[\begin{aligned}\Sigma_{\beta}=(X^TX/\sigma^2)^{-1}.\end{aligned}\]
\(\text{2) } \gamma\)
\[\begin{aligned} p(\gamma|\beta,\sigma^2,\tau^2,y) &\propto N_N(X\beta + Z\gamma,\sigma^2I_N)\cdot N_m(0, \tau^2 I_m) \\ &\propto \exp\Bigl\{-\frac{1}{2}(y-X\beta-Z\gamma)^T\frac{1}{\sigma^2}(y-X\beta-Z\gamma)\Bigr\} \exp\Bigl\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= \exp\Bigl\{ -\frac{1}{2\sigma^2}\Bigl(\gamma^TZ^TZ\gamma - 2\gamma^TZ^T(y-X\beta) + \|y-X\beta\|^2\Bigr) -\frac{1}{2\tau^2}\gamma^TI_m\gamma\Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2}\Bigl(\gamma^T\underbrace{\bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\bigr)}_{:= \Sigma_{\gamma}^{-1}}\gamma -2\gamma^TZ^T(y-X\beta/\sigma^2) \Bigr) \Bigr\} \\ &= \exp\Bigl\{-\frac{1}{2}\Bigl( \gamma^T\Sigma_{\gamma}^{-1}\gamma -2\gamma^T\Sigma_{\gamma}^{-1}\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2 \Bigr) \Bigr\} \\ &\propto \exp\Bigl\{-\frac{1}{2} \Bigl( (\gamma-\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2)^T\Sigma_{\gamma}^{-1}(\gamma-\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2) \Bigr) \Bigr\}. \end{aligned}\]
So, \[\begin{aligned}(\gamma| \beta, \sigma^2, \tau^2, y) \sim N_m\bigl(\Sigma_{\gamma}Z^T(y-X\beta)/\sigma^2, \Sigma_{\gamma}\bigr),\end{aligned}\] where \[\begin{aligned} \Sigma_{\gamma}=\Bigl(\frac{1}{\sigma^2}Z^TZ+\frac{1}{\tau^2}I_m\Bigr)^{-1}.\end{aligned}\]
\(\text{3) } \sigma^2\)
\[\begin{aligned} p(\sigma^2|\beta,\gamma,\tau^2,y) &\propto N_N(X\beta + Z\gamma,\sigma^2I_N)\cdot\frac{1}{\sigma} \\ &\propto \frac{1}{|\sigma^2 I_N|^{1/2}}\exp\Bigl\{-\frac{1}{2\sigma^2}\|y-X\beta-Z\gamma\|^2\Bigr\}\frac{1}{\sigma} \\ &= \frac{1}{(\sigma^2)^{N/2}}\frac{1}{(\sigma^2)^{1/2}}\exp\Bigl\{-\frac{1}{2\sigma^2}\|y-X\beta-Z\gamma\|^2\Bigr\} \\ &= (\sigma^2)^{-\frac{N+1}{2}}\exp\Bigl\{-\frac{1}{2\sigma^2}\|y-X\beta-Z\gamma\|^2\Bigr\} \\ &= (\sigma^2)^{-(\frac{N-1}{2} + 1)}\exp\Bigl\{-\frac{(N-1)\cdot\|y-X\beta-Z\gamma\|^2/(N-1)}{2\sigma^2}\Bigr\}. \end{aligned}\] So, \[\begin{aligned}(\sigma^2|\beta,\gamma,\tau^2,y) \sim \text{Inv-}\chi^2\Bigl(N-1, \frac{\|y-X\beta-Z\gamma\|^2}{N-1} \Bigr).\end{aligned}\]
\(\text{4) } \tau^2\)
\[\begin{aligned} p(\tau^2|\beta,\gamma,\sigma^2,y) &\propto \frac{1}{\tau}\cdot N_m(0, \tau^2 I_m) \\ &\propto \frac{1}{\tau}\frac{1}{|\tau^2I_m|^{1/2}}\exp\Bigl\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= \frac{1}{(\tau^2)^{-\frac{m+1}{2}}}\exp\Bigl\{-\frac{1}{2\tau^2}\gamma^T\gamma\Bigr\} \\ &= (\tau^2)^{-(\frac{m-1}{2}+1)}\exp\Bigl\{-\frac{(m-1)\gamma^T\gamma/(m-1)}{2\tau^2}\Bigr\} . \end{aligned}\] So, \[\begin{aligned}(\tau^2|\beta,\gamma,\sigma^2,y) \sim \text{Inv-}\chi^2\Bigl(m-1, \frac{\gamma^T\gamma}{m-1} \Bigr).\end{aligned}\]
(d) For \(m = 10\), \(n_i = 20, ^\forall i\), \(\beta = (1, 2, 3)^T\), \(\sigma^2 = 0.5\), and \(\tau^2 = 0.5\), generate each element of \(X\) from the standard normal distribution and \(y\) from the above distributional law. Using R or Python, program and run your Gibbs sampler in (c) to obtain the posterior distribution. Draw the marginal posterior distributions for \(\beta\), \(\tau^2\), and \(\sigma^2\) (histograms) and the joint posterior distribution for \((\tau^2, \sigma^2)\) (a scatter plot or a contour plot).
For starters, the packages “MASS” and “extraDistr” should be installed to use multivariate normal and inverse-chisquare distribution.
set.seed(2023311161)
library(MASS)
#install.packages("extraDistr")
library(extraDistr)
## Warning: 패키지 'extraDistr'는 R 버전 4.3.3에서 작성되었습니다
Following code shows how to generate 10000 samples using Gibbs sampler. The block diagonal matrix \(Z\) can be represented by R basic function ‘kronecker’ which perform Kronecker product of two matrices; \(A\) is an \(r \times s\) matrix and \(B\) is a \(p \times q\) matrix, then the Kronecker product of \(A\) and \(B\) is \[\begin{aligned}A\otimes B = \begin{bmatrix}a_{1,1}B & \dots & a_{1,s}B \\ \vdots & \ddots & \vdots \\ a_{r,1}B &\dots & a_{r,s}B\end{bmatrix},\end{aligned}\] where \(a_{i,j} = [A]_{i,j}\). \(A := I_m\) and \(B := 1_{20}\) leads to \[\begin{aligned} I_m\otimes 1_{20} = \begin{bmatrix}1_{20} & \dots & 0_{20} \\ \vdots & \ddots & \vdots \\ 0_{20} &\dots & 1_{20}\end{bmatrix} = Z.\end{aligned}\] The rest of the code is straightforward.
q1_gibbs <- function(iter_num){
# set parameters
m <- 10;ni <- 20;N <- m * ni;p <- 3
# generate X and y
beta_0 <- c(1, 2, 3)
sigma2_0 <- 0.5
tau2_0 <- 0.5
X <- matrix(rnorm(N * p), ncol = p)
Z <- kronecker(diag(1, m), rep(1, ni))
gamma_0 <- rnorm(m, 0, sqrt(tau2_0))
y <- X %*% beta_0 + Z %*% gamma_0 + rnorm(N, 0, sqrt(sigma2_0))
# Initial values for Gibbs sampler and variables for generated values
beta <- rep(0, p);gamma <- rep(0, m)
sigma2 <- 1;tau2 <- 1
beta_gibbs <- matrix(0, nrow = iter_num, ncol = p)
sigma2_gibbs <- rep(0,iter_num)
tau2_gibbs <- rep(0,iter_num)
# generate Gibbs sampling
for (i in 1:iter_num) {
# beta
Cov_beta <- solve(t(X)%*%X/sigma2)
mu_beta <- Cov_beta%*%(t(X)%*%(y - Z%*%gamma)/sigma2)
beta <- mvrnorm(1, mu_beta, Cov_beta)
# gamma
Cov_gamma <- solve(diag(1/tau2, m) + t(Z)%*%Z/sigma2)
mu_gamma <- Cov_gamma%*%(t(Z)%*%(y - X%*%beta)/sigma2)
gamma = mvrnorm(1, mu_gamma, Cov_gamma)
# sigma2
scale_sigma2 <- sum((y - X%*%beta - Z%*%gamma)^2)/(N-1)
sigma2 <- rinvchisq(1, N-1, sqrt( scale_sigma2 ) )
# tau2
tau2 <- rinvchisq(1, m-1, sqrt( t(gamma)%*%gamma)/(m-1) )
# Store samples
beta_gibbs[i, ] <- beta
sigma2_gibbs[i] <- sigma2
tau2_gibbs[i] <- tau2
}
result <- as.data.frame(cbind(beta_gibbs, sigma2_gibbs, tau2_gibbs))
colnames(result) <- c('beta1', 'beta2','beta3','sigma^2','tau^2')
return(result)
}
q1_result <- q1_gibbs(10000)
The marginal posterior distribution of \(\beta\) is as follows.
# marginal posterior of beta
par(mfrow = c(1,3))
hist(q1_result$beta1,
main = expression( "Posterior distribution of" ~ beta[1] ),
xlab = expression(~beta[1]) )
hist(q1_result$beta2,
main = expression( "Posterior distribution of" ~ beta[2] ),
xlab = expression(~beta[2]) )
hist(q1_result$beta3,
main = expression( "Posterior distribution of" ~ beta[3] ),
xlab = expression(~beta[3]) )
The marginal posterior distributions of \(\tau^2, \sigma^2\) and their joint posterior are as follows.
# marginal posterior of tau^2, sigma^2, and their joint posterior
par(mfrow = c(1,2))
hist(q1_result$`tau^2`,
main = expression( "Posterior distribution of" ~ tau^2 ),
xlab = expression(~tau^2) )
hist(q1_result$`sigma^2`,
main = expression( "Posterior distribution of" ~ sigma^2),
xlab = expression(~sigma^2) )
par(mfrow = c(1,1))
plot(q1_result$`tau^2`, q1_result$`sigma^2`,
main= expression( "Joint Posterior of" ~ tau^2 ~"and"~sigma^2),
xlab = expression(~tau^2), ylab = expression(~sigma^2))
Q2. For a response variable \(y \in \mathbb R^n\), design matrix \(X \in \mathbb R^{n\times p}\), and regression coefficients \(\beta \in \mathbb R^p\), consider a generalized linear model with likelihood of the form \[p(y|X, \beta, \phi) = \exp\Bigl(\sum_{i=1}^nL(y_i|\eta_i,\phi)\Bigr),\]where \(L\) is the log-likelihood function for the individual observations, \(\eta_i=X_i\beta\) is the linear predictor for individual \(i\) with row vector \(X_i\), and \(\phi\) is a dispersion parameter.
(a) For the maximum likelihood estimator \(\hat\beta \in \mathbb R^p\) of \(\beta\), show that the likelihood can be approximated by the normal distribution with mean \(\hat\beta\) and covariance matrix \(V\), where \[V = \Bigl[X^T \text{diag}\{−L''(y_i|\hat\eta_i, \phi)\}X\Bigr]^{−1},\qquad L''(y_i|\hat\eta_i, \phi) =\frac{d^2}{d\eta_i^2}L(y_i|\eta_i,\phi)\Big|_{\eta_i=\hat\eta_i}\] with \(\hat\eta_i = X_i\hat\beta\).
Let \(\mathcal L(\eta) := \sum_{i=1}^nL(y_i|\eta_i,\phi)\) for simplicity, where \(\eta = (\eta_1,\dots, \eta_p)\). We may use Taylor expansion of \(\mathcal L\) at \(\eta = \hat\eta\) as follows: \[\mathcal L(\eta) \approx \mathcal L(\hat\eta) + (\eta-\hat\eta)^T\nabla\mathcal L(\hat\eta) + \frac{1}{2}(\eta-\hat\eta)^T\mathbf H_{\mathcal L(\hat\eta)}(\eta-\hat\eta),\] where \(\mathbf H_{\mathcal L(\hat\eta)} := \mathbf H_{\mathcal L(\eta)}|_{\eta=\hat\eta}\). So,
\[\begin{aligned} p(y|X.\beta,\phi) &\approx \exp\bigl\{ \mathcal L(\hat\eta) + (\eta-\hat\eta)^T\nabla\mathcal L(\hat\eta) + \frac{1}{2}(\eta-\hat\eta)^T\mathbf H_{\mathcal L(\hat\eta)}(\eta-\hat\eta) \bigl\} \\ &\propto \exp\bigl\{ -\frac{1}{2}(\eta-\hat\eta)^T\bigl(-\mathbf H_{\mathcal L(\hat\eta)}\bigr)(\eta-\hat\eta) \bigr\} \\ &=\exp\bigl\{ -\frac{1}{2}(\beta-\hat\beta)^TX^T\bigl(-\mathbf H_{\mathcal L(\hat\eta)}\bigr)X(\beta-\hat\beta) \bigr\}, \end{aligned}\] and this shows \((y|X.\beta,\phi) \sim N_p\Bigl(\hat\beta, \bigl(X^T\bigl(-\mathbf H_{\mathcal L(\hat\eta)}\bigr)X\bigr)^{-1}\Bigr)\). By the definition of the Hessian matrix, \[\begin{aligned} \bigl[\mathbf H_{\mathcal L(\eta)} \bigr]_{i,j} = \frac{\partial^2\mathcal L}{\partial\eta_i\partial\eta_j} = \begin{cases}\frac{\partial^2L}{\partial \eta_i^2},& i=j \\ \\ 0,& i\neq j.\end{cases} \end{aligned}\] So, the hessian matrix becomes a diagonal matrix whose \(i\)th element is \(\frac{\partial^2L}{\partial \eta_i^2}\); \[\mathbf H_{\mathcal L(\eta)} := \text{diag}\{L''(y_i|\hat\eta_i,\phi)\}.\] Thus, the likelihood is approximated by multivariate (\(p\)-dimensional) normal distribution with mean vector \(\hat\beta\) and covariance matrix \(V := \bigl(X^T(\text{diag}\{-L''(y_i|\hat\eta_i,\phi)\})X\bigr)^{-1}\).
(b) For the logistic regression model expressed as \[p(y|\beta) = \prod_{i=1}^n\binom{k_i}{y_i}\Bigl(\frac{e^{\eta_i}}{1+e^{\eta_i}}\Bigr)^{y_i}\Bigl(\frac{1}{1+e^{\eta_i}}\Bigr)^{k_i-y_i},\] find the expression for \(V\) as a function of \(\hat\beta\).
\[\begin{aligned} \mathcal L(\eta) = \sum_{i=1}^n\Bigl\{\log\binom{k_i}{y_i} + y_i\log\frac{e^{\eta_i}}{1+e^{\eta_i}} +(k_i-y_i)\log\frac{1}{1+e^{\eta_i}}\Bigr\} = \sum_{i=1}^nL(y_i|\eta_i). \end{aligned}\]
The first derivative of \(L\) is as follows: \[\begin{aligned} \frac{\partial L}{\partial \eta_i} &= y_i\cdot\frac{1}{(\frac{e^{\eta_i}}{1+e^{\eta_i}})}\cdot\frac{ e^{\eta_i}(1+e^{\eta_i}) - e^{\eta_i}e^{\eta_i} }{(1+e^{\eta_i})^2} + (k_i-y_i)\cdot \frac{1}{(\frac{1}{1+e^{\eta_i}})}\cdot\frac{-e^{\eta_i}}{(1+e^{\eta_i})^2} \\ &= y_i\cdot\frac{1}{1+e^{\eta_i}} + (k_i-y_i)\cdot \frac{-e^{\eta_i}}{1+e^{\eta_i}} \\ &= y_i + (-k_i)\frac{e^{\eta_i}+1-1}{1+e^{\eta_i}}\\ &= (y_i-k_i) + \frac{k_i}{1+e^{\eta_i}}, \end{aligned}\] and the second derivative goes as follows: \[\begin{aligned} \frac{\partial^2 L}{\partial \eta_i^2} &= k_i\frac{-e^{\eta_i}}{(1+e^{\eta_i})^2}. \end{aligned}\] Thus, \[\begin{aligned} \mathbf H_{\mathcal L(\hat\eta)} &= \text{diag}\Bigl\{\frac{-k_ie^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2}\Bigr\} \qquad \text{and} \\ V &=\Bigl(X^T\text{diag}\Bigl\{\frac{k_ie^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2}\Bigr\}X \Bigr)^{-1}. \end{aligned}\]
(c) Let \(n = 500\) and \(\beta = (1, 2)^T\). Generate each element
of \(X \in \mathbb R^{500\times2}\)
from the standard normal distribution and then generate \(y\in\{0,1\}^{500}\) using the logistic
regression model (This means that \(k_i =
1\) for all \(i\)). Find the
maximum likelihood estimator \(\hat\beta\) using the glm function in R or
using the one equivalent in Python (statsmodels or scikit-learn). With
the improper prior \(p(\beta) \propto
1\), obtain the posterior distribution of \(\beta\) using the following three methods:
* Laplace approximation
* Independent Metropolis-Hastings
* Random walk Metropolis-Hastings.
Draw the marginal
posteriors of \(\beta\) obtained by the
above three methods on a single panel. Compare the posterior
distributions. Furthermore, compare the performance between the
independent MH and the random walk MH.
We may obtain design matrix \(X\) of standard Gaussian and corresponding response \(y\) using logistic regression as follows:
rm(list=ls())
set.seed(2023311161)
X <- matrix(rnorm(1000), nrow=500)
b <- c(1,2)
y <- rbinom(500,1, plogis(X%*%b))
The MLE of the \(\beta\) is obtained using ‘glm’ function in R as follows. The ‘bhat’ variable shows the MLE.
model <- glm(y ~ 0 + X[,1] + X[,2], family = binomial) # 0 : no intercept
bhat <- as.numeric(coefficients(model))
bhat
## [1] 1.010806 2.056443
The following shows how to obtain the posterior under uniform prior \(p(\beta) \propto 1\).
1) Laplace approximation
The above results of (a) and (b) can be applied so that \[p(\beta|y) \propto p(y|\beta) \approx
N_p\Bigl(\hat\beta,
\bigl(X^T\text{diag}\Bigl\{\frac{e^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2}
\Bigr\}X\bigr)^{-1} \Bigr).\]
The MLE is already obtained right above, so it remains to get the value of the matrix \(V = \bigl(X^T\text{diag}\Bigl\{\frac{e^{\hat\eta_i}}{(1+e^{\hat\eta_i})^2} \Bigr\}X\bigr)^{-1}\) in R. A vector of diagonal elements is saved in a variable ‘L’ , and correspondingly posterior \(\beta\) can be generated using ‘mvrnorm’ function.
set.seed(2023311161)
library(MASS)
g_num = 10000
L <- as.numeric(exp(X%*%bhat)/(1+exp(X%*%bhat))^2)
V <- solve(t(X)%*%diag(L)%*%X)
b_post_lap <- as.data.frame(mvrnorm(n=g_num, mu = bhat, Sigma = V))
colnames(b_post_lap) <- c('beta1','beta2')
For reference, the scatter plot of generated \(\beta_1,\beta_2\) are as follows:
plot(b_post_lap$beta1, b_post_lap$beta2, main = expression("Posterior Distribution of " * beta * " using Laplace Approximation"), xlab = expression("" * beta * "1"), ylab = expression("" * beta * "2"), col = "blue", pch = 16)
The histogram of \(\beta_1\) and \(\beta_2\) are as follows:
par(mfrow = c(1, 2))
hist(b_post_lap$beta1,
main=expression("Histogram of "*beta*"1 (Laplace)"),
xlab = expression("" * beta * "1"),
col = "skyblue",
freq = FALSE)
hist(b_post_lap$beta2,
main=expression("Histogram of "*beta*"2 (Laplace)"),
xlab = expression("" * beta * "2"),
col = "darkblue",
freq = FALSE)
2) Independent Metropolis-Hastings
The proposal density at time \(t\) is independent of the previous state;
\(J_t(\beta^*)\), which is proportional
to \(N(\hat\beta, V)\) under the
uniform prior. The procedure of Independent MH is as follows: 1) denote
initial value, say, \(\beta_0 :=
(2,1)^T\). 2) generate temporary \(\beta^* \sim N(\hat\beta, V)
\propto \exp\Bigl\{-\frac{1}{2}(\beta^*-\hat\beta)^TV^{-1}(\beta^*-\hat\beta)\Bigr\}\)
and \(u\sim \text{Unif}(0,1)\). 3) If
\(u < \min\Bigl\{ 1,
\frac{p(\beta^*|y)\exp\{-\frac{1}{2}(\beta_{t-1}-\hat\beta)^TV^{-1}(\beta_{t-1}-\hat\beta)\}}{p(\beta_{t-1}|y)\exp\{-\frac{1}{2}(\beta^*-\hat\beta)^TV^{-1}(\beta^*-\hat\beta)\}}
\Bigr\}\), let \(\beta_t :=
\beta^*\). If not, let \(\beta_t :=
\beta_{t-1}\).
When \(k_i=1\), the likelihood function becomes \[p(\beta|y) = \prod_{i=1}^n\Bigl(\frac{e^{X_i\beta}}{1+e^{X_i\beta}}\Bigr)^{y_i}\Bigl(\frac{1}{1+e^{X_i\beta}}\Bigr)^{1-y_i},\qquad y_i=0,1.\]
This function can be represented simply by defining some function in R (‘post_beta’) as follows:
set.seed(2023311161)
post_beta <- function(b,X,y){
prob_vec <- (exp(X%*%b)/(1+exp(X%*%b)))^y*(1/(1+exp(X%*%b)))^(1-y)
return(prod(prob_vec))
}
Using this function and the above MH algorithm, the ‘indep_MH’ function can genrate \(\beta\) under independent Metropolis Hastings.
indep_MH <- function(g_num, bhat,X, V){
betas <- matrix(c(2,1), nrow=1)
for(i in 1:g_num){
b_temp <- mvrnorm(1,mu = bhat, Sigma = V)
b_prev <- as.numeric(t(betas[i,]))
u <- runif(1)
acc_prob <- min(1,( post_beta(b_temp,X,y)*exp(-1/2*t(b_prev-bhat)%*%solve(V)%*%(b_prev-bhat)))/( post_beta(b_prev,X,y)*exp(1/2*t(b_temp-bhat)%*%solve(V)%*%(b_temp-bhat))))
if(u<acc_prob){
betas <- rbind(betas, b_temp)
}
else{betas <- rbind(betas,b_prev)}
}
betas <- betas[-1,]
rownames(betas) <- 1:g_num
return(betas)
}
We may obtain the posterior using this function, and for reference, the scatter plot is as follows:
b_post_ind <- indep_MH(g_num, bhat, X, V)
par(mfrow = c(1, 1))
plot(b_post_ind[,1], b_post_ind[,2], main = expression("Posterior of " * beta * " using Independent M-H"), xlab = expression("" * beta * "1"), ylab = expression("" * beta * "2"), col = "green", pch = 16)
The histogram (marginal posterior) of each \(\beta\) is given as follows:
par(mfrow = c(1, 2))
hist(b_post_ind[,1],
main=expression("Histogram of "*beta*"1 (Independent MH)"),
xlab = expression("" * beta * "1"),
col = "seagreen1",
freq = FALSE)
hist(b_post_ind[,2],
main=expression("Histogram of "*beta*"2 (Independent MH)"),
xlab = expression("" * beta * "2"),
col = "darkgreen",
freq = FALSE)
3) Random walk Metropolis-Hastings
The proposal density at time \(t\) is dependent on the previous state;
\(J_t(\beta^*|\beta_{t-1}) \propto
N(\beta_{t-1}, (2.38^2/d)V)\) under the uniform prior. The
procedure of Random walk MH is as follows: 1) denote initial value, say,
\(\beta_0 := (2,1)^T\). 2) generate
temporary \(\beta^* \sim
N(\beta_{t-1},V)\) and \(u\sim
\text{Unif}(0,1)\). 3) If \(u <
\min\Bigl\{ 1, \frac{p(\beta^*|y)}{p(\beta_{t-1}|y)} \Bigr\}\),
let \(\beta_t := \beta^*\). If not, let
\(\beta_t := \beta_{t-1}\).
The above 3rd procedure is plausible because of the symmetry. Under this facts and using the ‘post_beta’ function, we can define ‘randomwalk_MH’ function in R to generate posterior \(\beta\).
set.seed(2023311161)
randomwalk_MH <- function(g_num, X, V){
betas <- matrix(c(2,1), nrow=1)
for(i in 1:g_num){
b_prev <- as.numeric(t(betas[i,]))
b_temp <- mvrnorm(1,mu = b_prev, Sigma = 2.38^2/2*V)
u <- runif(1)
acc_prob <- min( 1, post_beta(b_temp,X,y)/post_beta(b_prev,X,y) )
if(u<acc_prob){
betas <- rbind(betas, b_temp)
}
else{betas <- rbind(betas,b_prev)}
}
betas <- betas[-1,]
rownames(betas) <- 1:g_num
return(betas)
}
The scatter plot of \(\beta_1,\beta_2\) is as follows:
b_post_rw <- randomwalk_MH(g_num, X, V)
par(mfrow = c(1, 1))
plot(b_post_rw[,1], b_post_rw[,2], main = expression("Posterior of " * beta * " using Random Walk M-H"), xlab = expression("" * beta * "1"), ylab = expression("" * beta * "2"), col = "red", pch = 16)
The histogram (marginal posterior) of each \(\beta\) is given as follows:
par(mfrow = c(1, 2))
hist(b_post_rw[,1],
main=expression("Histogram of "*beta*"1 (Random walk MH)"),
xlab = expression("" * beta * "1"),
col = "pink",
freq = FALSE)
hist(b_post_rw[,2],
main=expression("Histogram of "*beta*"2 (Random walk MH)"),
xlab = expression("" * beta * "2"),
col = "brown",
freq = FALSE)
4) comparison of the performance
Based on the above results, the Independent MH are more likely
to converge to the mean of the posterior fast, while the Random walk MH
tries to search for some outlier values and converge to the mean slowly.
In a nutshell, Independent MH are estimated to be more efficient than
Random walk MH as the dimension of the parameter \(\beta\) becomes larger.
END