Bayesian Statistics HW2
Q1 Consider analyzing count data, \(y = (y_1, . . . , y_n)^T , y_i \in \mathbb{N}_0 := \{0, 1, 2, . . . \}, i = 1, . . . , n\).
(a) The basic approach is using the Poisson model, \[p(y_i|\theta) =
\frac{\theta^{y_i}e^{-\theta}}{y_i!}, \quad y_i \in \mathbb{N}_0, i = 1,
. . . , n.\] With an improper prior \(p(\theta) \propto 1, \theta > 0\), show
that the posterior distribution \(p(\theta|y)\) is a gamma distribution. For
the dataset countdata.txt, draw the posterior density of
\(\theta\). [Do not use Monte Carlo
simulation. You can directly visualize the density curve.]
\[ \begin{aligned} p(\theta|y) \propto p(\theta)p(y|\theta) \propto 1 \cdot \prod_{i=1}^{n} \frac{\theta^{y_i}e^{-\theta}}{y_i!} \propto \theta^{\sum_{i=1}^{n} y_i} \exp\{-n\theta\} \end{aligned} \]
solution
Recall that \(x^{\alpha-1} \exp\{-x\beta\}\) is the kernel of Gamma distribution \(x \sim \Gamma(\alpha, \beta)\) where \(\alpha, \beta\) is shape and rate parameter respectively. This shows that:
\[ \theta|y \sim \Gamma\left(\sum_{i=1}^{n} y_i + 1, n\right). \]
The graph of the posterior density is drawn as follows:
rm(list = ls())
library(ggplot2)
## Warning: 패키지 'ggplot2'는 R 버전 4.3.3에서 작성되었습니다
set.seed(2023311161)
countdata = read.table("countdata.txt")
y = countdata[,1]
x <- seq(0, 50, by = 0.01)
pdf <- dgamma(x, shape = sum(y), rate = length(y))
plot1 <- ggplot(data.frame(x, pdf), aes(x = x, y = pdf)) +
geom_line(color = "blue", linewidth = 1) +
labs(x = "x", y = "Density", title = "Gamma Distribution") +
xlim(40, 50)
plot1
## Warning: Removed 4000 rows containing missing values or values outside the scale range
## (`geom_line()`).
(b) For the dataset countdata.txt, perform a posterior predictive check by generating replicates from the posterior predictive distribution through Monte Carlo simulation. Choose appropriate test quantities. Draw conclusions based on the results.
The posterior predictive distribution is as follows: \[p(\tilde y|y) = \int p(\tilde y|\theta)p(\theta|y)d\theta.\]
We may obtain posterior predictive distribution by extracting \(\theta\) first and the replicate \(\tilde y\) sequentially (Monte Carlo). The following shows the function ‘random_pois’ and the histogram of the replacated data.
random_pois <- function(N,y){
yt = c()
for(i in 1:N){
theta = rgamma(1,shape = sum(y)+1, rate = length(y)) #p(theta|y)
yt[i] = rpois(1, theta) #p(y tilda|theta)
}
return(yt)
}
yt <- random_pois(1000,y)
plot2 <- ggplot(data.frame(yt), aes(x = yt)) +
geom_histogram(aes(y = after_stat(density)),fill = "gray50", color = "black", bins = 30) +
labs(x = "x", y = "Density")
plot2
The purpose of this posterior predictive check is to investigate whether the replicated data covers the overdispersed area, far from mean data. So the ‘ppcheck_pois’ function counts how often the 3rd quantile of the replicate exceed that of the observed data \(y\) or the 1st quantile of the replicate is less than that of the \(y\).
ppcheck_pois <- function(M, y,N){
pp <- 0
for(i in 1:M){
ytil <- random_pois(N,y)
if(quantile(ytil,probs=0.75) >= quantile(y,probs=0.75)||
quantile(ytil,probs=0.25) <= quantile(y,probs=0.25)){
pp <- pp+1
}
}
return(pp/M)
}
ppcheck_pois(200,y,1000)
## [1] 0
This shows the replicate are rarely likely to cover the area far from the mean.
(c) An unfortunate property of the Poisson model is that it cannot
model overdispersed data or data in which the variance is greater than
the mean. This is because Poisson model has one parameter. An
alternative is considering the Poisson-gamma mixture model, \(y_i|\theta_i \sim Poi(\theta_i)\) and \(\theta_i|(\alpha, \beta) \sim \Gamma(\alpha,
\beta), \alpha > 0, \beta > 0\). Hence, the density of
\(y_i\) is given by: \[\begin{aligned} p(y_i|\alpha, \beta) &=
\int_{0}^{\infty} p(y_i|\theta_i)p(\theta_i|\alpha, \beta)d\theta_i \\
&= \binom{\alpha + y_i - 1}{y_i} \left(\frac{\beta}{\beta +
1}\right)^{\alpha} \left(\frac{1}{\beta + 1}\right)^{y_i}, \quad y_i \in
\mathbb{N}_0, \quad i = 1, \dots, n. \end{aligned} \] Note that
this is the density of a negative binomial distribution if \(\alpha\) is a positive integer. Despite the
flexibility of \(\alpha\) being any
positive value in our setting, this model is commonly referred to as the
negative binomial model for count data. For the dataset
countdata.txt, draw a contour plot of the unnormalized
posterior density of \((\alpha,
\beta)\) with an improper prior \(p(\alpha, \beta) \propto 1\). [Do not use
Monte Carlo simulation. You can directly draw a contour plot, though
some iterative trials may be necessary to find a suitable range for
visualization.]
solution
The posterior distribution \(p(\alpha, \beta|y)\) can be obtained as follows:
\[ \begin{aligned} p(\alpha, \beta|y) &\propto p(\alpha, \beta)p(y|\alpha, \beta) \\ &\propto 1 \times \prod_{i=1}^{n} \binom{\alpha + y_i - 1}{y_i} \left( \frac{\beta}{\beta + 1} \right)^{\alpha} \left( \frac{1}{\beta + 1} \right)^{y_i} \\ &= \left\{ \prod_{i=1}^{n} \binom{\alpha + y_i - 1}{y_i} \right\} \left( \frac{\beta}{\beta + 1} \right)^{n\alpha} \left( \frac{1}{\beta + 1} \right)^{\sum_i y_i}. \end{aligned} \]
Using this result, an R function can be defined that works as the unnormalized posterior density.
f <- function(a, b, y) {
n=length(y)
result = 1
for(i in 1:n){
temp = choose(a + y[i] - 1, y[i]) * (b / (b + 1))^a * (1 / (b + 1))^y[i]
result = result*temp
}
return(result)
}
The range of \(\alpha\) and \(\beta\) is determined as \(10 \leq \alpha \leq 35\) and \(0.2 \leq \beta \leq 0.8\), the interval of
which are \(0.1\) and \(0.01\) respectively. The density values are
saved in the variable z.
a_values <- seq(10, 35, by = 0.1) # 251
b_values <- seq(0.2, 0.8, by = 0.01) #61
z <- outer(a_values, b_values, FUN = Vectorize(function(a, b) f(a, b, y)))
The corresponding contour plot is displayed below.
contour(a_values, b_values, z, nlevels = 15,
xlab = "a", ylab = "b",
main = "Contour Plot of f(a, b)")
(d) For the dataset countdata.txt, perform the
posterior predictive check by generating replicates from the posterior
predictive distribution using Monte Carlo simulation. Use the same test
quantities as those used in (b). To generate samples from the posterior
predictive distribution, first draw \((\alpha,
\beta)\) from the posterior distribution using the “grid
sampling”, and then generate a replicate using the data distribution.
You will need to find a suitable range and grid points for the grid
sampling. Draw conclusions based on the results.
solution
The posterior predictive distribution for mixture model is as follows:
\[ p(\tilde{y}|y) = \int p(\tilde{y}|\alpha, \beta)p(\alpha, \beta|y)d(\alpha, \beta). \]
The replicated \(\tilde{y}\) follows negative binomial Nev-Bin\((\alpha, \beta)\) given the extracted \((\alpha, \beta)\).
For grid approximation (grid sampling), the same grids of \(\alpha, \beta\) determined in (c) will be
used again. Then, each value in z can be viewed as discrete
density value at each \(\alpha\) and
\(\beta\), so the z
divided by the total sum of z can be regarded as
approximated density at the \(\alpha\)
and \(\beta\).
z_gridapp <- z / sum(z)
The following random_ab function generates the \((\alpha, \beta)\) using inverse CDF method
and generate the replicate \(\tilde{y}\). For each iteration
(n in the R code), generate u from uniform
distribution \(Unif(0, 1)\) and
investigate whether the u is less than (approximated) CDF
value (prob in R code). Whenever that happens, save the
indices of each \(\alpha\) and \(\beta\) and find the corresponding the
value of \((\alpha, \beta)\). Then,
generate the nth replicate from rnbinom
function. In this function, the prob parameter should be
\(\frac{\beta}{\beta + 1}\), not just
\(\beta\). After repeating for
N times (N: the input that works as the total
number of iteration), the generated \((\alpha,
\beta)\), and \(\tilde{y}\) will
be returned in dataframe.
random_ab <- function(N, a, b, z) {
anew <- c()
bnew <- c()
ytilda <- c()
for (n in 1:N) {
prob <- 0
u <- runif(1)
for (i in 1:nrow(z)) {
for (j in 1:ncol(z)) {
if (u < prob) { break }
prob <- prob + z[i, j]
aidx <- i
bidx <- j
}
if (u < prob) { break }
}
anew[n] = a[aidx]
bnew[n] = b[bidx]
ytilda[n] = rnbinom(1, a[aidx], b[bidx] / (b[bidx] + 1))
} # end of n loop
return(data.frame(a = anew, b = bnew, y = ytilda))
} # end of function
Again, a function for posterior predictive check
ppcheck_mix is defined as follows, similar to the (b):
whether the 3rd quantile of the replicate is greater than or equal to
that of the given data or the 1st quantile of the replicate is less than
or equal to that of the given data.
ppcheck_mix <- function(M, y, N, a, b, z) {
pp <- 0
for (i in 1:M) {
ytil <- random_ab(N, a, b, z)$y
if (quantile(ytil, probs = 0.75) >= quantile(y, probs = 0.75) ||
quantile(ytil, probs = 0.25) <= quantile(y, probs = 0.25)) {
pp <- pp + 1
}
}
return(pp / M)
}
ppcheck_mix(200, y, 1000, a_values, b_values, z_gridapp)
## [1] 0.735
The result is much higher than in (b). This implies that the case in (d) are more likely to generate overdispersed data (greater than mean).
Q2 \[y|(\beta,\sigma^2) \sim N_n(X\beta,\sigma^2I)\]with prior \(p(\beta,\sigma^2) \propto \sigma^{-2}\)
(a) Derive the conditional posterior distribution of \(\beta\) given \(\sigma^2\), say \(p(\beta|\sigma^2,y)\).
We have \(p(y|\beta,\sigma^2) = (2\pi\sigma^2)^{-n/2}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\bigr\}\), so \[\begin{aligned} p(\beta,\sigma^2|,y) &\propto p(y|\beta,\sigma^2)p(\beta,\sigma^2) \\ &\propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\beta)^T(y-X\beta)\bigr\} \\ &\propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta +X\hat\beta -X\beta)^T(y-X\hat\beta +X\hat\beta -X\beta)\bigr\} \\ &\propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta)^T(y-X\hat\beta) + \underbrace{2(y-X\hat\beta)^T(X\hat\beta-X\beta)}_{1)}+(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\bigr\} \\ &\propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta)^T(y-X\hat\beta) +(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\bigr\} . \end{aligned}\]
The underbrace 1) becomes 0 because \(\hat\beta\) is the solution of the normal equation \(X^T(y-X\hat\beta) = 0\) Then, \(\exp\{(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\}\) is the only term in \(p(\beta,\sigma^2|y)\) retaining \(\beta\). So, \[\begin{aligned} p(\beta|\sigma^2,y) &\propto \exp\bigl\{(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\bigr\} \\ &\propto (2\pi)^{-n/2}|\sigma^2(X^TX)^{-1}|^{-1/2}\exp\bigl\{(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\bigr\}, \end{aligned}\] and thus, \(\beta|(\sigma^2,y) \sim N_p(\hat\beta,\sigma^2(X^TX)^{-1})\).
(b) Let \(\beta_1\) be the first element of \(\beta\). Derive the conditional posterior distribution of \(\beta_1\) given \(\sigma^2\), say \(p(\beta_1|\sigma^2,y)\)
We may use the result of (a), that is, \[\beta_1|(\sigma^2,y) \sim N(\hat\beta_1,[\sigma^2(X^TX)^{-1}]_{1,1})\] where \(\hat\beta = (X^TX)^{-1}X^Ty\).
(c) Derive the conditional posterior distribution of \(\sigma^2\) given \(\beta\), say \(p(\sigma^2|\beta,y)\).
Return to the \(p(\beta,\sigma^2|y)\); \[p(\beta,\sigma^2|y) \propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta)^T(y-X\hat\beta) +(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\bigr\}\] This shows that \((\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta)^T(y-X\hat\beta)\bigr\}\) is the only term containing \(\sigma^2\). The scaled inverse chi-square distribution \(\text{Inv-}\chi^2(\theta|\nu,s^2)\) has the following pdf: \[p(\theta) \propto \theta^{-(\nu/2+1)}\exp\{-\nu s^2/(2\theta)\}.\] Using this fact, we conclude \[\sigma^2|(\beta,y) \sim \text{Inv-}\chi^2\Bigl(n,\frac{(y-X\hat\beta)^T(y-X\hat\beta)}{n}\Bigr).\]
(d) By marginalizing out \(\sigma^2\) from \(p(\beta,\sigma^2|y)\), derive the marginal posterior distribution of \(\beta\), say \(p(\beta|y)\).
1) Suppose \(z|u\sim N_p(\mu,u\Sigma)\) and \(u\sim \text{Inv-}\chi^2(\nu,s^2)\), then \(z\sim t_{\nu}(\mu,s^2\Sigma)\)
\[\begin{aligned} p(z) &\propto \int p(z|u)p(u)du \\ &\propto \int|u\Sigma|^{-1/2}\exp\bigl\{-\frac{1}{2u}\underbrace{(z-\mu)^T\Sigma^{-1}(z-\mu)}_{:=\Delta_z}\bigr\}u^{-(\nu/2+1)}\exp\Bigl\{\frac{-\nu s^2}{2u}\Bigr\}du \\ &\propto \int \underbrace{u^{-p/2}u^{-(\nu/2+1)}\exp\biggl\{-\frac{\Delta_z+\nu s^2}{2u}\biggr\}}_{2)}du \\ & \Bigl( \text{2) is a kernel of Inv-}\chi^2(\nu+p,\frac{\Delta_z+\nu s^2}{\nu+p}) \Bigr)\\ &= \frac{\Gamma(\frac{\nu+p}{2})}{(\frac{\nu+p}{2})^{\frac{\nu+p}{2}}\Bigl(\sqrt{\frac{\Delta_z+\nu s^2}{\nu+p}}\Bigr)^{\nu+p}} \\ &\propto \biggl(\frac{\Delta_z+\nu s^2}{\nu+p}\biggr)^{-(\nu+p)/2} \\ &\propto \biggl(\frac{\Delta_z+\nu s^2}{\nu}\biggr)^{-(\nu+p)/2} \\ &\propto \biggl(1+\frac{\Delta_z}{\nu s^2}\biggr)^{-(\nu+p)/2} \\ &= \Bigl\{1+\frac{1}{\nu}(z-\mu)^T\frac{\Sigma^{-1}}{s^2}(z-\mu)\Bigr\}^{-(\nu+p)/2}, \end{aligned}\] which is a kernel of \(t_{\nu}(\mu,s^2\Sigma)\).
2) \(z \Rightarrow (\beta|y)\) and \(u\Rightarrow (\sigma^2|y)\)
We have \(\beta|(\sigma^2,y)\sim N_p(\hat\beta,\sigma^2(X^TX)^{-1})\) and \(\sigma^2|y \sim\text{Inv-}\chi^2(n-p,s^2)\), where \(s^2=\frac{1}{n-p}(y-X\hat\beta)^T(y-X\hat\beta)\), the result of (f). Thus, \(\beta|y\sim t_{n-p}(\hat\beta,s^2(X^TX)^{-1})\).
(e) By marginalizing out \(\sigma^2\) from \(p(\beta_1,\sigma^2|y)\), derive the marginal posterior distribution of \(\beta_1\), say \(p(\beta_1|y)\).
We may directly use the result of (d); \[\beta_1|y\sim t_{n-1}(\hat\beta,s^2[(X^TX)^{-1}]_{1,1}).\]
(f) By marginalizing out \(\beta\) from \(p(\beta,\sigma^2|y)\), derive the marginal posterior distribution of \(\sigma^2\), say \(p(\sigma^2|y)\).
Recall that \(p(\beta,\sigma^2|y) \propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta)^T(y-X\hat\beta) +(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\bigr\}\). \(\exp\{(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\}\) is a kernel of \(N_p(\hat\beta,\sigma^2(X^TX)^{-1})\), so \[\int\exp\{(\hat\beta-\beta)^TX^TX(\hat\beta-\beta)\}d\beta=(2\pi)^{-p/2}|\sigma^2(X^TX)^{-1}|^{-1/2} = (2\pi\sigma^2)^{-p/2}|X^TX|^{1/2}.\]Then, \[\begin{aligned} \int p(\beta,\sigma^2|y)d\beta &\propto (\sigma^2)^{-n/2-1}\exp\bigl\{-\frac{1}{2\sigma^2}(y-X\hat\beta)^T(y-X\hat\beta)\bigr\}\cdot(2\pi\sigma^2)^{-p/2}|X^TX|^{1/2} \\ &\propto (\sigma^2)^{-(n-p)/2-1}\exp\Bigl\{-\frac{n-p}{2\sigma^2}\frac{(y-X\hat\beta)^T(y-X\hat\beta)}{n-p}\Bigr\}, \end{aligned}\] which is a kernel of \(\text{Inv-}\chi^2\Bigl(n-p,\frac{(y-X\hat\beta)^T(y-X\hat\beta)}{n-p}\Bigr)\). Thus, \[\sigma^2|y\sim\text{Inv-}\chi^2\Bigl(n-p,\frac{(y-X\hat\beta)^T(y-X\hat\beta)}{n-p}\Bigr).\]