Bayesian Statistics HW5
Q1. Consider the finite mixture modeling for the galaxy dataset using location-scale Gaussian mixtures. For \(H\) mixture components, the conjugate prior for \(\pi = (\pi_1, \ldots, \pi_H)\) is a Dirichlet prior, \(\pi \sim \textsf{Dirichlet}(\alpha_1, \ldots, \alpha_H)\).
(a) Setting \(\alpha_h = 1\) for all \(h\) leads to a flat prior on the simplex and is hence widely accepted. For each of \(H = 1, 2, \ldots, 10\), find the pointwise posterior mean curve and 95% credible band of the density function using Gibbs sampling. Specify the values of hyper-parameters and initial values you use.
For starters, a location-scale Gaussian mixture has the following form; \[\begin{aligned} g(y|\pi) = \sum_{h=1}^H \pi_h\textsf{N}(y|\mu_h , \tau_h^2)\end{aligned}.\]
This code uses ‘extraDistr’ package to generate Dirichlet random variables as well as inverse Gamma distribution.
rm(list=ls())
set.seed(2023311161)
library(extraDistr) # for dirichlet
Extract data from ‘galaxy.txt’, and store the data in ‘speeds’ vector.
galaxy_data <- read.table("galaxy.txt", header = TRUE)
speeds <- galaxy_data$speed
n <- length(speeds)
One of our goal is to obtain estimated density \(g(y|\pi)\) at each \(y\), so we denote ‘y_grid’ as equally-spaced vector from minimum of ‘speeds’ to maximum of ‘speeds’.
n_iter <- 1000
burn_in <- 100
n_grid <- 100
y_grid <- seq(min(speeds), max(speeds), length.out = n_grid)
Lists for saving Gibbs sampler values are denoted as follows.
########## (a)
all_density_estimates_a <- list()
all_mean_density_a <- list()
all_credible_lower_a <- list()
all_credible_upper_a <- list()
For each \(H\), let ‘alpha’ be one-vector of length \(H\), ‘mu0’ be mean of ‘speeds’ vector, ‘kappa’ be 1, ‘a_tau’ be 2, and ‘b_tau’ be 1 respectively (corresponding to \(\alpha_h = 1, ^\forall h\), \(\mu_0, \kappa, a_{\tau}, b_{\tau}\). Using these hyperparameters, we set initial values of ‘mu’, ‘tau2’, ‘pis’, and ‘z’ (corresponding to \(\mu, \tau^2, \pi, z\) where \(z\) is a multinomial vector with \(z_i\in\{1,\dots,H\}, i=1,\dots,n\)).
Regarding Gibbs sampler, each posterior of \(z_i\) is obtained by \[\begin{aligned} \mathbb P(z_i=h|y,\mu,\tau^2) \propto \pi_h\textsf{N}(y_i|\mu_h,\tau_h^2) \end{aligned}\] The \(i\)th element of ‘probs’ vector represents this, so the \(i\)th element of \(z\) can be updated by ‘sample’ function and letting ‘probs’ as probability vector. The ‘\(-\)’ notation refers to the rest of the samples including \(y\).
Since the posterior of \((\mu_h, \tau_h^2)\) follows normal-inverse Gamma, we have \[\begin{aligned} p(\mu_h|-) \sim \textsf{N}(\mu_h|\hat\mu_h, \hat\kappa_h\tau_h^2) \end{aligned}\] where \(n_h = \sum_{i=1}^{n}\mathbf{1}_{z_i=h}, \bar y_h = n_h^{-1}\sum_{i:z_i=h}y_i, \hat\kappa_h = (\kappa^{-1}+n_h)^{-1}, \hat\mu_h=\hat\kappa(\kappa^{-1}\mu_0+n_h\bar y_h)\). Also, \[\begin{aligned} p(\tau_h^2|-) \sim \text{Inv-}\Gamma(\tau_h^2|\hat a_h, \hat b_h), \end{aligned}\] where \(\hat a_h=a_{\tau}+n_h/2\) and \[\hat b_h = b_{\tau}+\frac{1}{2}\biggl( \sum_{i:z_i=h}(y_i - \bar y_h)^2+\frac{n_h}{1+\kappa n_h}(\bar y_h - \mu_0)^2 \biggr).\] So, the \(h\)th elements of \(\mu, \tau^2\) is obtained by using ‘rnorm’ and ‘rinvgamma’ respectively and the above parameters.
The posterior of \(\pi\) is still Dirichlet distribution; \[\begin{aligned} p(\pi|-) \sim \textsf{Dirichlet}(1 + n_1, \dots, 1+n_H). \end{aligned}\] \(H=1\) is a special case because Dirichlet distribution is defined for \(H\ge2\). So, \(\pi_H=1\) for all iterations.
To obtain pointwise posterior mean and corresponding 95% credible band, the \(h\)th element of ‘iter’th row of ’density_estimates’ matrix is set to \(\pi_h\cdot\textsf{N}(y|\mu_h,\tau_h^2)\).
After removing burn-in samples, obtain the pointwise posterior mean ‘mean_density’ using ‘density_estimates’ matrix, which corresponds to below expression: \[\begin{aligned} \frac{1}{S}\sum_{s=1}^S\sum_{h=1}^H\pi_h\textsf{N}(y|\mu_h,\tau_h^2). \end{aligned}\] Similarly, 0.025 and 0.975 quantile can be obtained using ‘apply’ function.
In addition to a histogram of original ‘speeds’ data, the graph of mean density and 95% credible interval is shown for each \(H\).
for (H in 1:10) {
alpha <- rep(1, H) # Dirichlet prior parameters for pi
mu0 <- mean(speeds) # Prior mean for mu_h
kappa <- 1 # Prior precision for mu_h
a_tau <- 2 # Prior shape parameter for tau2_h
b_tau <- 1 # Prior scale parameter for tau2_h
mu <- rnorm(H, mean = mu0, sd = 1)
tau2 <- rinvgamma(H, alpha = a_tau, beta = b_tau)
pis <- rep(1/H, H)
z <- sample(1:H, n, replace = TRUE)
mu_gibbs <- matrix(0, nrow = n_iter, ncol = H)
tau2_gibbs <- matrix(0, nrow = n_iter, ncol = H)
pis_gibbs <- matrix(0, nrow = n_iter, ncol = H)
density_estimates <- matrix(0, nrow = n_iter, ncol = n_grid)
for (iter in 1:n_iter) {
# Update z_i
for (i in 1:n) {
probs <- pis * dnorm(speeds[i], mean = mu, sd = sqrt(tau2))
z[i] <- sample(1:H, 1, prob = probs )
}
# Update mu_h, tau2_h
ns <- numeric(H)
for (h in 1:H) {
n_h <- sum(z == h)
ybar_h <- mean(speeds[z == h])
# If no data points are assigned to component h, skip the update
if (n_h == 0) { next }
kappa_h <- (1/kappa + n_h)^(-1)
mu_h <- kappa_h*(mu0/kappa + n_h*ybar_h)
a_h <- a_tau + n_h/2
b_h <- b_tau + 0.5*(sum( (speeds[z == h] - ybar_h)^2)) + 0.5*(n_h/(kappa + n_h)*(ybar_h - mu0)^2)
mu[h] <- rnorm( 1 , mean = mu_h, sd = sqrt(kappa_h*tau2[h]) )
tau2[h] <- rinvgamma(1, alpha = a_h, beta = b_h)
ns[h] <- n_h
}
# Update pi
alpha_post <- alpha + ns
if(H == 1){pis <- c(1)} # caution, dirichlet parameter length must be >= 2
else{pis <- rdirichlet(1, alpha_post)}
# Store samples
mu_gibbs[iter, ] <- mu
tau2_gibbs[iter, ] <- tau2
pis_gibbs[iter, ] <- pis
# Estimate density
density_estimates[iter, ] <- rowSums(sapply(1:H, function(h) {
pis[h] * dnorm(y_grid, mean = mu[h], sd = sqrt(tau2[h])) # length(y_grid) by H matrix
}))
}
# burn-in
mu_gibbs <- mu_gibbs[-(1:burn_in), ]
tau2_gibbs <- tau2_gibbs[-(1:burn_in), ]
pis_gibbs <- pis_gibbs[-(1:burn_in), ]
density_estimates <- density_estimates[-(1:burn_in), ]
# Posterior mean density
mean_density <- colMeans(density_estimates)
# 95% credible intervals
credible_lower <- apply(density_estimates, 2, quantile, probs = 0.025)
credible_upper <- apply(density_estimates, 2, quantile, probs = 0.975)
# Store results
all_density_estimates_a[[H]] <- density_estimates
all_mean_density_a[[H]] <- mean_density
all_credible_lower_a[[H]] <- credible_lower
all_credible_upper_a[[H]] <- credible_upper
# Plot results
hist(speeds, freq = FALSE, col = 'gray', border = 'gray', breaks = 20, cex.main = 1,
main = bquote("Posterior Mean Density and 95% CI when " ~ alpha == 1 ~ " and H =" ~ .(H)),
xlab = 'Speed', ylab = 'Density',
ylim = range(c(credible_lower,credible_upper, max(density(speeds)$y), min(density(speeds)$y))) )
lines(y_grid, mean_density, col = 'blue', lwd = 1)
lines(y_grid, credible_lower, col = 'red', lwd = 1, lty = 2)
lines(y_grid, credible_upper, col = 'red', lwd = 1, lty = 2)
legend('topright', legend = c('Posterior Mean', '95% CI'),
col = c('blue', 'red'), lty = c(1, 2), lwd = 2, cex = 0.7)
}
Additionally, the following plot shows how the mean density changes for each \(H=1,\dots,10\). As \(H\) becomes bigger (lighter blue to dark blue), the mean densities are more likely to estimate the true values better.
maxmean <- 0
for(H in 1:10){
if( maxmean < max(all_mean_density_a[[H]]) )
maxmean <- max(all_mean_density_a[[H]])
}
hist(speeds, freq = FALSE, col = 'gray', border = 'gray', breaks = 20, cex.main = 1,
main = bquote("Histogram of galaxy data Posterior Mean Densities when "~ alpha == 1),
xlab = 'Speed', ylab = 'Density',
ylim = range(0, max(density(speeds)$y) , maxmean) )
col_list <- c('cyan1','cyan2', 'cyan3','cyan4',
'deepskyblue1','deepskyblue2','deepskyblue3','deepskyblue4',
'blue1','blue4')
for(H in 1:10){
lines(y_grid, all_mean_density_a[[H]], col = col_list[H], lwd = 2)
}
(b) An alternative is to use \(\alpha_h = 1/H\) for all \(h\). Repeat (a) using this \(\alpha_h\), and compare the results.
Most of the R code is identical to that of (a), except for the first ‘alpha’ vector, the elements of which are all \(1/H\).
##### (b)
all_density_estimates_b <- list()
all_mean_density_b <- list()
all_credible_lower_b <- list()
all_credible_upper_b <- list()
for (H in 1:10) {
alpha <- rep(1/H, H) # Dirichlet prior
mu0 <- mean(speeds) # Prior mean for mu_h
kappa <- 1 # Prior precision for mu_h
a_tau <- 2 # Prior shape parameter for tau2_h
b_tau <- 1 # Prior scale parameter for tau2_h
mu <- rnorm(H, mean = mu0, sd = 1)
tau2 <- rinvgamma(H, alpha = a_tau, beta = b_tau)
pis <- rep(1/H, H)
z <- sample(1:H, n, replace = TRUE)
mu_gibbs <- matrix(0, nrow = n_iter, ncol = H)
tau2_gibbs <- matrix(0, nrow = n_iter, ncol = H)
pis_gibbs <- matrix(0, nrow = n_iter, ncol = H)
density_estimates <- matrix(0, nrow = n_iter, ncol = n_grid)
for (iter in 1:n_iter) {
# Update z_i
for (i in 1:n) {
probs <- pis * dnorm(speeds[i], mean = mu, sd = sqrt(tau2))
z[i] <- sample(1:H, 1, prob = probs )
}
# Update mu_h, tau2_h
ns <- numeric(H)
for (h in 1:H) {
n_h <- sum(z == h)
ybar_h <- mean(speeds[z == h])
# If no data points are assigned to component h, skip the update
if (n_h == 0) { next }
kappa_h <- (1/kappa + n_h)^(-1)
mu_h <- kappa_h*(mu0/kappa + n_h*ybar_h)
a_h <- a_tau + n_h/2
b_h <- b_tau + 0.5*(sum( (speeds[z == h] - ybar_h)^2)) + 0.5*(n_h/(kappa + n_h)*(ybar_h - mu0)^2)
mu[h] <- rnorm( 1 , mean = mu_h, sd = sqrt(kappa_h*tau2[h]) )
tau2[h] <- rinvgamma(1, alpha = a_h, beta = b_h)
ns[h] <- n_h
}
# Update pi
alpha_post <- alpha + ns
if(H == 1){pis <- c(1)} # caution, dirichlet parameter length must be >= 2
else{pis <- rdirichlet(1, alpha_post)}
# Store samples
mu_gibbs[iter, ] <- mu
tau2_gibbs[iter, ] <- tau2
pis_gibbs[iter, ] <- pis
# Estimate density
density_estimates[iter, ] <- rowSums(sapply(1:H, function(h) {
pis[h] * dnorm(y_grid, mean = mu[h], sd = sqrt(tau2[h])) # length(y_grid) by H matrix
}))
}
# Discard burn-in samples
mu_gibbs <- mu_gibbs[-(1:burn_in), ]
tau2_gibbs <- tau2_gibbs[-(1:burn_in), ]
pis_gibbs <- pis_gibbs[-(1:burn_in), ]
density_estimates <- density_estimates[-(1:burn_in), ]
# Posterior mean density
mean_density <- colMeans(density_estimates)
# 95% credible intervals
credible_lower <- apply(density_estimates, 2, quantile, probs = 0.025)
credible_upper <- apply(density_estimates, 2, quantile, probs = 0.975)
# Store results
all_density_estimates_b[[H]] <- density_estimates
all_mean_density_b[[H]] <- mean_density
all_credible_lower_b[[H]] <- credible_lower
all_credible_upper_b[[H]] <- credible_upper
# Plot results
hist(speeds, freq = FALSE, col = 'gray', border = 'gray', breaks = 20, cex.main = 1,
main = bquote("Posterior Mean Density and 95% CI when " ~ alpha == 1/H ~ " and H =" ~ .(H)),
xlab = 'Speed', ylab = 'Density',
ylim = range(c(credible_lower,credible_upper, max(density(speeds)$y), min(density(speeds)$y))) )
lines(y_grid, mean_density, col = 'blue', lwd = 1)
lines(y_grid, credible_lower, col = 'red', lwd = 1, lty = 2)
lines(y_grid, credible_upper, col = 'red', lwd = 1, lty = 2)
legend('topright', legend = c('Posterior Mean', '95% CI'),
col = c('blue', 'red'), lty = c(1, 2), lwd = 2, cex = 0.7)
}
maxmean <- 0
for(H in 1:10){
if( maxmean < max(all_mean_density_b[[H]]) )
maxmean <- max(all_mean_density_b[[H]])
}
hist(speeds, freq = FALSE, col = 'gray', border = 'gray', breaks = 20, cex.main = 1,
main = bquote("Histogram of galaxy data Posterior Mean Densities when "~ alpha == 1/H),
xlab = 'Speed', ylab = 'Density',
ylim = range(0, max(density(speeds)$y) , maxmean) )
for(H in 1:10){
lines(y_grid, all_mean_density_b[[H]], col = col_list[H], lwd = 2)
}
The mean density gets closer to the true data as \(H\) increases as well. However, when it comes to the degree of getting closer to the original data, the case (a) \(\alpha = 1\) is greater than the case (b) \(\alpha=1/H\).