Bayes HW4

Q1. Consider a univariate nonparametric regression model \[y_i = \mu(x_i) + \epsilon_i, \quad \epsilon_i \sim N(0, 0.1^2), \quad i = 1, \dots, n. \] where \(\mu(x) = \sin^3(x)\) and \(x_i = \frac{2i - 1}{1000}, \quad i = 1, \dots, n\), with \(n = 500\). We will perform Bayesian inference on the mean function using the basis expansion with knot selection (Method 2). We define the full model using cubic splines with \(L\) interior equally spaced knots, resulting in \(H = L+4\) basis terms, and then we choose important knots from the \(L\) candidates. We denote the indicator variables for knots by \(\gamma = (\gamma_1, \ldots, \gamma_L)\), where \(\gamma_\ell = 1\) if the \(\ell\)th knot is included and \(\gamma_\ell = 0\) otherwise. We set \(L = 30\).

(a) The first step is to verify that using any type of spline basis is essentially identical. Consider two \(\gamma\)’s as follows:
- \(\gamma_\ell = 1\) if \(\ell\) is odd; \(\gamma_\ell = 0\) if \(\ell\) is even;
- \(\gamma_\ell = 0\) if \(\ell\) is odd; \(\gamma_\ell = 1\) if \(\ell\) is even.
For each of these \(\gamma\)’s, generate design matrices using the truncated power basis, polynomial radial basis, and B-spline basis. Using software, verify that the column spaces of the three design matrices are identical.
[Hint: Among several possible ways, you can check the equivalence of column spaces by examining their orthogonal projection matrices, as the orthogonal projection matrix is unique for a given column space.]
[Hint: You can generate B-spline basis functions using the ‘bs’ function (with an intercept) in R. There are a few equivalent ones in Python, e.g., ‘BSpline’ function in ‘scikit-fda’.]

Let us define two distinct gamma vectors as follow. “gamma_odd” vector has ‘1’ for odd and ‘0’ for even, while “gamma_even” has 1 for even and 0 for odd.

rm(list=ls())
#########     (a)

library(splines)

n <- 500;L <- 30
x <- (2*(1:n)-1)/1000


gamma_odd <- rep(c(1, 0), length.out = L) # 1,0,1,0, ... 
gamma_even <- rep(c(0, 1), length.out = L) # 0,1,0,1, ...

The equally spaced 30 number of knots are defined as follows (except for the left and right end) between 0 and 1.

knots <- seq(0, 1, length.out = L + 2)[-c(1, L + 2)]

The following shows how to make truncated power basis. For the cubic cases, all the basis except for \(1,x,x^2,x^3\) have the following form: \(\max\{0, x-x_l\}^3\), where \(x_l\) is one of knots, \(l=1,\dots,30\). The \(l\)th column of the first ‘basis’ matrix contains the result of \(\max\{0, x-x_l\}^3\) for each element of the input vector ‘x’. The next ‘basis’ matrix is set to return only the columns corresponding the 1 element of ‘gamma’ vector. Finally, add \(1,x,x^2,x^3\) at the left side of the ‘basis’ matrix, and return it.

trunc_p_basis <- function(x, knots, gamma) {
  basis <- sapply(knots, function(k) (x - k)^3*(x > k))
  basis <- basis[, gamma == 1]
  cbind(1, x, x^2, x^3, basis)
}

Similarly, the polynomial radial basis function can be defined, except for the function of ‘sapply’ should be ‘abs(x - k)^3’.

polyn_r_basis <- function(x, knots, gamma) {
  basis <- sapply(knots, function(k) abs(x - k)^3)
  basis <- basis[, gamma == 1]
  cbind(1, x, x^2, x^3, basis)
}

The ‘proj_mat’ function is defined in order to compute projection matrix readily.

proj_mat <- function(W) {W%*%solve(t(W)%*%W)%*%t(W)}

Now, consider the ‘gamma_odd’ case first. Design matrices generated by truncated power basis, polynomial radial basis, B-spline basis, and their corresponding projection matrices are as follows:

W_tp_odd <- trunc_p_basis(x, knots, gamma_odd)
W_pr_odd <- polyn_r_basis(x, knots, gamma_odd)
W_bs_odd <- bs(x, knots = knots[gamma_odd == 1], degree = 3, intercept = TRUE)

proj_tp_odd <- proj_mat(W_tp_odd)
proj_pr_odd <- proj_mat(W_pr_odd)
proj_bs_odd <- proj_mat(W_bs_odd)

Regarding their difference, all of their maximum difference values are close to 0; between \(10^{-6} \sim 10^{-5}\) scale.

max(abs(proj_tp_odd - proj_pr_odd))

## [1] 1.890928e-05

max(abs(proj_tp_odd - proj_bs_odd))

## [1] 1.886291e-05

max(abs(proj_pr_odd - proj_bs_odd))

## [1] 1.328363e-06

This shows that the three projection matrices are identical, so their column spaces are identical as well.

Next, consider the ‘gamma_even’. Design matrices generated by truncated power basis, polynomial radial basis, B-spline basis, and their corresponding projection matrices are as follows:

W_tp_even <- trunc_p_basis(x, knots, gamma_even)
W_pr_even <- polyn_r_basis(x, knots, gamma_even)
W_bs_even <- bs(x, knots = knots[gamma_even == 1], degree = 3, intercept = TRUE)


proj_tp_even <- proj_mat(W_tp_even)
proj_pr_even <- proj_mat(W_pr_even)
proj_bs_even <- proj_mat(W_bs_even)

Similarly, all of their maximum difference values are close to 0; between \(10^{-7} \sim 10^{-5}\) scale.

max(abs(proj_tp_even - proj_pr_even))

## [1] 7.651545e-05

max(abs(proj_tp_even - proj_bs_even))

## [1] 2.719679e-07

max(abs(proj_pr_even - proj_bs_even))

## [1] 7.650571e-05

This shows that the three projection matrices are identical, so their column spaces are identical as well.

(b) We now explore a model-averaged estimate of the target function. Using the above model with given \(x_i\), generate \(y_i\), \(i = 1, \ldots, n\). With the design matrix generated by the polynomial radial basis functions, put the g-prior on the coefficients \(\beta_\gamma\) with \(g = n\), the Jeffreys prior on \(\sigma^2\), and a Bernoulli prior \(p(\gamma_\ell = 1) = 1/2\), \(\ell = 1, \ldots, L\). The sampling scheme can be summarized as follows:
- Draw \(\gamma_\ell\) from \(p(\gamma_\ell | \gamma_{(-\ell)}, y)\), \(\ell = 1, \ldots, L\) (Gibbs sampling);
- Draw \(\sigma^2\) from \(p(\sigma^2 | \gamma, y)\);
- Draw \(\beta_\gamma\) from \(p(\beta_\gamma | \gamma, \sigma^2, y)\).
For \(x_0 = j/1000\), \(j = 1, 2, \ldots, 999\), obtain the model averaged pointwise posteriors for \(\mu(x_0)\). Draw the posterior mean and the 95% credible interval for every \(x_0\), which portrays the posterior mean curve and the 95% credible band on \([0, 1]\). Provide a table that shows the marginal posterior of \(\gamma_\ell\), \(p(\gamma_\ell = 1 | y)\), \(\ell = 1, \ldots, L\).

It is possible to generate \(y_i\) by setting some parameters and a function.

library(MASS) 
library(extraDistr)

## Warning: 패키지 'extraDistr'는 R 버전 4.3.3에서 작성되었습니다

set.seed(2023311161)

sigma2 <- 0.1^2
g <- n
pi_prior <- 0.5
n_iter <- 1000
burn_in <- 100

mu <- function(x) { sin(x)^3 }
y <- mu(x) + rnorm(n, mean = 0, sd = sqrt(sigma2))

Then, generate initial gamma vector of Bernoulli prior.

gam <- rbinom(L, 1, pi_prior)

Empty matrices and vector are also defined to store gibbs samplings.

gam_gibbs <- matrix(0, nrow = n_iter, ncol = L)
sigma2_gibbs <- numeric(n_iter)
beta_gibbs <- matrix(0, nrow = n_iter, ncol = L + 4)

Before getting into the main Gibbs sampler code, we need a function representing posterior of \(y\) given \(\gamma\): \(p(y|\gamma)\). Suppose \(p(\gamma)\) is discrete, \(\sigma^2 \sim \text{Inv-}\Gamma(a,b)\), and \(\beta_{\gamma}|\sigma^2,\gamma\sim\textsf{N}_H(\beta_0, \sigma^2\Psi_0)\), then we have \[p(y|\gamma) \propto |I_n + W_{\gamma}\Psi_0W_{\gamma}^T|^{-1/2}\cdot\biggl\{b + \frac{1}{2}(y-W_{\gamma}\beta_0)^T(I_n + W_{\gamma}\Psi_0W_{\gamma}^T)^{-1}(y-W_{\gamma}\beta_0) \biggr\}^{-(a+n/2)}.\] However, directly computing inverse of high-dimensional matrix such as \(\bigl(I_n + W_{\gamma}\Psi_0W_{\gamma}^T \bigr)^{-1}\) is way too heavy. So it is desirable to reduce the dimension of matrix where inversion is applied as much as possible. Due to Woodbury matrix identity and g-prior; \(\Psi_0=g(W_{\gamma}^TW_{\gamma})^{-1}\), we have \[\begin{aligned} \bigl(I_n + W_{\gamma}\Psi_0W_{\gamma}^T\bigr)^{-1} &= I_n^{-1} - I_n^{-1}W_{\gamma}\bigl(\Psi_0^{-1} + W_{\gamma}^TI_nW_{\gamma}\bigr)^{-1}W_{\gamma}^TI_n^{-1} \\ &= I_n^{-1}-W_{\gamma}\bigl(\Psi_0^{-1} + W_{\gamma}^TW_{\gamma}\bigr)^{-1}W_{\gamma}^T\\ &= I_n^{-1}-\frac{g}{g+1}W_{\gamma}\bigl(W_{\gamma}^TW_{\gamma}\bigr)^{-1}W_{\gamma}^T \end{aligned}\] The matrix \(W_{\gamma}\) is \(n \times |\gamma|\) matrix, where \(|\gamma| = \bigl(\sum_{l=1}^L\mathbf{1}(\gamma_l=1) + 4\bigr)\), so \(W_{\gamma}^TW_{\gamma}\) is at most \(34\times34\) matrix. Considering the computation time fo an inversion of \(p\times p\) matrix is \(O(p^3)\), the computation time of our matrix inversion is reduced exponentially; at least \((34/500)^3\) scale.

The g-prior is used, that is \(\beta_0 = 0, \Psi_0=g(W_{\gamma}^TW_{\gamma})^{-1}\). Also, Jeffry’s prior is used for \(\sigma^2\), so \(a=b=0\). Then, the posterior of \(y\) is simplified as \[\begin{aligned} p(y|\gamma) &\propto |I + W_{\gamma}\Psi_0W_{\gamma}^T|^{-1/2}\cdot\biggl\{\frac{1}{2}y^T(I + W_{\gamma}\Psi_0W_{\gamma}^T)^{-1}y \biggr\}^{-(n/2)} \\ &=\biggl| I_n-\frac{g}{g+1}W_{\gamma}\bigl(W_{\gamma}^TW_{\gamma}\bigr)^{-1}W_{\gamma}^T\biggr|^{1/2} \cdot\biggl\{\frac{1}{2}y^T\biggl(I_n-\frac{g}{g+1}W_{\gamma}\bigl(W_{\gamma}^TW_{\gamma}\bigr)^{-1}W_{\gamma}^T \biggr)y \biggr\}^{-(n/2)} . \end{aligned}\] The following ‘p_y_given_gam’ function calculates this function.

# pdf of y given gamma
p_y_given_gam <- function(y,x,knots,W_gam){
  n <- length(y)
  
  V <- diag(n) - n/(n+1)*W_gam%*%solve(t(W_gam)%*%W_gam)%*%t(W_gam)
  R <- (1/2)*t(y)%*%V%*%y
  
  return(sqrt(det(V))*R)
}

As previously mentioned, \(\gamma, \sigma^2, \beta_{\gamma}\) have to be sampled through Gibbs sampler, and we need posteriors of each values to sample them.

In order to sample \(l\)th element of ‘gamma’ vector, we need \(p(\gamma_l | \gamma_{-l},y)\), where \(-l\) implies the rest of the elements apart from \(l\)th one. Recall that the posterior of \(\gamma\) is \(p(\gamma|y) \propto p(\gamma)p(y|\gamma)\). Then, \[\begin{aligned} p(\gamma_l =1| \gamma_{-l},y) &=\frac{p(\gamma_l=1 | \gamma_{-l},y)}{p(\gamma_l =1| \gamma_{-l},y) + p(\gamma_l =0| \gamma_{-l},y)} \\ &= \frac{p(\gamma_l=1 | \gamma_{-l},y) p(\gamma_{-l}|y) }{p(\gamma_l =1| \gamma_{-l},y)p(\gamma_{-l}|y) + p(\gamma_l =0| \gamma_{-l},y)p(\gamma_{-l}|y)} \\ &= \frac{p(\gamma_l=1, \gamma_{-l}|y)}{p(\gamma_l=1, \gamma_{-l}|y) + p(\gamma_l=0, \gamma_{-l}|y)} \\ &= \frac{p(\gamma_l=1, \gamma_{-l})p(y|\gamma_l=1,\gamma_{-l}) }{p(\gamma_l=1, \gamma_{-l})p(y|\gamma_l=1,\gamma_{-l}) + p(\gamma_l=0, \gamma_{-l})p(y|\gamma_l=0,\gamma_{-l})} \\ &= \frac{p(\gamma_l=1)p(y|\gamma_l=1,\gamma_{-l})}{p(\gamma_l=1)p(y|\gamma_l=1,\gamma_{-l})+p(\gamma_l=0)p(y|\gamma_l=0,\gamma_{-l})} \\ &= \frac{p(y|\gamma_l=1,\gamma_{-l})}{p(y|\gamma_l=1,\gamma_{-l})+p(y|\gamma_l=0,\gamma_{-l})} \end{aligned}\] where the 5th equality holds due to the independence of each \(l\), and the last equality holds because of Bernoulli prior \(p(\gamma_l=1)= p(\gamma_l=0)=1/2\). This shows \(\gamma_l| \gamma_{-l},y\) follows Bernoulli distribution with above probability, so it remains to obtain the value of \(p(y|\gamma_l=1,\gamma_{-l})\) and \(p(y|\gamma_l=0,\gamma_{-l})\) respectively.

For the \(l\)th iteration, let ‘gam_temp’ be temprorary gamma vector that swtiches from 1 (or 0) to 0 *ro 1). When \(\gamma_l = 1\), then let the design matrix \(W_{\gamma}\) be made from polynomial radial basis function using current gamma vector and \(W_0\) be from same function using temporary gamma vector. In contrast, when \(\gamma_l = 0\), then let the design matrix \(W_{\gamma}\) be made from polynomial radial basis function using temporary gamma vector and \(W_0\) be from same function using current gamma vector. Then, ‘p_gam1_y’ stores the value of ‘p_y_given_gam’ function with inputs “y,x,knots,W_gam”, while ‘p_gam0_y’ stores the result of ‘p_y_given_gam’ function with inputs “y,x,knots,W_0”. After then, ‘p_gam1’ representing \(p(\gamma_l =1| \gamma_{-l},y)\) is calculated as ‘p_gam1_y/(p_gam1_y + p_gam0_y)’, and \(\gamma_l\) is updated by generating from \(\textsf{Bernoulli}(p(\gamma_l =1| \gamma_{-l},y))\).

The posterior of \(\sigma^2\) is given as follows: \[\begin{aligned} p(\sigma^2|\gamma, y) \sim \text{Inv-}\Gamma\biggl(\frac{n}{2}, \frac{1}{2}y^T\bigl(I_n+W_{\gamma}\Psi_0W_{\gamma}^T \bigr)^{-1}y \biggr), \end{aligned}\] by letting \(a=b=0\), and \(\beta_0 = 0\). Using the same inversion formula, \[\begin{aligned} p(\sigma^2|\gamma, y) \sim \text{Inv-}\Gamma\biggl(\frac{n}{2}, \frac{1}{2}y^T\biggl(I_n-\frac{g}{g+1}W_{\gamma}\bigl(W_{\gamma}^TW_{\gamma}\bigr)^{-1}W_{\gamma}^T \biggr)y \biggr). \end{aligned}\] So, we sample \(\sigma^2\) using ‘rinvgamma’ in ‘extraDistr’ package.

The posterior of \(\beta_{\gamma}\) is given as follows: \[\begin{aligned} p(\beta_{\gamma}|\sigma^2,\gamma,y) \sim \textsf{N}_{|\gamma|}\biggl(\frac{g}{g+1}\bigl(W_{\gamma}^TW_{\gamma}\bigr)^{-1}W_{\gamma}^Ty, \frac{g}{g+1}\sigma^2\bigl(W_{\gamma}^TW_{\gamma}\bigr)^{-1}\biggr) \end{aligned}\] So, the \(\beta_{\gamma}\) can be sampled by using ‘mvrnorm’ function. Recall that \(\beta_{\gamma}\) is \((L+4)\)-dimensional vector whose nonzero elements correspond to \(1,x,x^2,x^3\) and \(\gamma_h=1\), and the rest of zero elements correspond to \(\gamma_h=0\). So, each value of the generated ‘beta_gamma’ is stored to ‘iter’th row of ’beta_gibbs’ matrix correspondingly; the first 4 elements should be stored and the rest of them should correspond to \(\gamma_h=1\).

# Gibbs sampler main code
for (iter in 1:n_iter) {
  
  
  # Sample gamma
  for (l in 1:L) {
    gam_temp <- gam
    gam_temp[l] <- 1 - gam[l]
    
    if(gam[l] == 1){
      W_gam <- polyn_r_basis(x, knots, gam)
      W_0 <- polyn_r_basis(x, knots, gam_temp)
    }
    else{ # gam[l] == 0
      W_gam <- polyn_r_basis(x, knots, gam_temp)
      W_0 <- polyn_r_basis(x, knots, gam)
    }
    
    p_gam1_y <- p_y_given_gam(y,x,knots,W_gam)
    p_gam0_y <- p_y_given_gam(y,x,knots,W_0)
    
    p_gam1 <- p_gam1_y/(p_gam1_y + p_gam0_y)
    
    gam[l] <- rbinom(1, 1, p_gam1)
  }
  
  
  # Sample sigma2
  W_gam <- polyn_r_basis(x, knots, gam)
  V <- diag(n) - n/(n+1)*W_gam%*%solve(t(W_gam)%*%W_gam)%*%t(W_gam)
  alpha_post <- n/2
  beta_post <- 0.5*t(y)%*%V%*%y
  sigma2 <- rinvgamma(1, alpha = alpha_post, beta = beta_post)
    
  
  # Sample beta_gamma
  V_beta <- sigma2*(g/(1 + g))*solve( t(W_gam)%*%W_gam )
  mu_beta <- (g/(1 + g))*solve( t(W_gam)%*%W_gam )%*%t(W_gam)%*%y
  beta_gamma <- mvrnorm(1, mu_beta, V_beta)
  
  
  # Store samples
  gam_gibbs[iter, ] <- gam
  sigma2_gibbs[iter] <- sigma2
  beta_gibbs[iter, c(TRUE,TRUE,TRUE,TRUE, gam == 1)] <- beta_gamma
}

# burn-in
gam_gibbs <- gam_gibbs[-(1:burn_in), ]
sigma2_gibbs <- sigma2_gibbs[-(1:burn_in)]
beta_gibbs <- beta_gibbs[-(1:burn_in), ]

After generating Gibbs sampler, denote the grid of x (written as \(x_0\) in the question) and an empty matrix for the posterior mean.

# Posterior mean and credible intervals
x_grid <- (1:999)/1000
mu_post_gibbs <- matrix( 0, nrow = nrow(gam_gibbs) , ncol = length(x_grid) )

With \(\gamma\), the given nonparametric regression model can be written as \[\begin{aligned} y = W_{\gamma}\beta_{\gamma}+\epsilon. \end{aligned}\] So the posterior mean is \(\mu = W_{\gamma}\beta_{\gamma}\), and \(\mu(x_{0,j})\) corresponds to \((W_{\gamma}\beta_{\gamma})_j\) where \(W_{\gamma}\) is made from the x grid and \(x_{0,j} = j/1000\). This result is stored in ‘mu_post_gibbs’ matrix, and every mean, 0.025, 0.975 quantile are also stored in ‘mu_posterior_mean’, ‘mu_posterior_lower’, and ‘mu_posterior_upper’ respectively.

for (i in 1:nrow(gam_gibbs)) { 
  gam_i <- gam_gibbs[i, ]
  beta_i <- beta_gibbs[i, c(TRUE,TRUE,TRUE,TRUE, gam_i == 1)]
  W_i <- polyn_r_basis(x_grid, knots, gam_i)
  mu_post_gibbs[i,] <- W_i%*%beta_i
}


mu_posterior_mean <- apply(mu_post_gibbs, 2, mean)
mu_posterior_lower <- apply(mu_post_gibbs, 2, quantile, probs = 0.025)
mu_posterior_upper <- apply(mu_post_gibbs, 2, quantile, probs = 0.975)

The resulting plots are given as follows.

# Plot the results
plot(x_grid, mu(x_grid), type = 'l', col = 'black', lwd = 2, 
     ylim = range(mu_posterior_lower, mu_posterior_upper),
     xlab = 'x', ylab = 'mu(x)', main = 'Posterior Mean and 95% Credible Interval', cex.main = 1)
lines(x_grid, mu_posterior_mean, col = 'blue', lwd = 1)
lines(x_grid, mu_posterior_lower, col = 'red', lwd = 1, lty = 2)
lines(x_grid, mu_posterior_upper, col = 'red', lwd = 1, lty = 2)
legend('topleft', legend = c('True Function', 'Posterior Mean', '95% CI'),
       col = c('black', 'blue', 'red'), lty = c(1, 1, 2), lwd = 1, cex=0.7)

The following table shows the means of \(p(\gamma_l|y\)) for each \(l\).

# Table of marginal posterior of gamma
gamma_post_mean <- colMeans(gam_gibbs)
gamma_post_table <- data.frame(Knot = 1:L, P_gamma = gamma_post_mean)
print(gamma_post_table)

##    Knot    P_gamma
## 1     1 0.05222222
## 2     2 0.03777778
## 3     3 0.04222222
## 4     4 0.04777778
## 5     5 0.03444444
## 6     6 0.05111111
## 7     7 0.06000000
## 8     8 0.04111111
## 9     9 0.03555556
## 10   10 0.04555556
## 11   11 0.05222222
## 12   12 0.04555556
## 13   13 0.04444444
## 14   14 0.04000000
## 15   15 0.03222222
## 16   16 0.03444444
## 17   17 0.04666667
## 18   18 0.03777778
## 19   19 0.04333333
## 20   20 0.04777778
## 21   21 0.04888889
## 22   22 0.04555556
## 23   23 0.05222222
## 24   24 0.05000000
## 25   25 0.03222222
## 26   26 0.03777778
## 27   27 0.04555556
## 28   28 0.03666667
## 29   29 0.05444444
## 30   30 0.04555556

This shows the \(p(\gamma_l|y\)) from 1 to 5 iterations.

gam_gibbs_tb <- t(gam_gibbs)
gam_gibbs_tb[,(1:5)]

##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    0    0    0    0    0
##  [2,]    0    0    0    0    0
##  [3,]    0    0    0    0    0
##  [4,]    0    0    0    0    1
##  [5,]    1    0    0    1    0
##  [6,]    0    0    0    0    0
##  [7,]    0    0    0    0    0
##  [8,]    0    0    0    0    0
##  [9,]    0    0    0    0    0
## [10,]    0    0    1    0    0
## [11,]    1    0    0    0    0
## [12,]    0    0    0    0    0
## [13,]    0    0    0    0    0
## [14,]    0    0    0    0    0
## [15,]    1    0    0    0    0
## [16,]    0    0    0    0    0
## [17,]    0    0    0    0    0
## [18,]    0    0    0    0    0
## [19,]    0    0    0    0    0
## [20,]    0    0    0    0    0
## [21,]    0    0    0    0    0
## [22,]    1    1    0    0    0
## [23,]    0    0    0    0    0
## [24,]    0    0    0    0    0
## [25,]    0    0    0    0    0
## [26,]    0    0    0    0    0
## [27,]    0    0    0    0    0
## [28,]    0    0    1    0    0
## [29,]    0    0    0    0    0
## [30,]    0    0    0    0    0

Q2. Consider the nonparametric regression model \(y_i = \mu(x_i) + \epsilon_i\), \(\epsilon_i \sim \textsf{N}(0, \sigma^2)\), for \(x_i \in \mathbb{R}^p\). Suppose that \(\mu\) is endowed with a Gaussian process prior \(\mu \sim \textsf{GP}(0, k)\), where \(k\) is a covariance function.

(a) Let \(\bar{\mu} = (\mu(x_1), \ldots, \mu(x_n)) \in \mathbb{R}^n\) and \(\tilde{\mu} = (\mu(\tilde{x}_1), \ldots, \mu(\tilde{x}_m)) \in \mathbb{R}^m\) for \(\tilde{x}_i\), \(i = 1, \ldots, m\), which do not overlap with \(x_i\), \(i = 1, \ldots, n\). Show that \[ \begin{pmatrix} y \\ \tilde{\mu} \end{pmatrix} \Bigg| \sigma^2 \sim \textsf{N}_{n+m} \left( 0, \begin{pmatrix} K(x, x) + \sigma^2 I_n & K(x, \tilde{x}) \\ K(\tilde{x}, x) & K(\tilde{x}, \tilde{x}) \end{pmatrix} \right), \] where \(K(x, x)\) is the covariance matrix for the observed data points, \(K(x, \tilde{x})\) is the cross-covariance matrix between the observed and new data points, \(K(\tilde{x}, x)\) is the cross-covariance matrix (transpose of \(K(x, \tilde{x})\)), and \(K(\tilde{x}, \tilde{x})\) is the covariance matrix for the new data points.

Let \(\mu := \bar\mu\) for simplicity. The main goal is to obtain the distribution of \(y|\sigma^2\). The given nonparametric regression model can be written as \[y|\mu,\sigma^2 \sim \textsf{N}_{n+m}(\mu, \sigma^2I_n) \propto \exp\biggl\{-\frac{1}{2}(y-\mu)^T\Bigl(\frac{1}{\sigma^2}I_n\Bigr)(y-\mu)\biggr\}.\] Then, \[\begin{aligned} p(y|\mu,\sigma^2)p(\mu) &\propto \exp\biggl\{-\frac{1}{2}(y-\mu)^T\Bigl(\frac{1}{\sigma^2}I_n\Bigr)(y-\mu)\biggr\}\exp\biggl\{-\frac{1}{2}\mu^TK^{-1}\mu\biggr\} \\ &= \exp\biggl\{-\frac{1}{2}\biggl(\mu^T\Bigl(\frac{1}{\sigma^2}I_n+K^{-1} \Bigr)\mu - 2\mu^T\frac{1}{\sigma^2}I_ny + y^T\frac{1}{\sigma^2}I_ny\biggr)\biggr\} \\ &=\exp\biggl\{-\frac{1}{2}\biggl(\mu^T\bigl(\sigma^{-2}I_n+K^{-1} \bigr)\mu -2\mu^T\bigl(\sigma^{-2}I_n+K^{-1} \bigr)\sigma^{-2}\bigl(\sigma^{-2}I_n+K^{-1} \bigr)^{-1}y + y^T\sigma^{-2}\bigl(\sigma^{-2}I_n+K^{-1} \bigr)^{-1}\sigma^{-2}y\biggr)\\ &\qquad -\frac{1}{2}\biggl(y^T(\sigma^{-2}I_n)y - y^T\sigma^{-2}\bigl(\sigma^{-2}I_n+K^{-1} \bigr)^{-1}\sigma^{-2}y \biggr)\biggr\} \end{aligned}\] So, the right term \(\exp\biggl\{-\frac{1}{2}\biggl(y^T(\sigma^{-2}I_n)y - y^T\bigl(\sigma^{2}I_n+\sigma^4K^{-1} \bigr)^{-1}y \biggr)\biggr\}\) is a kernel of the distribution of \(y|\sigma^2\). Using the Woodbury matrix identity, \[\begin{aligned} \bigl(\sigma^{2}I_n+\sigma^4K^{-1} \bigr)^{-1} &= \bigl(\sigma^{2}I_n+(\sigma^2I)K^{-1}(\sigma^2I) \bigr)^{-1} \\ &= \sigma^{-2}I - \sigma^{-2}I\sigma^{2}I\bigl(K + \sigma^{2}I\sigma^{-2}I\sigma^{2}I\bigr)^{-1}\sigma^{2}I\sigma^{-2}I \\ &= \sigma^{-2}I-\bigl( K+\sigma^{2}I \bigr)^{-1}, \end{aligned}\] so \[\begin{aligned} p(y|\sigma^2) &\propto \exp\biggl\{-\frac{1}{2}\biggl(y^T(\sigma^{-2}I_n)y - y^T\bigl(\sigma^{2}I_n+\sigma^4K^{-1} \bigr)^{-1}y \biggr)\biggr\} \\ &= \exp\biggl\{-\frac{1}{2}y^T\bigl( K+\sigma^{2}I_n \bigr)^{-1}y\biggr\} \\ &\propto \textsf{N}_n \Bigl(0, K + \sigma^2I_n \Bigr) \end{aligned}\] The covariance between \(y\) and \(\tilde\mu\) is \[\begin{aligned} \text{Cov}[y, \tilde\mu|\sigma^2] &= \text{Cov}[\mu, \tilde\mu] + \text{Cov}[\epsilon,\tilde\mu|\sigma^2] = K(x,\tilde x), \end{aligned}\] because \(\tilde\mu\) and are independent of \(\epsilon\). Thus, \[\begin{pmatrix} y \\ \tilde{\mu} \end{pmatrix} \Bigg| \sigma^2 \sim \textsf{N}_{n+m} \biggl( 0, \begin{pmatrix} K(x, x) + \sigma^2 I_n & K(x, \tilde{x}) \\ K(\tilde{x}, x) & K(\tilde{x}, \tilde{x}) \end{pmatrix} \biggr).\]

(b) Suppose the nonparametric regression setup in Q1, where \(\mu(x) = \sin^3(x)\), \(\sigma^2 = 0.1^2\), \(x_i = \frac{2i-1}{1000}\), \(i = 1, \ldots, n\), with \(n = 500\). Consider a Gaussian process prior on \(\mu\) with the squared exponential covariance function, \[ k(x, x') = \tau^2 \exp \left( -\frac{(x - x')^2}{l^2} \right). \] Use the Jeffreys prior (log-uniform) for \(\sigma^2\). With a few different prespecified values of \(\tau\) and \(l\), obtain the posterior distribution of \(\tilde{\mu} = (\mu(\tilde{x}_1), \ldots, \mu(\tilde{x}_{999}))\) for the grid points \((\tilde{x}_1, \ldots, \tilde{x}_{999}) = (1/1000, \ldots, 999/1000)\) by sampling \((\tilde{\mu}, \sigma^2)\) from the joint posterior, i.e., - Draw \(\sigma^2\) from \(p(\sigma^2 | y)\) using grid sampling (on some suitable range); - Draw \(\tilde{\mu}\) from \(p(\tilde{\mu} | y, \sigma^2)\). Using the posterior of \(\tilde{\mu}\), draw the posterior mean curve and the 95% credible band on \((0, 1)\). Compare the results for different values of \(\tau\) and \(l\). Do not use Stan or something similar; write your own sampling code (the model does not require MCMC since we fix \(\tau\) and \(l\), so the use of Stan or similar software should be avoided).

The initial values are identical to those of Q1 as follows. “MASS” library is for multivariate normal distribution.

rm(list=ls())
library(MASS) 
#########     (b)
set.seed(2023311161)

# initial values
n <- 500
x <- (2*(1:n)-1)/1000
L <- 30
sigma2 <- 0.1^2
mu <- function(x) { sin(x)^3 }
y <- mu(x) + rnorm(n, mean = 0, sd = sqrt(sigma2))
x_grid <- (1:999)/1000

The values of \(\tau\) and \(l\) will be \((1, 0.1), (1, 0.5), (2, 0.5)\).

taus <- c(1, 1, 5)
ls <- c(0.1, 0.5, 0.5)

We have squared exponential covariance function, so the ‘cov_mat’ function is defined in order to return covariance matrix \(K\).

cov_mat <- function(x, x_p, tau, l) {
  tau^2*exp(-(outer(x, x_p, "-")^2)/(l^2))
}

The following ‘post_gp’ is the core function for this question. 3 covariance matrices for \(\text{Cov}[x,x], \text{Cov}[x,\tilde x], \text{Cov}[\tilde x,\tilde x]\)are defined with the given \(\tau\) and \(l\) values. The posterior of \(\sigma^2\) will be sampled by grid sampling, so grid for ‘sigma2’ is set between 0.01 and 0.2.

The posterior of \(\sigma^2\) is

\[\begin{aligned} p(\sigma^2 | y) &\propto p(y|\sigma^2)p(\sigma^2) \\ &\propto|K(x,x) + \sigma^2I_n|^{-1/2}\exp\biggl\{-\frac{1}{2}y^T\bigl(K(x,x) + \sigma^2I_n \bigr)^{-1}y \biggr\}\cdot\sigma^{-2} \end{aligned}\] If we use this pdf directly, the determinant of \(\bigl(K(x,x) + \sigma^2I_n \bigr)^{-1}\) returns \(\infty\). So we use log posterior to compute log determinant. \[\begin{aligned} \log p(\sigma^2 | y) = -\frac{1}{2}\log|K(x,x) + \sigma^2I_n| -\frac{1}{2}y^T\bigl(K(x,x) + \sigma^2I_n \bigr)^{-1}y -\log\sigma^2 + C \end{aligned}\] where \(C\) is some constant. The ‘log_posterior_sigma2’ vector has the log posterior values for each ‘sigma2_grid’ values. Directly applying ‘exp’ function to the ‘log_posterior_sigma2’ returns \(\pm\infty\), so we subtract the maximum value of ‘log_posterior_sigma2’ and then convert it to exponential term ‘posterior_sigma2’. After normalizing this vector, we sample 1 element from ‘sigma2_grid’ by letting ‘posterior_sigma2’ as a probability vector.

Sampling posterior ‘mu’ is straightforward, because it can be derived from (a) and properties of conditional multivariate normal distribution that \[\begin{aligned} p(\tilde\mu | y,\sigma^2) &\sim \textsf{N}_{m}\Bigl( \mathbb E[\tilde\mu | y,\sigma^2], \text{Cov}[\tilde\mu | y,\sigma^2] \Bigr), \\ \mathbb E[\tilde\mu | y,\sigma^2] &= K(\tilde x,x)\bigl(K(x,x) + \sigma^2I_n \bigr)^{-1}y, \\ \text{Cov}[\tilde\mu | y,\sigma^2] &= K(\tilde x,\tilde x) - K(\tilde x,x)\bigl(K(x,x) + \sigma^2I_n \bigr)^{-1}K(x,\tilde x). \end{aligned}\] We can sample ‘mu’ vector by ‘mvrnorm’ function and using the above mean vector and covariance matrix.

Repeat this for ‘n_ter’ times and remove ‘burn_in’ number of samples. ‘mu_posterior_mean’ vector has posterior mean of \(\mu\) for each ‘x_grid’ value, ‘mu_posterior_lower’ has 0.025 quantile of posterior mean of \(\mu\) for each ‘x_grid’ value, and ‘mu_posterior_upper’ has 0.975 quantile of posterior mean of \(\mu\) for each ‘x_grid’ value.

post_gp <- function(tau, l, y, x, x_grid) {
 
  # covariance matrices
  K <- cov_mat(x, x, tau, l)
  K_xgrid <- cov_mat(x_grid, x_grid, tau, l)
  K_x_xgrid <- cov_mat(x, x_grid, tau, l)
  
  
  # grid for sigma2
  sigma2_grid <- seq(0.001, 0.1, length.out = 100)

  
  # Storage for samples
  n_iter <- 100
  burn_in <- 10
  sigma2_samples <- numeric(n_iter)
  mu_samples <- matrix(0, ncol = length(x_grid), nrow = n_iter)
  
  

  for (iter in 1:n_iter) {
    
    # log posterior for sigma2_grid
    log_posterior_sigma2 <- sapply(sigma2_grid, function(s2) {
      K_inv <- solve(K + s2 * diag(n))
      log_det_K <- determinant(K + s2 * diag(n), logarithm = TRUE)$modulus[1]
      quad_form <- t(y) %*% K_inv %*% y
      
      return(-0.5*log_det_K - 0.5*quad_form - log(s2))
    } )
    

    # Normalize the log posterior
    posterior_sigma2 <- exp(log_posterior_sigma2 - max(log_posterior_sigma2) )
    posterior_sigma2 <- posterior_sigma2 / sum(posterior_sigma2)
    
    # Sample sigma2 using grid sampling
    sigma2 <- sample(sigma2_grid, size = 1, prob = posterior_sigma2)
    sigma2_samples[iter] <- sigma2
    
    
    # Sample mu 
    K_inv <- solve(K + sigma2 * diag(n)) # n by n
    post_mu_cov <- K_xgrid - t(K_x_xgrid) %*% K_inv %*% K_x_xgrid # 999 by 999
    post_mu_mean <- t(K_x_xgrid) %*% K_inv %*% y #length 999
    mu_samples[iter, ] <- mvrnorm(1, post_mu_mean, post_mu_cov)
    
  }
  
  # burn-in
  sigma2_samples <- sigma2_samples[-(1:burn_in)]
  mu_samples <- mu_samples[-(1:burn_in), ]
  
  # posterior mean and CI
  mu_posterior_mean  <- apply(mu_samples, 2, mean)
  mu_posterior_lower <- apply(mu_samples, 2, quantile, probs = 0.025)
  mu_posterior_upper <- apply(mu_samples, 2, quantile, probs = 0.975)
  
  list(mean = mu_posterior_mean, lower = mu_posterior_lower, upper = mu_posterior_upper)
}

The computation for inverse matrix is way too high (much higher than that of (a)), so the number of iteration is inevitably set to be small compared to other questions.

After defining the ‘post_gp’ function, we plot the results for all the 3 cases; \((1, 0.1), (1, 0.5), (2, 0.5)\).

# plot the result
for (i in 1:length(taus)) {
  tau <- taus[i]
  l <- ls[i]
  
  posterior_result <- post_gp(tau, l, y, x, x_grid)
  
  plot(x_grid, mu(x_grid), type = 'l', col = 'black', lwd = 2,
       ylim = range(posterior_result$lower, posterior_result$upper),
       xlab = 'x', ylab = 'mu(x)', 
       main = paste('GP Posterior with tau =', tau, ', l =', l), cex.main = 1)
  lines(x_grid, posterior_result$mean, col = 'blue', lwd = 1)
  lines(x_grid, posterior_result$lower, col = 'red', lwd = 1, lty = 2)
  lines(x_grid, posterior_result$upper, col = 'red', lwd = 1, lty = 2)
  legend('topleft', legend = c('True Function', 'Posterior Mean', '95% CI'),
         col = c('black', 'blue', 'red'), lty = c(1, 1, 2), lwd = 2, cex = 0.7)
  
}

Recall the exponential covariance function \[\begin{aligned} k(x_i,x_j) = \tau^2\exp\biggl\{ -\frac{(x_i-x_j)^2}{l^2}\biggr\}. \end{aligned}\] As \(l\) becomes larger, the posterior means as well as the corresponding quantiles are more likely to be smoother. This is because as \(l\) increases (assuming \(\tau\) is fixed), the covariance (correlation) between two x data increases as well. As \(\tau\) increases while \(l\) is fixed, only the magnitude (\(\tau\)) of the covariance increases while the correlation between two data (\(l\)) remains unchanged.

Bayesian Statistics HW4

Q2. Consider the nonparametric regression model \(y_i = \mu(x_i) + \epsilon_i\), \(\epsilon_i \sim \textsf{N}(0, \sigma^2)\), for \(x_i \in \mathbb{R}^p\). Suppose that \(\mu\) is endowed with a Gaussian process prior \(\mu \sim \textsf{GP}(0, k)\), where \(k\) is a covariance function.