Bayesian Statistics HW6
Q1. Consider the following Bayesian hierarchical mixture model. Let \(\mathbf{Y} = \{Y_i\}_{i=1}^n\) denote a set of observations that are grouped into \(J\) clusters following different Gaussian distributions, and let \(\mathbf{Z} = \{Z_i\}_{i=1}^n\) denote a set of latent variables for allocating each observation to a cluster such that \(Z_i = j\) if and only if the \(i\)th observation is allocated to the \(j\)th cluster, for \(j = 1, \dots, J\). The Bayesian hierarchical mixture model is expressed as \[ \begin{aligned} Y_i \mid (\mu, \sigma^2, Z) &\overset{\text{ind}}{\sim} \mathcal N(\mu_{Z_i}, \sigma^2), \quad i = 1, \dots, n, \\ Z_i \mid \mathbf{V} &\overset{\text{iid}}{\sim} \prod_{j=1}^J \pi_j(\mathbf{V})^{\delta_j(Z_i)}, \quad i = 1, \dots, n, \\ V_j \mid \alpha &\overset{\text{iid}}{\sim} \text{Beta}(1, \alpha), \quad j = 1, \dots, J-1, \quad V_J = 1, \\ \mu_j \mid \sigma^2 &\overset{\text{iid}}{\sim} \mathcal N(0, \kappa \sigma^2), \quad j = 1, \dots, J, \\ \sigma^2 &\sim \text{Inv-Gamma}(\gamma, \beta),\end{aligned}\] where \(\mu = (\mu_1, \dots, \mu_J)\) denotes a set of cluster-specific means of a normal distribution, \(\sigma^2\) denotes a common variance across clusters, \(\mathbf{V} = \{V_j\}_{j=1}^{J-1}\) denotes a set of random variables from a so-called stick-breaking process with \(V_J = 1\), so that \[ \pi_j(\mathbf{V}) = V_j \prod_{l<j} (1 - V_l)\] becomes a mixture probability weight, \(\delta_j(Z_i)\) is a delta function such that a probability measure is concentrated at \(Z_i = j\), and \((\alpha, \kappa, \gamma, \beta)\) are known constants. Our goal is to construct the Gibbs sampler that simulates the target posterior distribution of the model, \(p(\mathbf{Z}, \mathbf{V}, \mu, \sigma^2 \mid \mathbf{Y})\).
(*) Obtain the full posterior.
\[\begin{aligned} p(\mathbf{Z}, \mathbf{V}, \mu, \sigma^2 \mid \mathbf{Y}) &\propto p(\mathbf Y|\mathbf{Z}, \mathbf{V}, \mu, \sigma^2)p(\mathbf{Z}, \mathbf{V}, \mu, \sigma^2) \\ &\propto p(\mathbf Y|\mathbf{Z}, \mu, \sigma^2)p(\mathbf Z | \mathbf V)p(\mathbf{V}, \mu, \sigma^2) \\ &\propto p(\mathbf Y|\mathbf{Z}, \mu, \sigma^2)p(\mathbf Z | \mathbf V)p(\mathbf V)p(\mu|\sigma^2)p(\sigma^2) \end{aligned}\]
(a) Construct a sampling step for \(\mathbf{Z}\) in the Gibbs sampler.
1). Conditional posterior of \(Z_i\)
Recall that \[ Z_i \mid \mathbf{V} \;\sim\; \prod_{j=1}^J \pi_j(\mathbf{V})^{\delta_j(Z_i)}, \] where \[ \pi_j(\mathbf{V}) \;=\; V_j \prod_{l<j} (1 - V_l), \] and \[ Y_i \mid (Z_i = j, \mu_j, \sigma^2) \;\sim\; N(\mu_j, \sigma^2). \]
Given all other parameters \(\mu, \sigma^2, \mathbf{V}\) and the observed data \(Y_i\), the posterior probability for \(Z_i\) taking a particular cluster value \(j\) is proportional to the prior \(\pi_j(\mathbf{V})\) times the likelihood of \(Y_i\) under that cluster’s parameters:
\[ p(Z_i = j \mid \mathbf{Y}, \mu, \sigma^2, \mathbf{V}) \;\propto\; \pi_j(\mathbf{V}) \,\cdot\, f\bigl(Y_i \mid \mu_j, \sigma^2\bigr), \] where \(f(\cdot)\) is the density of the normal distribution \(\mathcal N(\mu_j, \sigma^2)\). Explicitly,
\[ f\bigl(Y_i \mid \mu_j, \sigma^2\bigr) \;=\; \frac{1}{\sqrt{2\pi\,\sigma^2}} \exp\!\Bigl(\!-\frac{(Y_i - \mu_j)^2}{2\,\sigma^2}\Bigr). \]
2). Normalizing to obtain the full conditional
To obtain a proper probability, we normalize over \(j = 1, \dots, J\):
\[ p(Z_i = j \mid \mathbf{Y}, \mu, \sigma^2, \mathbf{V}) \;=\; \frac{\pi_j(\mathbf{V})\, f\bigl(Y_i \mid \mu_j, \sigma^2\bigr)} {\sum_{k=1}^J \pi_k(\mathbf{V})\, f\bigl(Y_i \mid \mu_k, \sigma^2\bigr)}. \]
3). Gibbs sampling step
In a Gibbs sampler, we update each \(Z_i\) in turn (or all \(Z_i\) in a block) by:
i). Compute unnormalized weights for \(j = 1, \dots, J\): \[ w_j \;=\; \pi_j(\mathbf{V}) \times f\bigl(Y_i \mid \mu_j, \sigma^2\bigr). \]
ii). Normalize to get the probabilities: \[ p(Z_i = j \mid \dots) \;=\; \frac{w_j}{\sum_{k=1}^J w_k}. \]
iii). Sample \(Z_i\) from the discrete distribution with the above probabilities.
(b) Construct a sampling step for \(\mathbf{V}\) in the Gibbs sampler.
1). Prior of \(V_j\)
For \(j = 1, \dots, J-1\):
\[ V_j \mid \alpha \sim \text{Beta}(1, \alpha), \]
and \(V_J = 1\) is fixed. The Beta density is as follows:
\[ p(V_j) = \frac{\Gamma(1 + \alpha)}{\Gamma(1)\Gamma(\alpha)} V_j^{0} (1 - V_j)^{\alpha - 1} = \alpha (1 - V_j)^{\alpha - 1}. \]
2). Likelihood Contribution
The stick-breaking process defines the mixture weight for cluster \(j\):
\[ \pi_j(\mathbf{V}) = V_j \prod_{l<j} (1 - V_l). \]
Given the cluster assignment \(Z_i\), the likelihood contribution related to \(V_j\) depends on how many observations are assigned to cluster \(j\) and how many to clusters after \(j\).
For each \(i\): - If \(Z_i = j\), the likelihood involves \(\pi_j(\mathbf{V}) = V_j \prod_{l<j} (1 - V_l)\). - For clusters after \(j\), the \((1 - V_j)\) term appears in their weights.
Thus, the likelihood contribution of \(V_j\) is: - \(V_j\) raised to the power of number of observations assigned to cluster \(j\). - \((1 - V_j)\) raised to the power of number of observations assigned to clusters after \(j\).
3). Sufficient Statistics
Define: - \(n_j = \sum_{i=1}^n \delta_j(Z_i)\): number of observations assigned to cluster \(j\). - \(m_j = \sum_{i=1}^n \sum_{k>j} \delta_k(Z_i)\): number of observations assigned to clusters after \(j\).
4). Posterior Derivation
Combining the prior and likelihood:
\[ \begin{aligned} p(V_j \mid \mathbf{Z}=j) &\propto p(V_j) \times p(\mathbf Z=j|V_j) \\ &\propto \left[ \alpha (1 - V_j)^{\alpha - 1} \right] \times V_j^{n_j} (1 - V_j)^{m_j} \\ &\propto V_j^{n_j} (1 - V_j)^{\alpha - 1 + m_j}, \end{aligned} \]
which is the Beta kernel. Thus,
\[ V_j \mid (\mathbf{Z}=j) \sim \text{Beta}(1 + n_j, \alpha + m_j). \]
5). Summary: Gibbs Sampling Step
For each \(j = 1, \dots, J-1\):
- Compute:
- \(n_j = \#\{i : Z_i = j\}\)
- \(m_j = \#\{i : Z_i > j\}\)
- Sample:
\[ V_j \sim \text{Beta}(1 + n_j, \alpha + m_j). \]
Note that \(V_J = 1\) is deterministic.
(c) Construct a sampling step for \(\mu\) in the Gibbs sampler.
1). prior of \(\mu_j\)
Each \(\mu_j\) has
\[ \mu_j \mid \sigma^2 \sim N(0, \kappa \sigma^2), \quad j = 1, \dots, J. \] That is, \[p(\mu_j | \sigma^2) \propto \exp\left( -\frac{1}{2\kappa \sigma^2} \mu_j^2 \right).\]
2). Likelihood contribution for \(\mu_j\)
Given the cluster assignment \(Z_i\), the observations in cluster \(j\) are:
\[ Y_i \mid (Z_i = j, \mu_j, \sigma^2) \sim \mathcal N(\mu_j, \sigma^2). \]
For all observations assigned to cluster \(j\) (i.e., \(Z_i = j\)), the joint likelihood is:
\[ \prod_{i: Z_i = j} \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(Y_i - \mu_j)^2}{2\sigma^2} \right). \]
Equivalently:
\[ p(Y_i \mid Z_i = j, \mu_j, \sigma^2) \propto \exp\left( -\frac{1}{2\sigma^2} \sum_{i: Z_i = j} (Y_i - \mu_j)^2 \right). \]
3). Combine prior and likelihood
Multiplying prior and likelihood:
\[\begin{aligned} \log p(\mu_j \mid \mathbf{Y}, \mathbf{Z}, \sigma^2) &\propto -\frac{1}{2\kappa \sigma^2} \mu_j^2 - \frac{1}{2\sigma^2} \sum_{i: Z_i = j} (Y_i - \mu_j)^2 \\ &= -\frac{1}{2\kappa \sigma^2} \mu_j^2 - \frac{1}{2\sigma^2} \left( \sum_{i: Z_i = j} Y_i^2 - 2\mu_j \sum_{i: Z_i = j} Y_i + n_j \mu_j^2 \right) \\ &\propto -\frac{1}{2\sigma^2} \left( \frac{1}{\kappa} + n_j \right) \mu_j^2 + \frac{1}{\sigma^2} \left( \sum_{i: Z_i = j} Y_i \right) \mu_j. \end{aligned}\]
where \(n_j = \sum_{i=1}^n \delta_j(Z_i)\) is the number of observations in cluster \(j\).
4). Posterior distribution
Define:
\[ S_j = \sum_{i: Z_i = j} Y_i. \]
Then:
\[ \log p(\mu_j \mid \mathbf{Y}, \mathbf{Z}, \sigma^2) \propto -\frac{1}{2\sigma^2} \left( \frac{1}{\kappa} + n_j \right) \left( \mu_j^2 - 2 \frac{S_j}{\frac{1}{\kappa} + n_j} \mu_j \right). \]
Completing the square:
\[ \propto -\frac{1}{2\sigma^2} \left( \frac{1}{\kappa} + n_j \right) \left( \mu_j - \frac{S_j}{\frac{1}{\kappa} + n_j} \right)^2. \]
Thus, the conditional posterior is:
\[ \mu_j \mid \mathbf{Y}, \mathbf{Z}, \sigma^2 \sim N\left( \frac{S_j}{\frac{1}{\kappa} + n_j}, \; \frac{\sigma^2}{\frac{1}{\kappa} + n_j} \right). \]
5). Gibbs Sampling Step
For each \(j = 1, \dots, J\):
i). Compute: - \(n_j = \#\{i : Z_i = j\}\), - \(S_j = \sum_{i: Z_i = j} Y_i\).
ii). Sample:
\[ \mu_j \sim N\left( \frac{S_j}{\frac{1}{\kappa} + n_j}, \; \frac{\sigma^2}{\frac{1}{\kappa} + n_j} \right). \]
(d) Construct a sampling step for \(\sigma^2\) in the Gibbs sampler.
1) Prior for \(\sigma^2\)
The prior is an Inverse-Gamma distribution:
\[ \sigma^2 \sim \text{Inv-Gamma}(\gamma, \beta). \]
That is, the density:
\[ p(\sigma^2) \propto (\sigma^2)^{-(\gamma + 1)} \exp\left( -\frac{\beta}{\sigma^2} \right). \]
2) Likelihood contributions involving \(\sigma^2\)
There are two parts in the model where \(\sigma^2\) appears:
i). Likelihood of \(Y_i\):
For all observations:
\[ Y_i \mid (Z_i = j, \mu_j, \sigma^2) \sim N(\mu_j, \sigma^2). \]
This gives the likelihood contribution:
\[ \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(Y_i - \mu_{Z_i})^2}{2\sigma^2} \right). \]
ii). Prior of \(\mu_j\):
Each \(\mu_j\) has:
\[ \mu_j \mid \sigma^2 \sim N(0, \kappa \sigma^2). \]
This gives:
\[ \prod_{j=1}^J \frac{1}{\sqrt{2\pi \kappa \sigma^2}} \exp\left( -\frac{\mu_j^2}{2\kappa \sigma^2} \right). \]
3) Joint likelihood in terms of \(\sigma^2\)
Combining all terms:
- From the likelihood of \(Y_i\):
The log-likelihood contribution is:
\[ -\frac{n}{2} \log(\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (Y_i - \mu_{Z_i})^2. \]
- From the prior of \(\mu_j\):
There are \(J\) terms:
\[ -\frac{J}{2} \log(\sigma^2) - \frac{1}{2\kappa \sigma^2} \sum_{j=1}^J \mu_j^2. \]
- From the Inverse-Gamma prior:
\[ -(\gamma + 1) \log(\sigma^2) - \frac{\beta}{\sigma^2}. \]
The total log posterior of \(\sigma^2\) is proportional to:
\[ -\left( \gamma + 1 + \frac{n}{2} + \frac{J}{2} \right) \log(\sigma^2)- \frac{1}{2\sigma^2} \left( \sum_{i=1}^n (Y_i - \mu_{Z_i})^2 + \frac{1}{\kappa} \sum_{j=1}^J \mu_j^2 + 2\beta \right). \]
4) Recognize posterior form
This is the kernel of an Inverse-Gamma distribution. Specifically, the posterior is:
\[ \sigma^2 \mid - \sim \text{Inv-Gamma}(\gamma_{\text{post}}, \beta_{\text{post}}), \]
where:
\[ \gamma_{\text{post}} = \gamma + \frac{n}{2} + \frac{J}{2}, \]
\[ \beta_{\text{post}} = \beta + \frac{1}{2} \sum_{i=1}^n (Y_i - \mu_{Z_i})^2 + \frac{1}{2\kappa} \sum_{j=1}^J \mu_j^2. \]
5) Gibbs Sampling Step
- Compute:
- Residual sum of squares:
\[ S_1 = \sum_{i=1}^n (Y_i - \mu_{Z_i})^2. \]
- Prior sum:
\[ S_2 = \sum_{j=1}^J \mu_j^2. \]
- Update hyperparameters:
\[ \gamma_{\text{post}} = \gamma + \frac{n}{2} + \frac{J}{2}, \]
\[ \beta_{\text{post}} = \beta + \frac{1}{2} S_1 + \frac{1}{2\kappa} S_2. \]
- Sample:
\[ \sigma^2 \sim \text{Inv-Gamma}(\gamma_{\text{post}}, \beta_{\text{post}}). \]