Statistical Computing HW7
Q1. Consider the street gambling called ‘Yabawi’ (‘야바위’ in Korean). Suppose \(K \ge 3\) cups are placed upside down on a table and one ball is in one of the cups. The dealer repeats the process of randomly choosing two cups and swapping their positions. We assign a state to each location of the ball, i.e., state i means the ball is in the ith cup. The resulting state space is \(\{1,2,\dots, K\} = \Omega\).
i) This process is a Markov chain. Explain why.
Let the state of the ball at the time t be \(X_t\). Let the case \((X_{t+1} = i_{t+1}|X_t = i_t)\) (the state of the ball at time ‘t+1’ is $ i_{t+1}$ when the state at time ‘t’ is \(i_t\)) be \(A_{t+1,t}\) and let the case \((X_{t-1} = i_{t-1},\dots X_0=i_0)\) (the state at time ‘t-1’ is \(i_{t-1}\), …, and the state at time 0 is \(i_0\)) be \(B_{t-1}\), \((i_t \in \Omega, \quad \forall t=0,1,2,\dots)\). If the above process is Markov Chain (MC), the following formula should hold. \[\mathbb P(X_{t+1} = i_{t+1}|X_t = i_t,\dots,X_0=i_0) = \mathbb P(X_{t+1} = i_{t+1}|X_t = i_t), \text{that is,}\\ \mathbb P(A_{t+1,t}|B_{t-1}) = \mathbb P(A_{t+1,t}).\]
(1) \(\mathbb P(X_{t+1} = i_{t+1} | X_t=i_t)\)
The probability that the initial (t = 0) state is one of \(\{1,2,\dots, K\}\) is 1/K. When the state at time ‘t’, \(X_t\), is determined as \(i_t\) (‘t’ is non-negative integer), the probability that the next state \(X_{t+1}\) is same as the previous one \(i_t\) is the probability that the swapping does not occur at the cup that contains the ball, which is \[\frac{\binom{K-1}{2}}{\binom{K}{2}} = \frac{K-2}{K}.\]
Under the same condition, the case of the state \(X_{t+1}\) being \(i_{t+1} \neq i_t\) is to swap the \(i_t\)th and \(i_{t+1}\)th cup, which occurs only once. So, the probability \(\mathbb P(X_{t+1} = i_{t+1} | X_t=i_t)\) when \(i_{t+1} \neq i_t\) is \(\frac{1}{\binom{K}{2}} = \frac{2}{K(K-1)}\). The following expression ① summarized the \(\mathbb P(X_{t+1} = i_{t+1} | X_t=i_t)\).
\[\mathbb P(X_{t+1} = i_{t+1} | X_t=i_t) = \begin{cases} \frac{2}{K(K-1)}, \quad i_{t+1} \neq i_t. \tag{①} \\[3ex] \frac{K-2}{K}, \quad i_{t+1} = i_t. \end{cases}\]
(2) \(\mathbb P(A_{t+1,t} \cap B_{t-1}) = \mathbb P(A_{t+1,t}) \mathbb P(B_{t-1})\)
The probability of state \(X_{t+1}\) is determined solely by the state at time ‘t’, not all of the previous times (t-1 ~ 0). For example, when the ball is at the \(i\)th cup at time \(t\), the probability that the ball is at the \(j\)th cup at time \(t+1\) depends only on whether \(i = j\) or not (①). So, when the state of the ball at time ‘t’ is determined, the states of the ball at time \(0,1,\dots t-1\) do not affect the probability of the state at ‘t+1’. This means $A_{t+1,t} $ and \(B_{t-1}\) are independent. That is, \(\mathbb P(A_{t+1,t} \cap B_{t-1}) = \mathbb P(A_{t+1,t}) \mathbb P(B_{t-1})\).
(3) \(\mathbb P(B_{t-1}) > 0\)
The initial state \(X_0\) is randomly chosen, so \(\mathbb P(X_0 = i_0)\). Let the probability be \(b_0\). When the initial state is determined, the probability of the state at time ‘1’, \(\mathbb P(X_1=i_1|X_0 = i_0)\)depends only on the previous state \(i_0\) due to ① and the logic of (2). Let the probability \(\mathbb P(X_0 = i_0)\) be denoted by \(b_1\), which can be either \(\frac{2}{K(K-1)}\) or \(\frac{K-2}{K}\). After \(X_1\) is chosen, the probability of the state at time ‘2’, \(\mathbb P(X_2 = i_2| X_1=i_1, X_0 = i_0)\), also depends only on the previous state \(i_1\) for the same reason. Let the probability \(\mathbb P(X_2 = i_2| X_1=i_1, X_0 = i_0) = \mathbb P(X_2 = i_2| X_1=i_1)\) be \(b_2\), which can be either \(\frac{2}{K(K-1)}\) or \(\frac{K-2}{K}\). Repeating this rule, \(\mathbb P(X_{t-1}=i_{t-1}|X_{t-2} = i_{t-2},\dots,X_0 = i_0) = \mathbb P(X_{t-1}=i_{t-1}|X_{t-2} = i_{t-2})\) can be written as \(b_{t-1}\), which can be either \(\frac{2}{K(K-1)}\) or \(\frac{K-2}{K}\).
Due to the property of conditional probability, the following equation hold: \[\begin{aligned} \mathbb P(X_{t-1} = i_{t-1} \cap \dots \cap X_0 = i_0) &= \mathbb P(X_{t-1} = i_{t-1}|X_{t-2} = i_{t-2} \dots X_0 = i_0) *\mathbb P(X_{t-2} = i_{t-2} \dots X_0 = i_0) \\ &= \mathbb P(X_{t-1} = i_{t-1}|X_{t-2} = i_{t-2} \dots X_0 = i_0)* \mathbb P(X_{t-2} = i_{t-2} | X_{t-3} = i_{t-3}\dots X_0 = i_0) * \mathbb P(X_{t-3} = i_{t-3}\dots X_0 = i_0) \\ & \qquad \qquad \vdots \qquad \qquad \vdots \qquad \qquad \vdots \qquad \qquad \vdots \qquad \qquad \vdots \\ &= \mathbb P(X_{t-1} = i_{t-1}|X_{t-2} = i_{t-2} \dots X_0 = i_0)* \dots \mathbb P(X_1=i_1|X_0=i_0)*\mathbb P(X_0 = i_0) \\ &:= b_{t-1}*\dots b_1*b_0, \end{aligned}\] where \(b_s = \mathbb P(X_{s-1} = i_{s-1}|X_{s-2} = i_{s-2}\dots X_0=i_0), t\ge1\), and \(b_0 = \mathbb P(X_0 = i_0)\). Each \(bs, s = 0,1,\dots t-1\) is bigger than 0 (since \(K \ge 3\), the values \(\frac{1}{K}, \frac{2}{K(K-1)}, \frac{K-2}{K}\) are greater than 0), so the probability \(\mathbb P(B_{t-1}) = b_{t-1}*\dots b_1*b_0\) is bigger than 0.
(4) conclusion, \(\mathbb P(X_{t+1} = i_{t+1}|X_t=i_t,\dots, X_0 = i_0) =\mathbb P(X_{t+1} = i_{t+1}|X_t=i_t)\)
By the definition of conditional probability, the following expression holds: \(\mathbb P(A_{t+1}\cap B_{t-1}) = \mathbb P(A_{t+1}|B_{t-1}) \mathbb P(B_{t-1})\).
Since \(\mathbb P(A_{t+1}\cap B_{t-1}) = \mathbb P(A_{t+1}) \mathbb P(B_{t-1})\) and \(\mathbb P(B_{t-1}) > 0\), \(\mathbb P(A_{t+1}|B_{t-1}) = \mathbb P(A_{t+1})\).
Thus, \(\mathbb P(X_{t+1} = i_{t+1}|X_t=i_t,\dots,X_0=i_0) = \mathbb P(X_{t+1} = i_{t+1}|X_t=i_t)\), and this process is a Markov Chain.
ii) Find the transition probability matrix of this Markov chain.
Let the transition probability matrix be \(\mathbf P\), whose elements are \([\mathbf P]_{i,j},\ \ i,j=1,\dots,K\) It is shown in ① of question i) that \([\mathbf P]_{i,j} = \frac{2}{K(K-1)}, \ (j \neq i)\) and \([\mathbf P]_{i,i} = \frac{K-2}{K}\). So, all of the diagonals of \(\mathbf P\) are \(\frac{K-2}{K}\), and the rest of the elements are \(\frac{K-2}{K}\).
\[\mathbf P = \begin{bmatrix} \frac{K-2}{K} & \frac{K-2}{K} & \dots & \frac{K-2}{K} \\ \frac{K-2}{K} & \frac{K-2}{K} & \dots & \frac{K-2}{K} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{K-2}{K} & \frac{K-2}{K} & \dots & \frac{K-2}{K} \end{bmatrix}.\]
iii) This Markov chain is irreducible. Explain why.
A Markov chain is irreducible if \([\mathbf P^t]_{i,j} = \mathbb P(X_t = j|X_0 = i)\) for some \(‘t\)’ and for all \(i,j \in \Omega\). All of the elements of the transition probability matrix is positive (since \(K \ge 3\), the values \(\frac{2}{K(K-1)}\) and \(\frac{K-2}{K}\) are greater than 0), so \([\mathbf P]_{i,j} = \mathbb P(X_1 = j|X_0 = i) > 0\) for all \(i,j \in \Omega\). To sum up, for all i and j, there is \(t = 1\) such that \(\mathbb P(X_t = j|X_0 = i) > 0\). Thus, the Markov chain is irreducible.
iv) This Markov chain is aperiodic. Explain why.
To satisfy the aperiodicity, there should be some \(t > 0\) and some state \(j \in \Omega\) such that \(\mathbb P(X_t = j|X_0 = j) > 0\) and \(\mathbb P(X_{t+1} = j|X_0 = j) > 0\).
In the previous questions, it is shown that \([\mathbf P^1]_{j,j} = \frac{K-2}{K} > 0, \ j = 1,2,\dots,K\). Then, \[\begin{aligned} \\ [\mathbf P^2]_{j,j} &= \sum_{i=1}^K [\mathbf P]_{j,i}[\mathbf P]_{i,j} \\ &= \Bigl([\mathbf P]_{j,j} \Bigr)^2 + \sum_{i \neq j}[\mathbf P]_{j,i}[\mathbf P]_{i,j} \\ &= \left(\frac{K-2}{K}\right)^2 + (K-1)\left(\frac{2}{K(K-1)} \right)^2. \end{aligned}\]
We have the following conditions: \[K\ge 3 \to K-1 > 0, \quad \left(\frac{K-2}{K}\right)^2 > 0, \text{ and } \left(\frac{2}{K(K-1)} \right)^2 > 0\]
So, \([\mathbf P^2]_{j,j} = \mathbb P(X-2=j|X_0=j) > 0\).
To sum up, there is \(t = 1\) such that \(\mathbb P(X_t = j| X_0 = j) > 0\) and \(\mathbb P(X_{t+1} = j| X_0 = j) > 0\) for some state \(j\) (actually it holds for all \(j\)). Thus, this Markov Chain is aperiodic.
v) This Markov chain is time reversible. Explain why.
(1) stationary probabilities \(\pi\)
Since this Markov Chain is irreducible, it has a unique \(\pi = \{\pi_j | j = 1,2,\dots,K\}\), where \(\pi_j = \lim_{n \to \infty}\frac{1}{n}\sum_{i=1}^nI(X_t = j)\). So, there is a unique solution of \(\pi_j = \sum_{i=1}^K\pi_ip_{i,j}\) and \(\sum_{j=1}^K \pi_j = 1\), where \(p_{i,j} = [\mathbf P]_{i,j}\) and \(i,j \in \Omega\).
For a specific \(j\), \[\pi_j = \begin{cases} \frac{K-2}{K}\pi_j + \frac{2}{K(K-1)}\sum_{i \neq j}\pi_i \tag{②} \\[3ex] \sum_{i=1}^K \pi_i. \end{cases}\]
The first line of ② becomes \(\frac{2}{K(K-1)}\sum_{i \neq j}\pi_i = \frac{2}{K} \pi_j\).
Inserting the second line of ② into the previous equation leads to \[\frac{2}{K(K-1)}(1-\pi_j) = \frac{2}{K}\pi_j.\]
Reducing the same term \(2/K\) which is greater than 0 leads to \[\frac{1}{K-1}(1-\pi_j) = \pi_j.\]
\((K-1)\pi_j = 1-\pi_j\) and \(K\pi_j = 1\), so \(\pi_j = 1/K\).
Thus, the stationary probability \(\pi_j\) is \(1/K\) for all \(j = 1,\dots,K\).
(2) time reversible
An irreducible Markov Chain is time reversible if \(\pi_i p_{i,j} = \pi_j p_{j,i}\) for all \(i,j\).
\(\pi_j\) is \(1/K\) for all \(j = 1,\dots,K\), and \(p_{i,j} = \frac{2}{K(K-1)}\) for all $\(i,j\) such that \(i \neq j\). For these i and j, \(\pi_i p_{i,j} = \frac{1}{K}\frac{2}{K(K-1)}\) , and \(\pi_j p_{j,i} = \frac{1}{K}\frac{2}{K(K-1)}\). Since \(\pi_i p_{i,j} = \pi_j p_{j,i}\) for these i and j, this Markov chain is time reversible.
Q2. We want to generate a sample from the unnormalized density \[g(x)=\exp\{−5|x−5|^{1.5}\} + \exp \{−0.1|x+5|^{3.2}\}\]using the random walk Metropolis-Hastings algorithm with a normal proposal with standard deviation \(\sigma\).
i) Draw a graph of g to see its overall shape.
The function g(x) consists of two exponentials, whose (logical) maximum appears at x = -5 and x = 5. The interval of ‘x’ is chosen between -15 and 15. The function ‘g’ in R is defined to output the value of g(x).
x <- seq(-15,15,0.01)
g <- function(x){
return(exp(-5*(abs(x-5))^1.5) + exp(-0.1*(abs(x+5))^3.2))
}
plot(x, g(x), main = 'plot of g(x)', type = 'l')
The question Q2 is same as to generate random numbers from a pdf \(f(x) = g(x)/B\), where \(B = \int g(x)dx\). The value of B, although difficult to obtain analytically, is estimated to be about 4.295964. This B will be used afterwards.
B <- integrate(g, -15, 15)$value
B
## [1] 4.295964
ii) Starting from −15 as an initial value, generate an MCMC sample of size \(n = 10^5\) using \(\sigma = 1.25\). Look at the sequence to determine a suitable burn-in period. After a burn-in, draw a histogram (with fine enough bins), a trace plot (e.g., ts.plot), and an autocorrelation plot (e.g. acf).
A new function ‘generate_MCMC’ is defined to generate samples from the given \(f(x) = g(x)/B\). The only input of the function is the standard deviation ‘sd’. The initial state (xt) is -15. For every iteration, this function generates a temporary number x’ (‘xtemp’ in code) from a proposal density \[q(x'|x) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{\frac{1}{2\sigma^2}(x' - x)^2\right\},\] where x is ‘xt’ and \(\sigma\) is ‘sd’, and a number (‘u’) from uniform distribution \(\text{Unif}(0,1)\). Since \(q(x'|x)=q(x|x')\) due to the square term \((x' - x)^2\), the acceptance probability \(\alpha\) is \[\begin{aligned} \alpha &= \min\left\{\frac{f(x')q(x|x')}{f(x)q(x'|x)}, 1\right\} \\ &= \min\left\{\frac{g(x')/B}{g(x)/B}, 1\right\} \\ &= \min\left\{\frac{\exp\{−5|x'−5|^{1.5}\} + \exp \{−0.1|x'+5|^{3.2}\}}{\exp\{−5|x−5|^{1.5}\} + \exp \{−0.1|x+5|^{3.2}\}} ,1\right\}. \end{aligned}\]
If ‘u’ is less than \(g(x')/g(x)\), set the next state to be the temporary number (‘xtemp’), and if ‘u’ is bigger than or equal to \(g(x')/g(x)\), set the next state to be the current state (‘xt’). After then, change the value of the ‘xt’ to the ith value of ‘X’ vector. After the end of the iteration, it will return ‘X’ as an output.
stu_id <- 2011142127
set.seed(stu_id)
generate_MCMC <- function(sd){
xt <- -15
X <- c()
for(i in 1:10^5){
xtemp <- rnorm(1,xt, sd)
u <- runif(1)
if(u < g(xtemp)/g(xt)){X[i] <- xtemp}
else{X[i] <- xt}
xt <- X[i]
}
return(X)
}
(1) burn-in & ts.plot
Using the function, the MCMC sample of size \(10^5\) and \(\sigma = 1.25\) can be generated, which is saved in ‘X_2’ in R. The following plots show the trace plot of the X_2: the left one is the original, and the right one shows the first 2000 values. The values of X_2 reached quickly to the convergent area (around -5). So, the burn-in period can be small compared to the total number of the X_2 values (10^5). In this case, 1000 was chosen to be the burn-in period (1% of total original sample X_2).
par(mfrow = c(1,2))
X_2 <- generate_MCMC(1.25)
ts.plot(X_2, main = 'trace plot when sd = 1.25',ylim = c(-15, 5))
ts.plot(X_2[1:2000], main = 'trace plot when sd = 1.25',ylim = c(-15, 5))
X_2 <- X_2[-seq(1,1000,1)]
The following plots show the ts.plot of X_2 after discarding 1000 (burn-in period) values: the left one is the plot of X_2 after burn-in, and the right one shows the first 1000 values of the X_2.
par(mfrow = c(1,2))
ts.plot(X_2, main = 'trace plot after burn-in, sd = 1.25',ylim = c(-15, 10))
ts.plot(X_2[1:1000], main = 'trace plot after burn-in, sd = 1.25')
(2) histogram & autocorrelation
The left one of the below plots is the histogram of X_2 after burn-in. For reference, a graph of f(x) is overlaid on the histogram with red color. The interval of x axis is set to a small number ‘0.1’ to compare the similarity of the shape of histogram with the real pdf. The right one is the autocorrelation plot, and the value of ACF becomes almost 0 before the lag 20.
par(mfrow = c(1,2))
hist(X_2, freq = FALSE, main = 'histogram when sd = 1.25',
xlim = c(-15, 10), breaks = seq(-15,10,0.1))
lines(x, g(x)/B, col = 'red')
acf(ts(X_2), lag.max = 80, main = 'autocorrelation when sd = 1.25')
iii) © Repeat ii) with \(\sigma = 3.5\)
(1) burn-in & ts.plot
Using the function, the MCMC sample of size \(10^5\) and \(\sigma = 3.5\) can be generated, which is saved in ‘X_3’ in R. The following plots show the trace plot of the X_3: the left one is the original, and the right one shows the first 2000 values. The values of X_3 reaches quickly to the oscillating area. So, the burn-in period can be small value compared to the total number of the X_3 values (10^5). In this case, 1000 was chosen again to be the burn-in period.
set.seed(stu_id)
par(mfrow = c(1,2))
X_3 <- generate_MCMC(3.5)
ts.plot(X_3, main = 'trace plot when sd = 3.5',ylim = c(-15, 10))
ts.plot(X_3[1:2000], main = 'trace plot when sd = 3.5',ylim = c(-15, 10))
The following plots show the ts.plot of X_3 after discarding 1000 (burn-in period) values: the left one is the plot of X_3 after burn-in, and the right one shows the first 1000 values. When \(\sigma = 1.25\), all of the values of X_2 are around x = -5: there is no generated number in X_2 close to x = 5. However, the numbers of X_3 oscillate between x = -5 and x = 5.
X_3 <- X_3[-seq(1,1000,1)]
par(mfrow = c(1,2))
ts.plot(X_3, main = 'trace plot after burn-in, sd = 3.5',ylim = c(-15, 10))
ts.plot(X_3[1:1000], main = 'trace plot after burn-in, sd = 3.5')
(2) histogram & autocorrelation
The left one of the below plots is the histogram of X_3 after burn-in. The interval of x axis is 0.1 as well. For reference, a graph of f(x) is overlaid on the histogram with red color. The right one is the autocorrelation plot, and the value of ACF does not become 0 even at the lag 80.
par(mfrow = c(1,2))
hist(X_3, freq = FALSE, main = 'histogram when sd = 3.5',
xlim = c(-10, 10), breaks = seq(-15,10,0.1))
lines(x, g(x)/B, col = 'red')
acf(ts(X_3), lag.max = 80, main = 'autocorrelation when sd = 3.5')
iv) © Repeat ii) with \(\sigma = 10\).
(1) burn-in & ts.plot
Using the function, the MCMC sample of size \(10^5\) and \(\sigma = 10\) can be generated, which is saved in ‘X_4’ in R. The following plots show the trace plot of the X_4: the left one is the original, and the right one shows the first 2000 values. The values of X_4 reaches quickly to the oscillating area. So, the burn-in period can be small value compared to the total number of the X_4 values (10^5). In this case, 1000 was chosen again to be the burn-in period
set.seed(stu_id)
par(mfrow = c(1,2))
X_4 <- generate_MCMC(10)
ts.plot(X_4, main = 'trace plot when sd = 10',ylim = c(-15, 10))
ts.plot(X_4[1:2000], main = 'trace plot when sd = 10',ylim = c(-15, 10))
The following plots show the ts.plot of X_4 after discarding 1000 (burn-in period) values: the left one is the plot of X_4 after burn-in, and the right one shows the first 1000 values of X_4. In this case, the numbers of X_4 oscillate between x = -5 and x = 5 as well. The difference between \(\sigma = 3.5\) and \(\sigma = 10\) is that the values in X_4 is likely to move to other area more easily (from x=5 to x=-5 and vise versa), compared to X_3 which tends to stay in one area longer.
X_4 <- X_4[-seq(1,1000,1)]
par(mfrow = c(1,2))
ts.plot(X_4, main = 'trace plot after burn-in, sd = 10',ylim = c(-15, 10))
ts.plot(X_4[1:1000], main = 'trace plot after burn-in, sd = 10')
(2) histogram & autocorrelation
The left one of the below plots is the histogram of X_4 after burn-in. The interval of x axis is 0.1 as well. For reference, a graph of f(x) is overlaid on the histogram with red color. The right one is the autocorrelation plot, and the value of ACF becomes 0 before the lag 30.
par(mfrow = c(1,2))
hist(X_4, freq = FALSE, main = 'histogram when sd = 10',
xlim = c(-10, 10), breaks = seq(-15,10,0.1))
lines(x, g(x)/B, col = 'red')
acf(ts(X_4), lag.max = 80, main = 'autocorrelation when sd = 10')
v) Based on all of the results above, determine which \(\sigma\) is the best and which is the worst for this problem.
(1) worst case
When \(\sigma = 1.25\), there is no value of X_2 that exists around x = 5. The generated numbers of X2 does not explain the original density function f(x). Thus, \(\sigma = 1.25\) is the worst case.
(2) best case
The histograms of X_3 and X_4 are similar to the original pdf of f(x). The main difference between \(\sigma = 3.5\) and \(\sigma = 10\) is the autocorrelation plots. When \(\sigma = 10\), the ACF value becomes almost 0 around the lag 30. However, when\(\sigma = 3.5\), the ACF value is still significant around the lag 80. This implies that the samples of X_4 becomes almost independent much faster than those of X_3. Since the sample which is more independent is the better, the case \(\sigma = 10\) is the best case.
vi) Explain why checking only a histogram, a trace plot and an autocorrelation plot can be dangerous when using MCMC.
The proper example will be the case \(\sigma = 1.25\). When checking only a histogram or a trace plot, it will be misunderstood that all the data are centered around x = -5 and there is no data around x = 5. This, of course, proves to be false by overlaying the pdf f(x) on the histogram. So, checking the histogram or trace plot solely is not appropriate method.
Although the fact that 0-autocorrelation occurs faster means the sample is close to independent and a better sample, this fact does not solely determine the quality of the sample. The sample X_2 of the case \(\sigma = 1.25\) has the lowest autocorrelation since the ACF value becomes 0 much faster (around the lag 20), but this sample does not explain the original pdf f(x) exactly. So, whether the sample has the lowest ACF should not be the only criterion for MCMC.