SC HW8

Q1. Suppose \((X_1, \dots , X_d)\) has the uniform distribution on the d-dimensional unit hypersphere, which has the density of the form \[f(x_1,\dots,x_d) = c_d*I(x_{12}+\dots+ x_d^2 < 1)\],where \(I\) is an indicator function and \(c_d\) is the normalizing constant. Since \(c_d\) is defined as the reciprocal of the volume of the d-dimensional unit hypersphere, it is easy to obtain that\[c_d = \frac{\Gamma(1 + \frac{d}{2})}{\pi^{d/2}}.\]

i) Consider the acceptance-rejection method with the uniform proposal distribution on the d-dimensional hypercube \([−1,1]^d\). Find an expression for the acceptance probability as a function of d. Explain why this acceptance-rejection algorithm can be extremely inefficient for large d.

Let \(g(x) = g(x_1, \dots, x_d)\) be the uniform distribution on hypercube \([−1,1]^d\). Since \(\int_{-1}^1\dots\int_{-1}^1 g(x)dx_1\dots dx_d = 1\) and \(g(x)\) is a constant, \(g(x) = \frac{1}{2^d},\ (|x_i| < 1, \ ^\forall i=1,\dots, d)\). Then, the supremum of \(f(x)/g(x)\) is \[\begin{aligned} c &= \sup_{x}\frac{f(x)}{g(x)} \\ &= \frac{c_d}{1/2^d} \\ &= \frac{2^d\Gamma(1 + d/2)}{\pi^{d/2}} \\ &= \left(\frac{2}{\sqrt \pi} \right)^d\Gamma \left(1 + \frac{d}{2}\right). \end{aligned}\] So, the acceptance probability is \[\frac{1}{c} = \frac{1}{\left(\frac{2}{\sqrt \pi} \right)^d\Gamma \left(1 + \frac{d}{2}\right)}.\]

The following paragraph shows why the value c increases as d increases.

Since \(\frac{2}{\sqrt \pi} > 1\), the term \(\left(\frac{2}{\sqrt \pi} \right)^d\) will increase as d increases.

\(\Gamma\left(1 + \frac{d}{2}\right)\) can be shown differently based on the ‘d’.

If \(d = 2k, (k = 1,2,3,\dots,)\) then \(\Gamma\left(1 + \frac{d}{2}\right) = \Gamma(1+k) = k!\).

If \(d = 2k-1, (k = 1,2,3,\dots,)\) then \[\begin{aligned} \Gamma\left(1 + \frac{d}{2}\right) &= \Gamma\left(k + \frac{1}{2}\right) \\ &= \left(k - \frac{1}{2}\right)\Gamma\left(k - \frac{1}{2}\right) \\ &= \left(k - \frac{1}{2}\right)\dots\left(1 - \frac{1}{2}\right) \Gamma\left(\frac{1}{2} \right) \\ &= \left(k - \frac{1}{2}\right)\dots\left(1 - \frac{1}{2}\right) \sqrt \pi \end{aligned}\]

When \(d = 2k\), \[\begin{aligned} \frac{\Gamma\left(1 + \frac{d+1}{2}\right)}{\Gamma\left(1 + \frac{d}{2}\right)} &= \frac{((k+1)-1/2)*\dots(1-1/2)\sqrt \pi}{k!}\\ &= \frac{(k+1)-1/2}{k}*\frac{k-1/2}{k-1}*\dots*\frac{2-1/2}{1}*\left(1-\frac{1}{2}\right) \sqrt \pi \end{aligned}\]

It is true that \(\frac{2-1/2}{1}*\left(1-\frac{1}{2}\right) \sqrt \pi \approx 1.329\) and each fraction term \(\frac{(k+1) -1/2}{k}\) is greater than 1 for all positive integer \(k\). Then, \[\frac{\Gamma\left(1 + \frac{d+1}{2}\right)}{\Gamma\left(1 + \frac{d}{2}\right)} > 1,\quad ^\forall d = 2k.\]

When \(d = 2k-1\), \[\begin{aligned} \frac{\Gamma\left(1 + \frac{d+1}{2}\right)}{\Gamma\left(1 + \frac{d}{2}\right)} &= \frac{k!}{(k-1/2)*\dots(1-1/2)\sqrt \pi} \\ &= \frac{k}{k-1/2}*\dots*\frac{1}{1-1/2}*(\sqrt \pi)^{-1}. \end{aligned}\]

Note that \(\frac{1}{1-1/2}*(\sqrt \pi)^{-1} \approx 1.128\), and each fraction term \(\frac{k}{k-1/2}\) is greater than 1 for all positive integer \(k\). So, \[\frac{\Gamma\left(1 + \frac{d+1}{2}\right)}{\Gamma\left(1 + \frac{d}{2}\right)} > 1,\quad ^\forall d = 2k-1.\]

Since \(\frac{\Gamma\left(1 + \frac{d+1}{2}\right)}{\Gamma\left(1 + \frac{d}{2}\right)} > 1,\quad ^\forall d = 1,2,\dots\), \(\Gamma\left(1 + \frac{d}{2}\right)\) increases as \(d\) increases. Also, \((\frac{2}{\pi})^d\) increases exponentially as \(d\) increases. In other words, as \(d\) increases, the expected number of iterations until acceptance, which is equal to \(c\), will increase exponentially as well: \(d \propto c\). The rejection method becomes much slower as \(c\) becomes greater than 1. This coincides with the acceptance probability \(1/c\) becoming extremely lower.

Thus, acceptance-rejection algorithm is inefficient for large \(d\).

ii) Now we consider using the Gibbs sampling. Find the distribution of \(X_i\) conditional on the others, i.e., the distribution of \(X_i|(X_{-i} = x_{-i})\). Specify the type of distribution and its parameters.

When \(d = 1\), the hypersphere is a simple interval between -1 and 1. Then, f(x) is the pdf of uniform distribution: \(f \sim \text{Unif}(−1,1)\) so \(f(x) =\frac{1}{2},\ −1< x < 1\). The CDF is \(F(x) = \frac{1}{2}(x+1)\).

As written in the question, \(X_{-i}\) and \(x_{-i}\) represent all other random variables and all other observed values except for \(i\)th case. Similarly, \(x_{-i}^2\) will be defined as the sum of the square of all other observed values except for \(i\)th case from now on: \(x_{-i}^2 = \sum_{k \neq i}x_k^2\).

When \(x_{-i}\) is determined, the range of \(X_i\) is as follows: \(X_i^2 < 1-x_{-i}^2\). If there is no condition on the interval of \(X_i\), it follows \(\text{Unif}(−1,1)\) as previously mentioned. So, the CDF of \(X_i\) conditional on \(-\sqrt{1-x_{-i}^2} < X_i < \sqrt{1-x_{-i}^2}\), and its pdf, which is truncated distribution, are as follow: \[H(x_i) = \frac{F(x_i) - F(-\sqrt{1-x_{-i}^2})}{F(\sqrt{1-x_{-i}^2}) - F(-\sqrt{1-x_{-i}^2})}, \\[3ex] h(x_i) = \frac{f(x_i)}{F(\sqrt{1-x_{-i}^2}) - F(-\sqrt{1-x_{-i}^2})}, \quad -\sqrt{1-x_{-i}^2} < x_i < \sqrt{1-x_{-i}^2}.\]

The nominator \(f(x_i)\) is \(1/2\), and the denominator \(F(\sqrt{1-x_{-i}^2}) - F(-\sqrt{1-x_{-i}^2})\) is \(\frac{1}{2}\Bigl\{(\sqrt{1-x_{-i}^2} + 1) - (-\sqrt{1-x_{-i}^2} + 1) \Bigr\} = \frac{1}{2}*2\sqrt{1 - x_{-i}^2}\). So, \[h(x_i) = \frac{1}{2\sqrt{1-x_{-i}^2}},\quad -\sqrt{1-x_{-i}^2} < x_i < \sqrt{1-x_{-i}^2}.\]

The obtained pdf \(h(x_i)\) is the pdf of \(X_i\) conditional on other values \(X_{-i}\), which is \(f(x_i|X_{-i} = x_{-i})\). This function \(h(x_i)\) is the pdf of uniform distribution \(f_{i|-i} \sim \text{Unif}(-\sqrt{1-x_{-i}^2}, \sqrt{1-x_{-i}^2})\).

iii) For \(d = 5\), generate \(n = 1000\) random draws using a fix-scan Gibbs sampler with initial values zero (after a reasonable burn-in period). Store the values whenever every cycle is completed, i.e., obtain a sequence of random vectors ,\(\Bigl( X_1^{(1)},\dots,X_d^{(1)}\Bigr)\),\(\dots\), \(\Bigl( X_1^{(1000)},\dots,X_d^{(1000)}\Bigr)\). Draw a scatter plot of \((X_1,X_2)\).

The following code ‘hw8_q1’ generates ‘n’ number of Gibbs sampler with initial values (xt) zero. It is shown in question ii) that the conditional pdf \(f_{i|-i}\) is uniform distribution \(\text{Unif}(-\sqrt{1-x_{-i}^2}, \sqrt{1-x_{-i}^2})\). So, for every \(j\)th iteration \((j = 1,\dots,1000)\), this function ‘hw8_q1’ updates \(i\)th data ‘xt[i]’ from \(\text{Unif}(-\sqrt{1-x_{-i}^2}, \sqrt{1-x_{-i}^2})\), where \(i = 1, \dots, 5\). Since this process is Fixed-scan, the ith data is used when generating \((i+1)\)th data x[i+1]. After updating all ‘xt’ data, the ‘xt’ vector which corresponds to \(\Bigl(X_1^{(t)},\dots,X_5^{(t)}\Bigr)\) is added to the next column of ‘X’. The \(i\)th row of X matrix is the values of \(X_i\), and the \(j\)th column of X is the \(j\)th value for each \(X_i\).

stu_id <- 2011142127
set.seed(stu_id)

hw8_q1 <- function(n){
  xt <- rep(0,5)
  X <- c()
  for(j in 1:n){
    
    for(i in 1:5){
      xt[i] <- runif(1,-sqrt(1-sum(xt[-i]^2)), sqrt(1-sum(xt[-i]^2)))
    }
    
    X <- cbind(X, xt)
  }
  
  colnames(X) <- 1:n
  rownames(X) <- c('X1','X2','X3','X4','X5')
  return(X)
}
X <- hw8_q1(2000)

Then, it is possible to obtain ‘n’ number of random vectors, which are stored as jth column of the result matrix. This result ‘X’ does not discard the data of burn-in period. So, the initial ‘X’ matrix has sufficiently large number of columns bigger than 1000 to determine the burn-in period.

The following figures show the ts.plot of every \(X_i\) for the first 500 data. All of the data has almost no extreme values at the first time. The values of every \(X_i\) occurs frequently around 0 and oscillate around 0 as well. This trend appears from the very initial time. So, the proper burn-in period would be much smaller than the total number of data required (1000). In this case, the burn-in period was determined to be 100.

The X matrix between 101th and 1100th column is chosen as the 1000 random draws from Gibbs sampler.

par(mfrow = c(2,3))
ts.plot(X[1,][1:500], main = 'ts.plot of X1')
ts.plot(X[2,][1:500], main = 'ts.plot of X2')
ts.plot(X[3,][1:500], main = 'ts.plot of X3')
ts.plot(X[4,][1:500], main = 'ts.plot of X4')
ts.plot(X[5,][1:500], main = 'ts.plot of X5')

For reference, the histogram of X1, …, X5 for 1000 data after burn-in period is given below.

X <- X[,101:1100]

X1 <- X[1,]
X2 <- X[2,]
X3 <- X[3,]
X4 <- X[4,]
X5 <- X[5,]

par(mfrow = c(2,3))
hist(X1, main = 'Histogram of X1', freq = FALSE)
hist(X2, main = 'Histogram of X2', freq = FALSE)
hist(X3, main = 'Histogram of X3', freq = FALSE)
hist(X4, main = 'Histogram of X4', freq = FALSE)
hist(X5, main = 'Histogram of X5', freq = FALSE)

Finally, the scatter plot of X1 and X2 is presented below. Since the condition of X1, …, X5 are \(X_1^2 + \dots + X_5^2 < 1\), it is obvious that \(X_1^2 + X_2^2 < 1\). So, the red circle representing \(X_1^2 + X_2^2 = 1\) is given to show that \(X_1^2 + X_2^2 < 1\) for reference.

par(mfrow = c(1,1))
plot(X1,X2, main = 'plot of X1 and X2', xlim = c(-1,1), ylim = c(-1,1))
ucircle = cbind(cos((0:360)/180 * pi), sin((0:360)/180 * pi))
lines(ucircle, type = "l", col = "red")

Q2. Poisson regression is often used to model the relationship between a count response variable and predictor variables. We consider a Poisson regression model with one predictor variable \(x_i \in \mathbb R\) and a regression coefficient \(\beta \in \mathbb R\):\[\mathbb P(Y_i = y) = \frac{\lambda_i^y e^{-\lambda_i}}{y!}, \quad i=1,2,\dots,n,\]where \(Y_i \in \{0,1,2,\dots\}\) is a response variable and \(\lambda_i = e^{\beta x_i}\) is the mean response (i.e., \(\lambda_i = \mathbb E[Y_i]\)).

i) Write down the likelihood function and the log-likelihood function of the model.

The probability when \(Y_i = y_i\) is \(\mathbb P(Y_i = y_i) = \frac{\lambda_i^{y_i} e^{-\lambda_i}}{y_i!}\). Then, the likelihood function \(L(\beta)\) is the multiplication of all the probabilities: \[L(\beta) = \prod_{i=1}^n \frac{\lambda_i^{y_i} e^{-\lambda_i}}{y_i!} = \prod_{i=1}^n\frac{e^{\beta x_i* y_i} e^{-\lambda_i}}{y_i!}.\]

The log likelihood function is the log of \(L(\beta)\): \[\begin{aligned} l(\beta) &= \log\left(\prod_{i=1}^n e^{\beta x_i* y_i} e^{-\lambda_i}\right) - \log\left(\prod_{i=1}^ny_i! \right) \\ &= \sum_{i=1}^n\log(e^{\beta x_i* y_i} e^{-\lambda_i}) - \sum_{i=1}^n \log(y_i!) \\ &= \sum_{i=1}^n\Bigl(\beta x_i* y_i - \lambda_i -\log(y_i!)\Bigr) \\ &= \sum_{i=1}^n\Bigl(\beta x_i* y_i - e^{\beta x_i} -\log(y_i!)\Bigr). \end{aligned}\]

ii) Find the score function and the (expected) Fisher information of \(\beta\).

The score function \(S(\beta)\) is the (partial) derivative of log likelihood with respect to \(\beta\): \[\begin{aligned} S(\beta) &= \frac{d}{d\beta}l(\beta) \\ & = \frac{d}{d\beta}\sum_{i=1}^n\Bigl(\beta x_i* y_i - e^{\beta x_i} -\log(y_i!)\Bigr) \\ &= \sum_{i=1}^n(x_iy_i - x_ie^{\beta x_i}). \end{aligned}\]

The (expected) Fisher information of \(\beta\), \(I(\beta)\) is the expectation of second (partial) derivative of log likelihood with respect to \(\beta\) multiplied by -1: \[\begin{aligned} I(\beta) &= -\mathbb E\left[\frac{d^2}{d\beta^2}l(\beta) \right] \\ &= -\mathbb E\left[\frac{d^2}{d\beta^2}\sum_{i=1}^n(X_iY_i - X_ie^{\beta X_i}) \right] \\ &= \mathbb E\Bigl[\sum_{i=1}^n X_i^2e^{\beta X_i} \Bigr] \\ &= n\mathbb E[X^2 e^{\beta X}]. \end{aligned}\] Note that \(X\) is a random variable from \(\{x_1,\dots,x_n\}\), and each \(X_i\) is i.i.d.

iii) We want to find the maximum likelihood estimator of the Poisson regression model above. For such a task, show that the Newton-Raphson method and the Fisher scoring method are equivalent for this particular model. Using the method, give a specific updating rule for \(\beta\), say \(\beta^{(t)} \to \beta^{(t+1)}\).

For simplicity, \(\beta^{(t)}\) will be denoted by \(\beta_t\) from now on. \(\beta_t\) means the \(t\)th updated value of \(\beta\).

The solution of the nonlinear equation \(\frac{d}{d\beta}l(\beta) = l'(\beta) = 0\) is the MLE of \(\beta\). According to the Newton method, the updating rule is as follows: \(\beta_{t+1} = \beta_t - \frac{l'(\beta_t)}{l''(\beta_t)}\). Based on the Fisher Scoring method, the updating rule is as follows: \(\beta_{t+1} = \beta_t + \frac{l'(\beta_t)}{I(\beta_t)}\). In order for the two methods to be identical, the two values \(-l''(\beta_t)\) and \(I(\beta_t)\) should be identical.

It is already shown that \(-l''(\beta) = \sum_{i=1}^nx_i^2e^{\beta x_i}\). The Fisher Information is \(I(\beta) = n*\mathbb E[X^2e^{\beta X}]\), where \(X\) is a random variable from \(x_1,\dots,x_n\) which is identically sampled (iid) with probability \(1/n\). Then, the expectation can be written as \(\sum_{i=1}^n(x_i^2e^{\beta x_i})*\frac{1}{n}\). Multiplying \(n\),, this term becomes \(-l''(\beta)\).

Thus, \(-l''(\beta) = I(\beta)\) leads the two methods to being identical. The updating rule, as previously mentioned, is \[\beta_{t+1} = \beta_t + \frac{l'(\beta_t)}{I(\beta_i)}.\]

iv) For n = 1000, generate \(x_i, \ i=1,\dots,n\) independently from \(N(0,1)\), and then generate \(Y_i, \ i=1,\dots,n\) independently from \(\text{Pois}(e^{\beta x_i})\) with \(\beta = 1.5\). With the generated dataset, find the maximum likelihood estimate of \(\beta\) using the updating rule you obtained in iii) with initial value zero. Iterate the procedure until the relative error gets smaller than \(\epsilon = 10^{-5}\), i.e., \(\Bigl|\frac{\beta^{(t)} - \beta^{(t-1)}}{\beta^{(t-1)}} \Bigr| < \epsilon\).

The function ‘hw8_q2’ returns a dataframe containing the MLE of \(\beta\), the relative error, and the number of iterations. The input of this function serves as the initial value of \(\beta\). It generates 1000 random numbers ‘x’ from \(N(0,1)\) and generates y from Poisson distribution with different parameter \(e^{1.5 x_i}\). For every iteration ‘i’, calculate the value of \(l'(\beta_t)\) and \(I(\beta_t)\) (‘dl’ and ‘I.b’ in R code respectively), and update the next \(\beta\) (b.next) to \(\beta_t + \frac{l'(\beta_t)}{I(\beta_t)}\). If the absolute value of \(\frac{\beta_{t+1} - \beta_t}{\beta_t}\) is less than \(10^{-5}\) (and if \(\beta_t \neq 0\)), the iteration stops and ‘b.next’ becomes the MLE of \(\beta\).

set.seed(stu_id)

hw8_q2 <- function(b.t){
  
  x <- rnorm(1000, 0, 1)
  y <- c()
  for(i in 1:1000){
    y[i] <- rpois(1, exp(1.5*x[i]))
  }
 
  i <- 0
  while(1){
    i <- i+1
    
    dl <- sum(x*y - x*exp(b.t*x))
    I.b <- sum(x^2*exp(b.t*x))
    
    b.next <- b.t + dl/I.b
    
    if(b.t != 0 & abs((b.next - b.t)/b.t) < 10^(-5)){break}
    
    b.t <- b.next
  }
  
  ans <- data.frame(MLE = b.next,
                    error = abs((b.next - b.t)/b.t),
                    iter = i)
  
  return(ans)
}

The question requires the initial value of \(\beta\) to be 0, so ‘0’ is inserted to the function ‘hw8_q2’. Then, the MLE is estimated to be 1.487, the error is 6.5588∗\(10^{−9}\), and the number of iterations is 14.

answer <- hw8_q2(0)
answer

##       MLE        error iter
## 1 1.48705 6.558829e-09   14

Statistical Computing HW8

i) Consider the acceptance-rejection method with the uniform proposal distribution on the d-dimensional hypercube \([−1,1]^d\). Find an expression for the acceptance probability as a function of d. Explain why this acceptance-rejection algorithm can be extremely inefficient for large d.

ii) Now we consider using the Gibbs sampling. Find the distribution of \(X_i\) conditional on the others, i.e., the distribution of \(X_i|(X_{-i} = x_{-i})\). Specify the type of distribution and its parameters.

i) Write down the likelihood function and the log-likelihood function of the model.

ii) Find the score function and the (expected) Fisher information of \(\beta\).