SC HW6

Q1. In Homework 1, we considered the Monte Carlo simulation for estimating $\pi$. To construct estimators, the following two facts were used: \[(1) 4\mathbb E[I(V_1^2 + V_2^2 \le 1)] =\pi, \\[1ex] (2) 6\mathbb E[I(V_1^2 + V_2^2 + V_3^2 \le 1)] = \pi\], where $‘I’$ is the indicator function and $V_k \sim \text{Unif}(-1,1), k =1,2,3$. We observed that the estimator constructed by (1) exhibits fast convergence than the one with (2). This is indeed because the variance of the estimator with (1) has a smaller variance. Since each indicator variable is distributed as a Bernoulli, the variances are easily obtained and compared. Now we consider a different approach for comparing the variances. Show that the estimator with (1) is a conditioning estimator constructed by the estimator with (2).

To sum up the question, the main point is to find a proper condition that leads to \[4\mathbb E[I(V_1^2 + V_2^2 \le 1)] = 6\mathbb E[I(V_1^2 + V_2^2 + V_3^2 \le 1) | condition].\] So, the following method focuses on which condition would lead to the preceding equation. To start with, let the indicator $I(V_1^2 + V_2^2 \le 1)$ be ‘C’ and the indicator $I(V_1^2 + V_2^2 + V_3^2 \le 1)$ be ‘S’, which are similar to my notation of HW1 Q2. In other words, $C \sim \text{Bernoulli}(\pi/4)$ ad $S \sim \text{Bernoulli}(\pi/6)$. As previously mentioned in HW1, this makes sense because all the variables of $V_1, V_2, V_3$ are uniformly distributed in the same support, which leads to choosing points in a uniformly distributed cube of edge 2.

$C$	$0$	$1$	$S$	$0$	$1$
$P(C)$	$1-\pi/4$	$\pi/4$	$P(S)$	$1-\pi/6$	$\pi/6$

It is known that the conditional expectation is a function of the condition. For example, $\mathbb E[X_1|X_2] = f(X_2)$ The function$f(X_2)$ may have diverse form depending on the function ‘f’, such as $X_2^3, 2X_2^2$, and so on. The equation that has to be proved requires the conditional expectation to be a random variable ‘C’ multiplied by a constant, which is a function of the random variable ‘C’. So, the following explanation attempts to show that $4C = 6\mathbb E[S|C]$.

Before calculating the condition expectation, let us categorize all of the cases into 4 subcases and obtain the conditional probabilities. Note that ‘C’ and ‘S’ can have only 0 and 1 as their values.

(1) C=1, when $V_1^2 + V_2^2 \le 1$

Under the case ‘C = 1’, there are two subcases, including $S = 1, (V_1^2 + V_2^2 + V_3^2 \le 1)$ given $C = 1$, and $S = 0, (V_1^2 + V_2^2 + V_3^2 > 1)$ given $C = 1$.

The first subcase ‘S = 1 given C = 1’ has no special meaning because if $V_1^2 + V_2^2 + V_3^2 \le 1$ holds, $V_1^2 + V_2^2 \le 1$ is always true. In other words, $\{S = 1\} \subset \{C = 1\}$, which leads the probability $\mathbb P(S = 1,C=1)$ to $\mathbb P(S=1)$. Then, \[\begin{aligned} \mathbb P(S=1|C=1) &= \mathbb P(S = 1,C=1)/\mathbb P(C=1) \\ & = \mathbb P(S=1)/\mathbb P(C=1) \\ &= (\pi/6)/(\pi/4) = 2/3. \end{aligned}\]

The second subcase ‘S = 0 given C = 1’ implies the points that are outside of the sphere (with radius 1) and inside the cylinder (with radius 1 and height 2). Then, the volume of such space is $2\pi*1^2(\text{volumn of cylinder}) - \frac{4}{3}\pi*1^3(\text{volumn of shpere}) = \frac{2}{3}\pi$, and $\mathbb P(S=0,C=1) = \frac{2}{3}\pi/8 = \pi/12$. Thus, \[\mathbb P(S=0|C=1) = \mathbb P(S=0,C=1)/\mathbb P(C=1) = (\pi/12)/(\pi/4) = 1/3.\]

(2) C = 0: when $V_1^2 + V_2^2 > 1$

There are two subcases, S = 1 given C = 0, and S = 0 given C = 0.

The first subcase, S = 1 given C = 0, has no special meaning, because when $V_1^2 + V_2^2 > 1$, there are no such case that leads $V_1^2 + V_2^2 +V_3^2$to be less than or equal to 1. In other words, $\{S=1\} \cap \{C=0\} = \emptyset$, which leads to $\mathbb P(S=1,C=0) = 0$. So, \[\mathbb P(S=1|C=0) = \mathbb P(S=1,C=0)/\mathbb P(C=0) = 0.\]

The second subcase, S = 0 given C = 0, means the points outside the cylinder and inside the cube (the length of edge is 2). Then, the volume of such space is $(4-\pi)*2$, which leads the probability $\mathbb P(S=0,C=1)$ to $(4-\pi)*2/8 = 1-\pi/4$. Thus, \[\mathbb P(S=0|C=1) = \mathbb P(S=0,C=1)/\mathbb P(C=0) = (1-\pi/4)/(1-\pi/4) = 1.\]

(3) conditional expectation for each C

\[\begin{aligned} \mathbb E[S|C=1] &= \sum_{s\in\{0,1\}}s*\mathbb P(S=s|C=1) \\ &= 0*(\frac{1}{3}) + 1*(\frac{2}{3}) = \frac{2}{3}. \end{aligned}\]

Note that 2/3 can be expressed as (2/3)*1.

\[\begin{aligned} \mathbb E[S|C=0] &= \sum_{s\in\{0,1\}}s*\mathbb P(S=s|C=0) \\ &= 0*1 + 1*0 = 0. \end{aligned}\]

Note that 0 can be expressed as (2/3)*0.

(4) conclusion

To sum up, $\mathbb E[S|C=c] = (2/3)*c$ for $c = 0,1$. So, $\mathbb E[S|C] = (2/3)*C$ Multiplying the integer 6 to each side, the following expression can be induced. \[4C = \mathbb E[6S|C].\]

The random variable C is identical to the indicator $I(V_1^2 + V_2^2 \le 1)$, and the random variable S is same as $I(V_1^2 + V_2^2 + V_3^2 \le 1)$. Therefore, the proper condition is the ‘C’ or the indicator$I(V_1^2 + V_2^2 \le 1)$, and the expression above can be shows with respect to indicators as follows: \[4*I(V_1^2 + V_2^2 \le 1) = \mathbb E[6*I(V_1^2 + V_2^2 + V_3^2 \le 1)|I(V_1^2 + V_2^2 \le 1)].\]

Q2. Consider estimating a small probability $\mathbb P(X \in A)$, where $X \sim F$ with density $f$. Suppose another distribution $G$ with density $g$ is given to us, for which we can calculate $p = \mathbb P(Y \in A), Y \sim G$. Assume also that sampling $Y$ from $G$ conditional on $\{Y \in A\}$ is easily doable. Let this conditional distribution be $G_A$, i.e., $G_A(y) = \mathbb P(Y \le y|Y \in A)$, $Y \sim G$. We can construct an estimator as \[ \hat \theta = \frac{p}{n}\sum_{i=1}^n \frac{f(Y_i)}{g(Y_i)},\quad Y_i \stackrel{\text{i.i.d}}{\sim} G_A, \quad i=1,\dots,n.\]

i) Show that $\mathbb E_Y[\hat \theta] = \mathbb P(X \in A)$, where $X \sim F$.

\[\begin{aligned} \mathbb P(X \in A) &= \mathbb E_X [I(X\in A)]\\ &= \int_{-\infty}^{\infty} I(x \in A)f(x)dx &\qquad ①\\ &= \int_{-\infty}^{\infty} I(x \in A)\frac{f(x)}{g(x)}g(x)dx &\qquad ② \end{aligned}\]

The expression (1) is the expectation of a function of a variable that follows $F$, but the expression (2) is the expectation of a function of a variable that follows $G$. This problem denotes such variable as $Y \sim G$, so changing ‘x’ into ‘y’ would be better notation. Then, \[\int_{-\infty}^{\infty} I(y \in A)\frac{f(y)}{g(y)}g(y)dy = \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)}\right].\]

Due to the law of total expectation, the below expression holds. \[\mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)}\right] \\ = \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A\right]\mathbb P(Y \in A) + \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A^c\right]\mathbb P(Y \in A^c).\]

When $Y \in A^c$, the indicator $I(Y \in A)$ becomes 0. Then, the latter conditional expectation $\mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A^c\right]$ becomes the expectation of 0’s. Then, \[\begin{aligned} \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)}\right] &= \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A\right]\mathbb P(Y \in A) + 0 \\ &= \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A\right]*p &\qquad ③. \end{aligned}\]

The indicator function $I(Y \in A)$ is always 1 when $Y \in A$. According to the definition of $\hat \theta$, each $Y-i$ follows the distribution $G_A$, the truncated distribution of $G$. Then, every $Y_i$ is included in $A$. In other words, $I(Y_i \in A)$ is always 1. After choosing any ‘n’ number of $Y_i$, the above equation becomes \[p* \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A\right] = p*\mathbb E_Y\left[1*\frac{f(Y_i)}{g(Y_i)}\right].\]

Each $Y_i$ is the identical random variable from the distribution $G_A$, and each $Y_i$ are independent. Then, each $E\left[\frac{f(Y_i)}{g(Y_i)}\right]$ are identical and \[\begin{aligned} p*\mathbb E_Y\left[1*\frac{f(Y_i)}{g(Y_i)}\right] &= p*\frac{1}{n}\sum_{i = 1}^n \mathbb E_Y\left[\frac{f(Y_i)}{g(Y_i)}\right] \\ &= \frac{p}{n}\mathbb E_Y\left[\sum_{i = 1}^n\frac{f(Y_i)}{g(Y_i)} \right] \\ &= \mathbb E_Y\left[\frac{p}{n} \sum_{i = 1}^n\frac{f(Y_i)}{g(Y_i)} \right] \\ &= \mathbb E_Y[\hat \theta]. \end{aligned}\]

This proves $\mathbb E_Y[\hat \theta] = \mathbb P(X \in A)$. Let this value be $\theta$.

ii) Show that $\hat \theta$ is constructed by combining the importance sampling, the conditioning, and the stratified sampling.

(1) importance sampling

For example, when estimating $\lambda = \mathbb E_f[h(X)]$, where $h(X)$ is a function of $X$ and $\mathbb E_f$ means an expectation with respect to a random variable $X$ that has the pdf $f$, the method of estimating can be expressed in a different way. To explain, \[\begin{aligned} \lambda = \mathbb E_f[h(X)] &= \int h(x)f(x)dx \\ &= \int h(x)\frac{f(x)}{g(x)}g(x)dx \\ &= \mathbb E_g\left[h(Y)\frac{f(Y)}{g(Y)} \right], \end{aligned}\] where $Y \sim G$.

Instead of using random variable $X$ with pdf $f$ directly for estimating $\lambda$, generating random variable $Y$ from different pdf $g$ and averaging the values of $h(Y)\frac{f(Y)}{g(Y)}$ may estimate the same value $\lambda$ and reduce the variance, which is called the ‘importance sampling’.

Using the ③ of preceding proof, the value $\mathbb E_Y[\hat \theta]$ becomes $p*\mathbb E_g\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)}\right]$, which contains the general form of important sampling. Note that $Y_i \in A$ and $[I(Y_i \in A) = 1$ for all $i$, so the following expression holds: \[\mathbb E_g\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)}\right] = \mathbb E_g\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)} \Big | Y_i \in A\right],\] which contains the importance sampling form as well. To sum up, $\mathbb E_Y[\hat \theta]$ has the important sampling form as follows: $p*\mathbb E_g\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)}\Big | Y_i \in A\right]$, where $I(Y_i \in A)$ can be interpreted as $h(Y)$ of previous paragraph.

(2) stratified sampling and conditioning

For example, when estimating $\lambda = \mathbb E[Z]$, the main point of stratified sampling is to estimate $\mathbb E[Z|Y =y_i]$ under the known values $\mathbb P(Y = y_i)=p_i, i =1,\dots,n$. Letting the estimate of $\mathbb E[Z|Y =y_i]$ be $\hat {\mathbb E}[Z|Y =y_i]$, the $\lambda$ can be estimated by $\sum_{i=1}^n p_i*\hat {\mathbb E}[Z|Y =y_i]$. Note that $\mathbb E[Z|Y =y_i]$ is unknown value for all $i$. [stratified sampling].

The process of estimating $\lambda = \mathbb E[Z]$ by conditioning is to calculate $\mathbb E[Z|Y ]$for all generated $Y = y_i$, multiply by estimated probability $\mathbb P(Y=y_i) = \hat p_i$, and add the values. That is, $\lambda \approx \sum_{i=1}^n \hat p_i\mathbb E[Z|Y = y_i]$. Note that $\mathbb E[Z|Y =y_i]$ is known value for all $i$. [Conditioning]

The upper explanation is based on the case where one type is always known and other type is always unknown (or vise versa). However, there exist some cases some of the probabilities (or all of them) have known values and some of the conditional expectation (or all of them) have known values. In these cases, the combination of such methods (stratified + conditioning) is also possible. For example, $\lambda$ can be estimated as $p_1\mathbb E[Z|Y=y_1] + \hat p_2\mathbb E[Z|Y=y_2] + p_3\hat{\mathbb E}[Z|Y = y_3] + \dots$

We have shown that the expectation of $\hat \theta$ can be expressed as follows: \[\mathbb E[\hat \theta] = \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A\right]\mathbb P(Y \in A) + \mathbb E_Y\left[I(Y \in A)*\frac{f(Y)}{g(Y)} \Big| Y \in A^c\right]\mathbb P(Y \in A^c)\]

After sampling the $Y$, where $Y_i \in A$ and $Y_j \in A^c$, $\mathbb E[\hat \theta]$ can be represented as follows. The value ‘$m$’ is any proper positive integer. \[\mathbb E[\hat \theta] = \frac{1}{n}\sum_{i=1}^np*E_Y\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)} \Big| Y_i \in A\right] + \frac{1}{m}\sum_{i=1}^n(1-p)*E_Y\left[I(Y_j \in A)*\frac{f(Y_j)}{g(Y_j)} \Big| Y_j \in A^c\right] \]

The value of p is given, so it is a known value. Also, the conditional expectation $E_Y\left[I(Y_j \in A)*\frac{f(Y_j)}{g(Y_j)} \Big| Y_j \in A^c\right]$ is known value (= 0) because the indicator $I(Y_j \in A)$ is always 0. The other conditional expectation $E_Y\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)} \Big| Y_i \in A\right]$ is the unknown value. So, the expectation of $\hat \theta$ has a combination of known values and unknown values of probabilities and conditional expectations. Note that the conditional expectations have the importance sampling form. For reference, the first summation $\frac{1}{n}\sum_{i=1}^np*E_Y\left[I(Y_i \in A)*\frac{f(Y_i)}{g(Y_i)} \Big| Y_i \in A\right]$ is the exact form of stratified sampling.

Q3. Consider the problem of estimating $\mathbb P[U \le a]$ for a uniform random variable $U \sim \text{Unif}(0,1)$ and a constant $a\in(0,1)$. The probability can be expressed as $\mathbb E[I(U \le a)]$, where $I$ is the indicator function. Rather than the simple average estimator using $I(U \le a)$, we attempt to reduce the variance using the random variable of the form\[g(U) = I(U \le a) + c(U^k - \mathbb E[U^k]),\]where $k$ and $c$ are constants.

i) Plugging in the best $c$ into $g(U)$, find an expression for the variance of $g(U)$ as a function of $a$ and $k$.

Let us abridge the indicator $I(U \le a)$ as $I$ simply from now on. The variance of $g(U)$ is as follows: \[\begin{aligned} \text{Var}[g(U)] &= \text{Var}[I + c*U^k - c*\mathbb E[U^k]] \\ &= \text{Var}[I + c*U^k] \because \mathbb E[U^k] \text{ is constant} \\ &= \text{Var}[I] + c^2\text{Var}[U^k] + 2c*Cov[I, U^k] \\ &= \text{Var}[I] + \text{Var}[U^k]*\left(c^2 + 2*\frac{Cov[I,U^k]}{\text{Var}[U^k]}*c\right) \\ &= \text{Var}[I] + \text{Var}[U^k]*\left(c + \frac{Cov[I, U^k]}{\text{Var}[U^k]} \right)^2 - \frac{(Cov[I,U^k])^2}{\text{Var}[U^k]}. \end{aligned}\]

The above equation attains minimum when $c = \frac{(Cov[I,U^k])^2}{\text{Var}[U^k]}$, which is the best c. After applying the best c to the above equation, $\text{Var}[g(U)] = \text{Var}[I] - \frac{(Cov[I,U^k])^2}{\text{Var}[U^k]}$.

(1) $\text{Var}[I]$

$\mathbb E[I] = \mathbb P(U \le a) = \int_0^a1\ du = a$, and $I$ is a Bernoulli distribution with parameter %a%’: $I \sim \text{Bernoulli}(a)$. So, the variance of $I$ is $\text{Var}[I] = a(1-a)$.

(2) $\text{Var}[U^k]$

\[\mathbb E[U^k] = \int_0^1 u^k*1\ du = \left[\frac{1}{k+1}*u^{k+1} \right]_{u=0}^{u=1} = \frac{1}{k+1}.\]

Using this formula, $\mathbb E[U^{2k}] = \frac{1}{2k+1}$.

Thus, \[\text{Var}[U^k] = \mathbb E[U^{2k}] - (\mathbb E[U^k])^2 = \frac{1}{2k+1} - \left(\frac{1}{k+1} \right)^2 = \frac{k^2}{(2k+1)(k+1)}.\]

(3) $Cov[I, U^k]$

$\mathbb E[I] = a$ and $\mathbb E[U^k] = \frac{1}{k+1}$, so \[Cov[I, U^k] = \mathbb E\left[(a-1)\left(U^k - \frac{1}{k+1} \right) \right] = \mathbb E[I*U^k] - a*\frac{1}{k+1}.\]

By the law of total expectation, \[\begin{aligned} \mathbb E[I*U^k] &= \mathbb E[I*U^k|U \le a]\mathbb P(U \le a)+ \mathbb E[I*U^k|U > a]\mathbb P(U > a) \\ &= \mathbb E[I*U^k|U \le a]\mathbb P(U \le a) \quad \because I=0 \text{ when } U > a \\ &= \mathbb E[U^k|U \le a]*a \end{aligned}\]

Let the CDF of $U$ be $F$ with density $f$. Then, the density $(h)$ of $U$ conditional on $0 \le U \le a$ is (truncated distribution) as follows: \[h(u) = \frac{f(u)}{F(a) - F(0)} = \frac{1}{a}, \qquad 0 \le u \le a.\]

Then, the conditional expectation can be denoted by \[\mathbb E[U^k|U \le a] = \int_0^au^kh(u)du = \frac{a^k}{k+1}.\]

Thus, \[\begin{aligned} Cov[I, U^k] &= \mathbb E[U^k|U \le a]*a - \frac{a}{k+1} \\ &= \frac{a^k}{k+1}*a - \frac{a}{k+1} \\ &= \frac{a}{k+1}(a^k-1). \end{aligned}\]

(4) conclusion

Thus, \[\begin{aligned} \text{Var}[g(U)] &= V[I] - \frac{(Cov[I,U^k])^2}{\text{Var}[U^k]} \\ &= a(1-a) - \frac{(\frac{a}{k+1}(a^k-1))^2}{\frac{k^2}{(2k+1)(k+1)}} \\ &= a(1-a) - \frac{1}{k^2}*(2k+1)a^2(a^k-1)^2. \end{aligned}\]

ii) With a given $a$, we are certainly interested in $k$ which minimizes the variance of $g(U)$. Unfortunately, the expression for the variance involves both an exponential term and a polynomial term in $k$, and hence the optimal $k$ cannot be obtained analytically. We can still find the optimal $k$ in a numerical way. For each of $a = 0.2, 0.5, 0.8$, use the ‘optim’ function to find the optimal $k$. Find the variance of $g(U)$ for each $‘a’$ and the corresponding $k$, and calculate the percentage decrease compared to using $I(U \le a)$.

Before using the ‘optim’, a function named ‘fn’ is defined to use it as a parameter of ‘optim’ function. This function works as the variance of $g(U)$ for the given value ‘a’.

fn <- function(a, par){
  k <- par
  return(a*(1-a) - 1/k^2*(2*k+1)*a^2*(a^k - 1)^2)
}

The ‘optim’ function requires some parameters to operate. The ‘par’ is the initial parameter to be calculated, and the ‘fn’ is the function where optimization (in this case, the minimization) occurs.

For reference, the ‘optim’ suggests “Brent” method as a parameter for one dimensional problem, which requires minimum and maximum of the interval of k. However, setting any number that makes the expectation possible is allowed according to the notice. So, setting the initial ‘par’ to 1 was chosen, although it returns a warning with the optimal k and minimum variance. Note that returning a warning make no problem according to the notice.

The result of ‘optim’ when a = 0.2 is saved in ‘op.2’. The optimal k value when a = 0.2 (k.2) is 0.1466797 and the corresponding minimum variance of g(U) is 0.05367902. The value of k and the minimal variance is in the op.2$par and op.2$value respectively. Since the variance of $I(U \le a)$ is $a(1-a)$, $\text{Var}[I(U \le 0.2)] = 0.2*0.8 = 0.16$. Thus, the percentage of variance reduction is (0.16 – 0.05367902)/0.16 = 66.45%.

op.2 <- optim(par = 1, fn = fn, a = 0.2)

## Warning in optim(par = 1, fn = fn, a = 0.2): Nelder-Mead를 이용한 1차원 최적화 문제는 신뢰할 수 없습니다:
## "Brent" 또는 optimize()를 이용해보세요

op.2$value

## [1] 0.05367902

k.2 <- op.2$par
k.2

## [1] 0.1466797

0.2*(1-0.2)

## [1] 0.16

(0.2*(1-0.2) - op.2$value)/(0.2*(1-0.2)) # 66.45%

## [1] 0.6645061

The result of ‘optim’ when a = 0.5 is saved in ‘op.5’. The optimal k value when a = 0.5 (k.5) is 1.164062 and the corresponding minimum variance of g(U) is 0.06171835. The value of k and the minimal variance is in the op.5$par and op.5$value respectively. The variance of $I(U \le 0.5)$ is $0.5*0.5 = 0.25$, so the percentage of variance reduction is (0.25 – 0.0617835)/0.25 = 75.31%

op.5 <- optim(par = 1, fn = fn, a = 0.5)

## Warning in optim(par = 1, fn = fn, a = 0.5): Nelder-Mead를 이용한 1차원 최적화 문제는 신뢰할 수 없습니다:
## "Brent" 또는 optimize()를 이용해보세요

op.5$value

## [1] 0.06171835

k.5 <- op.5$par
k.5

## [1] 1.164062

0.5*(1-0.5)

## [1] 0.25

(0.5*(1-0.5) - op.5$value)/(0.5*(1-0.5)) # 75.31%

## [1] 0.7531266

The result of ‘optim’ when a = 0.8 is saved in ‘op.8’. The optimal k value when a = 0.8 (k.8) is 4.97348 and the corresponding minimum variance of g(U) is 0.03271169. The value of k and the minimal variance is in the op.8$par and op.8$value respectively. The variance of $I(U \le 0.8)$ is 0.8*0.2 = 0.16, so the percentage of variance reduction is (0.16 – 0.03271169)/0.16 = 79.555%

op.8 <- optim(par = 1, fn = fn, a = 0.8)

## Warning in optim(par = 1, fn = fn, a = 0.8): Nelder-Mead를 이용한 1차원 최적화 문제는 신뢰할 수 없습니다:
## "Brent" 또는 optimize()를 이용해보세요

op.8$value

## [1] 0.03271169

k.8 <- op.8$par
k.8

## [1] 4.973438

0.8*(1-0.8)

## [1] 0.16

(0.8*(1-0.8) - op.8$value)/(0.8*(1-0.8)) # 79.555%

## [1] 0.7955519

iii) For each of $a = 0.2, 0.5, 0.8$ with the corresponding optimal $k$, overlay graphs of $I(U \le a)$ and $u^k$ as functions of $u, 0 < u < 1$, in order to see the relationship between the binary indicator $I(U \le a)$ and the control variable $U^k$. Also, as mentioned in class, another way of doing this is comparing the graph of $I(V \le a^k)$ with a straight line by transforming 𝑣 = 𝑢𝑘. Draw the graph of 𝐼(𝑣 ≤𝑎𝑘), 0 < v < 1. Based on the two plots, explain why such a k must be obtained as the optimal value for each a.

To start with, a function that works as an indicator function has to be defined. It receives a vector and a number serving as a criterion whether the element of the vector is less than or equal to the number. If the element fits the criterion, it returns 1, and if not, it returns 0. The following code shows the function (indicator.a) and how it works.

indicator.a <- function(us, a){
  I <- rep(0,length(us))
  I[us <= a] <- 1
  return(I)
}
indicator.a(seq(0,1,0.1),0.2)

##  [1] 1 1 1 0 0 0 0 0 0 0 0

The following code shows how to make the plots of indicator function (unit step function) and the u^k. The plot shows when a = 0.2, a = 0.5, a = 0.8 from the left. The red, green, and blue graph in each plot means $y = u^k$, where the value k is the optimal value of k for each a.

u <- seq(0,1,0.01)
par(mfrow = c(1,3))
plot(u, indicator.a(u, 0.2), type = 'l', ylab = 'y',
     main = 'plot of indicator and u^k when a = 0.2')
lines(u, u^k.2, col = 'red')

plot(u, indicator.a(u, 0.5), type = 'l', ylab = 'y',
     main = 'plot of indicator and u^k when a = 0.5')
lines(u, u^k.5, col = 'green')

plot(u, indicator.a(u, 0.8), type = 'l', ylab = 'y',
     main = 'plot of indicator and u^k when a = 0.8')
lines(u, u^k.8, col = 'blue')

The optimal k values for each a are as follows: k.2 = 0.147, k.5 = 1.16, and k.8 = 4.97, defined in previous question. The reason for the k being chosen is that such k can reduce the variance of g(U) as much as possible. The equation below can be obtained by manipulating the form of $\text{Var}[g(U)]$: \[\text{Var}[g(U)] = \text{Var}[I] - \frac{(Cov[I,U^k])^2}{\text{Var} [U^k]} = \text{Var}[I] - \text{Var}[I]\left(Cov[I,U^k] \right)^2\]

The value of $\text{Var}[I]$ is fixed for each a, so how much the variance reduces depends on the correlation between $I$ and $U^k$.

(1) $I(U \le a)$ vs $U^k$

The absolute value of correlation between two random variables, say X and Y, is maximum (=1) when they have linear relationship. To explain,$Y = b_0 + b_1X$ makes the absolute correlation to be 1, where $b_0$ and $b_1$ are constant and $b_1 \neq 0$. Note that the correlation can also be negative if the $b_1$ is negative. The $U_k$ starts from 0 to 1, and the indicator $I$ goes from 1 to 0, the opposite direction. So the sign of correlation between the two variables is estimated to be negative. In such case, when a minus sign is applied to one of them, their correlation become positive, while the magnitude remains still. So, (-1) is multiplied to each indicator and ‘1’ is added as well to show the correlation clearly in one plot for each case of ‘a’. The below plots shows the shifted indicator (1 - indocator) and $u_k$ for each case of $‘a’$. Note that adding a constant does not affect the correlation.

par(mfrow = c(1,3))
plot(u, 1-indicator.a(u, 0.2), type = 'l', ylab = 'y',
     main = '1-indicator and u^k when a = 0.2')
lines(u, u^k.2, col = 'red')

plot(u, 1-indicator.a(u, 0.5), type = 'l', ylab = 'y',
     main = '1-indicator and u^k when a = 0.5')
lines(u, u^k.5, col = 'green')

plot(u, 1-indicator.a(u, 0.8), type = 'l', ylab = 'y',
     main = '1-indicator and u^k when a = 0.8')
lines(u, u^k.8, col = 'blue')

The main point of analyzing such plots (indicator versus u^k) is that their shape of the graph becomes similar if the correlation is high with identical sign, so that it seems one of the graphs tries to have a similar shape to another variable’s graph. In this context, the optimal k values were chosen.

The red graph of the first plot $y = u^{k.2}$ increases rapidly when u is between 0 and about 0.2, and increases slowly when u > 0.2. So, the red graph tries to rise rapidly at the early point and be flat when 0.2 < u < 1. This is quite consistent to the (1-indicator) function because it rises from 0 to 1 at the early part (u = 0.2), and keep the value 1 for a long interval (0.2 < u < 1). As the value ‘k’ decreases from 1, the shape of $u_k$ becomes more concave so that most of the values of $u_k$ get close to 1. Thus, the k.2, much smaller than 1, was chosen

The green graph of second plot $y = u^{k.5}$ is similar to simple line graph y = u. If there is a graph which is flat between 0 <u< 0.5, rises rapidly around u = 0.5, and being flat between 0.5 < u < 1, this would have extremely high correlation with the (1 - indocator). Note that only one term of a polynomial can only be either linear, concave, or convex: being concave and convex simultaneously is impossible. Since such ideal graph requires to be convex and concave at the same time, making such ideal one is impossible under this condition. Then, a graph similar to a straight line would be the best case to have the highest correlation with (1 - indocator) function under the condition. That’s why k.5 value is very close to 1.

The blue graph of third plot$y = u^{k.8}$ increases slowly when 0 <u < 0.8 and increases rapidly when u is between 0.8 and 1, and. So, the blue graph tries to be flat when 0 < u < 0.8 and rise rapidly at the point close to 1. This is quite consistent to the(1 - indocator) function because it keeps the value 0 for a long interval (0 < u < 0.8) and it rises from 0 to 1 at the rear part (u = 0.8). As the value ‘k’ increases from 1, the shape of $u_k$ becomes more convex so that most of the values of $u_k$ get close to 0. Thus, the k.8, much bigger than 1, was chosen.

Since such k.2, k.5, k.8 make the correlation between each (1 - indocator) and $u_k$ high (positive) value, $I(U \le a)$ function and $U_k$ have high (negative) correlation, which lead to minimum variance of g(U).

(2) $I(V \le a^k)$ vs $V$

v <- seq(0,1,0.01)
par(mfrow = c(1,3))
plot(v, indicator.a(v, 0.2^k.2), type = 'l', xlab = 'v', ylab = 'y',
     main = 'plot of indicator w.r.t v when a = 0.2')
lines(v, v, col = 'red')

plot(v, indicator.a(v, 0.5^k.5), type = 'l', xlab = 'v', ylab = 'y',
     main = 'plot of indicator w.r.t v when a = 0.5')
lines(v, v, col = 'green')


plot(v, indicator.a(v, 0.8^k.8), type = 'l', xlab = 'v', ylab = 'y',
     main = 'plot of indicator w.r.t v when a = 0.8')
lines(v, v, col = 'blue')

The plots above are the graphs of indicator function$I(V \le a^k)$ for each ‘a’ (black) and $y = v$ (red, green, and blue when a = 0.2, 0.5, 0.8). After the shift $v = u^k$, comparing the similarity of the shape of the graphs would be meaningless. So, these plots will be interpreted in a different view.

To compare more intuitively, the plots where the indicator is multiplied by (-1) and added by ‘1’ is also given below. To explain, $1 - I(V \le a^k)$ and $y= v$ are given below. Also, for the cases a = 0.2 and a = 0.8, the thickness of some part of the graphs is different. After $u_k$ is changed to v, the ‘v’ is not uniform any more. So, regarding the thickness as equal is not valid in this case, and different thickness (how much the v is compacted) is reflected to emphasize this.

par(mfrow = c(1,3))
v2 <- seq(0.2^k.2,1,0.01)
ones.2 <- rep(1,length(v2))
plot(v, 1-indicator.a(v, 0.2^k.2), type = 'l', xlab = 'v', ylab = 'y',
     main = '1-indicator and v when a = 0.2')
lines(v, v, col = 'red')
lines(v2,v2, col = 'red', lwd = 4)
lines(v2,ones.2, col = 'black', lwd = 4)


plot(v, 1-indicator.a(v, 0.5^k.5), type = 'l', xlab = 'v', ylab = 'y',
     main = '1-indicator and v when a = 0.5')
lines(v, v, col = 'green', lwd = 2)

v8 <- seq(0,0.8^k.8, 0.01)
ones.8 <- rep(0, length(v8))
plot(v, 1-indicator.a(v, 0.8^k.8), type = 'l', xlab = 'v', ylab = 'y',
     main = '1-indicator and v when a = 0.8')
lines(v, v, col = 'blue')
lines(v8,v8, col = 'blue', lwd = 4)
lines(v8,ones.8, col = 'black', lwd = 4)

stu_id <- 2011142127
set.seed(stu_id)
us <- runif(10^6)
par(mfrow = c(1,3))
hist(us^k.2, freq = FALSE, main = 'histogram of u^k when a = 0.2')
hist(us^k.5, freq = FALSE, main = 'histogram of u^k when a = 0.5')
hist(us^k.8, freq = FALSE, main = 'histogram of u^k when a = 0.8')

When a = 0.2, most of the values of $u^k$ are close to 1 (bigger than 0.8) because the value of k is close to 0. When a = 0.8, most of the values of $u^k$ are close to 0 (less than 0.1) because the k is large value (close to 5). The k when a = 0.5 is similar to 1, so the change from ‘u’ to ‘v’ does not have significant effect and quite evenly distributed (although slightly skewed). The histograms of $10^6$ numbers generated from $\text{Unif}(0,1)$ are also given to compare intuitively.

When a = 0.2, most of the (1 - indicator) function values will return ‘1’ due to the large amount (apparently about more than 80%) of $u^k$ whose values are bigger than 0.8. A lot of ‘v’ whose values are close to 1 are centered to the right part of the plot and a lot of ’1’s that are the result of (1 - indicator) are also centered to the similar part. Since large amount of the similar values are gathered in one part, it will result in a high (positive) correlation. When k is chosen to other values such as k.8 while maintaining the (1 - indicator) function, the amount of $u^k$ close to 1 will dramatically decrease and the correlation with the (1 - indicator) will decrease as well.

The similar interpretation is possible for a = 0.8. According to the histogram, extremely large amount of $u^k$ are smaller than 0.1, and large amount of 0 will be returned as the corresponding (1-indicator). Since large amount of 0 and $u^k$ close to 0 is centered at the left part, it will return high (positive) correlation as well. If the value k decrease to k.2, most of the $u^k$ will be close to 1 and it will make the correlation with (1 - indicator) small.

Also, k.5 is very close value to 1, so the transformation to u^k.5 are evenly distributed compared to other cases. So, it is similar to the case $I(U \le 0.5)$ vs $U^{k.5}$.

Since such k makes the correlation between the (1-indicator) and v high (positive) values, these k will also make the correlation between $I(V \le a^k)$ and v high (negative) values and this will lead to the minimum variance of g(U).

(3) conclusion

The above paragraphs have proved there is high (positive) correlation between $\{1 - I(u \le a) \text{ vs } u^k \}$ and $\{1 - I(v \le a^k) \text{ vs } v \}$. As previously mentioned, multiplying ‘-1’ and adding a constant ‘1’ does not affect the magnitude of the correlation and simply change the sign of the correlation. Thus, k.2, k.5, and k.8 was chosen to make the correlation between the indicator and u (or v) high (negative) correlation and make the variance of g(U) as small as possible.

iv) Let $U_i, \ i=1,\dots,n$ be an i.i.d random sample of $U$. Sampling $n = 10^6$ uniform random numbers, estimate the variances of the estimators $n^{-1}\sum_{i=1}^ng(U_i)$ for $a = 0.2, 0.5, 0.8$. Calculating the estimated variance of $n^{-1}\sum_{i=1}^nI(U_i \le a)$, estimate the percentage decrease for $a = 0.2, 0.5, 0.8$. Compare the estimated percentage decreases with the theoretical ones obtained above.

Every built-in basic function in R is available in this homework. So functions such as ‘var’, ‘cov’ will be used.

The following code serves as a function of estimating the variance of $n^{-1}\sum_{i=1}^ng(U_i)$, the sample mean of $g(U)$, whose inputs should be a vector (consisting of uniform distribution), value ‘a’ and the corresponding ‘k’. The optimal value of ‘c’ can be obtained by estimating the covariance of $I(U \le a)$ and $U^k$ and estimating the variance of $U^k$. Using the estimated ‘c’, the value of $g(U_i)$ can be obtained for each $U_i, \ i=1,\dots,10^6$. These values are saved in the ‘gs.ui’ vector, the length of which is same as the input vector (us). Since the estimated variance of a sample mean (corresponding to $n^{-1}\sum_{i=1}^ng(U_i)$ is its sample variance divided by the number of elements ($Var[\bar X] = S^2/n, \ S^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i - \bar X)^2$), the following code returns the (sample) variance of ‘gs.ui’ divided by n (= the length of ‘us’).

set.seed(stu_id)
n <- 10^6
us <- runif(n)

est_var_g <- function(us, a, k){
  n <- length(us)
  c <- -cov(indicator.a(us, a), us^k)/var(us^k) #estimated c
  gs.ui <- indicator.a(us, a) + c*(us^k - 1/(k+1)) # g(ui) from 1 to n
  S.2 <- var(gs.ui)
  return(S.2/n) #ch8 estimated var of X_bar
}

The variance of $n^{-1}\sum_{i=1}^nI(U_i \le a)$ can be obtained by using ‘var’ function to an indicator function (indicator.a) and dividing by the number of elements.

When a = 0.2, the estimated variance of $n^{-1}\sum_{i=1}^ng(U_i)$ is about $5.35∗10^{−8}$ and the estimated variance of $n^{-1}\sum_{i=1}^nI(U_i \le a)$ is about $1.59∗10^{−7}$. Then, the percentage of variance reduction is $(1.59∗10^{−7}- 5.35∗10^{−8})/(1.59∗10^{−7}) \approx 66.41\%$ .

# for the case a = 0.2

var_g_2 <- est_var_g(us, 0.2, k.2) # 5.3539*10^(-8)
var_i_2 <- var(indicator.a(us, 0.2))/n # 1.594*10^(-7)

(var_i_2 - var_g_2)/var_i_2 # 66.41155 %

## [1] 0.6641155

The theoretical percentage of variance reduction in ii) was about 66.45%, so it is shown that the variance reduction using $n^{-1}\sum_{i=1}^ng(U_i)$ is slightly lower than ideal one.

When a = 0.5, the estimated variance of $n^{-1}\sum_{i=1}^ng(U_i)$ is about $6.18∗10^{−8}$ and the estimated variance of $n^{-1}\sum_{i=1}^nI(U_i \le a)$ is about $2.50∗10^{−7}$. Then, the percentage of variance reduction is $(2.50∗10^{−7} - 6.18∗10^{−8})/(2.50∗10^{−7}) \approx 75.28\%$ .

# for the case a = 0.5

var_g_5 <- est_var_g(us, 0.5, k.5) # 6.18*10^(-8)
var_i_5 <- var(indicator.a(us, 0.5))/n # 2.500*10^(-7)

(var_i_5 - var_g_5)/var_i_5 # 75.27761 %

## [1] 0.7527761

The theoretical percentage of variance reduction in ii) was about 75.31%, so it is shown that the variance reduction using $n^{-1}\sum_{i=1}^ng(U_i)$ is slightly lower than ideal one too.

When a = 0.8, the estimated variance of $n^{-1}\sum_{i=1}^ng(U_i)$ is about $3.27∗10^{−8}$ and the estimated variance of $n^{-1}\sum_{i=1}^nI(U_i \le a)$ is about $1.60∗10^{−7}$. Then, the percentage of variance reduction is $(1.60∗10^{−7} - 3.27∗10^{−8})/(1.60∗10^{−7}) \approx 79.546\%$.

# for the case a = 0.8

var_g_8 <- est_var_g(us, 0.8, k.8) # 3.27*10^(-8)
var_i_8 <- var(indicator.a(us, 0.8))/n # 1.59887*10^(-7)

(var_i_8 - var_g_8)/var_i_8 # 79.54615 %

## [1] 0.7954615

The theoretical percentage of variance reduction in ii) was about 79.555%, so it is shown that the variance reduction using $n^{-1}\sum_{i=1}^ng(U_i)$ is slightly lower than ideal one too.

The reason for the estimated variance reduction being slightly lower than the ideal case is that the estimated one used the estimated optimal value of c, which would be different from ideal ‘c’. Note that the variance form of control variable is \[\text{Var}[I] + \text{Var}[U^k]*\left(c + \frac{Cov[I, U^k]}{\text{Var}[U^k]} \right)^2 - \frac{(Cov[I,U^k])^2}{\text{Var}[U^k]},\] the quadratic function of ‘c’. So, the value of c which is slightly different from the optimal $c = -\frac{Cov[I, U^k]}{\text{Var}[U^k]}$ would make the variance slightly bigger than the minimum variance.

Q4 The above problem may not have a practical meaning, as it is very obvious that $\mathbb P(U \le a)$ and we must know a to find the optimal $k$. Still, this is a good toy example to understand how the control variable method works. Now we consider a more practical situation: a problem of estimating $\mathbb P(h(U) \le a)$ for some complicated function h which is not invertible. In this case, finding the optimal k may be impossible. Suppose\[h(u) = \sin(20\pi u) + 10\sin\left(\frac{\pi(u-1)}{2} \right), \quad 0 < u < 1.\]Clearly, h is not invertible.

i) © We want to estimate 𝑃[ℎ(𝑈) ≤ −19.1]. A control variable estimator can be constructed by \[g(U) = I(h(U) \le -19.1) + c*(U^k - \mathbb E[U^k]).\] Now, an expression for the variance of $g(U)$ is not available, and hence the optimal k is not easily obtained. However, we can still utilize the informal graphical approach in Q3 iii) (both types of graphs or only one). Using a graphical analysis through trial and error, find a reasonable $k$.

To analyze it with graphical approach, the indicator function $I(h(U) \le -19.1)$ should be defined. The input is a vector, and ‘h’ is defined to represent the h(u) function. For the indices where the values of ‘h’ is less than or equal to -19.1, the I vector filled with all 0’s will become ‘1’. The output will be the shifted ‘I’ vector. The plot below shows a function similar to the unit step function made by the indicator.h.

indicator.h <- function(us){
  I <- rep(0,length(us))
  
  h <- sin(20*pi*us)+20*sin(pi*(us-1)/2)
  I[h <= -19.1] <- 1
  
  return(I)
}
par(mfrow = c(1,1))

uh <- seq(0,1,1/100000)
plot(uh, indicator.h(uh), type = 'l', ylab = 'y', xlab = 'u',
     main = 'plot of indicator I(h(u) <= -19.1)',cex.main = 1)

Except for some points at $0 < u < 2$, most of u in the interval make h(u) to be ‘1’. Analyzing the plot, it is assumed that the length of u which make h(u) ‘1’ is about 0.2. Then, this plot may be regarded as similar graph with the unit step function in Q3 iii) when a = 0.2.

The main point of this question is to find a proper ‘k’ that makes the variance of g(U) considerably small. As previously mentioned, the optimal k makes the absolute value of correlation between the indicator function $I(h(U) \le -19.1)$ and $U^k$ as large as possible. So, the following process attempts to find the best ‘k’ from the k.2 = 0.147, the value k when a = 0.2 in question i) through iv). As the k decreases (0.147 -> 0.14 -> 0.12), the correlation decreases as well. After letting the k increase, the correlation reaches maximum at k = 0.16. and decreases again. The maximum correlation is estimated as 0.6787737.

abs(cor(indicator.h(us), us^k.2)) # k.2 ~ 0.147

## [1] 0.678647

abs(cor(indicator.h(us), us^(0.14)))

## [1] 0.6785976

abs(cor(indicator.h(us), us^(0.12)))

## [1] 0.6782822

abs(cor(indicator.h(us), us^(0.15)))

## [1] 0.6786616

abs(cor(indicator.h(us), us^(0.16))) # chosen

## [1] 0.6786662

abs(cor(indicator.h(us), us^(0.17)))

## [1] 0.6786137

abs(cor(indicator.h(us), us^(0.18)))

## [1] 0.678506

Thus, k = 0.16 will be used as the reasonable value of k from now on (k.h).

k.h <- 0.16

ii) Sampling $n = 10^6$ uniform random numbers, $U_i \sim \text{Unif}(0,1)$, (i.i.d) $i = 1,\dots, n$, estimate $Cov[I(h(U) \le -19.1), U^k]$ with your $k$ obtained in i) to find a suitable value of $c$, and then estimate $\mathbb P(I(h(U) \le -19.1)$ using both the control variable estimator and the naive average estimator. Give your estimates and the estimated variances of the two estimators. Find the estimated percentage decrease of variance reduction.

(1) $Cov[I(h(U) \le -19.1), U^k]$

set.seed(stu_id)
us <- runif(n)

Generating 10^6 random numbers from Unif(0, 1) (denoted by ‘us’ in R code), the covariance can be easily obtained as follows. The estimated covariance is about -0.03.

# 1: cov
cov(indicator.h(us), us^k.h) #  -0.03

## [1] -0.03002794

(2) c

Then, the optimal value of c can be estimated by $-Cov[I(h(U) \le -19.1), U^k]/Var[U^k]$, so the proper value of c is estimated as 2.0865.

# 2 : c.h
c.h <- -cov(indicator.h(us), us^k.h)/var((us)^k.h) # 2.0865
c.h

## [1] 2.086522

(3) estimation of $\mathbb P(I(h(U) \le -19.1)$ and variance using $g(U)$

There is a function that returns a series of controlled estimators $I(h(U) \le -19.1) + c.h*\left(U^{k.h} -\frac{1}{1+k.h} \right)$. The ‘I’ in the ‘ctrl_v_h’ function works as the indicator function $I(h(U) \le -19.1)$. Then, the probability $\mathbb P(I(h(U) \le -19.1)$ can be estimated by averaging the outputs of the function ‘ctrl_v_h’. Based on this random generation, the average value is about 0.16276. The variance of $g(U)$ can be estimated by using the ‘var’ function to the obtained outputs, which is about 0.073.

ctrl_v_h <- function(us, k, c){
  I <- rep(0, length(us))
  I[sin(20*pi*us)+20*sin(pi*(us-1)/2) <= -19.1] <- 1
  
  Z <- I + c*(us^k - 1/(k+1))
  return(Z)
}

gs <- ctrl_v_h(us, k.h, c.h)
sum(gs)/n # 0.16276

## [1] 0.1627605

var(gs) # 0.073

## [1] 0.07337647

(4) estimation of $\mathbb P(I(h(U) \le -19.1)$ and variance using $I(h(U) \le -19.1)$

The probability $\mathbb P(I(h(U) \le -19.1)$ can be obtained by averaging the outputs of the naive estimator $I(h(U) \le -19.1)$, which is about 0.1624. The variance of this indicator is estimated as about 0.136.

mean(indicator.h(us)) # 0.1624

## [1] 0.162406

var(indicator.h(us)) # 0.136

## [1] 0.1360304

(5) estimated percentage of variance reduction

The reduction rate can be calculated by the deduction of naïve variance and the variance of g(U), and division of the value by the naïve variance. So, the reduction rate is estimated by around 46.06%.

(var(indicator.h(us)) - var(gs))/var(indicator.h(us)) # 46.06%

## [1] 0.4605878

Statistical Computing HW6

(1) C=1, when \(V_1^2 + V_2^2 \le 1\)

(2) C = 0: when \(V_1^2 + V_2^2 > 1\)

(3) conditional expectation for each C

(4) conclusion

i) Show that \(\mathbb E_Y[\hat \theta] = \mathbb P(X \in A)\), where \(X \sim F\).

ii) Show that \(\hat \theta\) is constructed by combining the importance sampling, the conditioning, and the stratified sampling.

(1) importance sampling

(2) stratified sampling and conditioning

i) Plugging in the best \(c\) into \(g(U)\), find an expression for the variance of \(g(U)\) as a function of \(a\) and \(k\).

(1) \(\text{Var}[I]\)

(2) \(\text{Var}[U^k]\)

(3) \(Cov[I, U^k]\)

(4) conclusion

(1) \(I(U \le a)\) vs \(U^k\)

(2) \(I(V \le a^k)\) vs \(V\)

(3) conclusion

(1) \(Cov[I(h(U) \le -19.1), U^k]\)

(2) c

(3) estimation of \(\mathbb P(I(h(U) \le -19.1)\) and variance using \(g(U)\)

(4) estimation of \(\mathbb P(I(h(U) \le -19.1)\) and variance using \(I(h(U) \le -19.1)\)

(5) estimated percentage of variance reduction