Statistical Computing HW5
Q1. Consider using bootstrap to approximate the sampling distribution of the sample median of a sample from the exponential distribution with rate 1, i.e., \(F = \text{Exp}(1)\). For an odd number n of observations, we define the sample median as the \(\frac{n+1}{2}\)th order statistic. Generate \(n = 5\) observations from \(F = \text{Exp}(1)\).
i) Using the five generated observations, obtain a bootstrap sample of size \(10^6\), and draw a histogram of the bootstrap-based sampling distribution of the sample median.
A simple function (‘hw5_1’) that receives a vector (‘obs’) and an integer (‘n’) as inputs and returns a vector (‘med’) of length \(10^6\) is required. For every iteration, this function extracts ‘n’ elements from the input ‘obs’ with replacement, select the median among them, and save it to the ith element of ‘med’ vector. After the \(10^6\) iteration is finished, this function outputs the ‘med’ vector.
stu_id <- 2011142127
set.seed(stu_id)
hw5_1 <- function(obs, n){
med <- c()
for(i in 1:10^6){
temp <- sample(obs, n, replace = TRUE)
med[i] <- median(temp)
}
return(med)
}
This question requires 5 observations for bootstrap sample from exponential distribution with rate parameter 1, so ‘rexp’ function is used to make 5 random numbers (‘observ.5’) from Exp(1). The values of ‘observ.5’ are as follows.
n <- 5
observ.5 <- rexp(n, 1)
observ.5
## [1] 1.57180296 1.12157825 0.03541378 0.22081847 0.25550808
Inserting this vector ‘observ.5’ with the integer ‘n’ to the ‘hw5_1’ function, the function will return a vector of length 106. Each element of this vector (‘samp_med.5’) is the sample median from the 5 extracted elements of ‘observ.5’ with replacement, which is bootstrap sample.
Before generating the histogram, some manipulation is needed in the ‘breaks’ parameter. The maximum values of the following continuous graphs (ii, iii, and iv) are about 1. If there is no restriction in the interval of x-axis, the resulting histogram will have extremely high value as its maximum, making the continuous graphs almost invisible or extremely close to x-axis. As a result, the interval of x-axis is adjusted so as to make the maximum of histogram becomes as low as possible. Also, it is desirable to maintain the relativity between the histogram values. For example, the adjusted interval should still make the histogram value at 0.255 the biggest value.
samp_med.5 <- hw5_1(observ.5, n)
tb <- table(samp_med.5)
names(tb) <- c('0.03541','0.2208', '0.2555', '1.1216', '1.5718')
tb
## 0.03541 0.2208 0.2555 1.1216 1.5718
## 57799 259855 364656 259798 57892
Based on the given seed number, the generated 5 numbers are 0.0354, 0.2208, 0.2555, 1.1216, and 1.5718 in sequence. This implies the histogram will only accumulate the 5 values discretely. Instead of assigning identical intervals (such as 0.01), different intervals are given according to the amount of the numbers which are shown in the ‘table(samp_med.5)’. The second and third number, 0.2208 and 0.255, has the second and the first largest amount respectively, but their distance is very close to each other. These numbers are also close to 0, so it is difficult to assign large intervals which would lead to connected histogram. In contrast, if narrow intervals were given, this would result in extremely high density. So, the proper intervals were assigned in ‘brk’ variable. After this manipulation, the maximum density value becomes approximate to 2.
brk <- c(0,0.04,0.08,0.23,0.25,0.43,1.05,1.2,1.56,1.6,2)
hist(samp_med.5, breaks = brk, freq = FALSE, ylim = c(0.0, 2.0),
main = 'Histogram among 5 samples')
ii) Find the density of the exact sampling distribution of the sample median, and overlay the density on the histogram obtained in i)
It is known that the kth order statistics from \(F\) among ‘n’ number of samples from has the following pdf \(g_k\).
\[g_k(x) = \frac{n!}{(k-1)!(n-k)!}\{F(x)\}^{k-1}\{1-F(x)\}^{n-k}f(x).\]
When ‘n’ is odd, the index of sample median is \(k = \frac{n+1}{2}\). Then, the pdf \(g_k\) of the sample median is as follows.
\[g_k(x) = \frac{n!}{\Bigl((\frac{n-1}{2})!\Bigr)^2}\{F(x)\}^{\frac{n-1}{2}}\{1-F(x)\}^{\frac{n-1}{2}}f(x).\]
The distribution \(F\) follows exponential with rate 1, so \(F(x) = 1 - e^{-x}\) and \(f(x) = e^{-x}\). Using this expression and n = 5,
\[g_3(x) = \frac{5!}{(2!)^2}\{1 - e^{-x}\}^{2}\{e^{-x}\}^2e^{-x}.\] This pdf is written in R as follows, and the graph is in the last part of iv) as previously mentioned.
x <- seq(0,2,0.01)
G <- factorial(n)/(factorial((n-1)/2))^2 # 30
g.3 <- G*(1-exp(-x))^(2)*(exp(-x))^(3)
iii) It is well known that, for a continuous distribution \(F\) with density \(f\), the sample median \(\bar X\)with a sample of size \(n\) has the asymptotic distribution,\[\sqrt{n}(\bar X - m) \stackrel{D}{\to} N \Bigl(0, \frac{1}{4\{f(x)\}^2} \Bigr),\]where ‘m’ is the population median of \(F, F(m) = 0.5\). Based on this result, we can construct an approximate distribution of the sample median \(N \bigl(0, \frac{1}{4\{f(x)\}^2} \bigr)\). Overlay the density on the plot obtained in i) and ii).
The asymptotic distribution form appears similar to general form of CLT. So this distribution is called ‘CLT’ simply from now on for convenience. although it is sort of different of real CLT. Also, define the sample median among ‘n’ samples as \(\tilde X_n\).
The population median ‘m’ is calculated by \(F(m) = 0.5\), which is \(1-e^{-m} = 0.5\). The value of \(m\) is \(\log(2)\), so \(f(m) = e^{-\log2} = 1/2\). Then \(4n\{f(m)\}^2 = 4n(1/2)^2\). So, the distribution of sample median is asymptotically as follows:
\[\sqrt{n}(\tilde X_n - \log2) \stackrel{D}{\to} N \Bigl(0, 1\Bigr).\]
That is, \(\tilde X_n\) is a normal distribution with mean \(\log2\) and variance \(1/n\), so the pdf of \(\tilde X_n\), denoted by \(\tilde h_n\) is, \[\tilde h_n = \frac{1}{\sqrt{2\pi\frac{1}{n}}}\exp\left\{-\frac{(x-\log2)^2}{2*\frac{1}{n}}\right\}.\] This pdf is written in R as follows, and this green graph is in the last part of the question iv).
h.5 <- sqrt(n/(2*pi))*exp(-n/2*(x-log(2))^2)
iv) Supposing that \(F\) is unknown
to us, we can instead use \(N\Bigl(\tilde x,
1/(4n*\{f(\tilde x)\}^2) \Bigr)\) where \(\tilde x\) is the sample median of the five
generated observations and \(\hat f\)
is a kernel density estimator of \(f\).
Using R, \(\hat f(\tilde x)\) can be
obtained as:
m=median(x) # x is a vector of the observed
values
density(x, from=m, to=m, n=1)$y
Overlay the
density on the plots obtained in i){iii). Compare the four
distributions.
Using the given condition, \(\hat f(\tilde x)\) can be calculated with ‘density’ function in R and the median of observed 5 values (‘m.5’ = 0.2555). The observe median \(\tilde x\) corresponds to ‘m.5’. Applying this method, the ‘density’ can be obtained (‘d.5’), and the value of y in this ‘density’ is 0.55379 (‘d.5$y’). The pdf \(\hat k_n\)of sample median \(\tilde X_n\)can be estimated as follows. \[\begin{aligned} \hat k_n &= \frac{1}{\sqrt{2\pi*\frac{1}{4n\{\hat f(\tilde x)\}^2}}}\exp\left\{-\frac{(x-\tilde x)^2}{2*\frac{1}{4n\{\hat f(\tilde x)\}^2}}\right\} \\ &= \sqrt{\frac{4n}{2\pi}}\hat f(\tilde x)\exp\left\{-2n\{\hat f(\tilde x)\}^2(x-\tilde x)^2\right\}. \end{aligned}\] The following code shows how to obtain \(\hat k_5\) using \(n=5, \tilde x = m.5\), and \(\hat f(\tilde x) = d.5\$y\).
m.5 <- median(observ.5)
m.5
## [1] 0.2555081
d.5 <- density(observ.5, from=m.5, to=m.5, n=1)
d.5$y
## [1] 0.5537908
k.5 <- sqrt(4*n/(2*pi))*(d.5$y)*exp(-2*n*(d.5$y)^2*(x-m.5)^2)
The histgram overlaid by each pdf obtained from ii),iii), and iv) is as follows.
hist(samp_med.5, breaks = brk, freq = FALSE, ylim = c(0.0, 2.0),
main = 'Histogram among 5 samples')
lines(x, g.3, col = 'red')
lines(x, h.5, col = 'blue')
lines(x, k.5, col = 'green')
legend('topright', legend = c('exact', 'CLT','kernel'),
col = c('red','blue','green'), pch = 16, cex = 0.9)
To sum up, the histogram overlaid by the 3 colored graphs are in the above. The exact distribution of sample median is red-colored, the asymptotic distribution of sample median (similar to CLT) is blue-colored, and the distribution with kernel density is green-colored. Each graph is denoted by ‘exact’, ‘CLT’, and ‘kernel’ in the legend.
Based on the 5 observed values, the most inconsistent graph is the blue CLT graph. The CLT assumes that the mean, which has the biggest frequency among its support, is log2 = 0.693. However, there is no histogram around 0.693; rather, the most frequent interval of histogram is around 0.2 and 0.3 and the second most frequent one is about 1.12. Since the blue graph has the highest value at the point where there is absolutely no histogram, the CLT does not explain the observed values and their bootstrap samples.
The red graph, the exact pdf of the sample median, has the largest value at around 0.5, so it appears to explain the bootstrap samples better than the blue graph. However, the green graph, the pdf using the kernel density estimator, is the best model, because the point where the pdf has the biggest value is consistent with the point where the histogram has the highest value. It is also expressed in the formula of the pdf with kernel density, which is the normal distribution and whose mean is the median of observed values, in this case the 0.2555.
Q2. Repeat i) - iv) of Q1 with \(n = 101\) observations drawn from \(\text{Exp}(1)\). Discuss the results.
1) histogram
Most of the logics and reasons are identical to those in i), including the hw5_1 function. The main difference is the number of observations, from 5 to 101. The 101 observations are saved in ‘observ.101’ variable using ‘rexp’ function, and bootstrap sample of size 106 (samp_med.101) is obtained through the function ‘hw5_1’.
The following code shows how to generate the histogram of ‘samp_med.101’. As presented at the end of the question v), the bootstrap sampled medians are quite centered around 0.6, compared to the previous question. Also, the difference between the maximum density value and the following continuous graphs are not significant without any manipulation of x-axis interval. So there is no specific change in the number of intervals. The maximum and minimum of x, y axis is determined so as to show the graphs and histogram clearly.
The histogram is presented at the end of this question v) with the overlaid continuous graphs. including this question, the number of observations becomes large, so presenting the observed values and the manipulation of x-axis interval will be omitted.
set.seed(stu_id)
# 5-1
n <- 101
observ.101 <- rexp(n, 1)
samp_med.101 <- hw5_1(observ.101, n)
hist(samp_med.101, freq = FALSE, xlim =c(0.0,1.0), ylim = c(0, 5.5),
main = 'Histogram among 101 samples')
2) exact distribution of sample median
The formula of \(g_k(x)\) the exact distribution, is almost same, except for the fact that \(n = 101\) and the corresponding k is (101+1)/2 = 51. So, \(g_{51}(x)\) is as follows. \[g_{51}(x) = \frac{101!}{(50!)^2}(1-e^{-x})^{50}(e^{-x})^{51}.\]
The following code shows how to obtain the pdf of the exact distribution \(g_{51}(x)\) and draw it to the histogram, which is shown in the last part of the question.
x <- seq(0,2,0.01)
G <- factorial(n)/(factorial((n-1)/2))^2
G
## [1] 1.019003e+31
g.51 <- G*(1-exp(-x))^(50)*(exp(-x))^(51)
3) CLT
The pdf of asymptotic distribution is almost same, except for \(n = 101\). The median of population is still \(\log2\). Then, given \(n = 101\), the pdf \(\tilde h_n(x)\) is as follows. \[\tilde h_{101} = \sqrt{\frac{101}{2\pi}}\exp\left\{-\frac{101}{2}(x - \log2)^2 \right\}.\] The following code shows how to obtain this pdf (h.101) and add it to the histogram. This continuous graph is in the last part of this question vi) as well.
h.101 <- sqrt(n/(2*pi))*exp(-n/2*(x-log(2))^2)
4) kernel density
The median of observed values is 0.5924 (m.101). Inserting this median and all of the observed values (observ.101) to the ‘density’ function, the kernel density estimator can be obtained (d.101$y) by 0.53718. Since the pdf is \[\hat k_n(x) = \sqrt{\frac{4n}{2\pi}}\hat f(\tilde x)\exp\left\{-2n\{\hat f(\tilde x)\}^2(x-\tilde x)^2\right\},\] \(\hat k_{101}(x)\) can be obtained as follows (k.101).
m.101 <- median(observ.101)
m.101
## [1] 0.5924458
d.101 <- density(observ.101, from=m.101, to=m.101, n=1)
d.101$y
## [1] 0.5371822
k.101 <- sqrt(4*n/(2*pi))*(d.101$y)*exp(-2*n*(d.101$y)^2*(x-m.101)^2)
5) histogram and pdfs
The histgram overlaid by each pdf obtained from 2),3), and 4) is as follows.
hist(samp_med.101, freq = FALSE, xlim =c(0.0,1.0), ylim = c(0, 5.5),
main = 'Histogram among 101 samples')
lines(x, g.51, col = 'red')
lines(x, h.101, col = 'blue')
lines(x, k.101, col = 'green')
legend('topleft', legend = c('exact', 'CLT','kernel'),
col = c('red','blue','green'), pch = 16, cex = 0.9)
In the above are the histogram and the 3 continuous graphs. After the number of observations becomes 101, the pdf of exact sample median and CLT become greatly similar. Since the median of observed values is 0.592 and the area of the highest frequency of histogram is around 0.6, the CLT and exact pdf of sample median don’t explain the bootstrap sample completely yet. Since the pdf based on kernel density has normal distribution with 0.592 as its mean, this pdf explains the histogram quite well compared to other red and blue curves.
Q3. Repeat i) - iv) of Q1 with \(n = 501\) observations drawn from \(\text{Exp}(1)\). Discuss the results.
1) histogram
The only difference with the previous ones is the number of observations, n = 501. In this case, presenting the observed values will be omitted too. The 501 observations are saved in ‘observ.501’ variable using ‘rexp’ function, and bootstrap sample of size 106 (samp_med.501) is obtained through the function ‘hw5_1’.
The following code shows how to generate the histogram of ‘samp_med.501’. This histogram is also centered and symmetric, but the centered value here is about 0.7. There is no specific change in the number of intervals for the same reason with v). The maximum and minimum of x, y axis is determined so as to show the graphs and histogram clearly.
The histogram is presented at the end of this question vi) with the overlaid continuous graphs.
set.seed(stu_id)
n <- 501
observ.501 <- rexp(n, 1)
samp_med.501 <- hw5_1(observ.501, n)
hist(samp_med.501, freq = FALSE, xlim =c(0.4,1.0), ylim = c(0, 12),
main = 'Histogram among 501 samples')
2) exact distribution of sample median
The formula of \(g_k(x)\)the exact distribution, is almost same, except for the fact that \(n = 501\) and the corresponding k is (501+1)/2 = 251. So, \(g_{251}(x)\) is as follows. \[g_{251}(x) = \frac{501!}{(250!)^2}(1-e^{-x})^{50}(e^{-x})^{51}.\]
Note that typing this formula directly to the R code will return NaN, since the R perceives the large factorial (ex. 250!) as infinite. So, an approximation is needed with the ‘lfactorial’ function, which returns the log value of the factorial of given integer. After obtaining the value \(\log \Bigl(\frac{501!}{(250!)^2} \Bigr) = \log(501!) - 2*\log(250!)\) and taking exponential to the value, the result will be approximate value of \(\frac{501!}{(250!)^2}\) The rest of the expression is almost similar to the previous one.
The following code shows how to obtain the pdf of the exact distribution \(g_{251}(x)\) and draw it to the histogram, which is shown in the last part of the question.
x <- seq(0,2,0.01)
G <- exp(lfactorial(501)-2*(lfactorial(250)))
g.251 <- G*(1-exp(-x))^(250)*(exp(-x))^(251)
3) CLT
The pdf of asymptotic distribution is almost same, except for \(n = 501\). The median of population is still \(\log2\). Then, given \(n = 501\), the pdf \(\tilde h_n\) is as follows: \[\tilde h_{501}(x) = \sqrt{\frac{501}{2\pi}}\exp\left\{-\frac{501}{2} (x - \log2)^2 \right\}.\] The following code shows how to obtain this pdf (h.501) and add it to the histogram. This continuous graph is in the last part of this question vi) as well.
h.501 <- sqrt(n/(2*pi))*exp(-n/2*(x-log(2))^2)
4) kernel density
The median of observed values is 0.6851 (m.501). Inserting this median and all of the observed values (observ.501) to the ‘density’ function, the kernel density estimator can be obtained (d.501$y) by 0.4891. Since the pdf is \[\hat k_n(x) = \sqrt{\frac{4n}{2\pi}}\hat f(\tilde x)\exp\left\{-2n\{\hat f(\tilde x)\}^2(x-\tilde x)^2\right\},\] \(\hat k_{501}(x)\) can be obtained as follows (k.501).
m.501 <- median(observ.501)
m.501
## [1] 0.6851052
d.501 <- density(observ.501, from=m.501, to=m.501, n=1)
d.501$y
## [1] 0.4891372
k.501 <- sqrt(4*n/(2*pi))*(d.501$y)*exp(-2*n*(d.501$y)^2*(x-m.501)^2)
5) histogram and pdfs
The histgram overlaid by each pdf obtained from 2),3), and 4) is as follows.
hist(samp_med.501, freq = FALSE, xlim =c(0.4,1.0), ylim = c(0, 12),
main = 'Histogram among 501 samples')
lines(x, g.251, col = 'red')
lines(x, h.501, col = 'blue')
lines(x, k.501, col = 'green')
legend('topright', legend = c('exact', 'CLT','kernel'),
col = c('red','blue','green'), pch = 16, cex = 0.9)
The above plot shows the histogram and the 3 continuous graphs. Not only are the exact pdf of sample median and pdf of CLT almost identical, but they are also very close to the kernel-based pdf. Naturally, all of the pdfs have the highest value at around median of observed values, 0.6851.
Then, it is estimated that the greater number of observations ‘n’ and the consecutive bootstrap sampling of Exp(1) will lead to a histogram similar to the pdf of normal distribution whose mean is about log(2) and the variance is 1/n, when \(X \sim F\).