Elements of Statistical Learning Ch3

1. (Ex 3.6) Show that the ridge regression estimate is the mean (and mode) of the posterior distribution, under a Gaussian prior \(\beta \sim \mathcal N_p(0, \tau^2 I_p)\) and Gaussian sampling model \(y \sim \mathcal N_N(X\beta, \sigma^2 I_N)\) with \(\lambda = \sigma^2/\tau^2\).

Solution

The posterior distribution of \(\beta\) is \(p(\beta | y) \propto p(y | \beta) p(\beta)\).
\(p(y | \beta) \propto \exp \left\{ -\frac{1}{2\sigma^2} | y - X\beta |^2 \right\}\) and
\(p(\beta) \propto \exp \left\{ -\frac{\tau^2}{2} |\beta|^2 \right\}\).

Thus,
\(p(y | \beta) p(\beta) \propto \exp \left\{ -\frac{1}{2} \left( \frac{1}{\sigma^2} | y - X\beta |^2 + \tau^2 |\beta|^2 \right) \right\}\).

This can be rewritten as:
\(\propto \exp \left\{ -\frac{1}{2} (\beta - \mu_\beta)' \Sigma^{-1} (\beta - \mu_\beta) \right\}\).

The quadratic form in the exponential should be identical, so
\(\frac{1}{\sigma^2} (y - X\beta)' (y - X\beta) + \frac{\tau^2}{2} \beta' \beta = (\beta - \mu_\beta)' \Sigma^{-1} (\beta - \mu_\beta) \quad \forall \beta\).

Expanding,
\(\frac{1}{\sigma^2} \beta' X' X \beta + \frac{\tau^2}{2} \beta' \beta = \beta' \Sigma^{-1} \beta\).

Thus,
\(\Sigma^{-1} = \frac{1}{\sigma^2} X' X + \frac{\tau^2}{2} I_p = \frac{1}{\sigma^2} (X'X + \frac{\tau^2}{2} \sigma^2 I_p)\).

For the mean,
\(-\frac{1}{\sigma^2} \beta' X' y = - \beta' \Sigma^{-1} \mu_\beta\),
so
\(\mu_\beta = \Sigma X' y\)
\(\quad = \frac{1}{\sigma^2} (X' X + \frac{\tau^2}{2} \sigma^2 I_p)^{-1} X' y\)
\(\quad = (X'X + \frac{\tau^2}{2} \sigma^2 I_p)^{-1} X' y\),

which is the Ridge regression estimator with \(\lambda = \frac{\tau^2}{2} \sigma^2\).

2. (Ex 3.12) Show that the ridge regression estimates can be obtained by ordinary least squares regression on an augmented data set: We augment the centered matrix \(X\) with \(p\) additional rows \(\sqrt \lambda I\) augment \(y\) with \(p\) zeros. By introducing artificial data having response value zero, the fitting procedure is forced to shrink the coefficients toward zero.

Solution

Let
\(\tilde{X} = \begin{bmatrix} X \\ \sqrt{\lambda} I_p \end{bmatrix}\)
and
\(\tilde{y} = \begin{bmatrix} y \\ 0_p \end{bmatrix}\).

Then,
\(\tilde{X}' \tilde{X} = \begin{bmatrix} X' & \sqrt{\lambda} I_p \end{bmatrix} \begin{bmatrix} X \\ \sqrt{\lambda} I_p \end{bmatrix} = X'X + \lambda I_p\)
and

\(\tilde{X}' \tilde{y} = \begin{bmatrix} X' & \sqrt{\lambda} I_p \end{bmatrix} \begin{bmatrix} y \\ 0_p \end{bmatrix} = X' y\).

So,
\(\tilde{\beta} = (\tilde{X}' \tilde{X})^{-1} \tilde{X}' \tilde{y} = (X'X + \lambda I_p)^{-1} X' y = \hat{\beta}^{\text{Ridge}}\).

3. Show that \(E[||\hat{\mu}^{\text{JS}} - \mu||^2] < E[||\hat{\mu}^{\text{ML}} - \mu||^2] = p\), where \(\mu^{\text{JS}} = \left(1 - \frac{p-2}{p} \right)Z\) and \(Z_i \sim N(\mu_i, 1)\) for \(i=1, \dots, p\), \(p \geq 3\), \(\mu = (\mu_1, \dots, \mu_p)\).

Solution

Let \(\hat{\mu} = (\hat{\mu}_1, \dots, \hat{\mu}_p)'\) be a random estimator of \(\mu\), not necessarily unbiased.
Then,
\[(\hat{\mu}_i - \mu_i)^2 = (\hat{\mu}_i - Z_i + Z_i - \mu_i)^2 = (\hat{\mu}_i - Z_i)^2 + (Z_i - \mu_i)^2 + 2 (\hat{\mu}_i - Z_i)(Z_i - \mu_i).\]

1) Expectation Calculation

\[ E[||\hat{\mu} - \mu||^2] = E[||\hat{\mu} - Z||^2] - p + 2 \sum_{i=1}^{p} \text{Cov}[Z_i, \hat{\mu}_i]. \]

So,

\[ E[(\hat{\mu}_i - \mu_i)^2] = E[(\hat{\mu}_i - Z_i)^2] + E[(Z_i - \mu_i)^2] + 2E[(\hat{\mu}_i - Z_i)(Z_i - \mu_i)]. \]

Since,

\[ E[(\hat{\mu}_i - Z_i)(Z_i - \mu_i)] = E_{Z}[(Z_i - \mu_i)(\hat{\mu}_i - Z_i)] - E_{Z}[Z_i - \mu_i]E_{\hat{\mu}_i}[\hat{\mu}_i - Z_i]. \]

Since \(E_{Z}[Z_i - \mu_i] = 0\), we get

\[ E[(\hat{\mu}_i - Z_i)(Z_i - \mu_i)] = \text{Cov}[Z_i, \hat{\mu}_i] - V[Z_i] = \text{Cov}[Z_i, \hat{\mu}_i] - 1. \]

Thus,
\[ E[(\hat{\mu}_i - \mu_i)^2] = E[(\hat{\mu}_i - Z_i)^2] - 1 + 2\text{Cov}[Z_i, \hat{\mu}_i], \]

and

\[ E[||\hat{\mu} - \mu||^2] = E[||\hat{\mu} - Z||^2] - p + 2 \sum_{i=1}^{p} \text{Cov}[Z_i, \hat{\mu}_i]. \]

2) Covariance Calculation

\[ \text{Cov}[Z_i, \hat{\mu}_i] = E \left[ Z_i \frac{d\hat{\mu}_i}{dZ_i} \right]. \]

(Let \(\varphi(x)\) be the p.d.f. of \(x \sim N(0,1)\).)

\[ \text{Cov}[\hat{\mu}_i, Z_i] = E_{Z_i}[(\hat{\mu}_i - \bar{\mu}_i)(Z_i - \mu_i)]. \]

(Let \(\bar{\mu}_i = E[\hat{\mu}_i]\).)

\[ = \int (\hat{\mu}_i - \bar{\mu}_i) (Z_i - \mu_i) \varphi(Z_i - \mu_i) dZ_i. \]

(Since \(\varphi'(x) = -x \varphi(x)\).)

\[ = \int (\hat{\mu}_i - \bar{\mu}_i) \left( - \frac{d}{dZ_i} \varphi(Z_i - \mu_i) \right) dZ_i. \]

Using integration by parts,

\[ = \left[ (\hat{\mu}_i - \bar{\mu}_i) \varphi(Z_i - \mu_i) \right]_{-\infty}^{\infty} + \int \frac{d}{dZ_i} (\hat{\mu}_i - \bar{\mu}_i) \varphi(Z_i - \mu_i) dZ_i. \]

\[ = \int \frac{d\hat{\mu}_i}{dZ_i} \varphi(Z_i - \mu_i) dZ_i = E \left[ \frac{d\hat{\mu}_i}{dZ_i} \right]. \]

3). Expected Norm of Modified Estimator

\[ E[||\hat{\mu}^{\text{JS}} - \mu||^2] = p - E\left[ \frac{(p-2)^2}{||Z||^2} \right]. \]

Since \[ E[||\hat{\mu}^{\text{mle}} - \mu||^2] = p, \] we conclude that \[ E[||\hat{\mu}^{\text{ms}} - \mu||^2] < p. \]

4) conclusion

A single observation \(Z\) leads to \(\hat{\mu}^{\text{mle}} = Z\).
Then, \[ E[||\hat{\mu}^{\text{mle}} - \mu||^2] = E[||Z - \mu||^2] = p. \] Thus, \[ E[||\hat{\mu}^{\text{ms}} - \mu||^2] = E[||\hat{\mu}^{\text{mle}} - \mu||^2] - E\left[ \frac{(p-2)^2}{||Z||^2} \right]. \] Since \(E\left[ \frac{1}{||Z||^2} \right]\) is positive for \(p > 2\), we obtain: \[ E[||\hat{\mu}^{\text{ms}} - \mu||^2] < p. \]

4. Show that \(\hat{\theta}^{\text{lasso}}\) which minimizes \[ F(\theta) = (z - \theta)^2 + \lambda |\theta| \] is given by \[ \hat{\theta}^{\text{lasso}} = \text{sign}(z)(|z| - \lambda/2)_+, \quad (z \in \mathbb{R}, \lambda > 0). \]

Solution

We have: \[ F(\theta) = z^2 - 2\theta Z + \theta^2 + \lambda |\theta|. \]

Case 1: \(z \geq 0\)

\(F\) is the summation of two graphs.

  • If \(\arg\min_{\theta \geq 0} F(\theta) = 0\), then \(F(\theta) = z^2\).
  • If \(\arg\min_{\theta \geq z} F(\theta) = z\), then
    \[ F(z) = z^2 - 2z^2 + z^2 + \lambda z = \lambda z. \]

When \(0 \leq \theta \leq z\),
\[ F(\theta) = z^2 - 2z\theta + \theta^2 + \lambda |\theta| = \theta^2 + (\lambda - 2z)\theta + z^2. \]
Thus,
\[ \arg\min_{\theta \in [0, z]} F(\theta) = \begin{cases} z - \lambda/2, & (z \geq \lambda/2) \\ 0, & (z < \lambda/2) \end{cases}. \]

Since \(z \geq \lambda/2\),
\[ \min \{ z^2, \lambda z - \lambda^2/4, z^2 \} = \min \{ z^2, \lambda z - \lambda^2/4 \} = \lambda z - \lambda^2/4. \]
So, \[ \theta = z - \lambda/2 \quad (z \geq \lambda/2), \]
which can be written as: \[ \theta = \text{sign}(z)(|z| - \lambda/2)_+. \]

When \(z < \lambda/2\),
\[ z^2 < \lambda z - \lambda^2/4, \]
so \(\theta = 0\).
Thus,
\[ \theta = \text{sign}(z)(|z| - \lambda/2)_+. \]


Case 2: \(z < 0\)

  • If $_{} F() = 0 $, then $ F() = z^2$.
  • If \(\arg\min_{\theta \geq z} F(\theta) = z\), then
    \[ F(z) = z^2 - 2z^2 + z^2 - \lambda z = \lambda z. \]

When \(\lambda/2 \leq z < 0\),
\[ \min \{ z^2, -\lambda z, -\lambda z - \lambda^2/4 \} = \min \{ z^2, -\lambda z - \lambda^2/4 \} = -\lambda z - \lambda^2/4. \]
So,
\[ \theta = z + \lambda/2 = (-z - \lambda/2)_+ = -(-z - \lambda/2)_+ = \text{sign}(z)(|z| - \lambda/2)_+. \]

When \(\lambda/2 < z < 0\),
\[ z^2 < -\lambda z, \]
so
\[ \min \{ z^2, -\lambda z \} = z^2, \]
and
\[ \theta = 0 = \text{sign}(z)(|z| - \lambda/2)_+. \]

Thus, we conclude that
\[ \hat{\theta}^{\text{lasso}} = \text{sign}(z)(|z| - \lambda/2)_+. \]

5. (Ex. 3.23) Consider a regression problem with all variables and response having mean zero and standard deviation one. Suppose also that each variable has identical absolute correlation with the response: \[\frac{1}{N}|\langle x_j, y\rangle| = \lambda,\quad j=1,...,p.\] Let \(\hat\beta\) be the least-squares coefficient of \(y\) on \(X\), and let \(u(\alpha) = \alpha X\hat\beta\) for \(\alpha ∈ [0, 1]\) be the vector that moves a fraction \(\alpha\) toward the least squares fit \(u\). Let RSS be the residual sum-of squares from the fullleast squares fit.

(a) Show that \[\frac{1}{N}\langle x_j, y-u(\alpha)\rangle = (1 - \alpha)\lambda,\quad j=1,...,p\] and hence the correlations of each \(x_j\) with the residuals remain equal in magnitude as we progress toward \(u\).

Solution

Note that
\[ \frac{1}{N} \langle x_i, y - u(\alpha) \rangle \]
is the \(i\)-th element of
\[ \frac{1}{N} X^T(y - u(\alpha)). \]

\[\begin{aligned} X^T(y - u(\alpha)) &= X^T(y - \alpha X(X^TX)^{-1}X^Ty) \\ &= X^Ty - \alpha X^TX(X^TX)^{-1}X^Ty \\ &=(1 - \alpha)X^Ty. \end{aligned}\]

We have
\[ \langle x_i, y \rangle = N\lambda, \quad \forall i = 1, \dots, p, \]
so the \(i\)-th element of
\[ \frac{1}{N} X^T(y - u(\alpha)) \]
is
\[ \left[ \frac{1}{N} X^T(y - u(\alpha)) \right]_i = \frac{1}{N} (1 - \alpha) X^Ty_i = \frac{1}{N} (1 - \alpha) N\lambda. \]

Thus,
\[ \frac{1}{N} \langle x_i, y - u(\alpha) \rangle = \frac{1}{N} (1 - \alpha) \langle x_i, y \rangle = \frac{1}{N} (1 - \alpha) N\lambda = (1 - \alpha) \lambda. \]

(b) Show that these absolute correlations are all equal to \[\lambda(\alpha) = \frac{1-\alpha}{\sqrt{(1-\alpha)^2 + \frac{\alpha(2-\alpha)}{N}RSS } }\lambda,\] and they decrease monotonically to zero.

Solution

The standard deviation of \(x_i\) is \(1\), meaning \(||x_i||^2 = 1\). Then,

\[\begin{aligned} &\frac{\langle x_i, y - u(\alpha) \rangle}{\{ \langle x_i, x_i \rangle \}^{\frac{1}{2}} \{ \langle y - u(\alpha), y - u(\alpha) \rangle \}^{\frac{1}{2}}} \\ &= \frac{\frac{1}{N} \langle x_i, y - u(\alpha) \rangle}{1 \cdot \left\{ \frac{1}{N} \langle y - u(\alpha), y - u(\alpha) \rangle \right\}^{\frac{1}{2}}} \\ &= \frac{(1 - \alpha) \lambda}{\left\{ \frac{1}{N} \langle y - u(\alpha), y - u(\alpha) \rangle \right\}^{\frac{1}{2}}}. \end{aligned}\]

Now,
\[ \langle y - \alpha X\hat{\beta}, y - \alpha X\hat{\beta} \rangle = y^Ty - 2\alpha y^TX\hat{\beta} + \alpha^2 \hat{\beta}^TX^TX \hat{\beta}. \]

Since \(X^TX\hat{\beta} = X^Ty\) (LSE), we get

\[ = y^Ty + (\alpha^2 - 2\alpha) y^TX\hat{\beta}. \]

Since SSE (RSS) is
\[ \langle y - X\hat{\beta}, y - X\hat{\beta} \rangle, \] same as \(\alpha = 1\) case above,

\[ \text{SSE} = y^Ty - y^TX\hat{\beta}. \]

Thus,

\[\begin{aligned} \langle y - \alpha X\hat{\beta}, y - \alpha X\hat{\beta} \rangle & = y^Ty + (\alpha^2 - 2\alpha) (y^TX\hat{\beta} - \text{SSE}) \\ &= (\alpha^2 - 2\alpha + 1) y^Ty - (\alpha^2 - 2\alpha) \text{SSE}. \end{aligned}\]

Since \(E[y^Ty] = 0\) and \(V[y^Ty] = 1\),

\[ E[y^Ty] = N. \]

Using this property,

\[ \langle y - \alpha X\hat{\beta}, y - \alpha X\hat{\beta} \rangle = N(1 - \alpha)^2 + \alpha(2 - \alpha) \text{SSE}. \]

Thus,

\[ \frac{\langle x_i, y - u(\alpha) \rangle}{\left\{ \langle x_i, x_i \rangle \right\}^{\frac{1}{2}} \left\{ \langle y - u(\alpha), y - u(\alpha) \rangle \right\}^{\frac{1}{2}}} = \frac{(1 - \alpha) \lambda}{\sqrt{(1 - \alpha)^2 + \frac{\text{SSE}}{N} \alpha (2 - \alpha)}} = \lambda(\alpha). \]

Define

\[ f(x) = (1 - \alpha)^2 + \frac{\text{SSE}}{N} \alpha (2 - \alpha). \]

It is increasing in \(\alpha \in (0,1)\), and \((1 - \alpha) \lambda\) is decreasing in \(\alpha \in (0,1)\).

Thus,

\[ \lambda(\alpha) = \frac{(1 - \alpha) \lambda}{\sqrt{(1 - \alpha)^2 + \frac{\text{SSE}}{N} \alpha (2 - \alpha)}} \]

decreases monotonically to \(\lambda(1) = 0\).

6. (Ex. 3.25) Derive expressions to identify the next variable to enter the active set at step \(k + 1\), and the value of \(\alpha\) at which this occurs.

Solution

Step 1: Solve for \(\alpha_b\)

\[ x_a^T r_k - \alpha x_a^T X_{A_k} \delta_k = x_b^T r_k - \alpha x_b^T X_{A_k} \delta_k \]

\[ \Rightarrow \alpha_b = \frac{x_b^T r_k - x_a^T r_k}{x_b^T X_{A_k} \delta_k - x_a^T X_{A_k} \delta_k} \]

Choose one with
\[ \alpha_b \in (0,1). \]

\[ x_a^T r_k - \alpha x_a^T X_{A_k} \delta_k = \alpha x_b^T X_{A_k} \delta_k - x_b^T r_k \]

\[ \Rightarrow \alpha_b = \frac{-x_b^T r_k + x_a^T r_k}{x_b^T X_{A_k} \delta_k + x_a^T X_{A_k} \delta_k} \]


Step 2: Compute Maximum

Then, compute
\[ |x_b^T r_{k+1} (\alpha_b)| \]
for all \(x_b \in A_k^C\) and choose \(b\) that maximizes the value:

\[ b = \arg\max_{b \in A_k^C} |x_b^T r_{k+1} (\alpha_b)|. \]

7. (Ex. 3.25) For given \(\lambda > 0\), let \(\hat\beta(\lambda)\) be the lasso solution that minimizes \[\frac{1}{2}\Vert y - X\beta\Vert_2^2 + \lambda\Vert \beta\Vert_1\] and \(\mathcal B_{\lambda}\) be the active set of variables in the solution.

(a) Show that for any active variable \(x_j (j \in \mathcal B_{\lambda})\) we have \[x_j'(y - X \hat\beta(\lambda)) = \lambda\cdot\text{sign}(\hat\beta_j(\lambda)).\]

Solution

Let
\[ Q(\beta) = \frac{1}{2} || y - X\beta ||_2^2 + \lambda ||\beta||_1, \]
which expands to:
\[ Q(\beta) = \frac{1}{2} (y^Ty - 2\beta^T X^T y + \beta^T X^T X \beta) + \lambda ||\beta||_1. \]

For active variable \(x_i\), where \(\beta_i \neq 0\), we compute the derivative:

\[\begin{aligned} \frac{\partial}{\partial \beta_i} Q(\beta) &= \frac{\partial}{\partial \beta_i} \left( -\beta^T X^T y + \frac{1}{2} \beta^T X^T X \beta \right) + \lambda \text{sign}(\beta_i) \\ &= [-X^T y + X^T X \beta]_i + \lambda \text{sign}(\beta_i) \\ &= (- x_i^T y + x_i^T X \beta) + \lambda \text{sign}(\beta_i). \end{aligned}\]

Since the definition of \(\hat{\beta}\) satisfies:

\[ [-X^T y + X^T \hat{\beta}]_i + \lambda \text{sign}(\beta_i) = 0, \]

it follows that:

\[ x_i^T (y - X \hat{\beta}) = \lambda \text{sign}(\beta_i). \]

(b) Show that for any non-active variable \(x_k, (k \notin \mathcal B_{\lambda})\) we have \[x_k'(y - X \hat\beta(\lambda)) \le \lambda.\]

Solution

For non-active variable \(x_k\), where \(\beta_k = 0\), Since \(f(x) = |x|\) is not differentiable at \(x = 0\), the subdifferential of \(f\) at \(\beta_k\) is:

\[ \partial f(\beta_k) = [-1,1]. \]

Still,

\[ (-x_k^T y + x_k^T X \hat{\beta}) + \lambda \partial f(\beta_k) = 0. \]

Thus,

\[ x_k^T (y - X \hat{\beta}) = \lambda \partial f(\beta_k). \]

and therefore,

\[ |x_k^T (y - X \hat{\beta})| \leq \lambda. \]

(c) Suppose that the set of active predictors is unchanged for \(\lambda_0 ≥ \lambda ≥ \lambda_1\). Show that there is a vector \(\gamma_0\) such that \[\hat\beta(\lambda) = \hat\beta(\lambda_0) - (\lambda - \lambda_0)\gamma_0\]

Solution

\[ x_i^T (y - X \hat{\beta}(\lambda)) = \lambda \text{sign}(\beta_i) \quad \text{for active } x_i. \]

\[ |x_i^T (y - X \hat{\beta}(\lambda))| \leq \lambda \quad \text{for non-active } x_i. \]

So,

\[ x_i^T (y - X \hat{\beta}(\lambda)) = \lambda \alpha_i, \quad \text{where } -1 \leq \alpha_i \leq 1. \]

Let,

\[ X^T (y - X \hat{\beta}(\lambda)) = \lambda a, \quad \text{where } a = (\alpha_1, \dots, \alpha_p). \]

From this,

\[ X^T X \hat{\beta}(\lambda) = X^T y - \lambda a. \]

Solving for \(\hat{\beta}(\lambda)\),

\[ \hat{\beta}(\lambda) = (X^T X)^{-1} (X^T y - \lambda a). \]

Thus,

\[ \gamma_0 = (X^T X)^{-1} a. \]