Elements of Statistical Learning Ch4

1. Let $G$ denote the class label in this subsection. Define $Y = (Y_1, \dots, Y_K)$ where $Y_k = 1$ if $G = k$, else $Y_k = 0$. $\mathbf Y$ is an $N \times K$ indicator response matrix, where each row contains a single 1. Fit a linear regression model to each column of $\mathbf Y$ simultaneously: \[\begin{aligned} \hat{Y} &= X (X^T X)^{-1} X^T Y \\ \hat{B} &= (X^T X)^{-1} X^T Y, \quad (p+1) \times K \text{ coefficient matrix},\end{aligned}\] where $X$ is the model matrix with $(p+1)$ columns for the $p$ inputs and an intercept. With input $x$, compute the fitted output $\hat{f}(x)^T = (1, x^T) \hat{B}$ and identify the largest component and classify accordingly $\hat{G}(x) = \arg\max_{k \in G} \hat{f}_k(x).$ Then, show that the sum of the fitted values satisfies:\[\sum_{k \in G} \hat{f}_k(x) = 1.\]

Solution

Let
\[ X = \begin{bmatrix} 1_N & X_{(p)} \end{bmatrix}, \]
where $1_N = (1, \dots, 1)^T$ and $X_{(p)}$ is an $N \times p$ matrix.

\[X^T X = \begin{bmatrix} 1_N^T 1_N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix} = \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}.\]

\[ X^T Y\mathbf 1_k = \begin{bmatrix} 1_N^T \\ X_{(p)}^T \end{bmatrix} 1_N = \begin{bmatrix} N \\ X_{(p)}^T 1_N \end{bmatrix}. \]

Then,
\[ \hat{B}\mathbf 1_k = (X^T X)^{-1} X^T Y\mathbf 1_k = \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}^{-1} \begin{bmatrix} N \\ X_{(p)}^T 1_N \end{bmatrix}. \]

Using the property

\[ \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}\begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}^{-1}= \begin{bmatrix} 1 & 0_p^T \\ 0_p & I_p \end{bmatrix}, \]

we get

\[ \hat{B}\mathbf 1_k = \begin{bmatrix} 1 \\ 0_p \end{bmatrix}, \quad \text{where } 0_p = (0, \dots, 0)^T \in \mathbb{R}^p. \]

Since

\[ \hat{f}(x)^T = (1, x^T) \hat{B}, \] we have \[ \sum_{k=1}^{K} \hat{f}_k(x) = \hat{f}(x)^T 1_K = (1, x^T) \hat{B} 1_K= (1, x^T) \begin{bmatrix} 1 \\ 0_p \end{bmatrix} = 1. \]

Thus, we have proved that
\[ \sum_{k=1}^{K} \hat{f}_k(x) = 1. \]

2. (Ex 4.2) Suppose we have features $x \in \mathbb{R}^p$, a two-class response, with class sizes $n_1, n_2$, and the target coded as $-n/n_1, n/n_2$.

(a) Show that the LDA rule classifies to class 2 if \[ x^T \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log\left(\frac{n_1}{n}\right) - \log\left(\frac{n_2}{n}\right),\] and class 1 otherwise.

Solution

The LDA classifies to class 2 if $\delta_2(x) > \delta_1(x)$. That is,

\[ x^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 + \log \hat{\pi}_2 > x^T \hat{\Sigma}^{-1} \hat{\mu}_1 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log \hat{\pi}_1. \]

In this case, $\hat{\pi}_1 = \frac{n_1}{n}$ and $\hat{\pi}_2 = \frac{n_2}{n}$ ($n = n_1 + n_2$). Then,

\[ x^T \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log \frac{n_1}{n} - \log \frac{n_2}{n} \]

implies that the LDA classifies to class 2.

(b) Consider minimization of the least squares criterion \[ \sum_{i=1}^{n} (y_i - \beta_0 - \beta^T x_i)^2. \] Show that the solution $\hat{\beta}$ satisfies \[ \left[(n - 2) \hat{\Sigma} + \frac{n_1 n_2}{n} \hat{\Sigma}_B \right] \hat{\beta} = n(\hat{\mu}_2 - \hat{\mu}_1),\] where $\hat{\Sigma}_B = (\hat{\mu}_2 - \hat{\mu}_1)(\hat{\mu}_2 - \hat{\mu}_1)^T$.

Solution

Let

\[ X_i = (x_{i1}, \dots, x_{ip})^T \in \mathbb{R}^p, \quad Y = (y_1, \dots, y_n) \in \mathbb{R}^n, \quad \beta = (\beta_1, \dots, \beta_p)^T \in \mathbb{R}^p. \]

We have:

\[ \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} y_i \right). \]

Define:

\[ X = \begin{bmatrix} 1 & x_1^T \\ 1 & x_2^T \\ \vdots & \vdots \\ 1 & x_n^T \end{bmatrix}, \quad X^T = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{bmatrix}, \]

and

\[ X^T X = \begin{bmatrix} n & \sum_{i=1}^{n} x_i^T \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i x_i^T \end{bmatrix}. \]

Also,

\[ X^T Y = \begin{bmatrix} \sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} y_i x_i \end{bmatrix}. \]

The least squares estimate (LSE) solution $ (_0, )^T $ satisfies:

\[ X^T X \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta} \end{bmatrix} = X^T Y. \]

That is,

\[ \begin{bmatrix} n & \sum_{i=1}^{n} x_i^T \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i x_i^T \end{bmatrix} \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta} \end{bmatrix}= \begin{bmatrix} \sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} y_i x_i \end{bmatrix}. \]

We obtain:

\[ \sum_{i=1}^{n} \hat{\beta}_0 + \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i, \]

\[ \left( \sum_{i=1}^{n} x_i \right) \hat{\beta}_0 + \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i. \]

Since,

\[ \hat{\beta}_0 = \frac{1}{n} \sum_{i=1}^{n} y_i - \left( \frac{1}{n} \sum_{i=1}^{n} x_i \right)^T \hat{\beta}, \]

we substitute:

(c) Hence show that $\hat{\Sigma}_B \hat{\beta}$ is in the direction $(\hat{\mu}_2 - \hat{\mu}_1)$ and thus \[\hat{\beta} \propto \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1).\] Therefore, the least squares regression coefficient is identical to the LDA coefficient, up to a scalar multiple.

solution

We now show:

\[ \hat{\Sigma}_B \hat{\beta} = c (\hat{\mu}_2 - \hat{\mu}_1). \]

Since,

\[ \hat{\Sigma}_B \hat{\beta} = (\hat{\mu}_2 - \hat{\mu}_1) (\hat{\mu}_2 - \hat{\mu}_1)^T \hat{\beta}, \]

where $c = (\hat{\mu}_2 - \hat{\mu}_1)^T \hat{\beta} \in \mathbb{R}$, we see that $\hat{\Sigma}_B \hat{\beta}$ is in the direction of $(\hat{\mu}_2 - \hat{\mu}_1)$.

Thus,

\[ \hat{\Sigma} \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1), \]

which implies:

\[ \hat{\beta} \propto \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1). \]

(d) Show that this result holds for any (distinct) coding of the two classes.

Solution

Using equation (1) from part (b), we see that we can use any $t_1, t_2$ such that $t_1 \neq t_2$. Then,

\[ \text{RHS} \propto (\hat{\mu}_2 - \hat{\mu}_1). \]

Also,

\[ \left( (n - 2) \hat{\Sigma} + \frac{n_1 n_2}{n} (\hat{\mu}_2 - \hat{\mu}_1)(\hat{\mu}_2 - \hat{\mu}_1)^T \right) \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1). \]

Thus,

\[ (n - 2) \hat{\Sigma} \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1), \]

and hence,

\[ \hat{\beta} \propto \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1). \]

3. (Ex 4.3) Suppose we transform the original predictors $X$ to $\hat{Y}$ via linear regression. In detail, let: $\hat{Y} = X(X^T X)^{-1} X^T Y = X \hat{B},$ where $Y$ is the indicator response matrix. Similarly, for any input $x \in \mathbb{R}^p$, we get a transformed vector:$\hat{x} = \hat{B}^T x \in \mathbb{R}^K.$ Show that LDA using $\hat{Y}$ is identical to LDA in the original space.

Solution

(1) Preliminary

Recall that each row of $Y$ has a single 1 and $Y \in \mathbb{R}^{N \times K}$.

The main point is to show that linear discriminant function w.r.t. $X$, $\delta_k(x)$, is identical to that w.r.t. $\hat{Y}$, $\hat{\delta}_k(x)$.

\[ \delta_k(x) = \log \pi_k + x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k \]

\[ \Rightarrow \hat{\delta}_k(x) = \log \tilde{\pi}_k + x^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k \]

Let $\tilde{\pi}_k$, $\tilde{\mu}_k$, $\tilde{\Sigma}$ be estimators of class probability, class mean, and class covariance respectively.

$\tilde{\pi}_k = \hat{\pi}_k \Rightarrow \tilde{\pi}_k = \frac{N_k}{N}$

\[ Y_k : \text{k-th column, indicator of k-group},\quad X^T Y_k \in \mathbb{R}^{p+1} \]
\[\begin{aligned} \tilde{\mu}_k &= \frac{1}{N_k} \sum_{i: y_i = k} \hat{B} x_i \\ &= B^T \frac{1}{N_k} \sum_{i: y_i = k} x_i \\ &= B^T \sum_{i: y_i = k} \frac{x_i}{N_k} = B^T X^T Y_k \end{aligned}\]

where \[ \hat{B} = (X^T X)^{-1} X^T Y \in \mathbb{R}^{p \times K}. \]

\[\begin{aligned} \tilde{\Sigma} &= \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( \hat{y}_i - \tilde{\mu}_k ) ( \hat{y}_i - \tilde{\mu}_k )^T \\ &= \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( B^T x_i - B^T \tilde{\mu}_k )( B^T x_i - B^T \tilde{\mu}_k )^T \\ &= B^T \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( x_i - \tilde{\mu}_k )( x_i - \tilde{\mu}_k )^T B \\ & = B^T \Sigma B \end{aligned}\]
\[\begin{aligned} (N-K) \tilde{\Sigma} &= \sum_{k=1}^{K} \sum_{i: y_i = k} ( x_i - \tilde{\mu}_k )( x_i - \tilde{\mu}_k )^T \\ & = \sum x_i x_i^T - 2 \sum_{k=1}^{K} \sum_{i: y_i = k} x_i \tilde{\mu}_k^T + \sum_{k=1}^{K} \sum_{i: y_i = k} \tilde{\mu}_k \tilde{\mu}_k^T \\ &= \sum x_i x_i^T - 2 \sum_{k=1}^{K} \frac{1}{N_k} X^T Y_k (Y_k^T X) + \sum_{k=1}^{K} X^T Y_k (Y_k^T X) \\ &= X^T X - \sum_{k=1}^{K} \frac{1}{N_k} X^T Y_k Y_k^T X \\ &= X^T X - X^T Y D Y^T X, \quad ( D = \text{diag} \{ n_1^{-1}, n_2^{-1}, \dots, n_K^{-1} \} ) \\ &= X^T (I - Y D Y^T) X \end{aligned}\]
So $\delta_k$ can be defined as

\[ \hat{\delta}_k(\hat{y}) = \log \tilde{\pi}_k + \hat{y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k. \]

We simply assume that $\tilde{\Sigma} \in \mathbb{R}^{p \times p}$, $\tilde{\Sigma}^{-1} \in \mathbb{R}^{K \times K}$ exist.

(2) $\hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k$

\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = (\hat{B}^T X)^T (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k, \] and \[\begin{aligned} \hat{B}^T \tilde{\Sigma} \hat{B} &= \frac{1}{N-K} \hat{B}^T X (I - Y D Y^T) X \hat{B} \\ &= \frac{1}{N-K} \left( \hat{B}^T X X^T \hat{B} - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( Y X (X^T X)^{-1} X^T X (X^T X)^{-1} X^T Y - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( Y X (X^T X)^{-1} X^T Y - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( H - H D H \right) \end{aligned}\] by letting $\hat{B}^T X^T Y := H$, so $H^T = H$.

For nonzero $a \in \mathbb{R}^K$,

\[ a^T H a = a^T Y X (X^T X)^{-1} X^T Y a = c^T (X^T X)^{-1} c, \quad ( c = X^T Y a \in \mathbb{R}^p ) \]

Assuming $K < \min(m, p)$, $X^T Y$ is full column rank and thus $c \neq 0$.

Since $(X^T X)^{-1}$ is P.D., $c^T (X^T X)^{-1} c = a^T H a > 0$, and thus $H$ is P.D. (and invertible). Then,

\[ (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} = (N-K) (H + H(-D)H)^{-1} = (N-K) (I - D H)^{-1} H^{-1}, \]

\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = (\hat{B}^T X)^T (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k = \left[ X^T \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k \right]_k, \] and

\[\begin{aligned} \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T X^T Y &= (N-K) \hat{B} (I - D H)^{-1} H^{-1} \hat{B}^T X^T Y \\ &= (N-K) \hat{B} (I - D H)^{-1} \quad \text{∵ } H = \hat{B}^T X^T Y \\ &= (N-K) \tilde{\Sigma} \hat{\Sigma}^{-1} \hat{B} (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} X^T (I - Y D Y^T) X \hat{B} (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} (X^T X \hat{B} - X^T Y D Y^T X \hat{B}) (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} (X^T Y - X^T Y D H) (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} X^T Y (I - D H) (I - D H)^{-1} = \hat{\Sigma}^{-1} X^T Y \end{aligned}\]

So,

\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = \left[ \frac{1}{N_k} x^T \hat{\Sigma}^{-1} X^T Y \right]_k= x^T \hat{\Sigma}^{-1} \left( \frac{1}{N_k} X^T Y_k \right) = x^T \hat{\Sigma}^{-1} \hat{\mu}_k. \]

(3) $\tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k$

Let $\hat{M} = [\hat{\mu}_1 \dots \hat{\mu}_K]$ be $K \times K$ matrix.

\[ \tilde{\mu}_k = \hat{B}^T \frac{1}{N_k} X^T Y_k = \hat{B}^T X \left( \frac{1}{N_k} Y_k \right) = \hat{B}^T X Y D, \]

so $\hat{M} = \hat{B}^T X Y D$.

\[ \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \hat{M} = \hat{\mu}_k^T \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T X^T Y D = \hat{\mu}_k^T \hat{\Sigma}^{-1} X^T Y D. \]

Thus,

\[ \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = \hat{\mu}_k^T \hat{\Sigma}^{-1} X^T [Y D]_k = \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k. \]

(4) Conclusion

\[\begin{aligned} \hat{\delta}_k(\hat{y}) &= \log \tilde{\pi}_k + \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k \\ &= \log \tilde{\pi}_k + x^T \hat{\Sigma}^{-1} \hat{\mu}_k - \frac{1}{2} \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k = \delta_k(x). \quad \blacksquare \end{aligned}\]

4. Consider the classification problem with $K$ classes and the $p$-dimension predictor $X$. Let $s = K - 1$ and $p \geq s$. For $k = 1, \dots, s$, we code the class $k$ as an indicator $s$-vector $Y = (Y_1, \dots, Y_s)$ with $Y_k = 1$ and $Y_i = 0$ for $i \neq k$, and class $K$ is coded as $Y = (0, \dots, 0)$. Let $S_{11}, S_{22}, S_{12}$ denote the sample covariance matrices scaled by $N$ for $X, Y$, and $(X, Y)$. Show that \[S_B = N^{-1} S_{12} S_{22}^{-1} S_{21}\] where $S_B$ is the between-class covariance matrix.

Solution

(1) Preliminary

Let $H = I_N - \frac{1}{N} 1_N 1_N^T$ be a centering matrix,
where $J_N := 1_N 1_N^T$, and $1_N := (1, \dots, 1) \in \mathbb{R}^N$.

Matrix $Y \in \mathbb{R}^{N \times S}$ is defined as $Y = [Y_1, \dots, Y_S]$,
where $Y_k \in \mathbb{R}^N$, $N_k = Y_k^T 1_N$ $(k = 1, \dots, S)$, $(Y_k)_i = 0$ or $1$.

\[\begin{aligned} N^{-1} Y^T 1_N &= \bar{Y} \in \mathbb{R}^S, \quad \text{where } (\bar{Y})_k = \frac{1}{N} \sum_{i} Y_{ik} = \frac{N_k}{N}. \\ N_k^{-1} X^T Y_k &= \bar X_k\in \mathbb{R}^p, \quad \text{where } (\bar{X}_k)_i \text{ is the average of k-class in the } i \text{th column of } X. \\ N^{-1} X^T 1_N &= \bar{X} \in \mathbb{R}^p, \quad \text{where } (\bar{X})_i \text{ is the average in } i \text{th column of } X. \end{aligned}\]

(2) $S_{12}$

\[\begin{aligned} S_{12} &= N^{-1} X^T H Y = N^{-1} X (I_N - N^{-1} 1_N 1_N^T) Y \\ &= N^{-1} X^T Y - N^{-1} X^T 1_N N^{-1} 1_N^T Y \\ &= N^{-1} X^T Y - N^{-1} X^T 1_N \begin{bmatrix} \frac{N_1}{N} & \dots & \frac{N_S}{N} \end{bmatrix} \qquad \because \text{(determinant of product)} \\ &= \frac{1}{N} \begin{bmatrix} X^T Y_1 - X^T 1_N \frac{N_1}{N} & \dots & X^T Y_S - X^T 1_N \frac{N_S}{N} \end{bmatrix} \\ &= \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix}. \end{aligned}\]

(3) $S_{22}^{-1}$

\[ S_{22} = \frac{1}{N} Y^T H Y = \frac{1}{N} Y^T (I_N - N^{-1} J_N) Y= \frac{1}{N} Y^T Y - N^{-1} Y^T J_N Y. \]

1) $Y^T Y = \text{diag} \{ N_1, \dots, N_S \}$

$Y_k$ is an indicator vector, showing whether $X_i$ is in class $k$. Each $X_{i,k}$ should only be in one class, so $(Y_k)_i (Y_\ell)_i = 0$ for $k \neq \ell$. $(Y_k)_i (Y_k)_i$ can’t be simultaneously 1, so $Y_k^T Y_\ell = 0$.

Note that $Y_k^T Y_k = N_k$. Then

\[ Y^T Y = \begin{bmatrix} Y_1^T \\ \vdots \\ Y_S^T \end{bmatrix} [Y_1, \dots, Y_S] = \text{diag} \{ N_1, \dots, N_S \} \]

2) $S_{22}^{-1}$ using Woodbury Matrix identity

Since the Woodbury Matrix identity is
$(A + U C V)^1 = A^{-1} - A^{-1} U (C + V A^{-1} U)^{-1} V A^{-1}$, we may use this as follows:

\[\begin{aligned} S_{22}^{-1} &= \left( \frac{1}{N} Y^T Y + Y^T \left( \frac{1}{N^2} J_N \right) Y \right)^{-1} \\ &= \left( \frac{1}{N} Y^T Y + \frac{1}{N^2} Y^T 1_N 1_N^T Y \right)^{-1} \\ &= N (Y^T Y)^{-1} - N (Y^T Y)^{-1} Y^T 1_N (-1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N )^{-1} \frac{1}{N} 1_N^T Y (Y^T Y)^{-1}\\ &= N (Y^T Y)^{-1} - (Y^T Y)^{-1} Y^T 1_N (-1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N)^{-1} 1_N^T Y (Y^T Y)^{-1} \end{aligned}\]

3) $1_N^T Y (Y^T Y)^{-1} Y^T 1_N$ and $(Y^T Y)^{-1} Y^T 1_N$

\[\begin{aligned} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N &= 1_N^T Y \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} Y^T 1_N \\ &= (N_1, \dots, N_s) \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_s} \right\} \begin{bmatrix} N_1 \\ \vdots \\ N_s \end{bmatrix} \\ &= N_1 + \dots + N_s = N - N_K \end{aligned}\]

Then,

\[ \left( -1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N \right)^{-1} = (-1 + \frac{1}{N} (N - N_K))^{-1} = (-1 + 1 - \frac{N_K}{N})^{-1} = - \frac{N}{N_K} \] and \[ (Y^T Y)^{-1} Y^T 1_N = \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_s} \right\} \begin{bmatrix} N_1 \\ \vdots \\ N_s \end{bmatrix} = 1_s \]

4) Conclusion

\[ S_{22}^{-1} = N (Y^T Y)^{-1} - 1_S \left( - \frac{N}{N_K} \right) 1_S^T \]

(3) $N S_{12} S_{22}^{-1} S_{21}$

It remains to show

\[ N S_{12} S_{22}^{-1} S_{21} = S_B := \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T. \]

\[\begin{aligned} N S_{12} S_{22}^{-1} S_{21} &= N \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \left( N (Y^T Y)^{-1} + \frac{N}{N_K} 1_S 1_S^T \right) \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} \\ &= \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \left( \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} + \frac{1}{N_K} 1_S 1_S^T \right) \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} \end{aligned}\]

We split this into two parts:

1)

\[ \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} = \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T. \]

2)

\[ \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} 1_S = \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}). \]

Note that$(N_k \bar{X}_k)_i$ is the sum of $X$’s in the $i$th column whose class is $k$, and $(N \bar{X})_i$ is the sum of $X$’s in the $i$th column for all classes $(1, \dots, s, K), (i = 1, \dots, p).$ Then, $(N \bar{X})_i - \sum_{k=1}^{S} N_k \bar{X}_k$ is the sum of $X$’s in the $i$th column whose class is $K$.
Using this, $\sum_{k=1}^{S} N_k \bar{X}_k = N \bar{X} - N_K \bar{X}_K$.

Also, $\sum_{k=1}^{S} N_k \bar{X}_k = (N - N_K) \bar{X}$. So,

\[ \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) = N \bar{X} - N_k \bar{X}_K - N \bar{X} + N_k \bar{X} = N_k (\bar{X}_K - \bar{X}), \]

and

\[ \frac{1}{N_K} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} 1_S 1_S^T \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} = N_K (\bar{X}_K - \bar{X}) (\bar{X}_K - \bar{X})^T. \]

3) Conclusion

Hence,

\[\begin{aligned} N S_{12} S_{22}^{-1} S_{21} &= \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T + N_k (\bar{X}_K - \bar{X}) (\bar{X}_K - \bar{X})^T \\ &= \sum_{k=1}^{K} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T = S_B. \end{aligned}\]

Elements of Statistical Learning Ch4

Solution

2. (Ex 4.2) Suppose we have features \(x \in \mathbb{R}^p\), a two-class response, with class sizes \(n_1, n_2\), and the target coded as \(-n/n_1, n/n_2\).

Solution

Solution

(c) Hence show that \(\hat{\Sigma}_B \hat{\beta}\) is in the direction \((\hat{\mu}_2 - \hat{\mu}_1)\) and thus \[\hat{\beta} \propto \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1).\] Therefore, the least squares regression coefficient is identical to the LDA coefficient, up to a scalar multiple.

solution

(d) Show that this result holds for any (distinct) coding of the two classes.

Solution

Solution

(1) Preliminary

(2) \(\hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k\)

(3) \(\tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k\)

(4) Conclusion

Solution

(1) Preliminary

(2) \(S_{12}\)

(3) \(S_{22}^{-1}\)

1) \(Y^T Y = \text{diag} \{ N_1, \dots, N_S \}\)

2) \(S_{22}^{-1}\) using Woodbury Matrix identity

3) \(1_N^T Y (Y^T Y)^{-1} Y^T 1_N\) and \((Y^T Y)^{-1} Y^T 1_N\)

4) Conclusion

(3) \(N S_{12} S_{22}^{-1} S_{21}\)

1)

2)

3) Conclusion