Elements of Statistical Learning Ch4
1. Let \(G\) denote the class label in this subsection. Define \(Y = (Y_1, \dots, Y_K)\) where \(Y_k = 1\) if \(G = k\), else \(Y_k = 0\). \(\mathbf Y\) is an \(N \times K\) indicator response matrix, where each row contains a single 1. Fit a linear regression model to each column of \(\mathbf Y\) simultaneously: \[\begin{aligned} \hat{Y} &= X (X^T X)^{-1} X^T Y \\ \hat{B} &= (X^T X)^{-1} X^T Y, \quad (p+1) \times K \text{ coefficient matrix},\end{aligned}\] where \(X\) is the model matrix with \((p+1)\) columns for the \(p\) inputs and an intercept. With input \(x\), compute the fitted output \(\hat{f}(x)^T = (1, x^T) \hat{B}\) and identify the largest component and classify accordingly \(\hat{G}(x) = \arg\max_{k \in G} \hat{f}_k(x).\) Then, show that the sum of the fitted values satisfies:\[\sum_{k \in G} \hat{f}_k(x) = 1.\]
Solution
Let
\[ X = \begin{bmatrix} 1_N & X_{(p)}
\end{bmatrix}, \]
where \(1_N = (1, \dots, 1)^T\) and
\(X_{(p)}\) is an \(N \times p\) matrix.
\[X^T X = \begin{bmatrix} 1_N^T 1_N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix} = \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}.\]
\[ X^T Y\mathbf 1_k = \begin{bmatrix} 1_N^T \\ X_{(p)}^T \end{bmatrix} 1_N = \begin{bmatrix} N \\ X_{(p)}^T 1_N \end{bmatrix}. \]
Then,
\[
\hat{B}\mathbf 1_k = (X^T X)^{-1} X^T Y\mathbf 1_k = \begin{bmatrix}
N & 1_N^T X_{(p)} \\
X_{(p)}^T 1_N & X_{(p)}^T X_{(p)}
\end{bmatrix}^{-1}
\begin{bmatrix}
N \\
X_{(p)}^T 1_N
\end{bmatrix}.
\]
Using the property
\[ \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}\begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}^{-1}= \begin{bmatrix} 1 & 0_p^T \\ 0_p & I_p \end{bmatrix}, \]
we get
\[ \hat{B}\mathbf 1_k = \begin{bmatrix} 1 \\ 0_p \end{bmatrix}, \quad \text{where } 0_p = (0, \dots, 0)^T \in \mathbb{R}^p. \]
Since
\[ \hat{f}(x)^T = (1, x^T) \hat{B}, \] we have \[ \sum_{k=1}^{K} \hat{f}_k(x) = \hat{f}(x)^T 1_K = (1, x^T) \hat{B} 1_K= (1, x^T) \begin{bmatrix} 1 \\ 0_p \end{bmatrix} = 1. \]
Thus, we have proved that
\[ \sum_{k=1}^{K} \hat{f}_k(x) = 1.
\]
2. (Ex 4.2) Suppose we have features \(x \in \mathbb{R}^p\), a two-class response, with class sizes \(n_1, n_2\), and the target coded as \(-n/n_1, n/n_2\).
(a) Show that the LDA rule classifies to class 2 if \[ x^T \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log\left(\frac{n_1}{n}\right) - \log\left(\frac{n_2}{n}\right),\] and class 1 otherwise.
Solution
The LDA classifies to class 2 if \(\delta_2(x) > \delta_1(x)\). That is,
\[ x^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 + \log \hat{\pi}_2 > x^T \hat{\Sigma}^{-1} \hat{\mu}_1 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log \hat{\pi}_1. \]
In this case, \(\hat{\pi}_1 = \frac{n_1}{n}\) and \(\hat{\pi}_2 = \frac{n_2}{n}\) (\(n = n_1 + n_2\)). Then,
\[ x^T \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log \frac{n_1}{n} - \log \frac{n_2}{n} \]
implies that the LDA classifies to class 2.
(b) Consider minimization of the least squares criterion \[ \sum_{i=1}^{n} (y_i - \beta_0 - \beta^T x_i)^2. \] Show that the solution \(\hat{\beta}\) satisfies \[ \left[(n - 2) \hat{\Sigma} + \frac{n_1 n_2}{n} \hat{\Sigma}_B \right] \hat{\beta} = n(\hat{\mu}_2 - \hat{\mu}_1),\] where \(\hat{\Sigma}_B = (\hat{\mu}_2 - \hat{\mu}_1)(\hat{\mu}_2 - \hat{\mu}_1)^T\).
Solution
Let
\[ X_i = (x_{i1}, \dots, x_{ip})^T \in \mathbb{R}^p, \quad Y = (y_1, \dots, y_n) \in \mathbb{R}^n, \quad \beta = (\beta_1, \dots, \beta_p)^T \in \mathbb{R}^p. \]
We have:
\[ \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} y_i \right). \]
Define:
\[ X = \begin{bmatrix} 1 & x_1^T \\ 1 & x_2^T \\ \vdots & \vdots \\ 1 & x_n^T \end{bmatrix}, \quad X^T = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{bmatrix}, \]
and
\[ X^T X = \begin{bmatrix} n & \sum_{i=1}^{n} x_i^T \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i x_i^T \end{bmatrix}. \]
Also,
\[ X^T Y = \begin{bmatrix} \sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} y_i x_i \end{bmatrix}. \]
The least squares estimate (LSE) solution $ (_0, )^T $ satisfies:
\[ X^T X \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta} \end{bmatrix} = X^T Y. \]
That is,
\[ \begin{bmatrix} n & \sum_{i=1}^{n} x_i^T \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i x_i^T \end{bmatrix} \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta} \end{bmatrix}= \begin{bmatrix} \sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} y_i x_i \end{bmatrix}. \]
We obtain:
\[ \sum_{i=1}^{n} \hat{\beta}_0 + \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i, \]
\[ \left( \sum_{i=1}^{n} x_i \right) \hat{\beta}_0 + \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i. \]
Since,
\[ \hat{\beta}_0 = \frac{1}{n} \sum_{i=1}^{n} y_i - \left( \frac{1}{n} \sum_{i=1}^{n} x_i \right)^T \hat{\beta}, \]
we substitute:
\[ \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} y_i \right). \]
(c) Hence show that \(\hat{\Sigma}_B \hat{\beta}\) is in the direction \((\hat{\mu}_2 - \hat{\mu}_1)\) and thus \[\hat{\beta} \propto \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1).\] Therefore, the least squares regression coefficient is identical to the LDA coefficient, up to a scalar multiple.
solution
We now show:
\[ \hat{\Sigma}_B \hat{\beta} = c (\hat{\mu}_2 - \hat{\mu}_1). \]
Since,
\[ \hat{\Sigma}_B \hat{\beta} = (\hat{\mu}_2 - \hat{\mu}_1) (\hat{\mu}_2 - \hat{\mu}_1)^T \hat{\beta}, \]
where \(c = (\hat{\mu}_2 - \hat{\mu}_1)^T \hat{\beta} \in \mathbb{R}\), we see that \(\hat{\Sigma}_B \hat{\beta}\) is in the direction of \((\hat{\mu}_2 - \hat{\mu}_1)\).
Thus,
\[ \hat{\Sigma} \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1), \]
which implies:
\[ \hat{\beta} \propto \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1). \]
(d) Show that this result holds for any (distinct) coding of the two classes.
Solution
Using equation (1) from part (b), we see that we can use any \(t_1, t_2\) such that \(t_1 \neq t_2\). Then,
\[ \text{RHS} \propto (\hat{\mu}_2 - \hat{\mu}_1). \]
Also,
\[ \left( (n - 2) \hat{\Sigma} + \frac{n_1 n_2}{n} (\hat{\mu}_2 - \hat{\mu}_1)(\hat{\mu}_2 - \hat{\mu}_1)^T \right) \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1). \]
Thus,
\[ (n - 2) \hat{\Sigma} \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1), \]
and hence,
\[ \hat{\beta} \propto \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1). \]
3. (Ex 4.3) Suppose we transform the original predictors \(X\) to \(\hat{Y}\) via linear regression. In detail, let: \(\hat{Y} = X(X^T X)^{-1} X^T Y = X \hat{B},\) where \(Y\) is the indicator response matrix. Similarly, for any input \(x \in \mathbb{R}^p\), we get a transformed vector:\(\hat{x} = \hat{B}^T x \in \mathbb{R}^K.\) Show that LDA using \(\hat{Y}\) is identical to LDA in the original space.
Solution
(1) Preliminary
Recall that each row of \(Y\) has a single 1 and \(Y \in \mathbb{R}^{N \times K}\).
The main point is to show that linear discriminant function w.r.t. \(X\), \(\delta_k(x)\), is identical to that w.r.t. \(\hat{Y}\), \(\hat{\delta}_k(x)\).
\[ \delta_k(x) = \log \pi_k + x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k \]
\[ \Rightarrow \hat{\delta}_k(x) = \log \tilde{\pi}_k + x^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k \]
Let \(\tilde{\pi}_k\), \(\tilde{\mu}_k\), \(\tilde{\Sigma}\) be estimators of class probability, class mean, and class covariance respectively.
\(\tilde{\pi}_k = \hat{\pi}_k \Rightarrow \tilde{\pi}_k = \frac{N_k}{N}\)
\[ Y_k : \text{k-th column, indicator of k-group},\quad X^T Y_k \in \mathbb{R}^{p+1} \]
\[\begin{aligned} \tilde{\mu}_k &= \frac{1}{N_k} \sum_{i: y_i = k} \hat{B} x_i \\ &= B^T \frac{1}{N_k} \sum_{i: y_i = k} x_i \\ &= B^T \sum_{i: y_i = k} \frac{x_i}{N_k} = B^T X^T Y_k \end{aligned}\]
where \[ \hat{B} = (X^T X)^{-1} X^T Y \in \mathbb{R}^{p \times K}. \]
\[\begin{aligned} \tilde{\Sigma} &= \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( \hat{y}_i - \tilde{\mu}_k ) ( \hat{y}_i - \tilde{\mu}_k )^T \\ &= \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( B^T x_i - B^T \tilde{\mu}_k )( B^T x_i - B^T \tilde{\mu}_k )^T \\ &= B^T \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( x_i - \tilde{\mu}_k )( x_i - \tilde{\mu}_k )^T B \\ & = B^T \Sigma B \end{aligned}\]
\[\begin{aligned} (N-K) \tilde{\Sigma} &= \sum_{k=1}^{K} \sum_{i: y_i = k} ( x_i - \tilde{\mu}_k )( x_i - \tilde{\mu}_k )^T \\ & = \sum x_i x_i^T - 2 \sum_{k=1}^{K} \sum_{i: y_i = k} x_i \tilde{\mu}_k^T + \sum_{k=1}^{K} \sum_{i: y_i = k} \tilde{\mu}_k \tilde{\mu}_k^T \\ &= \sum x_i x_i^T - 2 \sum_{k=1}^{K} \frac{1}{N_k} X^T Y_k (Y_k^T X) + \sum_{k=1}^{K} X^T Y_k (Y_k^T X) \\ &= X^T X - \sum_{k=1}^{K} \frac{1}{N_k} X^T Y_k Y_k^T X \\ &= X^T X - X^T Y D Y^T X, \quad ( D = \text{diag} \{ n_1^{-1}, n_2^{-1}, \dots, n_K^{-1} \} ) \\ &= X^T (I - Y D Y^T) X \end{aligned}\]
So \(\delta_k\) can be defined as
\[ \hat{\delta}_k(\hat{y}) = \log \tilde{\pi}_k + \hat{y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k. \]
We simply assume that \(\tilde{\Sigma} \in \mathbb{R}^{p \times p}\), \(\tilde{\Sigma}^{-1} \in \mathbb{R}^{K \times K}\) exist.
(2) \(\hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k\)
\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = (\hat{B}^T X)^T (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k, \] and \[\begin{aligned} \hat{B}^T \tilde{\Sigma} \hat{B} &= \frac{1}{N-K} \hat{B}^T X (I - Y D Y^T) X \hat{B} \\ &= \frac{1}{N-K} \left( \hat{B}^T X X^T \hat{B} - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( Y X (X^T X)^{-1} X^T X (X^T X)^{-1} X^T Y - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( Y X (X^T X)^{-1} X^T Y - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( H - H D H \right) \end{aligned}\] by letting \(\hat{B}^T X^T Y := H\), so \(H^T = H\).
For nonzero \(a \in \mathbb{R}^K\),
\[ a^T H a = a^T Y X (X^T X)^{-1} X^T Y a = c^T (X^T X)^{-1} c, \quad ( c = X^T Y a \in \mathbb{R}^p ) \]
Assuming \(K < \min(m, p)\), \(X^T Y\) is full column rank and thus \(c \neq 0\).
Since \((X^T X)^{-1}\) is P.D., \(c^T (X^T X)^{-1} c = a^T H a > 0\), and thus \(H\) is P.D. (and invertible). Then,
\[ (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} = (N-K) (H + H(-D)H)^{-1} = (N-K) (I - D H)^{-1} H^{-1}, \]
\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = (\hat{B}^T X)^T (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k = \left[ X^T \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k \right]_k, \] and
\[\begin{aligned} \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T X^T Y &= (N-K) \hat{B} (I - D H)^{-1} H^{-1} \hat{B}^T X^T Y \\ &= (N-K) \hat{B} (I - D H)^{-1} \quad \text{∵ } H = \hat{B}^T X^T Y \\ &= (N-K) \tilde{\Sigma} \hat{\Sigma}^{-1} \hat{B} (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} X^T (I - Y D Y^T) X \hat{B} (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} (X^T X \hat{B} - X^T Y D Y^T X \hat{B}) (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} (X^T Y - X^T Y D H) (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} X^T Y (I - D H) (I - D H)^{-1} = \hat{\Sigma}^{-1} X^T Y \end{aligned}\]
So,
\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = \left[ \frac{1}{N_k} x^T \hat{\Sigma}^{-1} X^T Y \right]_k= x^T \hat{\Sigma}^{-1} \left( \frac{1}{N_k} X^T Y_k \right) = x^T \hat{\Sigma}^{-1} \hat{\mu}_k. \]
(3) \(\tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k\)
Let \(\hat{M} = [\hat{\mu}_1 \dots \hat{\mu}_K]\) be \(K \times K\) matrix.
\[ \tilde{\mu}_k = \hat{B}^T \frac{1}{N_k} X^T Y_k = \hat{B}^T X \left( \frac{1}{N_k} Y_k \right) = \hat{B}^T X Y D, \]
so \(\hat{M} = \hat{B}^T X Y D\).
\[ \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \hat{M} = \hat{\mu}_k^T \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T X^T Y D = \hat{\mu}_k^T \hat{\Sigma}^{-1} X^T Y D. \]
Thus,
\[ \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = \hat{\mu}_k^T \hat{\Sigma}^{-1} X^T [Y D]_k = \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k. \]
(4) Conclusion
\[\begin{aligned} \hat{\delta}_k(\hat{y}) &= \log \tilde{\pi}_k + \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k \\ &= \log \tilde{\pi}_k + x^T \hat{\Sigma}^{-1} \hat{\mu}_k - \frac{1}{2} \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k = \delta_k(x). \quad \blacksquare \end{aligned}\]
4. Consider the classification problem with \(K\) classes and the \(p\)-dimension predictor \(X\). Let \(s = K - 1\) and \(p \geq s\). For \(k = 1, \dots, s\), we code the class \(k\) as an indicator \(s\)-vector \(Y = (Y_1, \dots, Y_s)\) with \(Y_k = 1\) and \(Y_i = 0\) for \(i \neq k\), and class \(K\) is coded as \(Y = (0, \dots, 0)\). Let \(S_{11}, S_{22}, S_{12}\) denote the sample covariance matrices scaled by \(N\) for \(X, Y\), and \((X, Y)\). Show that \[S_B = N^{-1} S_{12} S_{22}^{-1} S_{21}\] where \(S_B\) is the between-class covariance matrix.
Solution
(1) Preliminary
Let \(H = I_N - \frac{1}{N} 1_N
1_N^T\) be a centering matrix,
where \(J_N := 1_N 1_N^T\), and \(1_N := (1, \dots, 1) \in
\mathbb{R}^N\).
Matrix \(Y \in \mathbb{R}^{N \times
S}\) is defined as \(Y = [Y_1, \dots,
Y_S]\),
where \(Y_k \in \mathbb{R}^N\), \(N_k = Y_k^T 1_N\) \((k = 1, \dots, S)\), \((Y_k)_i = 0\) or \(1\).
\[\begin{aligned} N^{-1} Y^T 1_N &= \bar{Y} \in \mathbb{R}^S, \quad \text{where } (\bar{Y})_k = \frac{1}{N} \sum_{i} Y_{ik} = \frac{N_k}{N}. \\ N_k^{-1} X^T Y_k &= \bar X_k\in \mathbb{R}^p, \quad \text{where } (\bar{X}_k)_i \text{ is the average of k-class in the } i \text{th column of } X. \\ N^{-1} X^T 1_N &= \bar{X} \in \mathbb{R}^p, \quad \text{where } (\bar{X})_i \text{ is the average in } i \text{th column of } X. \end{aligned}\]
(2) \(S_{12}\)
\[\begin{aligned} S_{12} &= N^{-1} X^T H Y = N^{-1} X (I_N - N^{-1} 1_N 1_N^T) Y \\ &= N^{-1} X^T Y - N^{-1} X^T 1_N N^{-1} 1_N^T Y \\ &= N^{-1} X^T Y - N^{-1} X^T 1_N \begin{bmatrix} \frac{N_1}{N} & \dots & \frac{N_S}{N} \end{bmatrix} \qquad \because \text{(determinant of product)} \\ &= \frac{1}{N} \begin{bmatrix} X^T Y_1 - X^T 1_N \frac{N_1}{N} & \dots & X^T Y_S - X^T 1_N \frac{N_S}{N} \end{bmatrix} \\ &= \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix}. \end{aligned}\]
(3) \(S_{22}^{-1}\)
\[ S_{22} = \frac{1}{N} Y^T H Y = \frac{1}{N} Y^T (I_N - N^{-1} J_N) Y= \frac{1}{N} Y^T Y - N^{-1} Y^T J_N Y. \]
1) \(Y^T Y = \text{diag} \{ N_1, \dots, N_S \}\)
\(Y_k\) is an indicator vector, showing whether \(X_i\) is in class \(k\). Each \(X_{i,k}\) should only be in one class, so \((Y_k)_i (Y_\ell)_i = 0\) for \(k \neq \ell\). \((Y_k)_i (Y_k)_i\) can’t be simultaneously 1, so \(Y_k^T Y_\ell = 0\).
Note that \(Y_k^T Y_k = N_k\). Then
\[ Y^T Y = \begin{bmatrix} Y_1^T \\ \vdots \\ Y_S^T \end{bmatrix} [Y_1, \dots, Y_S] = \text{diag} \{ N_1, \dots, N_S \} \]
2) \(S_{22}^{-1}\) using Woodbury Matrix identity
Since the Woodbury Matrix identity is
\((A + U C V)^1 = A^{-1} - A^{-1} U (C +
V A^{-1} U)^{-1} V A^{-1}\), we may use this as follows:
\[\begin{aligned} S_{22}^{-1} &= \left( \frac{1}{N} Y^T Y + Y^T \left( \frac{1}{N^2} J_N \right) Y \right)^{-1} \\ &= \left( \frac{1}{N} Y^T Y + \frac{1}{N^2} Y^T 1_N 1_N^T Y \right)^{-1} \\ &= N (Y^T Y)^{-1} - N (Y^T Y)^{-1} Y^T 1_N (-1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N )^{-1} \frac{1}{N} 1_N^T Y (Y^T Y)^{-1}\\ &= N (Y^T Y)^{-1} - (Y^T Y)^{-1} Y^T 1_N (-1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N)^{-1} 1_N^T Y (Y^T Y)^{-1} \end{aligned}\]
3) \(1_N^T Y (Y^T Y)^{-1} Y^T 1_N\) and \((Y^T Y)^{-1} Y^T 1_N\)
\[\begin{aligned} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N &= 1_N^T Y \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} Y^T 1_N \\ &= (N_1, \dots, N_s) \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_s} \right\} \begin{bmatrix} N_1 \\ \vdots \\ N_s \end{bmatrix} \\ &= N_1 + \dots + N_s = N - N_K \end{aligned}\]
Then,
\[ \left( -1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N \right)^{-1} = (-1 + \frac{1}{N} (N - N_K))^{-1} = (-1 + 1 - \frac{N_K}{N})^{-1} = - \frac{N}{N_K} \] and \[ (Y^T Y)^{-1} Y^T 1_N = \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_s} \right\} \begin{bmatrix} N_1 \\ \vdots \\ N_s \end{bmatrix} = 1_s \]
4) Conclusion
\[ S_{22}^{-1} = N (Y^T Y)^{-1} - 1_S \left( - \frac{N}{N_K} \right) 1_S^T \]
(3) \(N S_{12} S_{22}^{-1} S_{21}\)
It remains to show
\[ N S_{12} S_{22}^{-1} S_{21} = S_B := \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T. \]
\[\begin{aligned} N S_{12} S_{22}^{-1} S_{21} &= N \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \left( N (Y^T Y)^{-1} + \frac{N}{N_K} 1_S 1_S^T \right) \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} \\ &= \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \left( \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} + \frac{1}{N_K} 1_S 1_S^T \right) \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} \end{aligned}\]
We split this into two parts:
1)
\[ \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} = \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T. \]
2)
\[ \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} 1_S = \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}). \]
Note that\((N_k \bar{X}_k)_i\) is
the sum of \(X\)’s in the \(i\)th column whose class is \(k\), and \((N
\bar{X})_i\) is the sum of \(X\)’s in the \(i\)th column for all classes \((1, \dots, s, K), (i = 1, \dots, p).\)
Then, \((N \bar{X})_i - \sum_{k=1}^{S} N_k
\bar{X}_k\) is the sum of \(X\)’s in the \(i\)th column whose class is \(K\).
Using this, \(\sum_{k=1}^{S} N_k \bar{X}_k = N
\bar{X} - N_K \bar{X}_K\).
Also, \(\sum_{k=1}^{S} N_k \bar{X}_k = (N - N_K) \bar{X}\). So,
\[ \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) = N \bar{X} - N_k \bar{X}_K - N \bar{X} + N_k \bar{X} = N_k (\bar{X}_K - \bar{X}), \]
and
\[ \frac{1}{N_K} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} 1_S 1_S^T \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} = N_K (\bar{X}_K - \bar{X}) (\bar{X}_K - \bar{X})^T. \]
3) Conclusion
Hence,
\[\begin{aligned} N S_{12} S_{22}^{-1} S_{21} &= \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T + N_k (\bar{X}_K - \bar{X}) (\bar{X}_K - \bar{X})^T \\ &= \sum_{k=1}^{K} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T = S_B. \end{aligned}\]