Elements of Statistical Learning Ch4

1. Let \(G\) denote the class label in this subsection. Define \(Y = (Y_1, \dots, Y_K)\) where \(Y_k = 1\) if \(G = k\), else \(Y_k = 0\). \(\mathbf Y\) is an \(N \times K\) indicator response matrix, where each row contains a single 1. Fit a linear regression model to each column of \(\mathbf Y\) simultaneously: \[\begin{aligned} \hat{Y} &= X (X^T X)^{-1} X^T Y \\ \hat{B} &= (X^T X)^{-1} X^T Y, \quad (p+1) \times K \text{ coefficient matrix},\end{aligned}\] where \(X\) is the model matrix with \((p+1)\) columns for the \(p\) inputs and an intercept. With input \(x\), compute the fitted output \(\hat{f}(x)^T = (1, x^T) \hat{B}\) and identify the largest component and classify accordingly \(\hat{G}(x) = \arg\max_{k \in G} \hat{f}_k(x).\) Then, show that the sum of the fitted values satisfies:\[\sum_{k \in G} \hat{f}_k(x) = 1.\]

Solution

Let
\[ X = \begin{bmatrix} 1_N & X_{(p)} \end{bmatrix}, \]
where \(1_N = (1, \dots, 1)^T\) and \(X_{(p)}\) is an \(N \times p\) matrix.

\[X^T X = \begin{bmatrix} 1_N^T 1_N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix} = \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}.\]

\[ X^T Y\mathbf 1_k = \begin{bmatrix} 1_N^T \\ X_{(p)}^T \end{bmatrix} 1_N = \begin{bmatrix} N \\ X_{(p)}^T 1_N \end{bmatrix}. \]

Then,
\[ \hat{B}\mathbf 1_k = (X^T X)^{-1} X^T Y\mathbf 1_k = \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}^{-1} \begin{bmatrix} N \\ X_{(p)}^T 1_N \end{bmatrix}. \]

Using the property

\[ \begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}\begin{bmatrix} N & 1_N^T X_{(p)} \\ X_{(p)}^T 1_N & X_{(p)}^T X_{(p)} \end{bmatrix}^{-1}= \begin{bmatrix} 1 & 0_p^T \\ 0_p & I_p \end{bmatrix}, \]

we get

\[ \hat{B}\mathbf 1_k = \begin{bmatrix} 1 \\ 0_p \end{bmatrix}, \quad \text{where } 0_p = (0, \dots, 0)^T \in \mathbb{R}^p. \]

Since

\[ \hat{f}(x)^T = (1, x^T) \hat{B}, \] we have \[ \sum_{k=1}^{K} \hat{f}_k(x) = \hat{f}(x)^T 1_K = (1, x^T) \hat{B} 1_K= (1, x^T) \begin{bmatrix} 1 \\ 0_p \end{bmatrix} = 1. \]

Thus, we have proved that
\[ \sum_{k=1}^{K} \hat{f}_k(x) = 1. \]

2. (Ex 4.2) Suppose we have features \(x \in \mathbb{R}^p\), a two-class response, with class sizes \(n_1, n_2\), and the target coded as \(-n/n_1, n/n_2\).

(a) Show that the LDA rule classifies to class 2 if \[ x^T \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log\left(\frac{n_1}{n}\right) - \log\left(\frac{n_2}{n}\right),\] and class 1 otherwise.

Solution

The LDA classifies to class 2 if \(\delta_2(x) > \delta_1(x)\). That is,

\[ x^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 + \log \hat{\pi}_2 > x^T \hat{\Sigma}^{-1} \hat{\mu}_1 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log \hat{\pi}_1. \]

In this case, \(\hat{\pi}_1 = \frac{n_1}{n}\) and \(\hat{\pi}_2 = \frac{n_2}{n}\) (\(n = n_1 + n_2\)). Then,

\[ x^T \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1) > \frac{1}{2} \hat{\mu}_2^T \hat{\Sigma}^{-1} \hat{\mu}_2 - \frac{1}{2} \hat{\mu}_1^T \hat{\Sigma}^{-1} \hat{\mu}_1 + \log \frac{n_1}{n} - \log \frac{n_2}{n} \]

implies that the LDA classifies to class 2.

(b) Consider minimization of the least squares criterion \[ \sum_{i=1}^{n} (y_i - \beta_0 - \beta^T x_i)^2. \] Show that the solution \(\hat{\beta}\) satisfies \[ \left[(n - 2) \hat{\Sigma} + \frac{n_1 n_2}{n} \hat{\Sigma}_B \right] \hat{\beta} = n(\hat{\mu}_2 - \hat{\mu}_1),\] where \(\hat{\Sigma}_B = (\hat{\mu}_2 - \hat{\mu}_1)(\hat{\mu}_2 - \hat{\mu}_1)^T\).

Solution

Let

\[ X_i = (x_{i1}, \dots, x_{ip})^T \in \mathbb{R}^p, \quad Y = (y_1, \dots, y_n) \in \mathbb{R}^n, \quad \beta = (\beta_1, \dots, \beta_p)^T \in \mathbb{R}^p. \]

We have:

\[ \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} y_i \right). \]

Define:

\[ X = \begin{bmatrix} 1 & x_1^T \\ 1 & x_2^T \\ \vdots & \vdots \\ 1 & x_n^T \end{bmatrix}, \quad X^T = \begin{bmatrix} 1 & 1 & \cdots & 1 \\ x_1 & x_2 & \cdots & x_n \end{bmatrix}, \]

and

\[ X^T X = \begin{bmatrix} n & \sum_{i=1}^{n} x_i^T \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i x_i^T \end{bmatrix}. \]

Also,

\[ X^T Y = \begin{bmatrix} \sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} y_i x_i \end{bmatrix}. \]

The least squares estimate (LSE) solution $ (_0, )^T $ satisfies:

\[ X^T X \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta} \end{bmatrix} = X^T Y. \]

That is,

\[ \begin{bmatrix} n & \sum_{i=1}^{n} x_i^T \\ \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i x_i^T \end{bmatrix} \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta} \end{bmatrix}= \begin{bmatrix} \sum_{i=1}^{n} y_i \\ \sum_{i=1}^{n} y_i x_i \end{bmatrix}. \]

We obtain:

\[ \sum_{i=1}^{n} \hat{\beta}_0 + \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i, \]

\[ \left( \sum_{i=1}^{n} x_i \right) \hat{\beta}_0 + \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i. \]

Since,

\[ \hat{\beta}_0 = \frac{1}{n} \sum_{i=1}^{n} y_i - \left( \frac{1}{n} \sum_{i=1}^{n} x_i \right)^T \hat{\beta}, \]

we substitute:

\[ \left( \sum_{i=1}^{n} x_i x_i^T \right) \hat{\beta} - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} x_i^T \right) \hat{\beta} = \sum_{i=1}^{n} y_i x_i - \frac{1}{n} \left( \sum_{i=1}^{n} x_i \right) \left( \sum_{i=1}^{n} y_i \right). \]

(c) Hence show that \(\hat{\Sigma}_B \hat{\beta}\) is in the direction \((\hat{\mu}_2 - \hat{\mu}_1)\) and thus \[\hat{\beta} \propto \hat{\Sigma}^{-1}(\hat{\mu}_2 - \hat{\mu}_1).\] Therefore, the least squares regression coefficient is identical to the LDA coefficient, up to a scalar multiple.

solution

We now show:

\[ \hat{\Sigma}_B \hat{\beta} = c (\hat{\mu}_2 - \hat{\mu}_1). \]

Since,

\[ \hat{\Sigma}_B \hat{\beta} = (\hat{\mu}_2 - \hat{\mu}_1) (\hat{\mu}_2 - \hat{\mu}_1)^T \hat{\beta}, \]

where \(c = (\hat{\mu}_2 - \hat{\mu}_1)^T \hat{\beta} \in \mathbb{R}\), we see that \(\hat{\Sigma}_B \hat{\beta}\) is in the direction of \((\hat{\mu}_2 - \hat{\mu}_1)\).

Thus,

\[ \hat{\Sigma} \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1), \]

which implies:

\[ \hat{\beta} \propto \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1). \]

(d) Show that this result holds for any (distinct) coding of the two classes.

Solution

Using equation (1) from part (b), we see that we can use any \(t_1, t_2\) such that \(t_1 \neq t_2\). Then,

\[ \text{RHS} \propto (\hat{\mu}_2 - \hat{\mu}_1). \]

Also,

\[ \left( (n - 2) \hat{\Sigma} + \frac{n_1 n_2}{n} (\hat{\mu}_2 - \hat{\mu}_1)(\hat{\mu}_2 - \hat{\mu}_1)^T \right) \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1). \]

Thus,

\[ (n - 2) \hat{\Sigma} \hat{\beta} \propto (\hat{\mu}_2 - \hat{\mu}_1), \]

and hence,

\[ \hat{\beta} \propto \hat{\Sigma}^{-1} (\hat{\mu}_2 - \hat{\mu}_1). \]

3. (Ex 4.3) Suppose we transform the original predictors \(X\) to \(\hat{Y}\) via linear regression. In detail, let: \(\hat{Y} = X(X^T X)^{-1} X^T Y = X \hat{B},\) where \(Y\) is the indicator response matrix. Similarly, for any input \(x \in \mathbb{R}^p\), we get a transformed vector:\(\hat{x} = \hat{B}^T x \in \mathbb{R}^K.\) Show that LDA using \(\hat{Y}\) is identical to LDA in the original space.

Solution

(1) Preliminary

Recall that each row of \(Y\) has a single 1 and \(Y \in \mathbb{R}^{N \times K}\).

The main point is to show that linear discriminant function w.r.t. \(X\), \(\delta_k(x)\), is identical to that w.r.t. \(\hat{Y}\), \(\hat{\delta}_k(x)\).

\[ \delta_k(x) = \log \pi_k + x^T \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k \]

\[ \Rightarrow \hat{\delta}_k(x) = \log \tilde{\pi}_k + x^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k \]

Let \(\tilde{\pi}_k\), \(\tilde{\mu}_k\), \(\tilde{\Sigma}\) be estimators of class probability, class mean, and class covariance respectively.

  1. \(\tilde{\pi}_k = \hat{\pi}_k \Rightarrow \tilde{\pi}_k = \frac{N_k}{N}\)

    \[ Y_k : \text{k-th column, indicator of k-group},\quad X^T Y_k \in \mathbb{R}^{p+1} \]

  2. \[\begin{aligned} \tilde{\mu}_k &= \frac{1}{N_k} \sum_{i: y_i = k} \hat{B} x_i \\ &= B^T \frac{1}{N_k} \sum_{i: y_i = k} x_i \\ &= B^T \sum_{i: y_i = k} \frac{x_i}{N_k} = B^T X^T Y_k \end{aligned}\]

where \[ \hat{B} = (X^T X)^{-1} X^T Y \in \mathbb{R}^{p \times K}. \]

  1. \[\begin{aligned} \tilde{\Sigma} &= \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( \hat{y}_i - \tilde{\mu}_k ) ( \hat{y}_i - \tilde{\mu}_k )^T \\ &= \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( B^T x_i - B^T \tilde{\mu}_k )( B^T x_i - B^T \tilde{\mu}_k )^T \\ &= B^T \frac{1}{N-K} \sum_{k=1}^{K} \sum_{i: y_i = k} ( x_i - \tilde{\mu}_k )( x_i - \tilde{\mu}_k )^T B \\ & = B^T \Sigma B \end{aligned}\]

  2. \[\begin{aligned} (N-K) \tilde{\Sigma} &= \sum_{k=1}^{K} \sum_{i: y_i = k} ( x_i - \tilde{\mu}_k )( x_i - \tilde{\mu}_k )^T \\ & = \sum x_i x_i^T - 2 \sum_{k=1}^{K} \sum_{i: y_i = k} x_i \tilde{\mu}_k^T + \sum_{k=1}^{K} \sum_{i: y_i = k} \tilde{\mu}_k \tilde{\mu}_k^T \\ &= \sum x_i x_i^T - 2 \sum_{k=1}^{K} \frac{1}{N_k} X^T Y_k (Y_k^T X) + \sum_{k=1}^{K} X^T Y_k (Y_k^T X) \\ &= X^T X - \sum_{k=1}^{K} \frac{1}{N_k} X^T Y_k Y_k^T X \\ &= X^T X - X^T Y D Y^T X, \quad ( D = \text{diag} \{ n_1^{-1}, n_2^{-1}, \dots, n_K^{-1} \} ) \\ &= X^T (I - Y D Y^T) X \end{aligned}\]

  3. So \(\delta_k\) can be defined as

\[ \hat{\delta}_k(\hat{y}) = \log \tilde{\pi}_k + \hat{y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k. \]

We simply assume that \(\tilde{\Sigma} \in \mathbb{R}^{p \times p}\), \(\tilde{\Sigma}^{-1} \in \mathbb{R}^{K \times K}\) exist.

(2) \(\hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k\)

\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = (\hat{B}^T X)^T (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k, \] and \[\begin{aligned} \hat{B}^T \tilde{\Sigma} \hat{B} &= \frac{1}{N-K} \hat{B}^T X (I - Y D Y^T) X \hat{B} \\ &= \frac{1}{N-K} \left( \hat{B}^T X X^T \hat{B} - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( Y X (X^T X)^{-1} X^T X (X^T X)^{-1} X^T Y - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( Y X (X^T X)^{-1} X^T Y - \hat{B}^T X Y D Y^T X \hat{B} \right) \\ &= \frac{1}{N-K} \left( H - H D H \right) \end{aligned}\] by letting \(\hat{B}^T X^T Y := H\), so \(H^T = H\).

For nonzero \(a \in \mathbb{R}^K\),

\[ a^T H a = a^T Y X (X^T X)^{-1} X^T Y a = c^T (X^T X)^{-1} c, \quad ( c = X^T Y a \in \mathbb{R}^p ) \]

Assuming \(K < \min(m, p)\), \(X^T Y\) is full column rank and thus \(c \neq 0\).

Since \((X^T X)^{-1}\) is P.D., \(c^T (X^T X)^{-1} c = a^T H a > 0\), and thus \(H\) is P.D. (and invertible). Then,

\[ (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} = (N-K) (H + H(-D)H)^{-1} = (N-K) (I - D H)^{-1} H^{-1}, \]

\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = (\hat{B}^T X)^T (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k = \left[ X^T \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T \frac{1}{N_k} X^T Y_k \right]_k, \] and

\[\begin{aligned} \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T X^T Y &= (N-K) \hat{B} (I - D H)^{-1} H^{-1} \hat{B}^T X^T Y \\ &= (N-K) \hat{B} (I - D H)^{-1} \quad \text{∵ } H = \hat{B}^T X^T Y \\ &= (N-K) \tilde{\Sigma} \hat{\Sigma}^{-1} \hat{B} (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} X^T (I - Y D Y^T) X \hat{B} (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} (X^T X \hat{B} - X^T Y D Y^T X \hat{B}) (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} (X^T Y - X^T Y D H) (I - D H)^{-1} \\ &= \hat{\Sigma}^{-1} X^T Y (I - D H) (I - D H)^{-1} = \hat{\Sigma}^{-1} X^T Y \end{aligned}\]

So,

\[ \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = \left[ \frac{1}{N_k} x^T \hat{\Sigma}^{-1} X^T Y \right]_k= x^T \hat{\Sigma}^{-1} \left( \frac{1}{N_k} X^T Y_k \right) = x^T \hat{\Sigma}^{-1} \hat{\mu}_k. \]

(3) \(\tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k\)

Let \(\hat{M} = [\hat{\mu}_1 \dots \hat{\mu}_K]\) be \(K \times K\) matrix.

\[ \tilde{\mu}_k = \hat{B}^T \frac{1}{N_k} X^T Y_k = \hat{B}^T X \left( \frac{1}{N_k} Y_k \right) = \hat{B}^T X Y D, \]

so \(\hat{M} = \hat{B}^T X Y D\).

\[ \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \hat{M} = \hat{\mu}_k^T \hat{B} (\hat{B}^T \tilde{\Sigma} \hat{B})^{-1} \hat{B}^T X^T Y D = \hat{\mu}_k^T \hat{\Sigma}^{-1} X^T Y D. \]

Thus,

\[ \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k = \hat{\mu}_k^T \hat{\Sigma}^{-1} X^T [Y D]_k = \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k. \]

(4) Conclusion

\[\begin{aligned} \hat{\delta}_k(\hat{y}) &= \log \tilde{\pi}_k + \hat{Y}^T \tilde{\Sigma}^{-1} \tilde{\mu}_k - \frac{1}{2} \tilde{\mu}_k^T \tilde{\Sigma}^{-1} \tilde{\mu}_k \\ &= \log \tilde{\pi}_k + x^T \hat{\Sigma}^{-1} \hat{\mu}_k - \frac{1}{2} \hat{\mu}_k^T \hat{\Sigma}^{-1} \hat{\mu}_k = \delta_k(x). \quad \blacksquare \end{aligned}\]

4. Consider the classification problem with \(K\) classes and the \(p\)-dimension predictor \(X\). Let \(s = K - 1\) and \(p \geq s\). For \(k = 1, \dots, s\), we code the class \(k\) as an indicator \(s\)-vector \(Y = (Y_1, \dots, Y_s)\) with \(Y_k = 1\) and \(Y_i = 0\) for \(i \neq k\), and class \(K\) is coded as \(Y = (0, \dots, 0)\). Let \(S_{11}, S_{22}, S_{12}\) denote the sample covariance matrices scaled by \(N\) for \(X, Y\), and \((X, Y)\). Show that \[S_B = N^{-1} S_{12} S_{22}^{-1} S_{21}\] where \(S_B\) is the between-class covariance matrix.

Solution

(1) Preliminary

Let \(H = I_N - \frac{1}{N} 1_N 1_N^T\) be a centering matrix,
where \(J_N := 1_N 1_N^T\), and \(1_N := (1, \dots, 1) \in \mathbb{R}^N\).

Matrix \(Y \in \mathbb{R}^{N \times S}\) is defined as \(Y = [Y_1, \dots, Y_S]\),
where \(Y_k \in \mathbb{R}^N\), \(N_k = Y_k^T 1_N\) \((k = 1, \dots, S)\), \((Y_k)_i = 0\) or \(1\).

\[\begin{aligned} N^{-1} Y^T 1_N &= \bar{Y} \in \mathbb{R}^S, \quad \text{where } (\bar{Y})_k = \frac{1}{N} \sum_{i} Y_{ik} = \frac{N_k}{N}. \\ N_k^{-1} X^T Y_k &= \bar X_k\in \mathbb{R}^p, \quad \text{where } (\bar{X}_k)_i \text{ is the average of k-class in the } i \text{th column of } X. \\ N^{-1} X^T 1_N &= \bar{X} \in \mathbb{R}^p, \quad \text{where } (\bar{X})_i \text{ is the average in } i \text{th column of } X. \end{aligned}\]

(2) \(S_{12}\)

\[\begin{aligned} S_{12} &= N^{-1} X^T H Y = N^{-1} X (I_N - N^{-1} 1_N 1_N^T) Y \\ &= N^{-1} X^T Y - N^{-1} X^T 1_N N^{-1} 1_N^T Y \\ &= N^{-1} X^T Y - N^{-1} X^T 1_N \begin{bmatrix} \frac{N_1}{N} & \dots & \frac{N_S}{N} \end{bmatrix} \qquad \because \text{(determinant of product)} \\ &= \frac{1}{N} \begin{bmatrix} X^T Y_1 - X^T 1_N \frac{N_1}{N} & \dots & X^T Y_S - X^T 1_N \frac{N_S}{N} \end{bmatrix} \\ &= \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix}. \end{aligned}\]

(3) \(S_{22}^{-1}\)

\[ S_{22} = \frac{1}{N} Y^T H Y = \frac{1}{N} Y^T (I_N - N^{-1} J_N) Y= \frac{1}{N} Y^T Y - N^{-1} Y^T J_N Y. \]

1) \(Y^T Y = \text{diag} \{ N_1, \dots, N_S \}\)

\(Y_k\) is an indicator vector, showing whether \(X_i\) is in class \(k\). Each \(X_{i,k}\) should only be in one class, so \((Y_k)_i (Y_\ell)_i = 0\) for \(k \neq \ell\). \((Y_k)_i (Y_k)_i\) can’t be simultaneously 1, so \(Y_k^T Y_\ell = 0\).

Note that \(Y_k^T Y_k = N_k\). Then

\[ Y^T Y = \begin{bmatrix} Y_1^T \\ \vdots \\ Y_S^T \end{bmatrix} [Y_1, \dots, Y_S] = \text{diag} \{ N_1, \dots, N_S \} \]

2) \(S_{22}^{-1}\) using Woodbury Matrix identity

Since the Woodbury Matrix identity is
\((A + U C V)^1 = A^{-1} - A^{-1} U (C + V A^{-1} U)^{-1} V A^{-1}\), we may use this as follows:

\[\begin{aligned} S_{22}^{-1} &= \left( \frac{1}{N} Y^T Y + Y^T \left( \frac{1}{N^2} J_N \right) Y \right)^{-1} \\ &= \left( \frac{1}{N} Y^T Y + \frac{1}{N^2} Y^T 1_N 1_N^T Y \right)^{-1} \\ &= N (Y^T Y)^{-1} - N (Y^T Y)^{-1} Y^T 1_N (-1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N )^{-1} \frac{1}{N} 1_N^T Y (Y^T Y)^{-1}\\ &= N (Y^T Y)^{-1} - (Y^T Y)^{-1} Y^T 1_N (-1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N)^{-1} 1_N^T Y (Y^T Y)^{-1} \end{aligned}\]


3) \(1_N^T Y (Y^T Y)^{-1} Y^T 1_N\) and \((Y^T Y)^{-1} Y^T 1_N\)

\[\begin{aligned} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N &= 1_N^T Y \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} Y^T 1_N \\ &= (N_1, \dots, N_s) \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_s} \right\} \begin{bmatrix} N_1 \\ \vdots \\ N_s \end{bmatrix} \\ &= N_1 + \dots + N_s = N - N_K \end{aligned}\]

Then,

\[ \left( -1 + \frac{1}{N} 1_N^T Y (Y^T Y)^{-1} Y^T 1_N \right)^{-1} = (-1 + \frac{1}{N} (N - N_K))^{-1} = (-1 + 1 - \frac{N_K}{N})^{-1} = - \frac{N}{N_K} \] and \[ (Y^T Y)^{-1} Y^T 1_N = \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_s} \right\} \begin{bmatrix} N_1 \\ \vdots \\ N_s \end{bmatrix} = 1_s \]

4) Conclusion

\[ S_{22}^{-1} = N (Y^T Y)^{-1} - 1_S \left( - \frac{N}{N_K} \right) 1_S^T \]


(3) \(N S_{12} S_{22}^{-1} S_{21}\)

It remains to show

\[ N S_{12} S_{22}^{-1} S_{21} = S_B := \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T. \]

\[\begin{aligned} N S_{12} S_{22}^{-1} S_{21} &= N \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \left( N (Y^T Y)^{-1} + \frac{N}{N_K} 1_S 1_S^T \right) \frac{1}{N} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} \\ &= \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \left( \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} + \frac{1}{N_K} 1_S 1_S^T \right) \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} \end{aligned}\]

We split this into two parts:

1)

\[ \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} \text{diag} \left\{ \frac{1}{N_1}, \dots, \frac{1}{N_S} \right\} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} = \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T. \]

2)

\[ \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} 1_S = \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}). \]

Note that\((N_k \bar{X}_k)_i\) is the sum of \(X\)’s in the \(i\)th column whose class is \(k\), and \((N \bar{X})_i\) is the sum of \(X\)’s in the \(i\)th column for all classes \((1, \dots, s, K), (i = 1, \dots, p).\) Then, \((N \bar{X})_i - \sum_{k=1}^{S} N_k \bar{X}_k\) is the sum of \(X\)’s in the \(i\)th column whose class is \(K\).
Using this, \(\sum_{k=1}^{S} N_k \bar{X}_k = N \bar{X} - N_K \bar{X}_K\).

Also, \(\sum_{k=1}^{S} N_k \bar{X}_k = (N - N_K) \bar{X}\). So,

\[ \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) = N \bar{X} - N_k \bar{X}_K - N \bar{X} + N_k \bar{X} = N_k (\bar{X}_K - \bar{X}), \]

and

\[ \frac{1}{N_K} \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X}) & \dots & N_S (\bar{X}_S - \bar{X}) \end{bmatrix} 1_S 1_S^T \begin{bmatrix} N_1 (\bar{X}_1 - \bar{X})^T \\ \vdots \\ N_S (\bar{X}_S - \bar{X})^T \end{bmatrix} = N_K (\bar{X}_K - \bar{X}) (\bar{X}_K - \bar{X})^T. \]

3) Conclusion

Hence,

\[\begin{aligned} N S_{12} S_{22}^{-1} S_{21} &= \sum_{k=1}^{S} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T + N_k (\bar{X}_K - \bar{X}) (\bar{X}_K - \bar{X})^T \\ &= \sum_{k=1}^{K} N_k (\bar{X}_k - \bar{X}) (\bar{X}_k - \bar{X})^T = S_B. \end{aligned}\]