Elements of Statistical Learning CH7

1. Let \(A = \{a_{ij}\}, B = \{b_{ij}\}\) be \((n \times n)\) positive semidefinite (PSD) matrices

(a) Show that \(A \otimes B\) is also PSD

proof

Note that \(B\) can be any \((m \times m)\) matrix (\(n \neq m\) in general).

1) \((A \otimes B)(C \otimes D) = (AC) \otimes (BD) \quad (C \in \mathbb{R}^{n \times k}, D \in \mathbb{R}^{m \times l})\)

\[\begin{aligned} (A \otimes B)(C \otimes D) &= \begin{bmatrix} a_{1,1}B & \cdots & a_{1,n}B \\ \vdots & \ddots & \vdots \\ a_{n,1}B & \cdots & a_{n,n}B \end{bmatrix} \begin{bmatrix} c_{1,1}B & \cdots & c_{1,k}B \\ \vdots & \ddots & \vdots \\ c_{n,1}B & \cdots & c_{n,k}B \end{bmatrix} \\ &= \begin{bmatrix} \sum_{s=1}^{n} a_{1,s}c_{s,1}BD & \cdots & \sum_{s=1}^{n} a_{1,s}c_{s,k}BD \\ \vdots & \ddots & \vdots \\ \sum_{s=1}^{n} a_{n,s}c_{s,1}BD & \cdots & \sum_{s=1}^{n} a_{n,s}c_{s,k}BD \end{bmatrix} \\ &= \begin{bmatrix} [AC]_{1,1} BD & \cdots & [AC]_{1,k}BD \\ \vdots & \ddots & \vdots \\ [AC]_{n,1} BD & \cdots & [AC]_{n,k}BD \end{bmatrix} \\ &= (AC) \otimes (BD) \end{aligned}\]

cf). For \(x \in \mathbb{R}^{n \times 1}, y \in \mathbb{R}^{m \times 1}\), \(x \otimes y = \begin{bmatrix} x_1 y \\ \vdots \\ x_n y \end{bmatrix} \in \mathbb{R}^{nm \times 1}\)

2) \(Ax = \lambda x, \quad By = \mu y \quad \Rightarrow \quad (A \otimes B)(x \otimes y) = \lambda \mu (x \otimes y)\)

Using 1) directly:

\[(A \otimes B)(x \otimes y) = (Ax) \otimes (By) = (\lambda x) \otimes (\mu y) = \lambda \mu (x \otimes y).\]

3) conclusion

Suppose \(\lambda_1, \cdots, \lambda_n\) and \(\mu_1, \cdots, \mu_m\) are eigenvalues of \(A\) & \(B\) respectively. Then, 2 shows that \(\{\lambda_i \mu_j : 1 \leq i \leq n, 1 \leq j \leq m\}\) is a set of eigenvalues of \(A \otimes B\). \(A, B\) are PSD, so \(\lambda_i \geq 0\) and \(\mu_j \geq 0\) \(\forall i,j\). Since \(\lambda_i \mu_j \geq 0\) \(\forall i,j\), \(A \otimes B\) is PSD.


(b) Show that the Hadamard product of \(A\) and \(B\) is PSD (Schur Product Thm)

proof

The Hadamard product \(A \circ B\) is defined as follows: \([A \circ B]_{i,j} = a_{ij} b_{ij}\).

The following 1) holds for \(A, B \in \mathbb{R}^{n \times n}\) (\(n \neq m\) in general) , \(x = (x_1, \cdots, x_m)^T \in \mathbb{R}^m\) and \(y = (y_1, \cdots, y_n) \in \mathbb{R}^n\).

1) \(x^T (A \circ B) y = \text{tr} \left\{ A \text{diag}(x) B \text{diag}(y) \right\}\)

\[\begin{aligned} x^T (A \circ B) y &= \sum_{i=1}^{m} \sum_{j=1}^{n} x_i (a_{ij} b_{ij}) y_j = \sum_{i=1}^{m} \sum_{j=1}^{n} (x_i a_{ij})(b_{ij} y_j) \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n} \left[ \text{diag}(x) A \right]_{i,j} \left[ B \, \text{diag}(y) \right]_{i,j} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n} \left[ A^T \text{diag}(x) \right]_{j,i} \left[ B \, \text{diag}(y) \right]_{i,j} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n} \left[ A^T \text{diag}(x) \right]_{j,i} \left[ B \, \text{diag}(y) \right]_{i,j} \\ &= \sum_{i=1}^{m} \sum_{j=1}^{n} \left[ A^T \text{diag}(x) B \, \text{diag}(y) \right]_{j,j} \\ &= \text{tr} \left\{ A^T \text{diag}(x) B \, \text{diag}(y) \right\} \end{aligned}\]

2) \(A \circ B \succeq 0\), conclusion.

Above result obviously holds when \(m = n\) & \(y = x \in \mathbb{R}^n\). That is, \[ x^T (A \circ B) x = \text{tr} \left\{ A \, \text{diag}(x) B \, \text{diag}(x) \right\}, \quad \because A \text{ is symmetric}. \]

\(A, B\) are symmetric, so \(A^{\frac{1}{2}}\) & \(B^{\frac{1}{2}}\) can be defined as follows.

\(\Lambda_A := \text{diag}([\lambda_{A1}, \cdots, \lambda_{Am}])\) and \(\Lambda_B := \text{diag}([\lambda_{B1}, \cdots, \lambda_{Bn}])\).
\(\lambda_{A,i}, \lambda_{B,i} \geq 0 \quad \forall i = 1, \cdots, n\) (\(\because A, B\) are PSD).

So, \(A^{\frac{1}{2}} := \text{diag}([\sqrt{\lambda_{A1}}, \cdots, \sqrt{\lambda_{Am}}])\) & \(B^{\frac{1}{2}} := \text{diag}([\sqrt{\lambda_{B1}}, \cdots, \sqrt{\lambda_{Bm}}])\) can be defined.

When \(A = \Gamma_A \Lambda_A \Gamma_A^T\) and \(B = \Gamma_B \Lambda_B \Gamma_B^T\), \(A^{\frac{1}{2}} = \Gamma_A \Lambda_A^{\frac{1}{2}} \Gamma_A^T\) & \(B^{\frac{1}{2}} = \Gamma_B \Lambda_B^{\frac{1}{2}} \Gamma_B^T\).

Then,

\[\begin{aligned} x^T (A \circ B) x &= \text{tr} \left\{ A \, \text{diag}(x) B \, \text{diag}(x) \right\} \\ &= \text{tr} \left\{ A^T A^{\frac{1}{2}} \text{diag}(x) B^{\frac{1}{2}} B^{\frac{1}{2}} \text{diag}(x) \right\} \\ &= \text{tr} \left\{ A^{\frac{1}{2}} \text{diag}(x) B^{\frac{1}{2}} B^{\frac{1}{2}} \text{diag}(x) A^{\frac{1}{2}} \right\} \\ &= \text{tr} \left\{ C C^T \right\} \quad \left( \text{let } C = B^{\frac{1}{2}} \text{diag}(x) A^{\frac{1}{2}} \right) \\ &= \text{tr} \left\{ \sum_{i=1}^{m} [C]_{i,j} [C]_{i,j} \right\} = \text{tr} \left\{ \sum_{i=1}^{m} c_{ii}^2 \right\} \\ &=\sum_{i=1}^{m} \sum_{j=1}^{m} C_{i,j}^2 \geq 0 \quad (C_{i,j} = [C]_{j,i}). \end{aligned}\]

This holds for all non-zero \(x \in \mathbb{R}^n\), so \(A \circ B\) is PSD.


2. (Ex. 7.4) Consider the in-sample error and the training error in squared error loss \[\text{Err}_{\text{in}} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{\mathcal{Y}^0} \left[ (Y_i^0 - f(x_i))^2 \right]\] \[\text{err} = \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2.\] Show that the average optimism in the training error is \(\mathbb{E}_{\mathcal{Y}} [y_i \hat{f}(x_i)]\) \[\omega = \frac{2}{N} \sum_{i=1}^{N} \text{Cov}_{\mathcal{Y}} [y_i, \hat{f}(x_i)]\]

proof

\(\mathbb{E}_{\mathcal{Y}}\) refers to expectation over the training values \(\mathcal{Y}\).
\(\mathcal{Y}^0\) refers to new observation corresponding to its training sets \(x_i\) \(\Rightarrow\) \(\mathcal{Y} \perp \mathcal{Y}^0\).

\[\begin{aligned} \omega &= \mathbb{E}_{\mathcal{Y}} \left[ \text{Err}_{\text{in}} - \text{err} \right] = \mathbb{E}_{\mathcal{Y}} \left[ \frac{1}{N} \sum_{i=1}^{N} \mathbb{E}_{\mathcal{Y}^0} \left[ (Y_i^0 - f(x_i))^2 \right] - \frac{1}{N} \sum_{i=1}^{N} (y_i - f(x_i))^2 \right] \\ &= \frac{1}{N} \mathbb{E}_{\mathcal{Y}} \left[ \sum_{i=1}^{N} \mathbb{E}_{\mathcal{Y}^0} \left[ (Y_i^0)^2 - 2 Y_i^0 f(x_i) + f(x_i)^2 \right] - \sum_{i=1}^{N} (y_i^2 - 2 y_i f(x_i) + f(x_i)^2) \right] \\ &= \frac{1}{N} \mathbb{E}_{\mathcal{Y}} \left[ \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}^0} \left[ (Y_i^0)^2 - 2 Y_i^0 f(x_i) \right] - (y_i^2 - 2 y_i f(x_i)) \right) \right] \quad (\because \mathcal{Y}^0 \perp f(x_i)) \\ &= \frac{1}{N} \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}} \left[ \mathbb{E}_{\mathcal{Y}^0} [(Y_i^0)^2] - y_i^2 + 2 y_i f(x_i) - 2 \mathbb{E}_{\mathcal{Y}^0} [Y_i^0] f(x_i) \right] \right) \\ &= \frac{1}{N} \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}} [\mathbb{E}_{\mathcal{Y}^0} [(Y_i^0)^2]] - \mathbb{E}_{\mathcal{Y}}[y_i^2] + 2 \mathbb{E}_{\mathcal{Y}} [y_i f(x_i)] - 2 \mathbb{E}_{\mathcal{Y}, \mathcal{Y}^0} [Y_i^0 f(x_i)] \right) \quad (\because \mathcal{Y} \perp \mathcal{Y}^0) \\ &= \frac{2}{N} \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}} [y_i f(x_i)] - \mathbb{E}_{\mathcal{Y}, \mathcal{Y}^0} [Y_i^0 f(x_i)] \right) \quad (\because \mathcal{Y}^0 \perp \mathcal{Y} \Rightarrow \mathbb{E}_{\mathcal{Y}} [y_i^2] = \mathbb{E}_{\mathcal{Y}^0} [ (Y_i^0)^2 ]) \\ &= \frac{2}{N} \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}} [y_i f(x_i)] - \mathbb{E}_{\mathcal{Y}} [\mathbb{E}_{\mathcal{Y}^0} [Y_i^0] f(x_i)] \right) \quad \text{law of total expectation} \\ &= \frac{2}{N} \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}} [y_i f(x_i)] - \mathbb{E}_{\mathcal{Y}} [\mathbb{E}_{\mathcal{Y}^0} [Y_i^0] \mathbb{E}_{\mathcal{Y}} [f(x_i)]] \right) \quad (\because \mathcal{Y}^0 \text{ is given } \rightarrow \text{constant in } \mathbb{E}_{\mathcal{Y}}) \\ &= \frac{2}{N} \sum_{i=1}^{N} \left( \mathbb{E}_{\mathcal{Y}} [y_i f(x_i)] - \mathbb{E}_{\mathcal{Y}} [y_i] \mathbb{E}_{\mathcal{Y}} [f(x_i)] \right) \quad (\because \mathbb{E}_{\mathcal{Y}} [f(x_i)] \perp \mathcal{Y}^0 \rightarrow \text{constant}) \\ &= \frac{2}{N} \sum_{i=1}^{N} \text{Cov}_{\mathcal{Y}} [y_i, f(x_i)]. \quad \square \end{aligned}\]