Elements of Statistical Learning Ch12

1. Let \(\hat{f}\) be the minimizer of the conditional risk \[ \mathbb{E} [ L(Y, f(x)) \mid x = x ] \] where \(Y \in \{-1, 1\}\). Let \(p(x) = \mathbb{P}(Y = 1 \mid x = x)\). Then, express \(\text{sign}(\hat{f})\) for different \(f\)’s using \(p(x)\).

\(f(x) \Rightarrow f\) and \(p(x) \Rightarrow p\) for simplicity.

(a) \(L(Y, f(x)) = (Y - f(x))^2\)

solution

\[ \begin{align*} \mathbb{E}_Y[(Y - f)^2] &= (1 - f)^2 p + (-1 - f)^2 (1 - p) \\ &= (1 - 2f + f^2) p + (1 + 2f + f^2) (1 - p) \\ &= f^2 (p + 1 - p) + f (-2p + 2(1 - p)) + (p + 1 - p) \\ &= f^2 - 2(2p - 1) f + 1 \\ &= \{ f - (2p - 1) \}^2 + 1 - (2p - 1)^2 \end{align*} \]

\(\Rightarrow\) minimized when \(f = 2p - 1\).

So, \(\hat{f}(x) = 2p(x) - 1\) and
\[ \text{sign}(\hat{f}) = \text{sign}(2p(x) - 1). \]

(b) \(L(Y, f(x)) = (1 - Y f(x))_+ , \quad \text{where} \quad Z_+ = Z \cdot I(Z \geq 0)\)

solution

\[ \mathbb{E}_Y[(1 - Y f(x))_+] := R(f) = (1 - f)_+ p + (1 + f)_+ (1 - p) \]

\(f(x) \leq -1\)

\[ R(f) = (1 - f) p + 0 = (1 - f) p \]

\(p \geq 0\), so \(R(f)\) is decreasing w.r.t \(f \in (-\infty, -1]\).

So, \[ \arg \min_{f \leq -1} R(f) = -1 \] and this holds for all \(p(x) \in [0, 1]\).

\(f(x) \geq 1\)

\[ R(f) = (1 + f) (1 - p), \] which is increasing w.r.t \(f \in [1, \infty)\).

So, \[ \arg \min_{f \geq 1} R(f) = 1 \] which holds for all \(p \in [0, 1]\).

\(-1 \leq f(x) \leq 1 \Rightarrow 1 - f \geq 0\), \(1 + f \geq 0\)

\(f\) is restricted to \(f \in [-1, 1]\) and \(Y \in \{-1, 1\}\). So, \(1 - Yf\) is non-negative: \((1 - Yf)_+ = 1 - Yf\). Then, \(R(f) = \mathbb{E}_Y[(1 - Yf)_+] = 1 - \mathbb{E}_Y[Yf]\).

The minimizer of \(R(f)\) is equivalent to the maximizer of \(\mathbb{E}_Y[Yf]\). Then, \[ \mathbb{E}_Y[Yf] = f(x) p(x) + (-f(x))(1 - p(x)) = f(x)(2p(x) - 1) \]

\(\mathbb{E}_Y[Yf]\) is increasing w.r.t \(f\) when \(2p - 1 > 0\), so \[ \arg\max_{-1 \leq f \leq 1} \mathbb{E}_Y[Yf] = 1. \]
\(\mathbb{E}_Y[Yf] = 0\) when \(p = \frac{1}{2}\), so \(\arg\max_{-1 \leq f \leq 1} \mathbb{E}_Y[Yf]\) doesn’t exist.
\(\mathbb{E}_Y[Yf]\) is decreasing w.r.t \(f\) when \(2p - 1 < 0\), so \[ \arg\max_{-1 \leq f \leq 1} \mathbb{E}_Y[Yf] = -1. \]

The equivalent expression is: \[ \arg\min_{-1 \leq f \leq 1} R(f) = \begin{cases} -1, & 0 < p < \frac{1}{2} \\ 1, & \frac{1}{2} < p < 1 \end{cases} \]

Conclusion

Hence, \[ \arg\min_{f} R(f) = \begin{cases} -1, & 0 < p < \frac{1}{2} \\ 1, & \frac{1}{2} < p < 1 \end{cases} \]

(c) \(L(y, f(x)) = \exp(-y f(x))\)

solution

\[ \mathbb{E}_Y[e^{-Yf}] = e^{-f} p + e^{f} (1 - p) := R(f) \]

Differentiate: \[ \frac{d}{df} R(f) = -e^{-f} p + e^{f} (1 - p) = -p \cdot \frac{1}{e^{f}} + (1 - p) e^{f} \]

Set derivative to 0: \[ -p \cdot \frac{1}{e^{f}} + (1 - p) e^{f} = 0 \\ \Leftrightarrow (1 - p) e^{2f} = p \\ \Leftrightarrow e^{2f} = \frac{p}{1 - p} \\ \Leftrightarrow f = \frac{1}{2} \log \left( \frac{p}{1 - p} \right) \]

\(p \geq \frac{1}{2} \Leftrightarrow \frac{p}{1 - p} \geq 1 \Leftrightarrow \hat{f} \geq 0\) and vice versa. So, \[ \text{sign}(\hat{f}) = \text{sign}(2p(x) - 1). \]

(d) \(L(y, f(x)) = \log(1 + e^{-y f(x)})\)

solution

\[ R(f) := \mathbb{E}_Y \left[ \log(1 + e^{-Y f(x)}) \right] = \log(1 + e^{-f}) p + \log(1 + e^{f}) (1 - p) \]

Differentiate:

\[ \frac{d}{df} R = p \cdot \frac{-e^{-f}}{1 + e^{-f}} + (1 - p) \cdot \frac{e^{f}}{1 + e^{f}} \]

Set derivative to zero:

\[ p \cdot \frac{-e^{-f}}{1 + e^{-f}} + (1 - p) \cdot \frac{e^{f}}{1 + e^{f}} = 0 \]

Simplify:

\[ \frac{-p \cdot e^{-f}}{1 + e^{-f}} = \frac{(1 - p) \cdot e^{f}}{1 + e^{f}} \]

Multiply numerator and denominator by \(e^{f}\):

\[ \frac{-p}{1 + e^{f}} = \frac{(1 - p) e^{f}}{1 + e^{f}} \]

Simplify:

\[ -p = (1 - p) e^{f} \]

Rearrange:

\[ \frac{p}{1 - p} = e^{f} \]

Take logarithm:

\[ \hat{f} = \log \left( \frac{p}{1 - p} \right) \]

Note that:

\[ p \geq \frac{1}{2} \Leftrightarrow \frac{p}{1 - p} \geq 1 \Leftrightarrow \hat{f} \geq 0 \] So,

\[ \text{sign}(\hat{f}) = \text{sign}(2p(x) - 1). \]

2.

(1) Show that \[\hat{\xi}_i = (1 - y_i (w^T x_i + b))_+\]

solution

Below are the Lagrangian form and corresponding KKT conditions (\(C > 0\)):

\[ \mathcal{L}((w, b, \xi), (\alpha, \beta)) = \frac{1}{2} \|w\|^2 + C \sum_i \xi_i - \sum_i \alpha_i \left\{ y_i (w^T x_i + b) - (1 - \xi_i) \right\} - \sum_i \beta_i \xi_i \]

Conditions: 1) \(y_i (w^T x_i + b) - (1 - \xi_i) \geq 0\), \(\xi_i \geq 0\) \(\quad \forall i\)

\(\alpha_i \geq 0\), \(\beta_i \geq 0\) \(\quad \forall i\)
\(\alpha_i \left\{ y_i (w^T x_i + b) - (1 - \xi_i) \right\} = 0\), \(\beta_i \xi_i = 0\)
\[ \frac{\partial \mathcal{L}}{\partial w} = w - \sum_i \alpha_i y_i x_i = 0, \quad \frac{\partial \mathcal{L}}{\partial b} = \sum_i \alpha_i y_i = 0, \quad \frac{\partial \mathcal{L}}{\partial \xi_i} = C - \alpha_i - \beta_i = 0 \quad \forall i \]

Let \(f_i = w^T x_i + b\) for simplicity.

Case 1: \(1 - y_i f_i \leq 0\)

Assume that \(\xi_i > 0\). \(\beta_i \xi_i = 0\) \(\Rightarrow\) \(\beta_i = 0\) (by KKT (iii)), so \(\beta_i = 0\). Since \(\alpha_i = C\) (by KKT (iv)), \(y_i f_i - 1 + \xi_i = 0\) (KKT (iii)). This leads to: \[ \xi_i = 1 - y_i f_i \leq 0, \] which is a contradiction (since \(\xi_i \geq 0\)). Thus, \(\xi_i = 0\) (by KKT (ii)).

Case 2: \(1 - y_i f_i > 0\)

\(\xi_i \geq 1 - y_i f_i\) (by KKT (ii)), so \(\xi_i > 0\) and accordingly \(\beta_i = 0\) (by KKT (iii)). Since \(C = \alpha_i\) (by KKT (iv)), \(y_i f_i - 1 + \xi_i = 0\) (by KKT (iii)). Thus: \[ \xi_i = 1 - y_i f_i \]

Therefore: \[ \hat{\xi}_i = (1 - y_i f_i)_+ \]

Note that the two statements are equivalent: \[ \left\{ \xi_i \geq 0 \ \&\ \xi_i = 1 - y_i f_i \geq 0 \right\} \ \Longleftrightarrow \ \left\{ \xi_i = (1 - y_i f_i)_+ \right\}. \]

(2) Show that: \[ \min_{w, b, \xi} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i \xi_i : \xi_i \geq 0, \ \xi_i \geq 1 - y_i f_i \right\} = \min_{w, b} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i (1 - y_i f_i)_+ \right\}\]

solution

1) “\(\leq\)” direction:

\[\begin{aligned} &\min_{w, b, \xi} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i \xi_i : \xi_i \geq 0, \ \xi_i \geq 1 - y_i f_i \right\} \\ &= \min_{w, b, \xi} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i \xi_i : \xi_i = (1 - y_i f_i)_+ \right\} \\ &\geq \min_{w, b, \xi} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i \xi_i : \xi_i \geq (1 - y_i f_i)_+ \right\} \quad (\text{since feasible region shrinks}) \\ &= \min_{w, b} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i (1 - y_i f_i)_+ \right\} \end{aligned}\]

2) “\(\geq\)” direction:

\[ \min_{w, b, \xi} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i \xi_i : \xi_i \geq (1 - y_i f_i)_+ \right\} \leq \min_{w, b} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i (1 - y_i f_i)_+ \right\} \] because \(A \subset B\) implies \(\min A \ge \min B\). Feasible point choosing \(\xi_i = (1 - y_i f_i)_+\) achieves equality.

3) Conclusion:

\[\min_{w, b, \xi} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i \xi_i : \xi_i \geq 0, \ \xi_i \geq 1 - y_i f_i \right\}= \min_{w, b} \left\{ \frac{1}{2} \|w\|^2 + C \sum_i (1 - y_i f_i)_+ \right\} \]