Softmax Explain

Reference: UFLDL Softmax Regression

Logistic Regression

training dataset is: $$\{(x^{(1)}, y^{(1)}), \{(x^{(2)}, y^{(2)}),\dots, (x^{(m)}, y^{(m)})\}$$

In binary classification setting, the predicting function is $$y^{(i)} \in \{0, 1\} $$

$$h_\theta(x) = y_i = \frac{1}{1+exp(-\theta^Tx)} $$

and the model parameter θ supposed to be trained to minimize the cost function $$J(\theta) = - \frac{1}{m} [ \sum_{i=1}^m y^{(i)} \bullet log h_\theta (x^{(i)}) + (1-y^{(i)}) \bullet log( 1- h_\theta(x^{(i)}))]$$

Softmax Regression

from Logicstic Regression

training dataset is: $$\{(x^{(1)}, y^{(1)}), \{(x^{(2)}, y^{(2)}),\dots, (x^{(m)}, y^{(m)})\}$$ and target categories are $$ y_i \in \{ 1, 2, \dots, k\}$$

then the predicting probability is $$ P(y=j|x, w) $$ and the predicting function is

$$ h_{\theta} (x^{(i)}) = \begin{bmatrix} p(y^{(i)}=1|x^{(i)}; \theta) \\ p(y^{(i)}=2|x^{(i)}; \theta) \\ \vdots \\ p(y^{(i)}=k|x^{(i)}; \theta) \\ \end{bmatrix} = \frac {1} { \sum_{j=1}^k e^{ \theta_j^T x^{(i)} } } \bullet \begin{bmatrix} e^{ \theta_1^T x^{(i)} } \\ e^{ \theta_2^T x^{(i)} } \\ \vdots \\ e^{ \theta_k^T x^{(i)} } \\ \end{bmatrix} $$

and we can write θ as a k(label)-by-(n + 1)(sample size +1) matrix, so $$ \theta = \begin{bmatrix} \mbox{---} \theta_1^T \mbox{---} \\ \mbox{---} \theta_2^T \mbox{---} \\ \vdots \\ \mbox{---} \theta_k^T \mbox{---} \\ \end{bmatrix} $$

the cost function will be $$ j(\theta) = - \frac{1}{m} [\sum_{i=1}^m \sum_{j=1}^k 1 \{j^{(i)}=j\} log \frac {e^{ \theta_j^T x^{(i)} }} {\sum_{l=1}^k e^{ \theta_l^T x^{(i)} }} ] $$ note:

  1. 1{•} is indicator function, means when condition is TRUE, then result will be 1; otherwise FALSE will be 0.
  2. the exponential part is $$ p(y^{(i)} = j | x^{(i)} ; \theta) = \frac{e^{\theta_j^T x^{(i)}}}{\sum_{l=1}^k e^{ \theta_l^T x^{(i)}} } $$
  3. this cost function euqals to logistic regression at binary classification setting $$ \begin{align} J(\theta) &= -\frac{1}{m} \left[ \sum_{i=1}^m (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) + y^{(i)} \log h_\theta(x^{(i)}) \right] \\ &= - \frac{1}{m} \left[ \sum_{i=1}^{m} \sum_{j=0}^{1} 1\left\{y^{(i)} = j\right\} \log p(y^{(i)} = j | x^{(i)} ; \theta) \right] \end{align} $$
  4. There is not known closed-form way to slove for the minimum of J(θ), we have to use gradient descent or L-BFGS.

Using gradient descent to solve minimize problem. Drivatives cost function: $$ \nabla_{\theta_j} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = j\} - p(y^{(i)} = j | x^{(i)}; \theta) \right) \right] } $$ on each iteration, we would perform the update $$ \theta_j := \theta_j - \alpha \nabla_{\theta_j}J(\theta), \forall j = 1, \dots, k $$

Practical usage


In [98]:
"""Softmax."""
import numpy as np
scores = np.array([3.0, 1.0, 0.2])
def softmax(x):
    """Compute softmax values for each sets of scores in x."""
    return np.exp(x) / np.sum(np.exp(x), axis=0)
    
print(scores, softmax(scores))


(array([ 3. ,  1. ,  0.2]), array([ 0.8360188 ,  0.11314284,  0.05083836]))

In [92]:
# Plot softmax curves
import matplotlib.pyplot as plt
x = np.arange(-2.0, 6.0, 0.1)
scores = np.vstack([x, np.ones_like(x), 0.2 * np.ones_like(x)])
plt.plot(x, softmax(scores).T, linewidth=2)
plt.show()



In [94]:
scores = np.array([[1, 2, 3, 6],
                   [2, 4, 5, 6],
                   [3, 8, 7, 4]])

In [99]:
x = 1000000000
y = 0.000001
for i in range(1000000):
    x += y
x -= 1000000000
print x


0.953674316406

In [ ]: