logistic regression

Target is:

$$ y \in \{0, 1\} $$

(since classification problem)

We will fit:

$$ \hat{y} = f(\mathbf{x}, \mathbf{\theta}) = \sigma(\mathbf{\theta}^T \mathbf{x}) $$

for $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n$

where we have $n$ datapoints, and targets similarly are correspondingly $y_1, y_2, \dots, y_n$.

MSE is:

$$ L = \frac{1}{n} \sum_{i=1}^n (\sigma(\theta^T\mathbf{x}_i) - y_i)^2 $$

How to differentiate $\sigma(x)$?

We have:

$$ \sigma(x) = \frac{1}{1 + e^{-x}} = (1 + e^{-x})^{-1} $$

So:

$$ \frac{d \sigma(x)}{dx} = (-1)(1 + e^{-x})^{-2}(e^{-x}(-1)) $$

$$ = \frac{e^{-x}}{(1+e^{-x})^2} $$

What about differentiating $\sigma(f(x))$?

We have:

$$ \sigma(f(x)) = (1 + e^{-f(x)})^{-1} $$

So:

$$ \frac{d(\sigma(f(x))}{dx} = (-1)(1 + e^{-f(x)})^{-2}(e^{-f(x)}(-df(x)/dx) $$

$$ = \frac{e^{-f(x)} df(x)/dx} {(1 + e^{-f(x)})^2} $$

differentiate wrt $\mathbf{\theta}_m$:

$$ \frac{\partial L}{d \theta_m} = \frac{1}{n} \sum_{i=1}^n \left( 2(\sigma(\theta^T \mathbf{x}_i) - y_i) \frac{} {} \right) $$

Let's write $\theta^T \mathbf{x}$ as $\theta'^T\mathbf{x}' + \theta_i x_i$ where $\theta = \theta' + \{ \theta_i \}$, and $\mathbf{x} = \mathbf{x}' + \{ x_i \}$. Then we have:

$$ L = \frac{1}{n} \sum_{i=1}^n \left( \frac{1}{1 + e^{-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j}}} - y_i \right)^2 $$

Take:

$$ L_i = \left( \frac{1} {1 + \exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j})} - y_i \right)^2 $$

For the term $1/(1 + \exp(- \theta'^T\mathbf{x}_i - \theta_j x_{i,j}))$, the derivative with repsect to $\theta_j$ is:

$$ -(1 + \exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j})^{-2}\exp(-\theta'^T\mathbf{x'}_i - \theta_j x_{i,j})(-x_{i,j}) $$

$$ = \frac{x_{i,j}\exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j})} {(1 + \exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j})^2} $$

So:

$$ \frac{\partial L } {\partial \theta_j } = \frac{1}{n} \sum_{i=1}^n \frac{\partial\mathcal{L}_i} {\partial \theta_j} $$

$$ = \frac{\partial L } {\partial \theta_j } = \frac{1}{n} \sum_{i=1}^n \frac{x_{i,j}\exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j})} {(1 + \exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j}))^2} $$

$$ = \frac{\partial L } {\partial \theta_j } = \frac{1}{n} \sum_{i=1}^n \frac{x_{i,j}\exp(-\theta^T\mathbf{x}_i)} {(1 + \exp(-\theta^T\mathbf{x}_i))^2} $$

$$ = \frac{\partial L } {\partial \theta_j } = \frac{1}{n} \sum_{i=1}^n x_{i,j}\exp(-\theta^T\mathbf{x}_i)\,\sigma(-\theta^T \mathbf{x}_i) $$

attempt using probability

We have:

data points $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n$
targets $y_i \in \{0, 1\}$ for $i \in \{1,2,\dots,n\}$

We want to model y as:

$$ p(y=1 \mid \mathbf{\theta}, \mathbf{x}) = f(\mathbf{x}, \theta) = \sigma(\theta^T \mathbf{x}) $$

We want to choose a $\theta$ which maximizes $p(\mathcal{Y} \mid \theta, \mathcal{X})$, ie 'maximum likelihood' solution. Alternative solutions, which might be different include:

~~marginalize over all \theta~~ Find the expected value of $\theta$, given a prior over $\theta$, ie some $p(\theta)$, and the data, by first finding some posterior distribution for $\theta$: $p(\theta \mid \mathcal{X}, \mathcal{Y})$, and then ~~marginalizing over all $\theta$, for each prediction: \hat{y} = \int_\theta p(\theta \mid \mathcal{X}, \mathcal{Y})\, f(\mathcal{\hat{x}}, \theta)\,d\theta~~ finding the expected value of $\theta$: $\hat{\theta} = \mathbb{E}_{p(\theta \mid \mathcal{X}, \mathcal{Y})}[\theta]$

The likelihood function is:

$$ \mathcal{L}_\theta = p( \mathcal{Y} \mid \mathcal{X}, \theta) $$

$$ = \prod_{i=1}^n p(y_i \mid \mathbf{x}_i, \theta) $$

log likelihood is:

$$ \log \mathcal{L}_\theta = \sum_{i=1}^n \log p(y_i \mid \mathbf{x}_i, \theta) $$

$$ \def\x{\mathbf{x}} \def\thetav{\mathbf{\theta}} p(y_i \mid \x_i, \thetav) = (\sigma(\thetav^T \x_i))^{y_i}(1-\sigma(\thetav^T \x_i ))^{(1-y_i)} $$

(since $y_i$ is either $0$, or $1$, so one of these terms will be 1, and the other non-one, so prodding them is correct. For example, sum would not be correct, since it would be offset by $1$ :) )

So:

$$ \log p(y_i \mid \x_i, \thetav) = \log \left( \sigma(\thetav^T\x_i)^{y_i}(1 - \sigma(\thetav^T\x_i))^{(1 - y_i)} \right) $$

$$ = y_i \log (\sigma(\thetav^T\x_i)) + (1 - y_i) \log (1 - \sigma(\thetav^T\x_i)) $$

So:

$$ \log \mathcal{L}_\thetav = \sum_{i=1}^n \left( y_i \log(\sigma(\thetav^T \x_i)) + (1 - y_i) \log(1 - \sigma(\thetav^T \x_i)) \right) $$

To maximize the log likelihood, we should take the derivative with respect to each $\theta_m$, and eg use gradient ascent on this.

$$ \frac{\partial \log \mathcal{L}_\thetav} {\partial \theta_m} = \sum_{i=1}^n \frac{ \partial \mathcal{L}_\thetav^i }{\partial \theta_m} $$

where we define:

$$ \mathcal{L}_\thetav^i = y_i \log(\sigma(\thetav^T \x_i)) + (1-y_i) \log( 1 - \sigma(\thetav^T \x_i)) $$

Let's define $\x'_i$ to be such that $\x_i = \x'_i + \{ x_{i,j} \}$, and $\theta'$ to be such that $\thetav = \theta' + \{ \theta_j \}$

So, looking at the left hand expression, we have

$$ \frac{\partial y_i \log(\sigma(\theta'^T \x'_i + \theta_m x_{i,m}))} {\partial \theta_m} $$

$$ = \frac{y_i} {\sigma(\theta'^T \x'_i + \theta_m x_{i,m}))} \frac{\partial \sigma(\theta'^T \x'_i + \theta_m x_{i,m}))} {\partial \theta_m} $$

$$ = \frac{y_i} {\sigma(\thetav^T \x_i)} \frac{\partial( \sigma(\theta_m x_{i,m}))} {\partial \theta_m} $$

Interlude: derivative of sigmoid function

Using https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x/1225116#1225116 as a reference:

Let's say we want to find the derivative of $y = \sigma(x) = (1 + \exp(-x))^{-1}$. So we have:

$$ \frac{dy}{dx} = (-1)(1 + \exp(-x))^{-2}(\exp(-x))(-1) $$

$$ = \frac{\exp(-x)} {(1 + \exp(-x))^2} $$

$$ = \frac{1 + \exp(-x) -1} {(1 + \exp(-x))^2} $$

$$ = \frac{1 + \exp(-x)} {(1 + \exp(-x))^2} - \frac{1} {(1 + \exp(-x))^2} $$

$$ = \sigma(x) - (\sigma(x))^2 $$

$$ = \sigma(x)(1 - \sigma(x)) $$

Ok, but in our case we have something more like $y = \sigma(f(x)) = (1 + \exp(-f(x))^{-1}$. What is the derivative of this?

$$ \frac{dy}{dx} = (-1)(1 + \exp(-f(x)))^{-2}(\exp(-f(x))(-f'(x)) $$

So, a factor of $f'(x)$ greater than the earlier derivative, and we need to substitute $x$ by $f(x)$. So the derivative in this case will be:

$$ f'(x)\sigma(f(x))(1 - \sigma(f(x))) $$

(end of interlude)

Therefore:

$$ \frac{ \partial( \sigma(\theta_m x_{i,m}))} {\partial \theta_m} $$

$$ = x_{i,m}\, \sigma(\theta_m x_{i,m})(1 - \sigma(\theta_m x_{i,m})) $$

attempt 2 at probability loss derivative

using derivation at bottom of http://ucanalytics.com/blogs/gradient-descent-logistic-regression-simplified-step-step-visual-guide/ as a guide, ie chain-ruling the loss function

We want:

$$ \frac{\partial \log \mathcal{L_i}} {\partial \theta_j} $$

for each $j \in \{1, \dots, m \}$

We have:

$$ \log \mathcal{L}_i = y_i \log (\sigma(\thetav^T\x_i)) + (1 - y_i) \log (1 - \sigma(\thetav^T\x_i)) $$

We can substitute a variable for $\theta^T \mathbf{x}_i$, which is called the 'logits'. Let's denote this variable as $F(\thetav, \mathbf{x}_i) = \thetav^T \mathbf{x}_i$. So we have:

$$ \log \mathcal{L}_i = y_i \log(F) + (1 - y_i)\log(1 - F) $$

And using the chain rule we have:

$$ \frac{\partial \log \mathcal{L}_i } {\partial \theta_j} = \frac{\partial \log \mathcal{L}_i} {\partial F} \frac{\partial F} {\partial \theta_j} $$

Next, let's substitute the linear function $\thetav^T \mathbf{x}_i$ as the variable $z_i$ (which is a scalar, since it's the result of a dot product). So we have:

$$ F = \sigma(z_i(\thetav, \x_i)) $$

And thus, using chain rule on $F$ we have:

$$ \frac{ \partial F} {\partial \theta_j} = \frac{\partial F} {\partial z_i} \frac{\partial z_i} {\partial \theta_j} $$

Finally the entire chain rule expression for the derivative of the log loss will be:

$$ \frac{\partial \log \mathcal{L}_i} {\partial \theta_j} = \frac{\partial \log \mathcal{L}_i } {\partial F} \frac{\partial F} {\partial z_i} \frac{\partial z_i} {\partial \theta_j} $$

Looking at each term, one by one, we have:

$$ \frac{\partial \log \mathcal{L}_i} {\partial F} = y_i \frac{1} {F} + (1 - y_i) \frac{-1}{1 - F} $$

$$ = \frac{y_i(1 - F) - (1 - y_i)F} {F(1 - F)} $$

$$ = \frac{y_i -Fy_i - F + Fy_i}{F(1-F)} $$

$$ = \frac{y_i - F}{F(F - 1)} $$

or we can also write as:

$$ \frac{\partial \log \mathcal{L}_i}{\partial F} = \frac{y_i}{F} + \frac{1- y_i}{F - 1} $$

Looking at the next term, we have:

$$ \frac{\partial F }{\partial z_i} = \frac{\partial}{\partial z_i} \sigma(z_i) $$

$$ = \sigma(z_i)(1 - \sigma(z_i)) $$

(as derived earlier)

Lastly we have:

$$ \frac{\partial z_i}{\partial \theta_j} = \frac{\partial }{\partial \theta_j} \thetav^T \mathbf{x}_i $$

$$ = x_{i,j} $$

So, overall, we have:

$$ \frac{\partial \log \mathcal{L}_i} {\partial \theta_j} = \frac{\partial \log \mathcal{L}_i}{\partial F} \frac{\partial F}{\partial z_i} \frac{\partial z_i}{\partial \theta_j} $$

$$ = \left( \frac{y_i}{F} + \frac{1-y_i}{F - 1} \right) \cdot \sigma(z_i) \cdot (1 - \sigma(z_i)) \cdot x_{i,j} $$

$$ = \frac{y_i - F}{F(F - 1)} \cdot F(1 - F) \cdot x_{i,j} $$

$$ = (F - y_i) x_{i,j} $$

So, for all dimensions $j \in \{1, \dots, m \}$, we have:

$$ \nabla_\thetav \log \mathcal{L}_i = (F - y_i)\mathbf{x}_i $$

Finally, for all $i \in \{1, \dots, n \}$, we have:

$$ \nabla_\theta \log \mathcal{L} = \sum_{i=1}^n (F - y_i) \x_i $$