Target is:
$$ y \in \{0, 1\} $$(since classification problem)
We will fit:
$$ \hat{y} = f(\mathbf{x}, \mathbf{\theta}) = \sigma(\mathbf{\theta}^T \mathbf{x}) $$for $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n$
where we have $n$ datapoints, and targets similarly are correspondingly $y_1, y_2, \dots, y_n$.
MSE is:
$$ L = \frac{1}{n} \sum_{i=1}^n (\sigma(\theta^T\mathbf{x}_i) - y_i)^2 $$How to differentiate $\sigma(x)$?
We have:
$$ \sigma(x) = \frac{1}{1 + e^{-x}} = (1 + e^{-x})^{-1} $$So:
$$ \frac{d \sigma(x)}{dx} = (-1)(1 + e^{-x})^{-2}(e^{-x}(-1)) $$What about differentiating $\sigma(f(x))$?
We have:
$$ \sigma(f(x)) = (1 + e^{-f(x)})^{-1} $$So:
$$ \frac{d(\sigma(f(x))}{dx} = (-1)(1 + e^{-f(x)})^{-2}(e^{-f(x)}(-df(x)/dx) $$differentiate wrt $\mathbf{\theta}_m$:
$$ \frac{\partial L}{d \theta_m} = \frac{1}{n} \sum_{i=1}^n \left( 2(\sigma(\theta^T \mathbf{x}_i) - y_i) \frac{} {} \right) $$Let's write $\theta^T \mathbf{x}$ as $\theta'^T\mathbf{x}' + \theta_i x_i$ where $\theta = \theta' + \{ \theta_i \}$, and $\mathbf{x} = \mathbf{x}' + \{ x_i \}$. Then we have:
$$ L = \frac{1}{n} \sum_{i=1}^n \left( \frac{1}{1 + e^{-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j}}} - y_i \right)^2 $$Take:
$$ L_i = \left( \frac{1} {1 + \exp(-\theta'^T\mathbf{x}'_i - \theta_j x_{i,j})} - y_i \right)^2 $$For the term $1/(1 + \exp(- \theta'^T\mathbf{x}_i - \theta_j x_{i,j}))$, the derivative with repsect to $\theta_j$ is:
So:
$$ \frac{\partial L } {\partial \theta_j } = \frac{1}{n} \sum_{i=1}^n \frac{\partial\mathcal{L}_i} {\partial \theta_j} $$We have:
We want to model y as:
$$ p(y=1 \mid \mathbf{\theta}, \mathbf{x}) = f(\mathbf{x}, \theta) = \sigma(\theta^T \mathbf{x}) $$We want to choose a $\theta$ which maximizes $p(\mathcal{Y} \mid \theta, \mathcal{X})$, ie 'maximum likelihood' solution. Alternative solutions, which might be different include:
The likelihood function is:
$$ \mathcal{L}_\theta = p( \mathcal{Y} \mid \mathcal{X}, \theta) $$log likelihood is:
$$ \log \mathcal{L}_\theta = \sum_{i=1}^n \log p(y_i \mid \mathbf{x}_i, \theta) $$(since $y_i$ is either $0$, or $1$, so one of these terms will be 1, and the other non-one, so prodding them is correct. For example, sum would not be correct, since it would be offset by $1$ :) )
So:
$$ \log p(y_i \mid \x_i, \thetav) = \log \left( \sigma(\thetav^T\x_i)^{y_i}(1 - \sigma(\thetav^T\x_i))^{(1 - y_i)} \right) $$So:
$$ \log \mathcal{L}_\thetav = \sum_{i=1}^n \left( y_i \log(\sigma(\thetav^T \x_i)) + (1 - y_i) \log(1 - \sigma(\thetav^T \x_i)) \right) $$To maximize the log likelihood, we should take the derivative with respect to each $\theta_m$, and eg use gradient ascent on this.
where we define:
$$ \mathcal{L}_\thetav^i = y_i \log(\sigma(\thetav^T \x_i)) + (1-y_i) \log( 1 - \sigma(\thetav^T \x_i)) $$Let's define $\x'_i$ to be such that $\x_i = \x'_i + \{ x_{i,j} \}$, and $\theta'$ to be such that $\thetav = \theta' + \{ \theta_j \}$
So, looking at the left hand expression, we have
$$ \frac{\partial y_i \log(\sigma(\theta'^T \x'_i + \theta_m x_{i,m}))} {\partial \theta_m} $$Using https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x/1225116#1225116 as a reference:
Let's say we want to find the derivative of $y = \sigma(x) = (1 + \exp(-x))^{-1}$. So we have:
$$ \frac{dy}{dx} = (-1)(1 + \exp(-x))^{-2}(\exp(-x))(-1) $$Ok, but in our case we have something more like $y = \sigma(f(x)) = (1 + \exp(-f(x))^{-1}$. What is the derivative of this?
So, a factor of $f'(x)$ greater than the earlier derivative, and we need to substitute $x$ by $f(x)$. So the derivative in this case will be:
Therefore:
$$ \frac{ \partial( \sigma(\theta_m x_{i,m}))} {\partial \theta_m} $$using derivation at bottom of http://ucanalytics.com/blogs/gradient-descent-logistic-regression-simplified-step-step-visual-guide/ as a guide, ie chain-ruling the loss function
We want:
$$ \frac{\partial \log \mathcal{L_i}} {\partial \theta_j} $$for each $j \in \{1, \dots, m \}$
We have:
$$ \log \mathcal{L}_i = y_i \log (\sigma(\thetav^T\x_i)) + (1 - y_i) \log (1 - \sigma(\thetav^T\x_i)) $$We can substitute a variable for $\theta^T \mathbf{x}_i$, which is called the 'logits'. Let's denote this variable as $F(\thetav, \mathbf{x}_i) = \thetav^T \mathbf{x}_i$. So we have:
$$ \log \mathcal{L}_i = y_i \log(F) + (1 - y_i)\log(1 - F) $$And using the chain rule we have:
$$ \frac{\partial \log \mathcal{L}_i } {\partial \theta_j} = \frac{\partial \log \mathcal{L}_i} {\partial F} \frac{\partial F} {\partial \theta_j} $$Next, let's substitute the linear function $\thetav^T \mathbf{x}_i$ as the variable $z_i$ (which is a scalar, since it's the result of a dot product). So we have:
$$ F = \sigma(z_i(\thetav, \x_i)) $$And thus, using chain rule on $F$ we have:
$$ \frac{ \partial F} {\partial \theta_j} = \frac{\partial F} {\partial z_i} \frac{\partial z_i} {\partial \theta_j} $$Finally the entire chain rule expression for the derivative of the log loss will be:
$$ \frac{\partial \log \mathcal{L}_i} {\partial \theta_j} = \frac{\partial \log \mathcal{L}_i } {\partial F} \frac{\partial F} {\partial z_i} \frac{\partial z_i} {\partial \theta_j} $$Looking at each term, one by one, we have:
$$ \frac{\partial \log \mathcal{L}_i} {\partial F} = y_i \frac{1} {F} + (1 - y_i) \frac{-1}{1 - F} $$or we can also write as:
$$ \frac{\partial \log \mathcal{L}_i}{\partial F} = \frac{y_i}{F} + \frac{1- y_i}{F - 1} $$Looking at the next term, we have:
$$ \frac{\partial F }{\partial z_i} = \frac{\partial}{\partial z_i} \sigma(z_i) $$(as derived earlier)
Lastly we have:
$$ \frac{\partial z_i}{\partial \theta_j} = \frac{\partial }{\partial \theta_j} \thetav^T \mathbf{x}_i $$So, overall, we have:
$$ \frac{\partial \log \mathcal{L}_i} {\partial \theta_j} = \frac{\partial \log \mathcal{L}_i}{\partial F} \frac{\partial F}{\partial z_i} \frac{\partial z_i}{\partial \theta_j} $$So, for all dimensions $j \in \{1, \dots, m \}$, we have:
$$ \nabla_\thetav \log \mathcal{L}_i = (F - y_i)\mathbf{x}_i $$Finally, for all $i \in \{1, \dots, n \}$, we have:
$$ \nabla_\theta \log \mathcal{L} = \sum_{i=1}^n (F - y_i) \x_i $$