Logistic Regression

Given input x, find the probability that y = 1, where 0 and 1 denote the two possibilities of a binary classification problem.

In order to obtain a probability between 0 and 1 we use the sigmoid function $$\hat{y} = a = \sigma(z) = \frac{1}{1+e^{-z}}$$ with z being linear $$z = wx +b$$

Loss-function

The loss-function for one sample can be given by $$L(\hat{y},y)=\frac{1}{2}(\hat{y} - y)^2$$

But then the optimization problem becomes non-convex (many local optima). Instead we use

$$L(\hat{y},y)=-(ylog\hat{y} + (1-y)log(1-\hat{y}))$$

If y = 1, then this simplifies to $L(\hat{y},y)=-log\hat{y}$, so a large $\hat{y}$ is wanted to minimize loss-function.

If y = 0, the equation simplifies to $L(\hat{y},y)=-log(1-\hat{y}))$, so a small $\hat{y}$ minimizes the loss-function.

Cost-function

The cost-function for all test-data:

$$J(w,b) = \frac{1}{m}\sum\limits_{i=1}^mL(\hat{y},y)$$

Initializing the weights and the bias with 0 or at random (does not matter too much because the Loss-function is convex) and then calculating the Cost-function given m samples having $n_x$-features, constitutes the Forward-step of Logistic-Regression.

The Backwards-step updates the weights and bias.

$$w_i = w -\alpha \frac{dJ(w, b)}{dw_i}$$
$$b = b -\alpha \frac{dJ(w, b)}{db}$$

This means, that if the gradient is negative then the update is positive and vice versa. Alpha, the learning rate, makes sure that we actually converge.

The needed partial derivatives are calculated usind the chainrule (Backpropagation).

This means that for example:

$$\frac{dL}{dw_i} = \frac{dz(w,b)}{dw_i}\cdot \frac{da(z)}{dz} \cdot \frac{dL(a, y)}{da}$$

so

$$\frac{dL}{dw_i} = (x_i) \cdot (a(1-a)) \cdot (-\frac{y}{a} + \frac{1-y}{1-a}) = x_i(a-y)$$

Vectorization

As can be seen in the following example, for-loops are around 300 times slower than vectorized calculations, because these can be done in parallel using SIMD.


In [13]:
import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
for i in range(1000000):
    a[i]*b[i]
toc = time.time()
print("For loop: " + str(1000*(toc-tic)) + "ms")

tic = time.time()
np.dot(a,b)
toc = time.time()
print("Vectorized version: " + str(1000*(toc-tic)) + "ms")


For loop: 326.5037536621094ms
Vectorized version: 0.9288787841796875ms

Therefore we use vectors:

$X = (n_x, m)$-Matrix containing the data-samples, one sample each column
$Z = [z_1, z_2, ..., z_m] = w^TX + b$
$A = [a_1, a_2, ..., a_m] = [\sigma(z_1), \sigma(z_2), ...] = \sigma(Z)$

and our code simplifies to

$dZ = A - Y$
$dw = \frac{1}{m}XdZ^T$
$db = \frac{1}{m}np.sum(dZ)$