Logistic Regression

(before Multinomial logistic regression)

We want to predict the probability of an input belonging to one of two classes.


Study case :

Classify the zero and one digits from MNist dataset

a) Dataset !

  • Input: Images of size 28*28 where a one or two is present
  • Output: 0 if the input is a 0, 1 otherwise

Lets... Load it, check it, quick, rewrite it !


In [8]:
import keras
import utils
reload(utils)
from utils import *
%pylab inline

# Get the input data (All the zeros and ones in the dataset)
(x, y), (x_, y_) = keras.datasets.mnist.load_data()
X = x[np.where(y<=1)]
Y = y[np.where(y<=1)]
Y = np.array(Y, dtype='int')

# Reshape the images to vectors
X = X.reshape(X.shape[0], -1)
X = X / 255.  # Normalize inputs

# Vizualize the digits
pylab.rcParams['figure.figsize'] = (15, 4)
for i in range(12):
    plt.subplot(2, 6, i+1)
    plt.imshow(X[i].reshape([28, 28]))
plt.show()


Populating the interactive namespace from numpy and matplotlib
0.995973154362 0.0134918877146
0.998420844848 0.00517915074765

b) Classifier

We use a logistic regression, (You may want to read this : http://cs229.stanford.edu/notes/cs229-notes1.pdf) :

The Cost is a function of the true output $Y$ and the prediction $p$, which itself is a function of a linear activation $s(x)$

  • linear unit : $ s = (W^t \cdot X + b) $
  • prediction : $ p(s) = \frac{1}{1 + e^{-s}} $
  • Cost : $ C(y, p) = - y \ln(p) - (1-y)(\ln(1-p)) $

To use gradient descent, we have to compute the gradient of the cost with respect to w :

$ \frac{dC}{dW} $

We take adventage of the chain rule :

$ \frac{dC}{dW} = \frac{dC}{dp} \cdot \frac{dp}{ds} \cdot \frac{ds}{dw} $


We derive each terms : \begin{align} \frac{dC}{dp} &= - \frac{y}{p} - (-1) \cdot \frac{1-y}{1-p} \\ &= - \frac{y}{p} + \frac{1-y}{1-p} \\ &= \frac{-y + y.p + p - y.p}{p \cdot (1-p)} \\ &= \frac{-y+p}{p \cdot (1-p)} \end{align}


\begin{align} \frac{dp}{ds} &= - \frac{-e^{-s}}{1 + e^{-s}} \\ &= \frac{-e^{-s}}{1 + e^{-s}} \\ &= \frac{e^{-s} + 1 - 1}{(1 + e^{-s})^2} \\ &= \frac{e^{-s} + 1}{(1 + e^{-s})^2} - \frac{1}{(1 + e^{-s})^2} \\ &= \frac{1}{1 + e^{-s}} - \left(\frac{1}{1 + e^{-s}}\right)^2 \\ &= p - p^2 \\ &= p \cdot (1-p) \end{align}
\begin{align} \frac{ds}{dw} = x \end{align}

All together, we have : \begin{align} \frac{dC}{dW} &= \frac{dC}{dp} \cdot \frac{dp}{ds} \cdot \frac{ds}{dw} \\ &= \frac{-y+p}{p \cdot (1-p)} \cdot p \cdot (1-p) \cdot x \\ &= (-y+p) \cdot x \\ &= (p-y) \cdot x \end{align}


In [ ]:
# Set-up the weights
W = np.random.random((784,))-.5


# Train
for _ in range(2):
    acc = []
    losses = []
    for x,y in zip(X, Y):
        pred = linear(x, W)
        pred = sigmoid(pred)
        acc.append(round(pred)==y)
        loss = nll(pred, y)
        losses.append(loss)
        update = (pred - y) * x 
        W = W - .02 * update
        
    print sum(acc) / float(len(acc)), sum(losses)/len(losses)

In [ ]:
gen = batch_generator(1)
valid_gen = batch_generator(100)
X_valid, Y_valid = valid_gen.next()

W = np.random.normal(size=IMG_SIZE * IMG_SIZE)
b = np.random.normal()

log = lambda x: np.log(x + 1e-8)
exp = lambda x: np.exp(x + 1e-8)

alph_ = 1.6732632423543772848170429916717
lambd_ = 1.0507009873554804934193349852946
linear = lambda x: np.dot(W.T, x) + b
sigm = lambda x: 1 / (1 + exp(-x))
elu = lambda x, alpha: np.maximum(x, alpha * (exp(x) - 1))
selu = lambda x: lambd_ * elu(x, alph_)
nll = lambda p, y: - y * log(p) - (1 - y) * log(1 - p)

In [4]:
def prob(X):
    return sigm(linear(X))


def loss(X, y):
    # loss = -   y .ln(  sigm(WT.X+b))
    #        -(1-y).ln(1-sigm(WT.X+b))
    p = prob(X)
    return nll(p, y)


def gradient_loss(X, y):
    # d.loss / d.W = (p-y).X
    p = prob(X)
    return ((p - y) * X)


def evaluate():
    probs = np.array(map(prob, X_valid))
    loss = nll(probs, Y_valid)
    loss = loss.mean()
    probs = map(round, probs)
    accuracy = sum(probs == Y_valid)
    return accuracy, loss
    

losses = []
alpha = 0.001
for epoch in range(60):
    _loss = 0
    alpha *= 0.95
    for _ in range(2000):
        X, Y = gen.next()
        X, Y = X[0], Y[0]
        _loss += loss(X, Y) 
        W = W - alpha * gradient_loss(X, Y)
    losses.append(_loss / 2000)
    print epoch, losses[-1], evaluate(), alpha


0 5.58640704366 (50, 5.4980701891512709) 0.00095
1 4.78260540237 (49, 4.9118047612289111) 0.0009025
2 4.6779473795 (45, 4.6402717779130489) 0.000857375
3 4.19205914058 (51, 4.6260771507937468) 0.00081450625
4 4.18138088763 (46, 4.1006190950133581) 0.0007737809375
5 4.03061164749 (45, 4.4078093967995144) 0.000735091890625
6 4.00313752114 (45, 3.957636464813949) 0.000698337296094
7 4.00241508545 (49, 3.973483561712158) 0.000663420431289
8 4.00545516721 (51, 3.937356929954051) 0.000630249409725
9 3.81162025656 (49, 3.8196276208981157) 0.000598736939238
10 3.75474551368 (44, 3.7994947362272757) 0.000568800092276
11 3.67568308959 (51, 3.964439740141497) 0.000540360087663
12 3.42577373362 (48, 3.6926151078334852) 0.00051334208328
13 3.55937114202 (48, 3.7084235642235086) 0.000487674979116
14 3.41879286207 (46, 3.9666758384427658) 0.00046329123016
15 3.39976138532 (47, 3.5711137297466298) 0.000440126668652
16 3.3747754989 (48, 3.6306529687410336) 0.000418120335219
17 3.34353080411 (49, 3.5043989749561422) 0.000397214318458
18 3.45143836959 (46, 3.5396416316390775) 0.000377353602535
19 3.38311020896 (48, 3.6162072502742206) 0.000358485922409
20 3.65954267922 (45, 3.9217008552532251) 0.000340561626288
21 3.37195837989 (48, 3.4380338061133342) 0.000323533544974
22 3.59789978355 (48, 3.4677439475656207) 0.000307356867725
23 3.3151534129 (47, 3.4253391760336727) 0.000291989024339
24 3.26146686929 (43, 3.8125735065031923) 0.000277389573122
25 3.35922874342 (48, 3.3612935805733799) 0.000263520094466
26 3.29183966871 (48, 3.3980402156421676) 0.000250344089742
27 3.24541810663 (48, 3.3968571035898121) 0.000237826885255
28 3.24288093327 (47, 3.3604785749531789) 0.000225935540993
29 3.10334828855 (47, 3.3236261364939161) 0.000214638763943
30 3.27249630872 (46, 3.5237206743978202) 0.000203906825746
31 3.29942450772 (49, 3.4054770497688658) 0.000193711484459
32 3.17544266729 (49, 3.3728691027481319) 0.000184025910236
33 3.10782246643 (50, 3.3443847813287619) 0.000174824614724
34 3.10312452758 (47, 3.3039372099099946) 0.000166083383988
35 3.11039406756 (47, 3.2903956767977558) 0.000157779214788
36 3.01876422007 (46, 3.2788950255654354) 0.000149890254049
37 3.14998150618 (47, 3.25227618355389) 0.000142395741346
38 3.12190866438 (48, 3.3807178492031729) 0.000135275954279
39 3.0572036775 (48, 3.2333472677610353) 0.000128512156565
40 3.1628851468 (46, 3.2511798478980301) 0.000122086548737
41 3.28884401217 (47, 3.2339452518996978) 0.0001159822213
42 3.1401521861 (47, 3.2573326251449384) 0.000110183110235
43 2.95490715026 (50, 3.2890192946274635) 0.000104673954723
44 3.21924685707 (46, 3.3661650359708473) 9.94402569871e-05
45 2.97125865137 (47, 3.2201365731455325) 9.44682441377e-05
46 3.0105916419 (50, 3.2484709422427915) 8.97448319309e-05
47 3.08820207266 (46, 3.1789324939692341) 8.52575903343e-05
48 2.93127857403 (46, 3.2043797585783413) 8.09947108176e-05
49 2.94007153269 (50, 3.2407554657565334) 7.69449752767e-05
50 3.02839157852 (48, 3.2201767826141094) 7.30977265129e-05
51 2.99339057912 (46, 3.2046952169412712) 6.94428401872e-05
52 3.0208836675 (45, 3.1900665488175846) 6.59706981779e-05
53 2.86364003746 (48, 3.2863857720911978) 6.2672163269e-05
54 3.07503077162 (50, 3.2166618088890231) 5.95385551055e-05
55 3.17568124151 (51, 3.2215881147922469) 5.65616273503e-05
56 2.86760908392 (48, 3.1844395905381906) 5.37335459827e-05
57 3.14490379642 (45, 3.1681813108390378) 5.10468686836e-05
58 3.13355313854 (47, 3.174980458644062) 4.84945252494e-05
59 3.02964391155 (46, 3.1529717218506721) 4.6069798987e-05

In [5]:
plt.plot(losses)
plt.show()



In [6]:
def prob(X):
    return sigm(selu(linear(X)))


def loss(X, y):
    # loss = -   y .ln(  sigm(WT.X+b))
    #        -(1-y).ln(1-sigm(WT.X+b))
    p = prob(X)
    return nll(p, y)


def gradient_loss(X, y):
    # d.loss / d.W = (p-y).X
    p = prob(X)
    if linear(X) <= 0:
        return X * (p - y) * (p + lambd_ * lambd_)
    else: 
        return X * (p - y) * lambd_


def evaluate():
    probs = np.array(map(prob, X_valid))
    loss = nll(probs, Y_valid)
    loss = loss.mean()
    probs = map(round, probs)
    accuracy = sum(probs == Y_valid)
    return accuracy, loss
    

losses = []
alpha = 0.001
for epoch in range(30):
    _loss = 0
    alpha *= 0.95
    for _ in range(2000):
        X, Y = gen.next()
        X, Y = X[0], Y[0]
        _loss += loss(X, Y)
        W = W - alpha * gradient_loss(X, Y)
    losses.append(_loss / 2000)
    print epoch, losses[-1], evaluate(), alpha


0 3.83802654556 (48, 4.4013385659404829) 0.00095
1 3.99140416499 (47, 4.2539705687866336) 0.0009025
2 3.89035629104 (46, 3.6237028880768731) 0.000857375
3 3.66249415031 (45, 4.8846913033125334) 0.00081450625
4 3.91929414034 (50, 4.0081923277662641) 0.0007737809375
5 3.85921859301 (46, 4.3803376646699403) 0.000735091890625
6 3.83240592633 (48, 4.0173954833909065) 0.000698337296094
7 3.66728704475 (48, 4.243323909749213) 0.000663420431289
8 3.67886858276 (44, 3.3552966672141293) 0.000630249409725
9 3.66690426904 (46, 3.9491893318703633) 0.000598736939238
10 3.93367755577 (46, 4.1890998085971374) 0.000568800092276
11 3.8107728032 (51, 5.6864281864492447) 0.000540360087663
12 3.60085335135 (50, 4.0974383676268342) 0.00051334208328
13 3.51783716129 (46, 4.2104113075881919) 0.000487674979116
14 3.87682400969 (49, 5.0915732144784833) 0.00046329123016
15 3.82137540006 (49, 5.4474757443576332) 0.000440126668652
16 3.63645619623 (44, 4.4993748975961321) 0.000418120335219
17 3.58141992227 (48, 4.8485145499985158) 0.000397214318458
18 3.56220959678 (43, 4.2079632250124197) 0.000377353602535
19 3.70978241656 (45, 4.0416780129651206) 0.000358485922409
20 3.72722209821 (46, 3.83724839712358) 0.000340561626288
21 3.63700121302 (51, 5.122721724813422) 0.000323533544974
22 3.72064874475 (48, 4.7728162919681578) 0.000307356867725
23 3.60130726083 (48, 4.0350406058257393) 0.000291989024339
24 3.63626757999 (46, 4.4730889104746456) 0.000277389573122
25 3.56857326331 (47, 4.2661337098623111) 0.000263520094466
26 3.75013823155 (46, 3.1673430289452784) 0.000250344089742
27 3.61434961727 (47, 4.2575482296559359) 0.000237826885255
28 3.45183108962 (49, 3.862873020423069) 0.000225935540993
29 3.62864557311 (47, 4.7574814872222619) 0.000214638763943

In [7]:
plt.plot(losses)
plt.show()