In [2]:
%matplotlib notebook
%load_ext Cython

# -*- coding: utf-8 -*-
"""
Created on 2018-01-15

@author: talbot
"""

import numpy as np
import matplotlib.pyplot as plt
from scipy.linalg import eigh

In [3]:
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: right;
    vertical-align: middle;
}
</style>
""")


Out[3]:

The perceptron

Hugues Talbot

In this notebook/presentation, we study the simplest neural network from the ground up. In spite of its extreme simplicity, we show that it has promising capabilities. Most importantly, this study allows us to demonstrate the basic elements of neural network analysis and optimisation.

$\newcommand{\V}[1]{\mathbf{\sf #1}}$ $\newcommand{\M}[1]{\mathbf{#1}}$

A single artificial neuron

The perceptron is a single artificial "neuron", an element of an artificial neural network. $\label{perceptron}$

As shown on the previous figure, the artificial neuron take as input a vector of real numbers $\V{x}=\{x_1, x_2, \ldots, x_n\}$. It performs a weighted sum and composes the result with an activation function $\sigma$, that acts like a threshold. The output is called $y$, a real number.

A typical activation function is a sigmoid. This is a smooth function that takes any real number as input and outputs a value between 0 and 1.

The activation function

Here we show the plot of a sigmoid, for which one possible formula is

\begin{equation} f(x) = \frac{1}{1+\exp(-x)} \end{equation}

This particular sigmoid is called the logistic function. It tends to 0 as x tends to $-\infty$ and to 1 as x tends to $+\infty$.


In [4]:
x=np.arange(-5,+5,step=0.1)
y=1/(1+np.exp(-x))
plt.figure()
plt.plot(x,y)
plt.show()


Interpretation of the perceptron

The perceptron acts as a binary classifier. Given a set of inputs together with a binary output $\{0,1\}$, the weights $w_i$ can be optimized so that the error is minimized. The trained classifier can then be used on new data. Since we use the logistic sigmoid function, this approach is well-known as logistic regression.

Concept of a loss function

In classification or regression, we typically train a system to predict a result given a set of inputs. The difference between the predicted outcome and the measured one is classically called the loss. There are many possible loss functions depending on the type of outcome.

The quadratic loss

In linear regression, is it typical to consider a quadratic loss

$$ {\cal L}_2(\V{x},\V{y}) = \| \M{A}\V{x}-\V{y} \|^2, $$

where the $\|.\|$ denotes the Euclidean (or $\ell_2$) norm and $A$ is a linear operator in matrix form. This loss gives rise to the well-known and powerful least-square methods. In linear regression this is well suited because the ${\cal L}_2$ loss is convex in this case.

This particular loss is very common because, in statistical terms, it corresponds to the log-likelihood of the Gaussian noise. Let us review these concepts in the next slide.

Loss as a log-likelihood

Given a parametrized family of probability density functions (for a continuous distribution) or probability mass function (for a discrete distribution), $$ \V{x} \mapsto f(\V{x}\, | \,\theta), $$ where $\theta$ is the vector of parameters. We call the likelihood the function $$ \theta \mapsto f(\V{x} \,|\, \theta). $$ Typically, this function is written $$ {\cal L}(\theta\, |\, \V{x}) = f(\V{x}\, |\, \theta) $$

The method of maximum likelihood estimation

Given the above, the method of maximum likelihood estimation (MLE) finds the value of the parameter vector $\hat{\theta}$ that maximizes the likelihood function ${\cal L}(\theta\, |\, x)$. $$ \hat{\theta} = \underset{\theta}{\text{argmax}}\;{\cal L}(\theta\,|\,\V{x}) $$ As is typical, we go for a minimization instead of a maximization, and we take the log of the likelihood. The value of $\hat{\theta}$ is unchanged. $$ \hat{\theta} = \underset{\theta}{\text{argmin}}\;-\log{\cal L}(\theta\,|\,\V{x}) $$ The function $-\log{\cal L}(\theta\,|\,\V{x})$ is often called the negative log-likelihood.

MLE estimation example

A typical case is when an observation represented by a linear operator $\M{H}$ is corrupted by noise. If we assume the noise is centered additive white Gaussian, we write

$$ \V{y} = \M{H}\V{x} + \eta, $$

where $\eta \sim G(0,\sigma)$. We can write

$$ \M{H}\V{x} - \V{y} \sim G(0,\sigma), $$

and so the probability distribution of $\V{x}$ given $\sigma$ is $$ P(\V{x}|\sigma) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{\|\M{H}\V{x}-\V{y}\|^2}{2\sigma^2}\right). $$

Following the method of maximum likelihood, the best estimator for $\V{x}$ is (by taking the log and ignoring constants)

$$ \hat{\V{x}} = \underset{\V{x}}{\text{argmin}} \; \|\M{H}\V{x} - \V{y}\|^2, $$

i.e. nothing more than the method of least squares, as mentioned above. since $\M{H}$ can be an arbitrary linear operator, it can represent very sophisticated aparatuses: MRI machines, CT scanners, non-trivial cameras, etc. This estimator is at the basis of many famous and varied methods in signal and image processing.

Gradient descent

With the method of maximum likelihood estimation, we will need to compute $$ \hat{\V{y}} = \underset{\V{w},b}{\text{argmin}}\;{\cal L}(\V{a},\V{y}) $$ For this, one way is to use the method of gradient descent. For this we take as example linear regression.

To solve the Least-Square regression problem, we seek a solution to

$$ \V{x}^\star = \underset{\V{x}}{\operatorname{argmin}} f_0(\V{x)} = \| \M{A} \V{x} - \V{b}\|^2_2 $$

The gradient of this objective function is

$$ \nabla f_0 (\V{x}) = 2 \M{A}^\top (\M{A} \V{x} - \V{b}) $$

The gradient descent algorithm is basically \begin{align} \text{do} \{ & \\ & \V{x}_{k+1} = \V{x}_{k} - \alpha \nabla f_0(\V{x}_k) \\ \} & \;\text{while}\; (|\V{x}_k - \V{x}_{k+1}| > \varepsilon) \end{align}

$\alpha$ is the stepsize, which is also called the learning rate. In the following function, it is fixed. Note that the maximum stepsize is a function of the maximum curvature of the function, or equivalently the maximum Lipschitz factor of the gradient of ${\cal L}$.


In [4]:
def iterative_sol_gd(A, y, step, epsilon=1e-5, maxiter=10000):
    """
    This function solves a least-square problem by a gradient descent approach
    with a fixed stepsize.
    """
    AT = A.T
    pn = AT.dot(y) * 0.
    iter = 0
    while True:
        Grad = 2*AT.dot(A.dot(pn) - y)
        pn1 = pn - step*Grad
        diff = pn1-pn
        if (diff.T.dot(diff) < epsilon*epsilon):
            print("OK (%d iterations)" % iter)
            break
        pn = pn1
        iter += 1
        if (iter > maxiter):
            print("Exceeded maximum number of iterations (%d)" % maxiter)
            break
    return pn1

In [5]:
def iterative_linear_regression(eps=1e-5, LF=1.5):
    """
        Iterative solution to the least square problem by gradient descent.
        LF is the Lipschitz factor. Theoretically 1.0 is the best and 2.0 diverges. 
        In practice slightly less than 2.0 yields (e.g. 1.99) better results.
    """
    # input range
    x = np.arange(1,4,0.01)
    n = x.size
    # linear model, corrupted by noise
    y1 = 3.0 * x + 2.0
    y = y1 + np.random.normal(0,0.5,n)
    # linear regression retrieves the model
    A= np.array((x, np.ones(n)))
    AT=A.T
    y= y.T
    # compute the Hessian of the system
    A2=AT.dot(A)
    # compute the largest eigenvalue
    leigh = eigh(A2,eigvals_only=True)
    mleigh = max(leigh)  
    print("Lipschitz constant = %f" % (2*mleigh))
    L= 1./(2*mleigh)
    ## print("Maximum stepsize = %f" % (2*L))
    res = iterative_sol_gd(AT,y,step=LF*L,epsilon=eps)
    print("Estimated parameters : p1=%.5f, p2=%.5f" % (res[0],res[1]))
    solutionA = np.dot(np.linalg.inv(np.dot(AT.T,AT)),np.dot(AT.T,y))
    print("Expected parameters  : p1=%.5f, p2=%.5f" % (solutionA[0],solutionA[1]))
    linesol1 = res[0]*x + res[1]
    linesol2 = solutionA[0]*x + solutionA[1]
    fig = plt.figure()
    plt.plot(x,y,"g.", x,linesol1,"--k", x,linesol2,"b")
    fig.show()
    return

In [6]:
iterative_linear_regression(eps=1e-7,LF=1.6)


Lipschitz constant = 4727.902862
OK (614 iterations)
Estimated parameters : p1=3.02595, p2=1.97274
Expected parameters  : p1=3.02595, p2=1.97275

Logistic loss

With the Perceptron, our output is real number between 0 and 1, which we compare to a binary ground truth, our system is no longer linear, so the quadratic loss is not suitable. Recall that in standard linear regression, we seek to make predictions of the sort $$ \hat{\V{y}} = \V{w}^\top\V{y} + b $$ With logistic regression, we seek a $\hat{\V{y}} \in \{0,1\}$ answer, so we coerce the previous model into a "smooth" thresholding function, the sigmoid: $$ \hat{\V{y}} = \sigma(\V{w}^\top\V{y} + b). $$

Derivation of the logistic loss in the scalar case

The output will look like a probability $\hat{y} \in [0,1]$. We also have $\sigma(-z) = 1-\sigma(z)$. We may write \begin{align} P(y=1\,|\,z) &= \sigma(z) = \frac{1}{1+e^{-z}}\\ P(y=0\,|\,z) &= 1-\sigma(z) = \frac{1}{1+e^{z}}. \end{align} Since $y$ is either 0 or 1, these may be expressed more compactly together in a single equation as $$ P(y,z) = \sigma(z)^y\,(1-\sigma(z))^{1-z} $$ The negative log-likelihood is therefore \begin{align} {\cal L}(z,y) &= -\log(\sigma(z)^y\,(1-\sigma(z))^{1-y})\\ &= -(y\log(\sigma(z)) + (1-y)\log(1-\sigma(z))) \end{align}

Logistic loss in the vectorial case

The vectorial case is exactly the same, with \begin{align} {\cal L}(\V{z},y) &= -\log(\sigma(\V{z})^y\,(1-\sigma(\V{z}))^{1-y})\\ &= -(y\log(\sigma(\V{z})) + (1-y)\log(1-\sigma(\V{z}))), \end{align} recall that even in this case $y$ is a scalar.

If we write $a = \sigma(\V{w}^\top\V{y} + b)$, we can also write \begin{equation} {\cal L}(\V{a},y) = -(y\log \V{a} + (1-y)\log(1-\V{a}))\label{logitloss} \end{equation}

Gradient descent for the 1D logistic regression

In order to take things gradually, we first consider a simple, one-parameter logistic regression problem. We consider 1D derivations, so all quantities here like $a$ and $y$ are scalar.

in 1D, the model is

$$ a = \sigma(z) = \sigma(w x + b) $$

We recall the logistic loss: $$ {\cal L}(a,y) = -(y\log(a) + (1-y)\log(1-a)) $$ Deriving with respect to $a$ yields $$ \frac{\partial {\cal L}}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a} $$

Recall also that $$ a = \sigma(z) = \frac{1}{1+\exp(-z)} $$

for the derivation with respect to $z$, we have to use the chain rule

$$ \frac{\partial{\cal L}}{\partial z} = \frac{\partial{\cal L}}{\partial a}\frac{\partial a}{\partial z} $$

We have $$ \frac{\partial a}{\partial z} = \frac{\partial \sigma(z)}{\partial z} = \frac{\exp(-z)}{(1+\exp(-z))^2} = a (1-a) $$ Hence, with the chain rule \begin{align} \frac{\partial{\cal L}}{\partial z} &= \left(-\frac{y}{a} + \frac{1-y}{1-a}\right) a (1-a)\\ &= -y (1-a) + a (1-y) \\ &= a - y \end{align}

From this derivative we can compute the differentials for the parameters of the regression $w$ and $b$:

$$ \frac{\partial {\cal L}}{\partial w} = \frac{\partial {\cal L}}{\partial z}\frac{\partial z}{\partial w} = (a-y)x $$$$ \frac{\partial {\cal L}}{\partial b} = \frac{\partial {\cal L}}{\partial z}\frac{\partial z}{\partial b} = (a-y) $$

From these, the gradient descent is straightforward. The code is in the next slide.


In [99]:
def sigma(z):
    return (1/(1+np.exp(-z)))

def iterative_sol_1d_logistic_gd(x, y, step, epsilon=1e-5, maxiter=10000):
    """
    This function solves the 1d logistic regression problem by a gradient descent approach
    with a fixed stepsize.
    x and y are given vectors with the same size. x is real and y is binary.
    """
    J = np.zeros(maxiter+1) ; dw = 0 ; db = 0
    w = np.zeros(maxiter+1) ; b = np.zeros(maxiter+1)
    i = 0
    while True:
        z = w[i]*x + b[i]
        a = sigma(z)
        ## compute the loss
        J[i] = -(y.T.dot(np.log(a)) + (1-y).T.dot(np.log(1-a)))
        dz = a - y
        dw = x.dot(dz)
        db = np.sum(dz) ## equivalent to np.ones(y.shape[0]).dot(dx)
        w[i+1] = w[i] - step*dw
        b[i+1] = b[i] - step*db
        if ((dw*dw+db*db) < epsilon*epsilon):
            print("OK, iteration=", i)
            break
        i += 1
        if (i >= maxiter):
            print("Exceeded maximum number of iterations (%d)" % maxiter)
            break
    return (w,b,J,i)

In [100]:
myT=np.random.random(10)*6+1
myF=np.random.random(10)*7-2
Y = np.concatenate((np.ones(10),np.zeros(10)))
X = np.concatenate((myT,myF))
plt.figure()
plt.scatter(X,Y)
plt.show()



In [10]:
(w,b,J,i) = iterative_sol_1d_logistic_gd(X,Y,step=0.02,epsilon=1e-5,maxiter=100000)
trueW = w[0:i] ; trueB = b[0:i] ; trueJ = J[0:i]
lastW = trueW[-1] ; lastB = trueB[-1] ; lastJ = trueJ[-1]
print(lastW,lastB)
plt.figure()
plt.title("Log10 of the loss convergence")
plt.plot(np.log10(trueJ-lastJ))
plt.show()


OK, iteration= 324
0.5511033395985682 -1.008393660946268
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:7: RuntimeWarning: divide by zero encountered in log10
  import sys

In [11]:
x = np.arange(-2,6,step=0.1)
plt.figure()
plt.scatter(X,Y)
plt.plot(x,sigma(lastW*x+lastB), color='r')
plt.show()
print("Inflexion point = ", -lastB/lastW)


Inflexion point =  1.8297723647996706

Multi-dimensional logistic regression

We have to adapt the formulas to the case where $\V{w}=\{w_1,w_2,\ldots,w_n\}$ is vectorial. $b$ remains a scalar. The loss was defined with vector elements and remains the same $\eqref{eq:logitloss}$. The code adaptation is fairly straightforward.


In [131]:
## Normal python implementation

def iterative_sol_nd_logistic_gd(X, y, step, warm_restart=None, b=0, 
                                 epsilon=1e-5, maxiter=10000, verbose=False):
    """
    This function solves the N-d logistic regression problem by a gradient descent approach
    with a fixed stepsize.
    X is a NxM matrix, where N is the dimension of the problem and M the number of samples. 
    y is a binary vector of length M.
    """
    assert len(X.shape) == 2
    N = X.shape[0] ## dimension of the problem
    J = np.zeros(maxiter+1)  

    ## now these are vectors of length N
    if warm_restart is None:
        w = np.zeros(N,dtype=DTYPE) 
        b = 0
    else:
        w = warm_restart
        b = b # in the arglist

    if (verbose):
        ## progress report as things can be very slow
        stepiter=int(maxiter/50) ## print 50 dots

    i = 0
    while True:
        z = w.T.dot(X) + b ## should be a row vector of lenth M
        a = sigma(z) ## same
        ## compute the loss
        J[i] = -(y.T.dot(np.log(a)) + (1-y).T.dot(np.log(1-a)))
        dz = a - y
        dw = (1./N)*X.dot(dz.T) ## should be a vector of length N
        db = (1./N)*np.sum(dz)  ## equivalent to np.ones(y.shape[0]).dot(dx)
        w = w - step*dw
        b = b - step*db
        if ((dw.T.dot(dw)+db*db) < epsilon*epsilon):
            print("OK, iteration=", i)
            break
        i += 1
        if (verbose):
            if (i%stepiter == 0):
                print('.', end='')
        if (i >= maxiter):
            print("Exceeded maximum number of iterations (%d)" % maxiter)
            print("Gradient norm still=", np.sqrt(dw.T.dot(dw)+db*db))
            break
    return (w,b,J,i)

In [125]:
%%cython

## because of the cython environment, we need to redefine the needed functions

import numpy as np
cimport numpy as np
cimport cython



DTYPE = np.double
ctypedef np.double_t DTYPE_t

@cython.boundscheck(False) # turn off bounds-checking for entire function
def citerative_sol_nd_logistic_gd(np.ndarray[DTYPE_t,ndim=2] X, y, step, warm_restart=None, b=0, 
                                 epsilon=1e-5, maxiter=10000, verbose=False):
    """
    This function solves the N-d logistic regression problem by a gradient descent approach
    with a fixed stepsize.
    X is a NxM matrix, where N is the dimension of the problem and M the number of samples. 
    y is a binary vector of length M.
    """
    #assert len(X.shape) == 2
    cdef int i
    cdef int N = X.shape[0] ## dimension of the problem
    cdef np.ndarray[DTYPE_t,ndim=1] J = np.zeros(maxiter+1,dtype=DTYPE)  
    cdef double db = 0
    cdef np.ndarray[DTYPE_t,ndim=1] dz
    cdef np.ndarray[DTYPE_t,ndim=1] dw
    cdef np.ndarray[DTYPE_t,ndim=1] w
    cdef np.ndarray[DTYPE_t,ndim=1] z
    cdef np.ndarray[DTYPE_t,ndim=1] a
    ## now these are vectors of length N
    if warm_restart is None:
        w = np.zeros(N,dtype=DTYPE) 
        b = 0
    else:
        w = warm_restart
        b = b # in the arglist
    

    if (verbose):
        ## progress report as things can be very slow
        stepiter=int(maxiter/50) ## print 50 dots

    i = 0
    while True:
        z = w.T.dot(X) + b ## should be a row vector of lenth M
        a = 1/(1+np.exp(-z)) ## same
        ## compute the loss
        J[i] = -(y.T.dot(np.log(a)) + (1-y).T.dot(np.log(1-a)))
        dz = a - y
        dw = (1./N)*X.dot(dz.T) ## should be a vector of length N
        db = (1./N)*np.sum(dz)  ## equivalent to np.ones(y.shape[0]).dot(dx)
        w = w - step*dw
        b = b - step*db
        if ((dw.T.dot(dw)+db*db) < epsilon*epsilon):
            print("OK, iteration=", i)
            break
        i += 1
        if (verbose):
            if (i%stepiter == 0):
                print('.', end='')
        if (i >= maxiter):
            print("Exceeded maximum number of iterations (%d)" % maxiter)
            print("Gradient norm still=", np.sqrt(dw.T.dot(dw)+db*db))
            break
    return (w,b,J,i)

2-d illustration

Here we create an artificial example with some Gaussian distributions. In this example, the blue dots are the negative samples and the red dots the positive ones. We apply Logistic Regression to form a decision model separating both populations.


In [132]:
## create two populations
M=100
myT = np.random.randn(M,2)*2+(6,5)
myF = np.random.randn(M,2)*1.5+(3,3)
X=np.concatenate((myF,myT))
y = np.concatenate((np.zeros(M),np.ones(M)))
plt.figure()
plt.scatter(X[0:M,0],X[0:M,1],color="blue")
plt.scatter(X[M:,0],X[M:,1],color="red")
plt.show()



In [134]:
(w,b,J,i) = iterative_sol_nd_logistic_gd(X.T,y,step=2.5e-3,epsilon=1e-7,maxiter=20000)
trueJ = J[0:i] ; lastJ = trueJ[-1]
print("w=",w, " b=", b)
plt.figure()
plt.title("Log10 of the loss convergence")
plt.plot(np.log10(trueJ-lastJ))
plt.show()


OK, iteration= 9419
w= [0.87846647 0.58538669]  b= -6.076119124307143
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: divide by zero encountered in log10
  
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:6: RuntimeWarning: invalid value encountered in log10
  

In [103]:
X.shape


Out[103]:
(200, 2)

Best separation

The "best" separation between the two populations is the line where $a = \V{w}^\top\V{x} + b = 0$. This line defines a unique hyperplane. In 2D, the hyperplane is simply a line, this can be written as

\begin{align} w_1 x_1 + w_2 x_2 + b & = 0\\ x_1 = 0 ; x_2 &= -\frac{b}{w_2} \\ x_2 = 0 ; x_1 &= -\frac{b}{w_1} \end{align}

This line can be interpreted as the line of equi-probability since $\sigma(0) = 0.5$. Said otherwise, points on this line are as likely to belong to the blue or red class, given the training data.


In [104]:
plt.figure()
plt.scatter(X[0:M,0],X[0:M,1],color="blue")
plt.scatter(X[M:,0],X[M:,1],color="red")
plt.plot([(-b/w[0]), 0], [0, (-b/w[1])], 'k-')

plt.show()


Accuracy and test data

What we've just done is provide some (artificial) test data to fit a regresssion line between two distributions. In real problems these distributions are unknown, but here we know the limit distribution perfectly since we simulated it. First it might be intersesting to revisit some classification principles and methods based on this example.

The first thing to realize is that we classify as positives the dots associated with a $\sigma \geq 0.5$ and negative the remainder. By doing so, we make some mistakes since some blue dots are beyond the equiprobable line and some red dots are before it.

There are 4 classes of points: the red dots that are correctly classified as positive (True Positives), the red dots that are incorrectly classified as negative (False Negatives), the blue dots that are incorrectly classified as positives (False Positives) and finally the blue dots that are correcty classified as negatives (True Negatives).

The arrangement of these 4 numbers in a matrix is called the confusion matrix:

True condition False Condition
Predicted condition positive True Positives False Positives
Predicted condition negative False Negatives True Negatives

One way to summarize the mistakes made is to compute a few ratios. The two more common to medicine are the sensitivity and specificity. More specifically

sensitivity (or true positive rate, TPR) = $\frac{\text{TP}}{\text{TP}+\text{FN}}$ ; specificity (or true negative rate TNR) = $\frac{\text{TN}}{\text{FP}+\text{TN}}$

General accuracy (which is only appropriate for a balanced example) is the average of these two ratios, or $\text{accuracy} = \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}$. Generally, the higher the accuracy the better, but in some instances it might be better to lower the FN vs. the FP or vice-versa.


In [105]:
def logit_test(x):
    return(sigma(w.dot(x.T)+b))

def classif_results(X, cut=0.5):
    """ Returns the confusion matrix of a balanced logistic regression
    """
    M = int(X.shape[0]/2) ## assuming as many negatives as positive, as is the case here
    res = logit_test(X)
    neg = res[0:M] ; pos = res[M:(2*M)]
    TP = pos[pos[0:M] >= cut].shape[0] ## red dots that should be red
    FN = M-TP
    TN = neg[neg[0:M] < cut].shape[0]
    FP = M-TN
    return((TP,FP,FN,TN))

TP,FP,FN,TN = classif_results(X)
print("Sensitivity = ", TP/(TP+FN), " Specificity = ", TN/(FP+TN))


Sensitivity =  0.88  Specificity =  0.89

With the default cutoff at 0.5, the ratios are about the same. We can see how these numbers change if we change the cutoff. To display this in a synthetic fashion, we can plot them in a sensitivity vs. specificity scatter plot, called the Receiver Operating Characteristic or ROC. Typically the abcissa are the FP (or 1-specificity) and the ordinates are the TP (or sensitivity).


In [106]:
v=np.arange(10,-5,step=-0.2)
roc = np.zeros((v.shape[0],2))
bestacc = 0
besti = 0
for i in range(0,v.shape[0]):
    cut = sigma(v[i])
    TP,FP,FN,TN = classif_results(X,cut)
    roc[i,0]=1-(TN/(FP+TN))
    roc[i,1]=TP/(TP+FN)
    acc = (TP+TN)/(TP+TN+FN+FP)
    if (acc > bestacc):
        bestacc = acc
        besti = i
    
plt.figure()
plt.plot(roc[:,0],roc[:,1])
plt.scatter(roc[:,0],roc[:,1],color='red')
plt.scatter(roc[besti,0],roc[besti,1],color='green')
plt.title("Receiver Operating Characteristic")
plt.xlabel("False Positives")
plt.ylabel("True positives")

print("Best accuracy (green dot) = ", bestacc, ", for FPR=%.3g"%roc[besti,0]," TPR=%.3g"%roc[besti,1], ", i.e. cutoff=%g"%sigma(v[besti]))


Best accuracy (green dot) =  0.905 , for FPR=0.15  TPR=0.96 , i.e. cutoff=0.310026

Remarks:

As can be seen, the best cutoff in terms of accuracy is not 0.5, but close to 0.3 in this example. This is because Logistic regression does not try to optimize accuracy. However, choosing a cutoff other than 0.5 generates a bias in the method, which is often undesirable. This is an early example of the unavoidable variance-bias tradeoff in all machine learning methods that will merit consideration in a later course.

Applying logistic regression to a real problem

Here we apply our logistic regression approch to a real problem, the Wisconsin Breast Cancer dataset. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.


In [135]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
print("Dataset keys=", cancer.keys())
print("Dataset features = ", cancer['feature_names'], "\n There are", len(cancer['feature_names']), "features")


Dataset keys= dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])
Dataset features =  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension'] 
 There are 30 features

In [136]:
import pandas

data = np.c_[cancer.data, cancer.target]
columns = np.append(cancer.feature_names, ["target"])
frame = pandas.DataFrame(data, columns=columns)

We show a few data points below. Overall there are 569 cases. The last column is the target, i.e. what the model should predict. 0 is benign and 1 is cancer.


In [137]:
print("Number of cases = ", len(frame))
frame[1:10]


Number of cases =  569
Out[137]:
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension target
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0.0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0.0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0.0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0.0
5 12.45 15.70 82.57 477.1 0.12780 0.17000 0.15780 0.08089 0.2087 0.07613 ... 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440 0.0
6 18.25 19.98 119.60 1040.0 0.09463 0.10900 0.11270 0.07400 0.1794 0.05742 ... 27.66 153.20 1606.0 0.1442 0.2576 0.3784 0.1932 0.3063 0.08368 0.0
7 13.71 20.83 90.20 577.9 0.11890 0.16450 0.09366 0.05985 0.2196 0.07451 ... 28.14 110.60 897.0 0.1654 0.3682 0.2678 0.1556 0.3196 0.11510 0.0
8 13.00 21.82 87.50 519.8 0.12730 0.19320 0.18590 0.09353 0.2350 0.07389 ... 30.73 106.20 739.3 0.1703 0.5401 0.5390 0.2060 0.4378 0.10720 0.0
9 12.46 24.04 83.97 475.9 0.11860 0.23960 0.22730 0.08543 0.2030 0.08243 ... 40.68 97.65 711.4 0.1853 1.0580 1.1050 0.2210 0.4366 0.20750 0.0

9 rows × 31 columns

Training and testing subsets

We split the data set into training and testing cases subsets. This is very common but not the only way to ensure that the classification works correctly.


In [138]:
from sklearn.model_selection import train_test_split

X = frame[frame.columns[:-1]]
y = frame.target
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   random_state = 0)
print("Number of training samples = ", len(y_train), ", number of test samples =", len(y_test))
X_train_matrix=X_train.as_matrix() ## much faster if we do this
y_train_vector=y_train.as_matrix() ## same


Number of training samples =  426 , number of test samples = 143

Running the algorithm

We now run the algorithm we wrote earlier. Unfortunately in this case, we perform gradient descent in a high-dimensional space and it can be very slow. In order to improve things somewhat we perform a "warm restart" of the algorithm, which allows us to increase the step size. This is still very slow. In subsequent work, we will see how to improve this aspect of the algorithm.


In [141]:
import time

(w,b,J,i) = iterative_sol_nd_logistic_gd(X_train_matrix.T,y_train_vector,step=1e-6,epsilon=1e-5,maxiter=40000)
print("w0=",w, " b0=", b)
# WR1
tic=time.time()
(w,b,J,i) = citerative_sol_nd_logistic_gd(X_train_matrix.T,y_train_vector, 
                                          warm_restart=w,b=b,
                                          step=2e-6,epsilon=1e-5,maxiter=4000000,verbose=True)
toc=time.time()
elapsed = toc - tic
print("Elapsed=",elapsed,"s")
trueJ = J[0:i] ; lastJ = trueJ[-1]
print("w=",w, " b=", b)
plt.figure()
plt.title("Log10 of the loss convergence")
plt.plot(np.log10(trueJ-lastJ))
plt.show()


Exceeded maximum number of iterations (40000)
Gradient norm still= 1.9193838173945408
w0= [ 3.06567553e-02 -1.52891420e-02  1.22433171e-01  1.58078414e-02
 -8.86997130e-05 -1.77312927e-03 -2.73969369e-03 -1.14441339e-03
 -2.44585392e-04  7.57721717e-05  3.58658290e-04  1.07009528e-03
 -6.44874693e-03 -4.64621239e-02 -1.02928744e-05 -3.66632760e-04
 -4.88267219e-04 -1.36133014e-04 -6.61222314e-05 -1.85556003e-05
  3.19579088e-02 -4.67171997e-02  6.25284633e-02 -3.17632588e-02
 -3.71902458e-04 -6.26194848e-03 -8.02338367e-03 -2.32126183e-03
 -1.26220914e-03 -3.78065951e-04]  b0= 0.0046488588074805
..................................................Exceeded maximum number of iterations (4000000)
Gradient norm still= 0.17012074218707512
Elapsed= 230.7371950149536 s
w= [ 1.15403853  0.08805253  0.29958839 -0.0126695  -0.05332953 -0.20547086
 -0.2933679  -0.13206    -0.10751195 -0.01258522  0.04868811  0.5181866
  0.17881257 -0.10679954 -0.00332276 -0.02742737 -0.04938617 -0.01600683
 -0.01762243 -0.00119852  1.17566286 -0.25581802 -0.32501    -0.01552529
 -0.09297941 -0.63638425 -0.8475159  -0.2645346  -0.26864782 -0.06395347]  b= 0.22004854906823415
/usr/local/lib/python3.6/site-packages/ipykernel_launcher.py:17: RuntimeWarning: divide by zero encountered in log10

In [142]:
trueJ


Out[142]:
array([77.26393036, 77.26370933, 77.26348831, ..., 45.1865083 ,
       45.18650656, 45.18650483])

Classification results on the real data

Even though convergence is not achieved, we use the weight we have learn to perform classification and see how we go.


In [143]:
def confusion(w,b,X,y):
    z = w.T.dot(X.T) + b
    prob = (sigma(z) >= 0.5).astype(int) ## prediction
    trueP = (y == 1)
    ptp = prob[trueP]
    TP=np.sum(ptp) ## works because the outcomes are binary
    FN=np.sum(1-ptp)
    trueN = (y == 0)
    ptn = prob[trueN]
    TN = np.sum(1-ptn)
    FP = np.sum(ptn)
    return((TP,FP,FN,TN))


print("Training set:")
print("Confusion matrix=")
TP,FP,FN,TN=confusion(w,b,X_train_matrix,y_train_vector)
print(np.array([[TP,FP],[FN,TN]]))
print("Sensitivity = %.3g"% (TP/(TP+FN)), " Specificity = %.3g"% (TN/(FP+TN)))

print("Test set: ")
X_test_matrix=X_test.as_matrix() ## much faster if we do this
y_test_vector=y_test.as_matrix() ## same
TP,FP,FN,TN=confusion(w,b,X_test_matrix,y_test_vector)
print(np.array([[TP,FP],[FN,TN]]))
print("Sensitivity = %.3g"% (TP/(TP+FN)), " Specificity = %.3g"% (TN/(FP+TN)))


Training set:
Confusion matrix=
[[260  10]
 [  7 149]]
Sensitivity = 0.974  Specificity = 0.937
Test set: 
[[85  1]
 [ 5 52]]
Sensitivity = 0.944  Specificity = 0.981

Not bad ! 94.4% sensitivity on the test set means we detect 94.4% of breast cancers; while 98.1% specificity means only 1.9% of cases (100%-98.1%) are benign cases misclassified as cancers. We also notice that the results on the test set are not very different that those of the training set. This is a feature of linear classifiers like Logistic Regression: they are unbiased.

Professional version of logistic regression

Scikit-learn provides an efficient version of logistic regression that converges quickly. Let's use that.


In [144]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification, make_blobs
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.datasets import load_breast_cancer

# Load the cancer data
(X_cancer, y_cancer) = load_breast_cancer(return_X_y = True)
X_train_pro, X_test_pro, y_train_pro, y_test_pro = train_test_split(X_cancer, y_cancer,
                                                   random_state = 0)
# Call the Logisitic Regression function
clf = LogisticRegression().fit(X_train_pro, y_train_pro)

In [145]:
print(clf.coef_[0])
print(clf.intercept_)


[ 1.72213733  0.0898126   0.10599231 -0.00713483 -0.12840879 -0.33349496
 -0.49935642 -0.26507865 -0.26760624 -0.02149464  0.03647235  0.98652964
  0.11708904 -0.10871701 -0.00796627  0.01056313 -0.02918485 -0.02818257
 -0.03431327  0.00856821  1.35830906 -0.28904236 -0.24983316 -0.02012386
 -0.21696407 -1.02745376 -1.44794323 -0.53377399 -0.64855852 -0.10913297]
[0.35462971]

In [146]:
# Compute and print the Accuray scores
print('Accuracy of Logistic regression classifier on training set: {:.2f}'
     .format(clf.score(X_train_pro, y_train_pro)))
print('Accuracy of Logistic regression classifier on test set: {:.2f}'
     .format(clf.score(X_test_pro, y_test_pro)))
y_predicted=clf.predict(X_test_pro)
# Compute and print confusion matrix
confusion = confusion_matrix(y_test_pro, y_predicted)
print(confusion)
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
print('Accuracy: {:.3f}'.format(accuracy_score(y_test_pro, y_predicted)))
print('Precision: {:.3f}'.format(precision_score(y_test_pro, y_predicted)))
print('Recall: {:.3f}'.format(recall_score(y_test_pro, y_predicted)))
print('F1: {:.3f}'.format(f1_score(y_test_pro, y_predicted)))


Accuracy of Logistic regression classifier on training set: 0.96
Accuracy of Logistic regression classifier on test set: 0.96
[[52  1]
 [ 5 85]]
Accuracy: 0.958
Precision: 0.988
Recall: 0.944
F1: 0.966

The results are exactly the same !

Conclusion

In this series of slides we have introduced the Perceptron, which is the simplest instance of a neural network: a simulation of a single neuron. We have seen that it is exactly equivalent to Logistic Regression, a classical machine learning technique. We have also shown how to train the perceptron via gradient descent. We have shown that this technique is not very complicated but not very efficient. We have shown that in spite of its simplicity, it is possible to achieve very good results with it. Future work will involve cascades of neurons arranged in artificial neural networks (ANN), but will use much of the same techniques.


In [ ]: