Introduction to Logistic Regression

Scott Hendrickson (2014-03-20)

Classification:

  • Learn classification of vectors from labeled training data
  • Vectors of N dimensions
  • Labels: {0,1}
  • Generalize to predict class of new vectors

In [2]:
import numpy as np
import pandas as pd 
from ggplot import *
%matplotlib inline

Simple 1-D Example


In [13]:
dat_1d = {"x"    :[2, 2, 1, 0.25, -0.4, -1, -2, -2, 1], 
       "label":[0, 0, 0,    0,    1,  1,  1,  1, 1]}
df_1d = pd.DataFrame.from_dict(data = dat_1d)
df_1d


Out[13]:
label x
0 0 2.00
1 0 2.00
2 0 1.00
3 0 0.25
4 1 -0.40
5 1 -1.00
6 1 -2.00
7 1 -2.00
8 1 1.00

9 rows × 2 columns


In [14]:
df_1d["label_fact"] = pd.Categorical.from_array([str(x) for x in df_1d.label])
ggplot(aes(x="x",y="label", color="label_fact"), data=df_1d) + geom_point()


Out[14]:
<ggplot: (287567401)>

Sigmoid Function

Properties of function of distance from boundary:

  • Continuous
  • Domain of all reals
  • Range [0,1]
  • Convex function (differentiable, $\frac{d s(x)}{dx} > 0 \text{ } \forall \text{ } x$

$f(x) = \frac{1}{1 + \exp(-z)}$


In [15]:
def sigmoid(x):
    # vector in, vector out
    return 1./(1. + np.exp(-x))

# Let's see it
d_sig = pd.DataFrame.from_dict({"x": np.linspace(-8,8,30), 
                                "y": sigmoid(np.linspace(-8,8,30))})
ggplot(aes(x="x", y="y"), data = d_sig) + geom_line(color="orange")


Out[15]:
<ggplot: (287556145)>

Model

Model predicts $l(x) = [0,1]$ based on sigmoid. Calculate argument to sigmoid from input data, e.g.:

For 1-D input vectors:

$z = \beta_0 + \beta_1 x$

Or 2-D input vectors:

$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

$\ldots$

Model prediction in 1-D:

$y(x) = sigmoid(\beta_0 + \beta_1 x) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x))}$

Cost Function

Need a cost function for which optimization will give the boundary N-dim plane defined by $[\beta_0, \beta_1, \ldots ]$.

Least Squares?

$Cost = \frac{1}{m} \sum{\frac{1}{2}(l(x) - y(x))^2}$

This is the cost if the model prediction is $y(x)$ and the actual outcome is $l(x)$. With the sigmoid, this funciton is not convex, so local minima are challenging.

Why is this a problem?

Lots of local minima mean gradient descent may not find the global optimum. We would like a convex function so if you run gradient descent and converge to the global minimum.

Logistic Regression Cost Function

For model y(x):

$C = -[1-l(x)]\ln[l(x)-y(x)] - l(x) \ln[y(x)] $

$C = -[\text{cost for label 0}] - [\text{cost for label 1}]$


In [17]:
def cost(label_arr, model_arr):
    # inputs are nb.array([])
    # sum over training set
    c = -(1. - label_arr)*np.log(1. - model_arr) - label_arr*np.log(model_arr)
    return np.sum(c)/len(c)
print "Model  ~ Label: {:.4}".format(cost(np.array([0,0,0,1,1,1]), np.array([0.1,0.2,0.05,0.99,0.98,0.99])))
print "Model !~ Label: {:.4}".format(cost(np.array([0,0,0,1,1,1]), np.array([0.99,0.98,0.99,0.1,0.2,0.05])))


Model  ~ Label: 0.07002
Model !~ Label: 3.338

In [18]:
def model_1d(model_vec, df):
    # Calculate model output
    z = model_vec[0] + model_vec[1] * df.x
    return sigmoid(z)
    
def model_cost(model_vec, df, model):
    # model_vec is python vector
    # df is data frame with column x and column label
    # model is a functionto calculate model output
    return cost(df.label, model(model_vec, df))

In [19]:
from scipy.optimize import minimize as mini
res_1d = mini(model_cost, x0=[0.1,0.1], args=(df_1d, model_1d)) 
print (res_1d)


   status: 0
  success: True
     njev: 13
     nfev: 52
 hess_inv: array([[ 14.24001119,  -8.10781827],
       [ -8.10781827,  14.28838055]])
      fun: 0.3234494073359467
        x: array([ 0.92795283, -1.97511698])
  message: 'Optimization terminated successfully.'
      jac: array([  3.44961882e-06,   4.65661287e-06])

In [20]:
df_pred_1d = pd.DataFrame.from_dict({"x": np.linspace(-3,3,99)})
df_pred_1d["y"] = model_1d(res_1d.x, df_pred_1d)
df_pred_1d.head()


Out[20]:
x y
0 -3.000000 0.998945
1 -2.938776 0.998810
2 -2.877551 0.998657
3 -2.816327 0.998485
4 -2.755102 0.998290

5 rows × 2 columns


In [21]:
ggplot(aes(x="x",y="label", color="label_fact"), data=df_1d) + \
  geom_point() + \
  geom_line(aes(x="x", y="y"), color="green", data=df_pred_1d)


Out[21]:
<ggplot: (282128693)>

Example - 2D


In [22]:
dat_2d = {   "x1":[-1.0, 0.3, 2.05, 0.95,  2,-1.1,-2.4,-0.9,-2.1], 
             "x2":[ 2.0, 0.6,  2.1,  2.1, -1,-1  ,-1.8,  -2, 0.0], 
       "label":[0,0,0,0,0,1,1,1,1]}
df_2d = pd.DataFrame.from_dict(data = dat_2d)
df_2d["label_fact"] = pd.Categorical.from_array([str(x) for x in df_2d.label])
df_2d


Out[22]:
label x1 x2 label_fact
0 0 -1.00 2.0 0
1 0 0.30 0.6 0
2 0 2.05 2.1 0
3 0 0.95 2.1 0
4 0 2.00 -1.0 0
5 1 -1.10 -1.0 1
6 1 -2.40 -1.8 1
7 1 -0.90 -2.0 1
8 1 -2.10 0.0 1

9 rows × 4 columns


In [23]:
ggplot(aes(x="x1",y="x2", color="label_fact"), data=df_2d) + geom_point()


Out[23]:
<ggplot: (287556169)>

In [24]:
def model_2d(model_vec, df):
    z = model_vec[0] + model_vec[1] * df.x1 + model_vec[2] * df.x2
    return sigmoid(z)

In [25]:
res_2d = mini(model_cost, x0=[1,1,0.5], args=(df_2d, model_2d)) 
print (res_2d)


   status: 0
  success: True
     njev: 19
     nfev: 95
 hess_inv: array([[ 26585.66512643,  35936.64997437,  28285.71275239],
       [ 35936.64997437,  48579.4332378 ,  38236.01146309],
       [ 28285.71275239,  38236.01146309,  30096.51214756]])
      fun: 4.171345112083083e-06
        x: array([-5.27977744, -8.66641472, -7.1442268 ])
  message: 'Optimization terminated successfully.'
      jac: array([  1.12009656e-06,   1.18361356e-07,   5.67908933e-06])

In [27]:
# z = 0 defines a line
x1 = np.linspace(-3,3,27)
x2 = -(res_2d.x[0] + x1*res_2d.x[1])/res_2d.x[2]
df_pred_2d = pd.DataFrame.from_dict({ "x1": x1, 
                                      "x2": x2  })
df_pred_2d["y"] = model_2d(res_2d.x, df_pred_2d)
df_pred_2d.head()


Out[27]:
x1 x2 y
0 -3.000000 2.900169 0.5
1 -2.769231 2.620231 0.5
2 -2.538462 2.340293 0.5
3 -2.307692 2.060355 0.5
4 -2.076923 1.780417 0.5

5 rows × 3 columns


In [29]:
ggplot(aes(x="x1", y="x2"),data=df_pred_2d) + geom_line() + geom_point(aes(x="x1",y="x2", color="label_fact"), data=df_2d)


Out[29]:
<ggplot: (287531813)>

In [31]:
from scipy.stats import norm
# Make some data
npts = 65
x1 = np.concatenate((norm.rvs(loc=2, scale=1.5,   size=npts),  norm.rvs(loc=-2, scale=1.5, size=npts)))
x2 = np.concatenate((norm.rvs(loc=2, scale=1.5, size=npts),  norm.rvs(loc=-1, scale=1.5,   size=npts)))
lab = [0]*npts + [1]*npts
# Plots and fits
dat_gaus = {  "x1": x1, 
              "x2": x2 , 
           "label": lab }
df_2d_gaus = pd.DataFrame.from_dict(data = dat_gaus)
df_2d_gaus["label_fact"] = pd.Categorical.from_array([str(x) for x in df_2d_gaus.label])
# Fit
res_2d_gaus = mini(model_cost, x0=[0.1,1.2,0.5], args=(df_2d_gaus, model_2d))
# z = 0 defines a line
x1 = np.linspace(-4, 4, 27)
x2 = -(res_2d_gaus.x[0] + x1*res_2d_gaus.x[1])/res_2d_gaus.x[2]
print res_2d_gaus
df_2d_gaus_pred = pd.DataFrame.from_dict({"x1": x1, "x2": x2})
df_2d_gaus_pred["y"] = model_2d(res_2d_gaus.x, df_2d_gaus)
ggplot(aes(x="x1", y="x2"), data=df_2d_gaus_pred) + \
     geom_line(color="red") + \
     geom_point(aes(x="x1",y="x2", color="label_fact"), data=df_2d_gaus)


   status: 0
  success: True
     njev: 25
     nfev: 125
 hess_inv: array([[  62.21889931,  -38.26727073,  -24.90246783],
       [ -38.26727073,  136.80563705,   59.50383824],
       [ -24.90246783,   59.50383824,   68.78974595]])
      fun: 0.060892790277256276
        x: array([ 1.06422726, -2.93912367, -2.09587989])
  message: 'Optimization terminated successfully.'
      jac: array([  2.24681571e-06,  -3.12458724e-07,   1.49244443e-06])
Out[31]:
<ggplot: (287566385)>

In [34]:
def model_quad(model_vec, df):
    # quadratic
    z = model_vec[0] + model_vec[1]*df.x2 + model_vec[2]*df.x1*df.x1 + model_vec[3]*df.x1
    return sigmoid(z)
# Make some data
npts = 100
x1 = np.concatenate((norm.rvs(loc=-3,scale=1, size=npts), 
                     norm.rvs(loc=0, scale=1, size=npts),
                     norm.rvs(loc=0, scale=1, size=npts),
                     norm.rvs(loc=3, scale=1, size=npts)))
x2 = np.concatenate((norm.rvs(loc=4, scale=6, size=npts), 
                     norm.rvs(loc=-5, scale=2, size=npts),
                     norm.rvs(loc=6, scale=2, size=npts),
                     norm.rvs(loc=4, scale=6, size=npts)))
lab = [0]*npts + [1]*npts + [0]*npts + [0]*npts
# Plots and fits
dat_quad = {    "x1": x1, 
                "x2": x2, 
             "label": lab }
df_quad = pd.DataFrame.from_dict(data = dat_quad)
df_quad["label_fact"] = pd.Categorical.from_array([str(x) for x in df_quad.label])
# Fit
res_quad = mini(model_cost, x0=[0.5, 1.2, -1, 2], args=(df_quad, model_quad))
print res_quad
# z = 0 defines a line
x1 = np.linspace(-3,3,21)
x2 = -(res_quad.x[0] + res_quad.x[3]*x1 + res_quad.x[2]*x1*x1)/res_quad.x[1]
df_quad_pred = pd.DataFrame.from_dict({"x1": x1, "x2": x2})
df_quad_pred["y"] = model_2d(res_quad.x, df_quad)
ggplot(aes(x="x1", y="x2"), data=df_quad_pred) + \
     geom_line(color="red") + \
     geom_point(aes(x="x1",y="x2", color="label_fact"), data=df_quad)


   status: 0
  success: True
     njev: 33
     nfev: 198
 hess_inv: array([[ 81.81109734,   0.46931549, -19.57898876,   3.00398338],
       [  0.46931549,   4.06837566,   5.65185489,  -2.85425827],
       [-19.57898876,   5.65185489,  18.31695346,  -9.71437105],
       [  3.00398338,  -2.85425827,  -9.71437105,  28.75195032]])
      fun: 0.08494712883483503
        x: array([ 0.41457294, -0.78381036, -1.17937652,  0.32299065])
  message: 'Optimization terminated successfully.'
      jac: array([ -3.28756869e-07,  -1.57207251e-06,  -4.22820449e-07,
        -4.17232513e-07])
Out[34]:
<ggplot: (287541013)>

EndNotes:

Followup ideas that are important to success with machine learning:

  • Split data into training set and test set, hold out test set and measure error in both sets during training to estimate point where generalization breaks down due to over-fitting! This may mean you don't use a canned optimizer as I did in these demos.
  • Dimensionlity: As you get more dimensions in your input vector, it is harder to fill the space with data $Volume \propto d^{N_d}$. This means addition another dimension requires lots more data!
  • Regularization helps limit complexity of model. Add a term like $C = \ldots + \sum{\beta_i}^2$ to penalize having many large model parameters $\beta$ in the model.
  • Don't use this code, us scikits-learn!
  • Look at Baysian approach that gives you distributions of separating surfaces.

In [ ]: