Introduction to Logistic Regression

Scott Hendrickson (2014-03-20)

Classification:

Learn classification of vectors from labeled training data
Vectors of N dimensions
Labels: {0,1}
Generalize to predict class of new vectors



In [2]:

    
import numpy as np
import pandas as pd 
from ggplot import *
%matplotlib inline

Simple 1-D Example



In [13]:

    
dat_1d = {"x"    :[2, 2, 1, 0.25, -0.4, -1, -2, -2, 1], 
       "label":[0, 0, 0,    0,    1,  1,  1,  1, 1]}
df_1d = pd.DataFrame.from_dict(data = dat_1d)
df_1d









    Out[13]:






  
    
      
      label
      x
    
  
  
    
      0
       0
       2.00
    
    
      1
       0
       2.00
    
    
      2
       0
       1.00
    
    
      3
       0
       0.25
    
    
      4
       1
      -0.40
    
    
      5
       1
      -1.00
    
    
      6
       1
      -2.00
    
    
      7
       1
      -2.00
    
    
      8
       1
       1.00
    
  

9 rows × 2 columns



In [14]:

    
df_1d["label_fact"] = pd.Categorical.from_array([str(x) for x in df_1d.label])
ggplot(aes(x="x",y="label", color="label_fact"), data=df_1d) + geom_point()









    












    Out[14]:





<ggplot: (287567401)>

Sigmoid Function

Properties of function of distance from boundary:

Continuous
Domain of all reals
Range [0,1]
Convex function (differentiable, $\frac{d s(x)}{dx} > 0 \text{ } \forall \text{ } x$

$f(x) = \frac{1}{1 + \exp(-z)}$



In [15]:

    
def sigmoid(x):
    # vector in, vector out
    return 1./(1. + np.exp(-x))

# Let's see it
d_sig = pd.DataFrame.from_dict({"x": np.linspace(-8,8,30), 
                                "y": sigmoid(np.linspace(-8,8,30))})
ggplot(aes(x="x", y="y"), data = d_sig) + geom_line(color="orange")









    












    Out[15]:





<ggplot: (287556145)>

Model

Model predicts $l(x) = [0,1]$ based on sigmoid. Calculate argument to sigmoid from input data, e.g.:

For 1-D input vectors:

$z = \beta_0 + \beta_1 x$

Or 2-D input vectors:

$z = \beta_0 + \beta_1 x_1 + \beta_2 x_2$

$\ldots$

Model prediction in 1-D:

$y(x) = sigmoid(\beta_0 + \beta_1 x) = \frac{1}{1 + \exp(-(\beta_0 + \beta_1 x))}$

Cost Function

Need a cost function for which optimization will give the boundary N-dim plane defined by $[\beta_0, \beta_1, \ldots ]$.

Least Squares?

$Cost = \frac{1}{m} \sum{\frac{1}{2}(l(x) - y(x))^2}$

This is the cost if the model prediction is $y(x)$ and the actual outcome is $l(x)$. With the sigmoid, this funciton is not convex, so local minima are challenging.

Why is this a problem?

Lots of local minima mean gradient descent may not find the global optimum. We would like a convex function so if you run gradient descent and converge to the global minimum.

Logistic Regression Cost Function

For model y(x):

$C = -[1-l(x)]\ln[l(x)-y(x)] - l(x) \ln[y(x)] $

$C = -[\text{cost for label 0}] - [\text{cost for label 1}]$



In [17]:

    
def cost(label_arr, model_arr):
    # inputs are nb.array([])
    # sum over training set
    c = -(1. - label_arr)*np.log(1. - model_arr) - label_arr*np.log(model_arr)
    return np.sum(c)/len(c)
print "Model  ~ Label: {:.4}".format(cost(np.array([0,0,0,1,1,1]), np.array([0.1,0.2,0.05,0.99,0.98,0.99])))
print "Model !~ Label: {:.4}".format(cost(np.array([0,0,0,1,1,1]), np.array([0.99,0.98,0.99,0.1,0.2,0.05])))









    



Model  ~ Label: 0.07002
Model !~ Label: 3.338



In [18]:

    
def model_1d(model_vec, df):
    # Calculate model output
    z = model_vec[0] + model_vec[1] * df.x
    return sigmoid(z)
    
def model_cost(model_vec, df, model):
    # model_vec is python vector
    # df is data frame with column x and column label
    # model is a functionto calculate model output
    return cost(df.label, model(model_vec, df))



In [19]:

    
from scipy.optimize import minimize as mini
res_1d = mini(model_cost, x0=[0.1,0.1], args=(df_1d, model_1d)) 
print (res_1d)









    



   status: 0
  success: True
     njev: 13
     nfev: 52
 hess_inv: array([[ 14.24001119,  -8.10781827],
       [ -8.10781827,  14.28838055]])
      fun: 0.3234494073359467
        x: array([ 0.92795283, -1.97511698])
  message: 'Optimization terminated successfully.'
      jac: array([  3.44961882e-06,   4.65661287e-06])



In [20]:

    
df_pred_1d = pd.DataFrame.from_dict({"x": np.linspace(-3,3,99)})
df_pred_1d["y"] = model_1d(res_1d.x, df_pred_1d)
df_pred_1d.head()









    Out[20]:






  
    
      
      x
      y
    
  
  
    
      0
      -3.000000
       0.998945
    
    
      1
      -2.938776
       0.998810
    
    
      2
      -2.877551
       0.998657
    
    
      3
      -2.816327
       0.998485
    
    
      4
      -2.755102
       0.998290
    
  

5 rows × 2 columns



In [21]:

    
ggplot(aes(x="x",y="label", color="label_fact"), data=df_1d) + \
  geom_point() + \
  geom_line(aes(x="x", y="y"), color="green", data=df_pred_1d)









    












    Out[21]:





<ggplot: (282128693)>

Example - 2D



In [22]:

    
dat_2d = {   "x1":[-1.0, 0.3, 2.05, 0.95,  2,-1.1,-2.4,-0.9,-2.1], 
             "x2":[ 2.0, 0.6,  2.1,  2.1, -1,-1  ,-1.8,  -2, 0.0], 
       "label":[0,0,0,0,0,1,1,1,1]}
df_2d = pd.DataFrame.from_dict(data = dat_2d)
df_2d["label_fact"] = pd.Categorical.from_array([str(x) for x in df_2d.label])
df_2d









    Out[22]:






  
    
      
      label
      x1
      x2
      label_fact
    
  
  
    
      0
       0
      -1.00
       2.0
       0
    
    
      1
       0
       0.30
       0.6
       0
    
    
      2
       0
       2.05
       2.1
       0
    
    
      3
       0
       0.95
       2.1
       0
    
    
      4
       0
       2.00
      -1.0
       0
    
    
      5
       1
      -1.10
      -1.0
       1
    
    
      6
       1
      -2.40
      -1.8
       1
    
    
      7
       1
      -0.90
      -2.0
       1
    
    
      8
       1
      -2.10
       0.0
       1
    
  

9 rows × 4 columns



In [23]:

    
ggplot(aes(x="x1",y="x2", color="label_fact"), data=df_2d) + geom_point()









    












    Out[23]:





<ggplot: (287556169)>



In [24]:

    
def model_2d(model_vec, df):
    z = model_vec[0] + model_vec[1] * df.x1 + model_vec[2] * df.x2
    return sigmoid(z)



In [25]:

    
res_2d = mini(model_cost, x0=[1,1,0.5], args=(df_2d, model_2d)) 
print (res_2d)









    



   status: 0
  success: True
     njev: 19
     nfev: 95
 hess_inv: array([[ 26585.66512643,  35936.64997437,  28285.71275239],
       [ 35936.64997437,  48579.4332378 ,  38236.01146309],
       [ 28285.71275239,  38236.01146309,  30096.51214756]])
      fun: 4.171345112083083e-06
        x: array([-5.27977744, -8.66641472, -7.1442268 ])
  message: 'Optimization terminated successfully.'
      jac: array([  1.12009656e-06,   1.18361356e-07,   5.67908933e-06])



In [27]:

    
# z = 0 defines a line
x1 = np.linspace(-3,3,27)
x2 = -(res_2d.x[0] + x1*res_2d.x[1])/res_2d.x[2]
df_pred_2d = pd.DataFrame.from_dict({ "x1": x1, 
                                      "x2": x2  })
df_pred_2d["y"] = model_2d(res_2d.x, df_pred_2d)
df_pred_2d.head()









    Out[27]:






  
    
      
      x1
      x2
      y
    
  
  
    
      0
      -3.000000
       2.900169
       0.5
    
    
      1
      -2.769231
       2.620231
       0.5
    
    
      2
      -2.538462
       2.340293
       0.5
    
    
      3
      -2.307692
       2.060355
       0.5
    
    
      4
      -2.076923
       1.780417
       0.5
    
  

5 rows × 3 columns



In [29]:

    
ggplot(aes(x="x1", y="x2"),data=df_pred_2d) + geom_line() + geom_point(aes(x="x1",y="x2", color="label_fact"), data=df_2d)









    












    Out[29]:





<ggplot: (287531813)>



In [31]:

    
from scipy.stats import norm
# Make some data
npts = 65
x1 = np.concatenate((norm.rvs(loc=2, scale=1.5,   size=npts),  norm.rvs(loc=-2, scale=1.5, size=npts)))
x2 = np.concatenate((norm.rvs(loc=2, scale=1.5, size=npts),  norm.rvs(loc=-1, scale=1.5,   size=npts)))
lab = [0]*npts + [1]*npts
# Plots and fits
dat_gaus = {  "x1": x1, 
              "x2": x2 , 
           "label": lab }
df_2d_gaus = pd.DataFrame.from_dict(data = dat_gaus)
df_2d_gaus["label_fact"] = pd.Categorical.from_array([str(x) for x in df_2d_gaus.label])
# Fit
res_2d_gaus = mini(model_cost, x0=[0.1,1.2,0.5], args=(df_2d_gaus, model_2d))
# z = 0 defines a line
x1 = np.linspace(-4, 4, 27)
x2 = -(res_2d_gaus.x[0] + x1*res_2d_gaus.x[1])/res_2d_gaus.x[2]
print res_2d_gaus
df_2d_gaus_pred = pd.DataFrame.from_dict({"x1": x1, "x2": x2})
df_2d_gaus_pred["y"] = model_2d(res_2d_gaus.x, df_2d_gaus)
ggplot(aes(x="x1", y="x2"), data=df_2d_gaus_pred) + \
     geom_line(color="red") + \
     geom_point(aes(x="x1",y="x2", color="label_fact"), data=df_2d_gaus)









    



   status: 0
  success: True
     njev: 25
     nfev: 125
 hess_inv: array([[  62.21889931,  -38.26727073,  -24.90246783],
       [ -38.26727073,  136.80563705,   59.50383824],
       [ -24.90246783,   59.50383824,   68.78974595]])
      fun: 0.060892790277256276
        x: array([ 1.06422726, -2.93912367, -2.09587989])
  message: 'Optimization terminated successfully.'
      jac: array([  2.24681571e-06,  -3.12458724e-07,   1.49244443e-06])






    












    Out[31]:





<ggplot: (287566385)>



In [34]:

    
def model_quad(model_vec, df):
    # quadratic
    z = model_vec[0] + model_vec[1]*df.x2 + model_vec[2]*df.x1*df.x1 + model_vec[3]*df.x1
    return sigmoid(z)
# Make some data
npts = 100
x1 = np.concatenate((norm.rvs(loc=-3,scale=1, size=npts), 
                     norm.rvs(loc=0, scale=1, size=npts),
                     norm.rvs(loc=0, scale=1, size=npts),
                     norm.rvs(loc=3, scale=1, size=npts)))
x2 = np.concatenate((norm.rvs(loc=4, scale=6, size=npts), 
                     norm.rvs(loc=-5, scale=2, size=npts),
                     norm.rvs(loc=6, scale=2, size=npts),
                     norm.rvs(loc=4, scale=6, size=npts)))
lab = [0]*npts + [1]*npts + [0]*npts + [0]*npts
# Plots and fits
dat_quad = {    "x1": x1, 
                "x2": x2, 
             "label": lab }
df_quad = pd.DataFrame.from_dict(data = dat_quad)
df_quad["label_fact"] = pd.Categorical.from_array([str(x) for x in df_quad.label])
# Fit
res_quad = mini(model_cost, x0=[0.5, 1.2, -1, 2], args=(df_quad, model_quad))
print res_quad
# z = 0 defines a line
x1 = np.linspace(-3,3,21)
x2 = -(res_quad.x[0] + res_quad.x[3]*x1 + res_quad.x[2]*x1*x1)/res_quad.x[1]
df_quad_pred = pd.DataFrame.from_dict({"x1": x1, "x2": x2})
df_quad_pred["y"] = model_2d(res_quad.x, df_quad)
ggplot(aes(x="x1", y="x2"), data=df_quad_pred) + \
     geom_line(color="red") + \
     geom_point(aes(x="x1",y="x2", color="label_fact"), data=df_quad)









    



   status: 0
  success: True
     njev: 33
     nfev: 198
 hess_inv: array([[ 81.81109734,   0.46931549, -19.57898876,   3.00398338],
       [  0.46931549,   4.06837566,   5.65185489,  -2.85425827],
       [-19.57898876,   5.65185489,  18.31695346,  -9.71437105],
       [  3.00398338,  -2.85425827,  -9.71437105,  28.75195032]])
      fun: 0.08494712883483503
        x: array([ 0.41457294, -0.78381036, -1.17937652,  0.32299065])
  message: 'Optimization terminated successfully.'
      jac: array([ -3.28756869e-07,  -1.57207251e-06,  -4.22820449e-07,
        -4.17232513e-07])






    












    Out[34]:





<ggplot: (287541013)>

EndNotes:

Followup ideas that are important to success with machine learning:

Split data into training set and test set, hold out test set and measure error in both sets during training to estimate point where generalization breaks down due to over-fitting! This may mean you don't use a canned optimizer as I did in these demos.
Dimensionlity: As you get more dimensions in your input vector, it is harder to fill the space with data $Volume \propto d^{N_d}$. This means addition another dimension requires lots more data!
Regularization helps limit complexity of model. Add a term like $C = \ldots + \sum{\beta_i}^2$ to penalize having many large model parameters $\beta$ in the model.
Don't use this code, us scikits-learn!
Look at Baysian approach that gives you distributions of separating surfaces.



In [ ]:

	x	y
0	-3.000000	0.998945
1	-2.938776	0.998810
2	-2.877551	0.998657
3	-2.816327	0.998485
4	-2.755102	0.998290

	label	x1	x2	label_fact
0	0	-1.00	2.0	0
1	0	0.30	0.6	0
2	0	2.05	2.1	0
3	0	0.95	2.1	0
4	0	2.00	-1.0	0
5	1	-1.10	-1.0	1
6	1	-2.40	-1.8	1
7	1	-0.90	-2.0	1
8	1	-2.10	0.0	1

	x1	x2	y
0	-3.000000	2.900169	0.5
1	-2.769231	2.620231	0.5
2	-2.538462	2.340293	0.5
3	-2.307692	2.060355	0.5
4	-2.076923	1.780417	0.5