Non-Linear Logistic Regression

In the last session we looked at the basic concepts of logistic regression.

Logistic classification is about predicting one or another category.
Models give us numerical values.
The way to convert numerical values to categorical values is by using the sigmoid.
A new penalty function that has small values if you guess very close to correctly and very large values otherwise (roughly).



In [1]:

    
# Import our usual libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
# Set up the path
import os
# OS-independent way to navigate the file system
# Data directory is one directory up in relation to directory of this notebook
data_dir_root = os.path.normpath(os.getcwd() + os.sep + os.pardir + os.sep + "Data")
# Where the file is
file_url = data_dir_root + os.sep + "forged-bank-notes.csv"
#file_url



In [3]:

    
# Load the data
# header=0 drops the header row in the csv file
data = pd.read_csv(file_url, header=0, names=['V1', 'V2', 'V3', 'V4', 'Genuine'])



In [4]:

    
# Set up the inputs and 
# display the few rows of the input
inputs_v1_v2 = data[['V1', 'V2']]
inputs_v3_v4 = data[['V3', 'V4']]
inputs_v1_v3 = data[['V1', 'V3']]
inputs_v1_v4 = data[['V1', 'V4']]
inputs_v2_v3 = data[['V2', 'V3']]
inputs_v2_v4 = data[['V2', 'V4']]

Let's start where we left off last time.

We were looking at a bank notes dataset. The dataset has features V1, V2, V3, and V4.

We were looking just at V1 and V2 -- to keep things simple enough to visualize things easily.

We'll continue to look at V1 and V2...



In [5]:

    
# What the first few rows of the dataset looks like -- 
#  for just the V1 and V2 features.
inputs_v1_v2.head()



In [6]:

    
# And here's what the first few lines of the outputs/targets

# Set up the output and 
# display the first few rows of the output/target
output = data[['Genuine']]
output.head()



In [7]:

    
# Set up the training data
X_train_v1_v2 = {'data': inputs_v1_v2.values, 'feature1': 'V1', 'feature2': 'V2'}
X_train_v3_v4 = {'data': inputs_v3_v4.values, 'feature1': 'V3', 'feature2': 'V4'} 
X_train_v1_v3 = {'data': inputs_v1_v3.values, 'feature1': 'V1', 'feature2': 'V3'}
X_train_v1_v4 = {'data': inputs_v1_v4.values, 'feature1': 'V1', 'feature2': 'V4'}
X_train_v2_v3 = {'data': inputs_v2_v3.values, 'feature1': 'V2', 'feature2': 'V3'}
X_train_v2_v4 = {'data': inputs_v2_v4.values, 'feature1': 'V2', 'feature2': 'V4'}
X_train_v1_v2['data'].shape









    Out[7]:





(1372, 2)



In [8]:

    
# Set up the target data 
y = output.values

# Change the shape of y to suit scikit learn's array shape requirements
y_train = np.array(list(y.squeeze()))
y_train.shape









    Out[8]:





(1372,)



In [9]:

    
# Set up the positive and negative categories
# Scatter of V1 versus V2
positive = data[data['Genuine'].isin([1])]
negative = data[data['Genuine'].isin([0])]



In [10]:

    
# Set up the logistic regression model from SciKit Learn
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
# Solvers that seem to work well are 'liblinear' and 'newton-cg"
lr = LogisticRegression(C=100.0, random_state=0, solver='liblinear', verbose=2)



In [11]:

    
# Train the model and find the optimal parameter values
lr.fit(X_train_v1_v2['data'], y_train)









    



[LibLinear]





    Out[11]:





LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=2, warm_start=False)

At this point, (just imagine that) we've:

visualized the data
defined the task we'd like to accomplish
defined the model
defined the penalty for the being wrong
used an iterative algorithm (like gradient descent) to find the optimal values of the parameters

(Can you picture all of this from the dataset point of view?)



In [12]:

    
# These are the optimal values of w0, w1 and w2
w0 = lr.intercept_[0]
w1 = lr.coef_.squeeze()[0]
w2 = lr.coef_.squeeze()[1]
print("w0: {}\nw1: {}\nw2: {}".format(w0, w1, w2))









    



w0: 0.5865973329054549
w1: -1.1107052345243296
w2: -0.27255003089824



In [13]:

    
# Function for plotting class boundaries
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score

def poly_boundary_plot(XTrain, YTrain, degree, show_contours=0):
    
    # XTrain has to have exactly 2 features for this visualization to work
    
    # Transform the training inputs
    poly = PolynomialFeatures(degree)
    X_train_poly = poly.fit_transform(XTrain['data'])
    # NOTE: the poly function adds a bias value of 1 to each row of input data -- 
    #  default setting is include_bias=True
    
    # Set up the logistic regression model from SciKit Learn
    # Solvers that seem to work well are 'liblinear' and 'newton-cg"
    lr = LogisticRegression(C=100.0, random_state=0, solver='liblinear', verbose=2)
    
    # Fit the polynomial data to the simple linear logistic regression model we have
    lr.fit(X_train_poly, YTrain);
    
    # Create a grid of feature values
    
    # Find the min and max values of the two features 
    # Make grid values
    GRID_INCREMENT = 0.02
    x1_min = np.array([XTrain['data'][i][0] for i in range(len(XTrain['data']))]).min()
    x1_max = np.array([XTrain['data'][i][0] for i in range(len(XTrain['data']))]).max()
    
    x2_min = np.array([XTrain['data'][i][1] for i in range(len(XTrain['data']))]).min()
    x2_max = np.array([XTrain['data'][i][1] for i in range(len(XTrain['data']))]).max()
    
    xx1, xx2 = np.mgrid[x1_min:x1_max:GRID_INCREMENT, x2_min:x2_max:GRID_INCREMENT]
    #xx1.shape, xx2.shape
    
    # Create the grid
    grid = np.c_[xx1.ravel(), xx2.ravel()]
    grid.shape
    
    # The predictions of the model
    preds_poly = lr.predict(poly.fit_transform(grid))
    preds_poly_probs = lr.predict_proba(poly.fit_transform(grid))
    preds_poly_probs_0 = np.array([preds_poly_probs[i][0] for i in range(len(preds_poly_probs))])
    preds_poly_probs_1 = np.array([preds_poly_probs[i][1] for i in range(len(preds_poly_probs))])
    
    #return preds_poly, preds_poly_probs, preds_poly_probs_0, preds_poly_probs_1
    
    # Where did the model misclassify banknotes?
    # Keep in mind we are only using V1 and V2
    ## CAUTION: USING EXISTING variable values here
    model_preds = lr.predict(X_train_poly)
    errors_poly = data[data['Genuine'] != model_preds]
    #errors_poly
    
    # Get some classification performance metrics
    accuracy = metrics.accuracy_score(YTrain, model_preds)
    report = metrics.classification_report(YTrain, model_preds)
    confusion_matrix = metrics.confusion_matrix(YTrain, model_preds, labels=None, sample_weight=None)
    
    # Plot the boundary
    fig, ax = plt.subplots(figsize=(15,10))

    ax.scatter(positive[XTrain['feature1']], positive[XTrain['feature2']], s=30, c='b', marker='.', label='Genuine')
    ax.scatter(negative[XTrain['feature1']], negative[XTrain['feature2']], s=30, c='r', marker='.', label='Forged')

    ax.set_xlabel(XTrain['feature1'])
    ax.set_ylabel(XTrain['feature2'])

    # Now plot black circles around data points that were incorrectly predicted
    ax.scatter(errors_poly[XTrain['feature1']], errors_poly[XTrain['feature2']], facecolors="none", edgecolors="m", s=80, label="Wrongly Classified")

    # Finally plot the line which represents the decision boundary
    #ax.plot(x1, x2, color="green", linestyle="--", marker=None, label="boundary")
    # And plot the contours that separate the 1s from the 0s
    plt.contour(xx1,xx2,preds_poly.reshape(xx1.shape), colors='g', linewidths=1)
    if show_contours == 1:
        # preds_poly_probs_0 for contours of probability of 0 -- i.e. prob(forged bank note)
        # preds_poly_probs_1 for contours of probability of 1 -- i.e. prob(genuine bank note)
        contour_probs = preds_poly_probs_1
        cs = plt.contour(xx1,xx2,contour_probs.reshape(xx1.shape), linewidths=0.7)
        plt.clabel(cs, inline=1, fontsize=12)

    ax.legend(loc='lower right')
    
    title = 'Logistic Regression\n'
    title = title + 'Bank Note Validation Based on Feature Values ' + XTrain['feature1'] + ' and ' + XTrain['feature2'] + '\n'
    title = title + 'Polynomial Degree: ' + str(degree) + '\n'
    title = title + 'Number of misclassified points = ' + str(len(errors_poly))

    plot = plt.title(title);
    
    return errors_poly, accuracy, confusion_matrix, report, plot

...and this is what we saw last time for linear logistic regression



In [14]:

    
# logistic regression - what we saw last time
# NOTE: The contours are probabilities that the bank note is genuine
errors, accuracy, conf_matrix, report, plot = poly_boundary_plot(X_train_v1_v2, 
                    y_train, 
                    degree=1, 
                    show_contours=0)









    



[LibLinear]



In [15]:

    
# Which rows of the dataset are misclassfied?
errors









    Out[15]:







  
    
      
      V1
      V2
      V3
      V4
      Genuine
    
  
  
    
      4
      0.329240
      -4.455200
      4.57180
      -0.988800
      0
    
    
      7
      2.092200
      -6.810000
      8.46360
      -0.602160
      0
    
    
      18
      1.447900
      -4.879400
      8.34280
      -2.108600
      0
    
    
      21
      0.329200
      -4.455200
      4.57180
      -0.988800
      0
    
    
      25
      0.705700
      -5.498100
      8.33680
      -2.871500
      0
    
    
      26
      1.143200
      -3.741300
      5.57770
      -0.635780
      0
    
    
      30
      -0.248110
      -0.177970
      4.90680
      0.154290
      0
    
    
      34
      -1.616200
      0.809080
      8.16280
      0.608170
      0
    
    
      86
      -1.131300
      1.903700
      7.53390
      1.022000
      0
    
    
      95
      -0.644720
      -4.606200
      8.34700
      -2.709900
      0
    
    
      97
      2.161600
      -6.880400
      8.15170
      -0.081048
      0
    
    
      99
      -2.741900
      11.403800
      2.53940
      -5.579300
      0
    
    
      107
      -1.858400
      7.886000
      -1.66430
      -1.838400
      0
    
    
      119
      0.929700
      -3.797100
      4.64290
      -0.295700
      0
    
    
      123
      -1.507500
      1.922400
      7.14660
      0.891360
      0
    
    
      145
      0.519500
      -3.263300
      3.08950
      -0.984900
      0
    
    
      172
      -1.695200
      1.065700
      8.82940
      0.949550
      0
    
    
      178
      -0.131440
      -1.777500
      8.33160
      0.352140
      0
    
    
      181
      -0.117830
      -1.578900
      8.03000
      -0.028031
      0
    
    
      190
      0.929700
      -3.797100
      4.64290
      -0.295700
      0
    
    
      195
      -1.858400
      7.886000
      -1.66430
      -1.838400
      0
    
    
      227
      -2.647900
      10.137400
      -1.33100
      -5.470700
      0
    
    
      230
      -1.338900
      1.552000
      7.08060
      1.031000
      0
    
    
      233
      0.469010
      -0.633210
      7.38480
      0.365070
      0
    
    
      241
      -1.393100
      1.566400
      7.53820
      0.784030
      0
    
    
      268
      0.929700
      -3.797100
      4.64290
      -0.295700
      0
    
    
      277
      1.363800
      -4.775900
      8.41820
      -1.883600
      0
    
    
      280
      0.209770
      -0.461460
      7.72670
      0.909460
      0
    
    
      283
      0.322300
      -0.898080
      8.08830
      0.692220
      0
    
    
      285
      1.774700
      -6.433400
      8.15000
      -0.898280
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      1163
      0.681800
      4.850400
      -5.21330
      -6.104300
      1
    
    
      1168
      0.740670
      1.729900
      -3.19630
      -0.145700
      1
    
    
      1169
      0.982960
      3.422600
      -3.96920
      -1.711600
      1
    
    
      1210
      0.025013
      3.399800
      -4.43270
      -4.265500
      1
    
    
      1216
      0.600500
      0.999450
      -2.21260
      0.097399
      1
    
    
      1217
      0.616520
      3.894400
      -4.72750
      -4.394800
      1
    
    
      1223
      1.340300
      4.132300
      -4.70180
      -2.598700
      1
    
    
      1224
      0.268770
      4.987000
      -5.15080
      -6.391300
      1
    
    
      1229
      0.238740
      2.087900
      -3.35220
      -0.665530
      1
    
    
      1230
      0.621200
      3.677100
      -4.07710
      -2.071100
      1
    
    
      1271
      -0.071503
      3.741200
      -4.54150
      -4.252600
      1
    
    
      1277
      0.562320
      1.001500
      -2.27260
      -0.006049
      1
    
    
      1278
      0.539360
      3.894400
      -4.81660
      -4.341800
      1
    
    
      1284
      1.208000
      4.074400
      -4.76350
      -2.612900
      1
    
    
      1285
      0.295200
      4.885600
      -5.14900
      -6.232300
      1
    
    
      1290
      0.110320
      1.974100
      -3.36680
      -0.652590
      1
    
    
      1291
      0.523740
      3.644000
      -4.07460
      -1.990900
      1
    
    
      1319
      0.663650
      -0.045533
      -0.18794
      0.234470
      1
    
    
      1324
      0.062525
      2.930100
      -3.54670
      -2.673700
      1
    
    
      1332
      0.904070
      3.370800
      -4.49870
      -3.696500
      1
    
    
      1338
      1.581000
      0.869090
      -2.31380
      0.824120
      1
    
    
      1339
      1.551400
      3.801300
      -4.91430
      -3.748300
      1
    
    
      1344
      1.345100
      0.235890
      -1.87850
      1.325800
      1
    
    
      1345
      2.227900
      4.095100
      -4.80370
      -2.111200
      1
    
    
      1346
      1.257200
      4.873100
      -5.28610
      -5.874100
      1
    
    
      1351
      1.318300
      1.901700
      -3.31110
      0.065071
      1
    
    
      1352
      1.489600
      3.428800
      -4.03090
      -1.425900
      1
    
    
      1353
      0.115920
      3.221900
      -3.43020
      -2.845700
      1
    
    
      1359
      0.066129
      2.491400
      -2.94010
      -0.621560
      1
    
    
      1367
      0.406140
      1.349200
      -1.45010
      -0.559490
      1
    
  

157 rows × 5 columns



In [16]:

    
# Classification accuracy
accuracy









    Out[16]:





0.88556851311953355



In [17]:

    
# Comfusion Matrix
print(conf_matrix)



In [18]:

    
# True negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = conf_matrix.ravel()
tn, fp, fn, tp









    Out[18]:





(693, 69, 88, 522)



In [19]:

    
# Precision, recall, f1-score
print(report)









    



             precision    recall  f1-score   support

          0       0.89      0.91      0.90       762
          1       0.88      0.86      0.87       610

avg / total       0.89      0.89      0.89      1372

Non-Linear Logistic Regression



In [20]:

    
# logistic regression
# NOTE: The contours are probabilities that the bank note is genuine
errors, accuracy, conf_matrix, report, plot = poly_boundary_plot(X_train_v1_v2, 
                                                y_train, 
                                                degree=5, 
                                                show_contours=1)









    



[LibLinear]





    



/anaconda/lib/python3.6/site-packages/sklearn/linear_model/base.py:340: RuntimeWarning: overflow encountered in exp
  np.exp(prob, prob)



In [21]:

    
# Which rows of the dataset are misclassfied?
errors









    Out[21]:







  
    
      
      V1
      V2
      V3
      V4
      Genuine
    
  
  
    
      17
      0.803550
      2.847300
      4.34390
      0.601700
      0
    
    
      30
      -0.248110
      -0.177970
      4.90680
      0.154290
      0
    
    
      31
      1.488400
      3.627400
      3.30800
      0.489210
      0
    
    
      34
      -1.616200
      0.809080
      8.16280
      0.608170
      0
    
    
      49
      1.634900
      3.286000
      2.87530
      0.087054
      0
    
    
      79
      1.131700
      3.964700
      3.39790
      0.843510
      0
    
    
      86
      -1.131300
      1.903700
      7.53390
      1.022000
      0
    
    
      95
      -0.644720
      -4.606200
      8.34700
      -2.709900
      0
    
    
      123
      -1.507500
      1.922400
      7.14660
      0.891360
      0
    
    
      159
      1.356600
      4.235800
      2.13410
      0.321100
      0
    
    
      170
      0.570600
      -0.024800
      1.24210
      -0.562100
      0
    
    
      172
      -1.695200
      1.065700
      8.82940
      0.949550
      0
    
    
      180
      1.642600
      3.014900
      0.22849
      -0.147000
      0
    
    
      213
      0.680870
      2.325900
      4.90850
      0.549980
      0
    
    
      216
      1.323400
      3.296400
      0.23620
      -0.119840
      0
    
    
      224
      -0.360380
      4.115800
      3.11430
      -0.371990
      0
    
    
      225
      1.356200
      3.213600
      4.34650
      0.786620
      0
    
    
      226
      0.570600
      -0.024800
      1.24210
      -0.562100
      0
    
    
      230
      -1.338900
      1.552000
      7.08060
      1.031000
      0
    
    
      233
      0.469010
      -0.633210
      7.38480
      0.365070
      0
    
    
      241
      -1.393100
      1.566400
      7.53820
      0.784030
      0
    
    
      242
      1.640600
      3.548800
      1.39640
      -0.364240
      0
    
    
      248
      0.379800
      0.709800
      0.75720
      -0.444400
      0
    
    
      265
      1.594000
      4.705500
      1.37580
      0.081882
      0
    
    
      280
      0.209770
      -0.461460
      7.72670
      0.909460
      0
    
    
      288
      0.441250
      2.948700
      4.32250
      0.715500
      0
    
    
      312
      1.590200
      2.294800
      3.24030
      0.184040
      0
    
    
      322
      -1.734400
      2.017500
      7.76180
      0.935320
      0
    
    
      326
      -0.161080
      -6.462400
      8.35730
      -1.521600
      0
    
    
      329
      1.698800
      2.909400
      2.90440
      0.110330
      0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      910
      0.552980
      -3.461900
      1.70480
      1.100800
      1
    
    
      911
      2.031000
      1.852000
      -3.01210
      0.003003
      1
    
    
      917
      1.563100
      0.895990
      -1.97020
      0.654720
      1
    
    
      918
      2.391700
      4.556500
      -4.98880
      -2.898700
      1
    
    
      923
      0.428300
      -0.949810
      -1.07310
      0.321100
      1
    
    
      971
      0.003120
      -4.006100
      1.79560
      0.917220
      1
    
    
      972
      1.351800
      1.059500
      -2.34370
      0.399980
      1
    
    
      978
      0.758960
      0.291760
      -1.65060
      0.838340
      1
    
    
      980
      0.636550
      5.202200
      -5.21590
      -6.121100
      1
    
    
      1007
      -0.172960
      -1.181600
      1.38180
      0.733600
      1
    
    
      1014
      0.143290
      -1.088500
      1.00390
      0.487910
      1
    
    
      1032
      0.163580
      -3.358400
      1.37490
      1.356900
      1
    
    
      1040
      1.787500
      4.780000
      -5.13620
      -3.236200
      1
    
    
      1045
      -0.014902
      -1.024300
      -0.94024
      0.649550
      1
    
    
      1068
      0.318030
      -0.993260
      1.09470
      0.886190
      1
    
    
      1075
      0.376370
      -0.823580
      0.78543
      0.745240
      1
    
    
      1093
      0.744280
      -3.772300
      1.61310
      1.575400
      1
    
    
      1094
      2.017700
      1.798200
      -2.95810
      0.209900
      1
    
    
      1100
      1.437800
      0.668370
      -2.02670
      1.027100
      1
    
    
      1101
      2.194300
      4.550300
      -4.97600
      -2.725400
      1
    
    
      1106
      -0.126900
      -1.150500
      -0.95138
      0.578430
      1
    
    
      1154
      -0.360250
      -4.449000
      2.10670
      0.943080
      1
    
    
      1161
      0.704900
      0.171740
      -1.78590
      0.361190
      1
    
    
      1319
      0.663650
      -0.045533
      -0.18794
      0.234470
      1
    
    
      1327
      -0.240370
      -1.783700
      2.13500
      1.241800
      1
    
    
      1337
      0.234600
      -4.515200
      2.11950
      1.444800
      1
    
    
      1338
      1.581000
      0.869090
      -2.31380
      0.824120
      1
    
    
      1344
      1.345100
      0.235890
      -1.87850
      1.325800
      1
    
    
      1345
      2.227900
      4.095100
      -4.80370
      -2.111200
      1
    
    
      1350
      -0.218880
      -2.203800
      -0.09540
      0.564210
      1
    
  

98 rows × 5 columns



In [22]:

    
# Classification accuracy
accuracy









    Out[22]:





0.9285714285714286



In [23]:

    
# Comfusion Matrix
print(conf_matrix)



In [24]:

    
# True negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = conf_matrix.ravel()
tn, fp, fn, tp









    Out[24]:





(696, 66, 32, 578)



In [25]:

    
# Precision, recall, f1-score
print(report)









    



             precision    recall  f1-score   support

          0       0.96      0.91      0.93       762
          1       0.90      0.95      0.92       610

avg / total       0.93      0.93      0.93      1372

At some point, just making the model more and more complex will start to produce diminishing returns. At this point it's more data that will help.

We've been working with just 2 of the 4 features -- why not work with all the features available to us? This gives us more predictive power but makes it hard to visualize the boundaries.

We can, however, see how our predictions are going by looking at the rows in the dataset that are misclassified.

Use all 4 features instead of just V1 and V2



In [26]:

    
# Set up the inputs
inputs_all = data[['V1', 'V2', 'V3', 'V4']]



In [27]:

    
# Here are some key stats on the inputs
inputs_all.describe()









    Out[27]:







  
    
      
      V1
      V2
      V3
      V4
    
  
  
    
      count
      1372.000000
      1372.000000
      1372.000000
      1372.000000
    
    
      mean
      0.433735
      1.922353
      1.397627
      -1.191657
    
    
      std
      2.842763
      5.869047
      4.310030
      2.101013
    
    
      min
      -7.042100
      -13.773100
      -5.286100
      -8.548200
    
    
      25%
      -1.773000
      -1.708200
      -1.574975
      -2.413450
    
    
      50%
      0.496180
      2.319650
      0.616630
      -0.586650
    
    
      75%
      2.821475
      6.814625
      3.179250
      0.394810
    
    
      max
      6.824800
      12.951600
      17.927400
      2.449500



In [28]:

    
# Turn the inputs into an array of training data
X_all_train = inputs_all.values
X_all_train.shape









    Out[28]:





(1372, 4)



In [29]:

    
# Sanity check
X_all_train[0:3]









    Out[29]:





array([[ 3.6216 ,  8.6661 , -2.8073 , -0.44699],
       [ 4.5459 ,  8.1674 , -2.4586 , -1.4621 ],
       [ 3.866  , -2.6383 ,  1.9242 ,  0.10645]])



In [30]:

    
# The output remains the same
y_train.shape









    Out[30]:





(1372,)



In [31]:

    
# Use the same logistic regression model as before
# Train the model and find the optimal parameter values
lr.fit(X_all_train, y_train)









    



[LibLinear]





    Out[31]:





LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=2, warm_start=False)



In [32]:

    
# These are the optimal values of w0, w1, w2, w3, and w4
w0 = lr.intercept_[0]
w1 = lr.coef_.squeeze()[0]
w2 = lr.coef_.squeeze()[1]
w3 = lr.coef_.squeeze()[2]
w4 = lr.coef_.squeeze()[3]
print("w0: {}\nw1: {}\nw2: {}\nw3: {}\nw4: {}".format(w0, w1, w2, w3, w4))









    



w0: 6.7977093065581125
w1: -7.264170431822384
w2: -3.885217128663818
w3: -4.8902809890361185
w4: -0.5436532238010893



In [33]:

    
# Genuine or fake for the entire data set
y_all_pred = lr.predict(X_all_train)
print(y_all_pred)









    



[0 0 0 ..., 1 1 1]



In [34]:

    
lr.score(X_all_train, y_train)









    Out[34]:





0.99198250728862969



In [35]:

    
# The probabilities of [Genuine = 0, Genuine = 1]
y_all_pred_probs = lr.predict_proba(X_all_train)
print(y_all_pred_probs)









    



[[  1.00000000e+00   9.37582493e-18]
 [  1.00000000e+00   2.49232658e-20]
 [  9.99999999e-01   1.24646312e-09]
 ..., 
 [  1.41340716e-01   8.58659284e-01]
 [  4.74684318e-03   9.95253157e-01]
 [  7.96637064e-07   9.99999203e-01]]



In [36]:

    
# Where did the model misclassify banknotes?
errors = data[data['Genuine'] != y_all_pred]
print('Number of Misclassifications = {}'.format(len(errors)))
errors









    



Number of Misclassifications = 11






    Out[36]:







  
    
      
      V1
      V2
      V3
      V4
      Genuine
    
  
  
    
      145
      0.51950
      -3.2633
      3.0895
      -0.984900
      0
    
    
      320
      0.51950
      -3.2633
      3.0895
      -0.984900
      0
    
    
      351
      0.51950
      -3.2633
      3.0895
      -0.984900
      0
    
    
      671
      0.51947
      -3.2633
      3.0895
      -0.984920
      0
    
    
      716
      0.51950
      -3.2633
      3.0895
      -0.984900
      0
    
    
      911
      2.03100
      1.8520
      -3.0121
      0.003003
      1
    
    
      918
      2.39170
      4.5565
      -4.9888
      -2.898700
      1
    
    
      942
      -3.37930
      -13.7731
      17.9274
      -2.032300
      1
    
    
      1094
      2.01770
      1.7982
      -2.9581
      0.209900
      1
    
    
      1101
      2.19430
      4.5503
      -4.9760
      -2.725400
      1
    
    
      1345
      2.22790
      4.0951
      -4.8037
      -2.111200
      1

Lesson: With enough data, a linear model is often good enough.

Summary

We now have in our toolkit ways to make numerical and categorical predictions.

Can you think of a prediction that doesn't predict a numerical value or a category?

Moreover, our dataset can contain any number of features and our features can be complex.

We know how to take linear models and make them into non-linear models to capture more complex patterns in our data.

We can even bandy about fancy terms like logistic regression, penalty functions, gradient descent, support vector machines, and neural networks!



In [ ]:

	V1	V2
0	3.62160	8.6661
1	4.54590	8.1674
2	3.86600	-2.6383
3	3.45660	9.5228
4	0.32924	-4.4552

	V1	V2	V3	V4	Genuine
4	0.329240	-4.455200	4.57180	-0.988800	0
7	2.092200	-6.810000	8.46360	-0.602160	0
18	1.447900	-4.879400	8.34280	-2.108600	0
21	0.329200	-4.455200	4.57180	-0.988800	0
25	0.705700	-5.498100	8.33680	-2.871500	0
26	1.143200	-3.741300	5.57770	-0.635780	0
30	-0.248110	-0.177970	4.90680	0.154290	0
34	-1.616200	0.809080	8.16280	0.608170	0
86	-1.131300	1.903700	7.53390	1.022000	0
95	-0.644720	-4.606200	8.34700	-2.709900	0
97	2.161600	-6.880400	8.15170	-0.081048	0
99	-2.741900	11.403800	2.53940	-5.579300	0
107	-1.858400	7.886000	-1.66430	-1.838400	0
119	0.929700	-3.797100	4.64290	-0.295700	0
123	-1.507500	1.922400	7.14660	0.891360	0
145	0.519500	-3.263300	3.08950	-0.984900	0
172	-1.695200	1.065700	8.82940	0.949550	0
178	-0.131440	-1.777500	8.33160	0.352140	0
181	-0.117830	-1.578900	8.03000	-0.028031	0
190	0.929700	-3.797100	4.64290	-0.295700	0
195	-1.858400	7.886000	-1.66430	-1.838400	0
227	-2.647900	10.137400	-1.33100	-5.470700	0
230	-1.338900	1.552000	7.08060	1.031000	0
233	0.469010	-0.633210	7.38480	0.365070	0
241	-1.393100	1.566400	7.53820	0.784030	0
268	0.929700	-3.797100	4.64290	-0.295700	0
277	1.363800	-4.775900	8.41820	-1.883600	0
280	0.209770	-0.461460	7.72670	0.909460	0
283	0.322300	-0.898080	8.08830	0.692220	0
285	1.774700	-6.433400	8.15000	-0.898280	0
...	...	...	...	...	...
1163	0.681800	4.850400	-5.21330	-6.104300	1
1168	0.740670	1.729900	-3.19630	-0.145700	1
1169	0.982960	3.422600	-3.96920	-1.711600	1
1210	0.025013	3.399800	-4.43270	-4.265500	1
1216	0.600500	0.999450	-2.21260	0.097399	1
1217	0.616520	3.894400	-4.72750	-4.394800	1
1223	1.340300	4.132300	-4.70180	-2.598700	1
1224	0.268770	4.987000	-5.15080	-6.391300	1
1229	0.238740	2.087900	-3.35220	-0.665530	1
1230	0.621200	3.677100	-4.07710	-2.071100	1
1271	-0.071503	3.741200	-4.54150	-4.252600	1
1277	0.562320	1.001500	-2.27260	-0.006049	1
1278	0.539360	3.894400	-4.81660	-4.341800	1
1284	1.208000	4.074400	-4.76350	-2.612900	1
1285	0.295200	4.885600	-5.14900	-6.232300	1
1290	0.110320	1.974100	-3.36680	-0.652590	1
1291	0.523740	3.644000	-4.07460	-1.990900	1
1319	0.663650	-0.045533	-0.18794	0.234470	1
1324	0.062525	2.930100	-3.54670	-2.673700	1
1332	0.904070	3.370800	-4.49870	-3.696500	1
1338	1.581000	0.869090	-2.31380	0.824120	1
1339	1.551400	3.801300	-4.91430	-3.748300	1
1344	1.345100	0.235890	-1.87850	1.325800	1
1345	2.227900	4.095100	-4.80370	-2.111200	1
1346	1.257200	4.873100	-5.28610	-5.874100	1
1351	1.318300	1.901700	-3.31110	0.065071	1
1352	1.489600	3.428800	-4.03090	-1.425900	1
1353	0.115920	3.221900	-3.43020	-2.845700	1
1359	0.066129	2.491400	-2.94010	-0.621560	1
1367	0.406140	1.349200	-1.45010	-0.559490	1

	V1	V2	V3	V4	Genuine
17	0.803550	2.847300	4.34390	0.601700	0
30	-0.248110	-0.177970	4.90680	0.154290	0
31	1.488400	3.627400	3.30800	0.489210	0
34	-1.616200	0.809080	8.16280	0.608170	0
49	1.634900	3.286000	2.87530	0.087054	0
79	1.131700	3.964700	3.39790	0.843510	0
86	-1.131300	1.903700	7.53390	1.022000	0
95	-0.644720	-4.606200	8.34700	-2.709900	0
123	-1.507500	1.922400	7.14660	0.891360	0
159	1.356600	4.235800	2.13410	0.321100	0
170	0.570600	-0.024800	1.24210	-0.562100	0
172	-1.695200	1.065700	8.82940	0.949550	0
180	1.642600	3.014900	0.22849	-0.147000	0
213	0.680870	2.325900	4.90850	0.549980	0
216	1.323400	3.296400	0.23620	-0.119840	0
224	-0.360380	4.115800	3.11430	-0.371990	0
225	1.356200	3.213600	4.34650	0.786620	0
226	0.570600	-0.024800	1.24210	-0.562100	0
230	-1.338900	1.552000	7.08060	1.031000	0
233	0.469010	-0.633210	7.38480	0.365070	0
241	-1.393100	1.566400	7.53820	0.784030	0
242	1.640600	3.548800	1.39640	-0.364240	0
248	0.379800	0.709800	0.75720	-0.444400	0
265	1.594000	4.705500	1.37580	0.081882	0
280	0.209770	-0.461460	7.72670	0.909460	0
288	0.441250	2.948700	4.32250	0.715500	0
312	1.590200	2.294800	3.24030	0.184040	0
322	-1.734400	2.017500	7.76180	0.935320	0
326	-0.161080	-6.462400	8.35730	-1.521600	0
329	1.698800	2.909400	2.90440	0.110330	0
...	...	...	...	...	...
910	0.552980	-3.461900	1.70480	1.100800	1
911	2.031000	1.852000	-3.01210	0.003003	1
917	1.563100	0.895990	-1.97020	0.654720	1
918	2.391700	4.556500	-4.98880	-2.898700	1
923	0.428300	-0.949810	-1.07310	0.321100	1
971	0.003120	-4.006100	1.79560	0.917220	1
972	1.351800	1.059500	-2.34370	0.399980	1
978	0.758960	0.291760	-1.65060	0.838340	1
980	0.636550	5.202200	-5.21590	-6.121100	1
1007	-0.172960	-1.181600	1.38180	0.733600	1
1014	0.143290	-1.088500	1.00390	0.487910	1
1032	0.163580	-3.358400	1.37490	1.356900	1
1040	1.787500	4.780000	-5.13620	-3.236200	1
1045	-0.014902	-1.024300	-0.94024	0.649550	1
1068	0.318030	-0.993260	1.09470	0.886190	1
1075	0.376370	-0.823580	0.78543	0.745240	1
1093	0.744280	-3.772300	1.61310	1.575400	1
1094	2.017700	1.798200	-2.95810	0.209900	1
1100	1.437800	0.668370	-2.02670	1.027100	1
1101	2.194300	4.550300	-4.97600	-2.725400	1
1106	-0.126900	-1.150500	-0.95138	0.578430	1
1154	-0.360250	-4.449000	2.10670	0.943080	1
1161	0.704900	0.171740	-1.78590	0.361190	1
1319	0.663650	-0.045533	-0.18794	0.234470	1
1327	-0.240370	-1.783700	2.13500	1.241800	1
1337	0.234600	-4.515200	2.11950	1.444800	1
1338	1.581000	0.869090	-2.31380	0.824120	1
1344	1.345100	0.235890	-1.87850	1.325800	1
1345	2.227900	4.095100	-4.80370	-2.111200	1
1350	-0.218880	-2.203800	-0.09540	0.564210	1

	V1	V2	V3	V4
count	1372.000000	1372.000000	1372.000000	1372.000000
mean	0.433735	1.922353	1.397627	-1.191657
std	2.842763	5.869047	4.310030	2.101013
min	-7.042100	-13.773100	-5.286100	-8.548200
25%	-1.773000	-1.708200	-1.574975	-2.413450
50%	0.496180	2.319650	0.616630	-0.586650
75%	2.821475	6.814625	3.179250	0.394810
max	6.824800	12.951600	17.927400	2.449500

	V1	V2	V3	V4	Genuine
145	0.51950	-3.2633	3.0895	-0.984900	0
320	0.51950	-3.2633	3.0895	-0.984900	0
351	0.51950	-3.2633	3.0895	-0.984900	0
671	0.51947	-3.2633	3.0895	-0.984920	0
716	0.51950	-3.2633	3.0895	-0.984900	0
911	2.03100	1.8520	-3.0121	0.003003	1
918	2.39170	4.5565	-4.9888	-2.898700	1
942	-3.37930	-13.7731	17.9274	-2.032300	1
1094	2.01770	1.7982	-2.9581	0.209900	1
1101	2.19430	4.5503	-4.9760	-2.725400	1
1345	2.22790	4.0951	-4.8037	-2.111200	1