In the last session we looked at the basic concepts of logistic regression.
In [1]:
# Import our usual libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
# Set up the path
import os
# OS-independent way to navigate the file system
# Data directory is one directory up in relation to directory of this notebook
data_dir_root = os.path.normpath(os.getcwd() + os.sep + os.pardir + os.sep + "Data")
# Where the file is
file_url = data_dir_root + os.sep + "forged-bank-notes.csv"
#file_url
In [3]:
# Load the data
# header=0 drops the header row in the csv file
data = pd.read_csv(file_url, header=0, names=['V1', 'V2', 'V3', 'V4', 'Genuine'])
In [4]:
# Set up the inputs and
# display the few rows of the input
inputs_v1_v2 = data[['V1', 'V2']]
inputs_v3_v4 = data[['V3', 'V4']]
inputs_v1_v3 = data[['V1', 'V3']]
inputs_v1_v4 = data[['V1', 'V4']]
inputs_v2_v3 = data[['V2', 'V3']]
inputs_v2_v4 = data[['V2', 'V4']]
Let's start where we left off last time.
We were looking at a bank notes dataset. The dataset has features V1, V2, V3, and V4.
We were looking just at V1 and V2 -- to keep things simple enough to visualize things easily.
We'll continue to look at V1 and V2...
In [5]:
# What the first few rows of the dataset looks like --
# for just the V1 and V2 features.
inputs_v1_v2.head()
Out[5]:
In [6]:
# And here's what the first few lines of the outputs/targets
# Set up the output and
# display the first few rows of the output/target
output = data[['Genuine']]
output.head()
Out[6]:
In [7]:
# Set up the training data
X_train_v1_v2 = {'data': inputs_v1_v2.values, 'feature1': 'V1', 'feature2': 'V2'}
X_train_v3_v4 = {'data': inputs_v3_v4.values, 'feature1': 'V3', 'feature2': 'V4'}
X_train_v1_v3 = {'data': inputs_v1_v3.values, 'feature1': 'V1', 'feature2': 'V3'}
X_train_v1_v4 = {'data': inputs_v1_v4.values, 'feature1': 'V1', 'feature2': 'V4'}
X_train_v2_v3 = {'data': inputs_v2_v3.values, 'feature1': 'V2', 'feature2': 'V3'}
X_train_v2_v4 = {'data': inputs_v2_v4.values, 'feature1': 'V2', 'feature2': 'V4'}
X_train_v1_v2['data'].shape
Out[7]:
In [8]:
# Set up the target data
y = output.values
# Change the shape of y to suit scikit learn's array shape requirements
y_train = np.array(list(y.squeeze()))
y_train.shape
Out[8]:
In [9]:
# Set up the positive and negative categories
# Scatter of V1 versus V2
positive = data[data['Genuine'].isin([1])]
negative = data[data['Genuine'].isin([0])]
In [10]:
# Set up the logistic regression model from SciKit Learn
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
# Solvers that seem to work well are 'liblinear' and 'newton-cg"
lr = LogisticRegression(C=100.0, random_state=0, solver='liblinear', verbose=2)
In [11]:
# Train the model and find the optimal parameter values
lr.fit(X_train_v1_v2['data'], y_train)
Out[11]:
At this point, (just imagine that) we've:
(Can you picture all of this from the dataset point of view?)
In [12]:
# These are the optimal values of w0, w1 and w2
w0 = lr.intercept_[0]
w1 = lr.coef_.squeeze()[0]
w2 = lr.coef_.squeeze()[1]
print("w0: {}\nw1: {}\nw2: {}".format(w0, w1, w2))
In [13]:
# Function for plotting class boundaries
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
def poly_boundary_plot(XTrain, YTrain, degree, show_contours=0):
# XTrain has to have exactly 2 features for this visualization to work
# Transform the training inputs
poly = PolynomialFeatures(degree)
X_train_poly = poly.fit_transform(XTrain['data'])
# NOTE: the poly function adds a bias value of 1 to each row of input data --
# default setting is include_bias=True
# Set up the logistic regression model from SciKit Learn
# Solvers that seem to work well are 'liblinear' and 'newton-cg"
lr = LogisticRegression(C=100.0, random_state=0, solver='liblinear', verbose=2)
# Fit the polynomial data to the simple linear logistic regression model we have
lr.fit(X_train_poly, YTrain);
# Create a grid of feature values
# Find the min and max values of the two features
# Make grid values
GRID_INCREMENT = 0.02
x1_min = np.array([XTrain['data'][i][0] for i in range(len(XTrain['data']))]).min()
x1_max = np.array([XTrain['data'][i][0] for i in range(len(XTrain['data']))]).max()
x2_min = np.array([XTrain['data'][i][1] for i in range(len(XTrain['data']))]).min()
x2_max = np.array([XTrain['data'][i][1] for i in range(len(XTrain['data']))]).max()
xx1, xx2 = np.mgrid[x1_min:x1_max:GRID_INCREMENT, x2_min:x2_max:GRID_INCREMENT]
#xx1.shape, xx2.shape
# Create the grid
grid = np.c_[xx1.ravel(), xx2.ravel()]
grid.shape
# The predictions of the model
preds_poly = lr.predict(poly.fit_transform(grid))
preds_poly_probs = lr.predict_proba(poly.fit_transform(grid))
preds_poly_probs_0 = np.array([preds_poly_probs[i][0] for i in range(len(preds_poly_probs))])
preds_poly_probs_1 = np.array([preds_poly_probs[i][1] for i in range(len(preds_poly_probs))])
#return preds_poly, preds_poly_probs, preds_poly_probs_0, preds_poly_probs_1
# Where did the model misclassify banknotes?
# Keep in mind we are only using V1 and V2
## CAUTION: USING EXISTING variable values here
model_preds = lr.predict(X_train_poly)
errors_poly = data[data['Genuine'] != model_preds]
#errors_poly
# Get some classification performance metrics
accuracy = metrics.accuracy_score(YTrain, model_preds)
report = metrics.classification_report(YTrain, model_preds)
confusion_matrix = metrics.confusion_matrix(YTrain, model_preds, labels=None, sample_weight=None)
# Plot the boundary
fig, ax = plt.subplots(figsize=(15,10))
ax.scatter(positive[XTrain['feature1']], positive[XTrain['feature2']], s=30, c='b', marker='.', label='Genuine')
ax.scatter(negative[XTrain['feature1']], negative[XTrain['feature2']], s=30, c='r', marker='.', label='Forged')
ax.set_xlabel(XTrain['feature1'])
ax.set_ylabel(XTrain['feature2'])
# Now plot black circles around data points that were incorrectly predicted
ax.scatter(errors_poly[XTrain['feature1']], errors_poly[XTrain['feature2']], facecolors="none", edgecolors="m", s=80, label="Wrongly Classified")
# Finally plot the line which represents the decision boundary
#ax.plot(x1, x2, color="green", linestyle="--", marker=None, label="boundary")
# And plot the contours that separate the 1s from the 0s
plt.contour(xx1,xx2,preds_poly.reshape(xx1.shape), colors='g', linewidths=1)
if show_contours == 1:
# preds_poly_probs_0 for contours of probability of 0 -- i.e. prob(forged bank note)
# preds_poly_probs_1 for contours of probability of 1 -- i.e. prob(genuine bank note)
contour_probs = preds_poly_probs_1
cs = plt.contour(xx1,xx2,contour_probs.reshape(xx1.shape), linewidths=0.7)
plt.clabel(cs, inline=1, fontsize=12)
ax.legend(loc='lower right')
title = 'Logistic Regression\n'
title = title + 'Bank Note Validation Based on Feature Values ' + XTrain['feature1'] + ' and ' + XTrain['feature2'] + '\n'
title = title + 'Polynomial Degree: ' + str(degree) + '\n'
title = title + 'Number of misclassified points = ' + str(len(errors_poly))
plot = plt.title(title);
return errors_poly, accuracy, confusion_matrix, report, plot
...and this is what we saw last time for linear logistic regression
In [14]:
# logistic regression - what we saw last time
# NOTE: The contours are probabilities that the bank note is genuine
errors, accuracy, conf_matrix, report, plot = poly_boundary_plot(X_train_v1_v2,
y_train,
degree=1,
show_contours=0)
In [15]:
# Which rows of the dataset are misclassfied?
errors
Out[15]:
In [16]:
# Classification accuracy
accuracy
Out[16]:
In [17]:
# Comfusion Matrix
print(conf_matrix)
In [18]:
# True negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = conf_matrix.ravel()
tn, fp, fn, tp
Out[18]:
In [19]:
# Precision, recall, f1-score
print(report)
In [20]:
# logistic regression
# NOTE: The contours are probabilities that the bank note is genuine
errors, accuracy, conf_matrix, report, plot = poly_boundary_plot(X_train_v1_v2,
y_train,
degree=5,
show_contours=1)
In [21]:
# Which rows of the dataset are misclassfied?
errors
Out[21]:
In [22]:
# Classification accuracy
accuracy
Out[22]:
In [23]:
# Comfusion Matrix
print(conf_matrix)
In [24]:
# True negatives, false positives, false negatives, and true positives
tn, fp, fn, tp = conf_matrix.ravel()
tn, fp, fn, tp
Out[24]:
In [25]:
# Precision, recall, f1-score
print(report)
At some point, just making the model more and more complex will start to produce diminishing returns. At this point it's more data that will help.
We've been working with just 2 of the 4 features -- why not work with all the features available to us? This gives us more predictive power but makes it hard to visualize the boundaries.
We can, however, see how our predictions are going by looking at the rows in the dataset that are misclassified.
In [26]:
# Set up the inputs
inputs_all = data[['V1', 'V2', 'V3', 'V4']]
In [27]:
# Here are some key stats on the inputs
inputs_all.describe()
Out[27]:
In [28]:
# Turn the inputs into an array of training data
X_all_train = inputs_all.values
X_all_train.shape
Out[28]:
In [29]:
# Sanity check
X_all_train[0:3]
Out[29]:
In [30]:
# The output remains the same
y_train.shape
Out[30]:
In [31]:
# Use the same logistic regression model as before
# Train the model and find the optimal parameter values
lr.fit(X_all_train, y_train)
Out[31]:
In [32]:
# These are the optimal values of w0, w1, w2, w3, and w4
w0 = lr.intercept_[0]
w1 = lr.coef_.squeeze()[0]
w2 = lr.coef_.squeeze()[1]
w3 = lr.coef_.squeeze()[2]
w4 = lr.coef_.squeeze()[3]
print("w0: {}\nw1: {}\nw2: {}\nw3: {}\nw4: {}".format(w0, w1, w2, w3, w4))
In [33]:
# Genuine or fake for the entire data set
y_all_pred = lr.predict(X_all_train)
print(y_all_pred)
In [34]:
lr.score(X_all_train, y_train)
Out[34]:
In [35]:
# The probabilities of [Genuine = 0, Genuine = 1]
y_all_pred_probs = lr.predict_proba(X_all_train)
print(y_all_pred_probs)
In [36]:
# Where did the model misclassify banknotes?
errors = data[data['Genuine'] != y_all_pred]
print('Number of Misclassifications = {}'.format(len(errors)))
errors
Out[36]:
Lesson: With enough data, a linear model is often good enough.
We now have in our toolkit ways to make numerical and categorical predictions.
Can you think of a prediction that doesn't predict a numerical value or a category?
Moreover, our dataset can contain any number of features and our features can be complex.
We know how to take linear models and make them into non-linear models to capture more complex patterns in our data.
We can even bandy about fancy terms like logistic regression, penalty functions, gradient descent, support vector machines, and neural networks!
In [ ]: