ACKNOWLEDGEMENT
A lot of the code in this notebook is from John D. Wittenauer's notebooks that cover the exercises in Andrew Ng's course on Machine Learning on Coursera. This is mostly Wittenauer's and Ng's work and acknowledged as such. I've also used some code from Sebastian Raschka's book Python Machine Learning.
You're in charge of Quality Assurance at a semiconductor manufacturing plant. You have the data on a sample of microchips that have each been through two tests. You have the two scores for each microchip in the sample. In addition, based on the scores, you have labels for each microchip on whether it was accepted or rejected. Your goal is to determine if a new microchip that's been tested should be accepted or rejected based on the scores it receives.
The structure of this problem is the same as the one we had in the previous chapter. The difference is in the data. Let's take a look at it.
In [1]:
# Use the functions from another notebook in this notebook
%run SharedFunctions.ipynb
In [2]:
# Import our usual libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [3]:
import os
path = os.getcwd() + '/Data/ex2data2.txt'
data2 = pd.read_csv(path, header=None, names=['Test 1', 'Test 2', 'Accepted'])
data2.head()
Out[3]:
In [4]:
data2.shape
Out[4]:
We have 118 rows of data in our dataset; each row has 3 columns -- 2 inputs and the associated output classification.
In [5]:
positive = data2[data2['Accepted'].isin([1])]
negative = data2[data2['Accepted'].isin([0])]
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(positive['Test 1'], positive['Test 2'], s=30, c='b', marker='s', label='Accepted')
ax.scatter(negative['Test 1'], negative['Test 2'], s=30, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
ax.set_title('Full Dataset')
Out[5]:
Notice something immediately? There's no way to draw a line to separate the blue squares from the red Xes. The decision boundary needs to be a curve which means this is a non-linear problem. We're going to "manufacture" a number of new features in order to tackle this problem. You'll see that it's not hard to draw non-linear boundaries; but it's much harder to draw a non-linear boundary and still have a useful model with which to make predictions -- predictions about the future, as Yogi Berra would say. Regularization is the technique for making sure that complex models with non-linear decision boundaries can still be useful.
Instead of getting ahead of ourselves, let's take this step by step.
We'll see why this split makes sense a bit later, but for now, let's split our one dataset into two datasets -- one with 80% of the original data (the Training set) and the other with 20% of the original data (the Test set).
In [6]:
# Split the data into Training and Test (80-20 split into training and test sets)
from sklearn.model_selection import train_test_split
X = data2.iloc[:, 0:2].values
y = data2.iloc[:, 2].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [7]:
# Pickle the data for later use in the Neural Network notebook
import cPickle as pickle
pkl_file_path = os.getcwd() + '/Data/ex2data2.pkl'
pickle.dump( [X_train, X_test, y_train, y_test], open( pkl_file_path, "wb" ) )
In [8]:
X_test
Out[8]:
In [9]:
# Bring the training inputs and training outputs together
#X_train.shape
#y_train.shape
training_data = np.c_[X_train, y_train]
# Get all inputs that have positive outputs -- i.e., Accepted
pos_train_inputs = np.array([tr_d for tr_d in training_data if tr_d[2] == 1])[:,0:2]
#pos_train_inputs
# Get all the inputs that have negative outputs -- i.e., Rejected
neg_train_inputs = np.array([tr_d for tr_d in training_data if tr_d[2] == 0])[:,0:2]
#neg_train_inputs
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(pos_train_inputs[:,0], pos_train_inputs[:,1], s=30, c='b', marker='s', label='Accepted')
ax.scatter(neg_train_inputs[:,0], neg_train_inputs[:,1], s=30, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
ax.set_title('Training Dataset')
Out[9]:
The training dataset looks just like the full dataset -- it just contains fewer data points.
In [10]:
# Bring the test inputs and test outputs together
#X_test.shape
#y_test.shape
test_data = np.c_[X_test, y_test]
# Get all inputs that have positive outputs -- i.e., Accepted
pos_test_inputs = np.array([t_d for t_d in test_data if t_d[2] == 1])[:,0:2]
#pos_test_inputs
# Get all the inputs that have negative outputs -- i.e., Rejected
neg_test_inputs = np.array([t_d for t_d in test_data if t_d[2] == 0])[:,0:2]
#neg_test_inputs
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(pos_test_inputs[:,0], pos_test_inputs[:,1], s=30, c='b', marker='s', label='Accepted')
ax.scatter(neg_test_inputs[:,0], neg_test_inputs[:,1], s=30, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
ax.set_title('Test Dataset')
Out[10]:
The test dataset looks a lot sparser than the full dataset. There's good reason for that which we'll get to later in this chapter.
We only have two inputs -- scores in tests 1 and 2. However, we need a non-linear decision boundary between the microchips that are accepted versus those that are rejected. The only way to do this is to manufacture more features based on the two features we already have.
To make a point, we're going to go crazy with the manufacturing of features. In fact, our model will be a sixth-degree polynomial. The image below shows how to systematically construct the factors of an nth-order polynomial. If you count the circled items below, you'll see there are 28 of them. These are the terms we'll use in our 6th order polynomial to fit the training data.
[[WRITE OUT THE MODEL]]
In [11]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(6)
X_train_poly6 = poly.fit_transform(X_train)
X_test_poly6 = poly.fit_transform(X_test)
# NOTE: the poly function adds a bias value of 1 to each row of input data -- default setting is include_bias=True
In [12]:
X_train_poly6.shape
Out[12]:
In [13]:
X_train.shape
Out[13]:
We've increased the number of features from 2 to 28 by using a 6th degree polynomial. Let's use logistic regression to solve the problem.
We've made the inputs very complicated -- so before we implement a logistic regression algorithm that depends on some form of gradient descent to find its solution, let's normalize our inputs so that the algorithm can operate efficiently.
In [14]:
# first row of the training inputs
X_train_poly6[0]
Out[14]:
In [35]:
# Implement logistic regression using Scikit Learn
# Large C means low regularization lambda; Small C means high regularization lambda
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import cross_val_score
lr = LogisticRegression(solver='newton-cg', C=100)
lr.fit(X_train_poly6, y_train)
Out[35]:
In [36]:
# Classification accuracy on the training data
lr.score(X_train_poly6, y_train)
Out[36]:
In [37]:
# How did this classifier do on the test data?
predicted = lr.predict(X_test_poly6)
print metrics.accuracy_score(y_test, predicted)
print metrics.roc_auc_score(y_test, predicted)
In [18]:
lr.intercept_
Out[18]:
In [19]:
# The optimal values of theta
lr.coef_
Out[19]:
In [20]:
# Sanity checks
m = X_train_poly6.shape[0]
print 'Number of samples:', m
print 'lr.coef_ shape:', lr.coef_.shape
# Multiply the inputs by the coefficients to get the optimal cost
print 'X_train_poly6 shape:', X_train_poly6.shape
dotProd = sigmoid(np.matrix(X_train_poly6) * np.matrix(lr.coef_).T)
cost_first_term = np.multiply(-y_train.reshape(94,1), np.log(dotProd))
cost_second_term = np.multiply(np.subtract(1,y_train.reshape(94,1)),np.log(np.subtract(1, dotProd)))
non_reg_cost = np.divide(np.sum(np.subtract(cost_first_term, cost_second_term)), m)
print 'Optimal Non-Regularized Cost:', non_reg_cost
# Regularize the cost
Lambda = 100.
theta_squared = np.square(lr.coef_)
theta_squared[0,0] = 0
reg_theta = np.sum(theta_squared)
reg_term = np.divide(Lambda,np.multiply(2,m))
print 'Regularization Term:', reg_term
reg_cost = np.add(non_reg_cost, reg_term)
print 'Regularized Cost:', reg_cost
In [21]:
# Contour plot of the decision boundary
# Make grid values
xx1, xx2 = np.mgrid[-1:1.5:.02, -1:1.5:.02]
In [22]:
xx1
Out[22]:
In [23]:
xx2
Out[23]:
In [24]:
# Create the grid
grid = np.c_[xx1.ravel(), xx2.ravel()]
grid.shape
Out[24]:
In [25]:
# Get the probabilities for each grid value
# Note: Make sure it's transformed so
probs = lr.predict_proba(poly.fit_transform(grid))
probs.shape
Out[25]:
In [26]:
probs_bin = lr.predict(poly.fit_transform(grid))
probs_bin.shape
Out[26]:
In [27]:
grid.shape
Out[27]:
In [28]:
probs_to_binary = []
for prob in probs:
if prob[1] > 0.5:
probs_to_binary.append(1)
else:
probs_to_binary.append(0)
In [29]:
# number of ones predicted
(probs_to_binary != np.zeros(15625)).sum()
Out[29]:
In [31]:
#plt.contour(xx1,xx2,np.array(probs_to_binary).reshape(xx1.shape))
fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(pos_test_inputs[:,0], pos_test_inputs[:,1], s=30, c='b', marker='s', label='Accepted')
ax.scatter(neg_test_inputs[:,0], neg_test_inputs[:,1], s=30, c='r', marker='x', label='Rejected')
ax.legend()
ax.set_xlabel('Test 1 Score')
ax.set_ylabel('Test 2 Score')
ax.set_title('Test Dataset')
plt.contour(xx1,xx2,probs_bin.reshape(xx1.shape), colors='y', linewidths=0.5)
Out[31]:
Let's recap. We chose a linear logistic regression model. But our classification problem needed a non-linear boundary. So we took the two inputs we had and "manufactured" a 6th degree polynomial set of inputs from them. We could have chosen any degree for the polynomial; our choice is based on maximizing the predictive value of our classifier which in turn means maximizing the classifier's success rate over the test data set.
While the parameters of the model are the coeffients that are learned by the model from the training data set, there are two hyperparameters in our model. The first is the value of the regularization constant C. The second is the degree n of our polynomial. While the parameters of the model are learned, the hyperparameters of the model must be chosen by the practitioner. There are best practices for choosing these hyperparameters, but they do involve a good deal of judgment acquired from experience. Induction is hard!
In [496]:
# A fancy boundary plot from Michael Waskom
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
X, y = make_classification(200, 2, 2, 0, weights=[.5, .5], random_state=15)
clf = LogisticRegression().fit(X[:100], y[:100])
xx, yy = np.mgrid[-5:5:.01, -5:5:.01]
grid = np.c_[xx.ravel(), yy.ravel()]
probs = clf.predict_proba(grid)[:, 1].reshape(xx.shape)
f, ax = plt.subplots(figsize=(8, 6))
contour = ax.contourf(xx, yy, probs, 25, cmap="RdBu",
vmin=0, vmax=1)
ax_c = f.colorbar(contour)
ax_c.set_label("$P(y = 1)$")
ax_c.set_ticks([0, .25, .5, .75, 1])
ax.scatter(X[100:,0], X[100:, 1], c=y[100:], s=50,
cmap="RdBu", vmin=-.2, vmax=1.2,
edgecolor="white", linewidth=1)
ax.set(aspect="equal",
xlim=(-5, 5), ylim=(-5, 5),
xlabel="$X_1$", ylabel="$X_2$")
Out[496]:
In [ ]: