In [1]:
%matplotlib inline
# Setup - import some packages we'll need
import numpy as np
import matplotlib.pyplot as plt
Logistic regression is a method of fitting a regression model based on one dependent variable (DV) and one or more independent variables (IVs). The difference between logistic and linear regression is that logistic regression predicts results of discrete categories, meaning that $y | x$ is the result of a Bernoulli distribution rather than a Gaussian. Linear regression is more appropriate if the dependent variable is continuous.
Advantages of logistic regression:
Disadvantages of logistic regression:
In [2]:
from sklearn import datasets
# Import some data to fit. We'll use the iris data and fit only to the first two features.
iris = datasets.load_iris()
X = iris.data[:, :2]
Y = iris.target
n_classes = len(set(Y))
In [3]:
# Plot the training data and take a look at the classes
fig = plt.figure(figsize=(6,6))
ax = fig.add_subplot(111)
md = {0:'o',1:'^',2:'s'}
cm = plt.cm.Set1
for i in range(n_classes):
inds = (Y==i)
ax.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=plt.cm.Set1,marker=md[i],s=50)
ax.set_xlabel(iris['feature_names'][0])
ax.set_ylabel(iris['feature_names'][1]);
In [4]:
# Train the logistic regression model
h = 0.02 # step size in the mesh
# Create an instance of the classifier
from sklearn import linear_model
logreg = linear_model.LogisticRegression(C=1e5)
# Fit the data with the classifier
logreg.fit(X, Y)
# Create a 2D grid to evaluate the classifier on
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Evaluate the classifier at every point in the grid
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])
# Reshape the output so that it can be overplotted on our grid
Z = Z.reshape(xx.shape)
In [5]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# Plot the training points
for i in range(n_classes):
inds = (Y==i)
ax1.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=plt.cm.Set1,marker=md[i],s=50)
# Plot the classifier with the training points on top
ax2.pcolormesh(xx, yy, Z, cmap=cm)
for i in range(n_classes):
inds = (Y==i)
ax2.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=cm,marker=md[i],s=50,edgecolor='k')
# Label the axes and remove the ticks
for ax in (ax1,ax2):
ax.set_xlabel(iris['feature_names'][0])
ax.set_ylabel(iris['feature_names'][1])
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max());
So, now you have a predictor for the class of any future object based on the input data (length and width of sepals). For example:
In [6]:
length = 6.0
width = 3.2
# Just the discrete answer
data = np.array([width,length]).reshape(1,-1)
pred_class = logreg.predict(data)[0]
target_name = iris['target_names'][pred_class]
print "Overall predicted class of the new flower is {0:}.\n".format(target_name)
# Probabilities for all the classes
pred_probs = logreg.predict_proba(data)
for name,prob in zip(iris['target_names'],pred_probs[0]):
print "\tProbability of class {0:12} is {1:.2f}%.".format(name,prob*100.)
Decision trees are a non-parametric, supervised method for learning both classification and regression. It works by creating series of increasingly deeper rules for separating the independent variables based on combinations of the dependent variables. Rules can include simple thresholds on the DVs, Gini coefficient, cross-entropy, or misclassification.
Advantages of decision trees:
Disadvatages of decision trees:
In [7]:
# Let's try it out again on the iris dataset.
from sklearn import tree
tree_classifier = tree.DecisionTreeClassifier()
tree_classifier.fit(X,Y)
# Evaluate the classifier at every point in the gricdb
Z = tree_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
# Reshape the output so that it can be overplotted on our grid
Z = Z.reshape(xx.shape)
In [8]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# Plot the training points
for i in range(n_classes):
inds = (Y==i)
ax1.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=plt.cm.Set1,marker=md[i],s=50)
# Plot the classifier with the training points on top
ax2.pcolormesh(xx, yy, Z, cmap=cm)
for i in range(n_classes):
inds = (Y==i)
ax2.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=cm,marker=md[i],s=50,edgecolor='k')
# Label the axes and remove the ticks
for ax in (ax1,ax2):
ax.set_xlabel(iris['feature_names'][0])
ax.set_ylabel(iris['feature_names'][1])
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max());
So this model has higher accuracy on the training set since it creates smaller niches and separated areas of different classes. However, this illustrates the danger of overfitting the model; further test sets will likely have poorer performance.
Like logistic regression, you can extract both discrete predictions and probabilities for each class:
In [9]:
# Just the discrete answer
data = np.array([width,length]).reshape(1,-1)
pred_class = tree_classifier.predict(data)[0]
target_name = iris['target_names'][pred_class]
print "Overall predicted class of the new flower is {0:}.\n".format(target_name)
# Probabilities for all the classes
pred_probs = tree_classifier.predict_proba(data)
for name,prob in zip(iris['target_names'],pred_probs[0]):
print "\tProbability of class {0:12} is {1:.2f}%.".format(name,prob*100.)
How could we not overfit? Maybe try trees with different depths.
In [10]:
fig,axarr = plt.subplots(2,3,figsize=(15,10))
for depth,ax in zip(range(1,7),axarr.ravel()):
tree_depthlim = tree.DecisionTreeClassifier(max_depth=depth)
tree_depthlim.fit(X,Y)
# Evaluate the classifier at every point in the gricdb
Z = tree_depthlim.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Plot the classifier with the training points on top
ax.pcolormesh(xx, yy, Z, cmap=cm)
for i in range(n_classes):
inds = (Y==i)
ax.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=cm,marker=md[i],s=50,edgecolor='k')
# Label the axes and remove the ticks
ax.set_title('Max. depth = {0}'.format(depth))
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max());
Random forest (RF) is an example of an ensemble method of classification; this takes many individual estimators that each have some element of randomness built into them, and then combines the individual results to reduce the total variance (although potentially with a small increase in bias).
Random forests are built on individual decision trees. For the construction of the classifier in each tree, rather than picking the best split among all features at each node in the tree, the algorithm will pick the best split for a random subset of the features and then continue constructing the classifier. This means that all the trees will have slightly different classifications even based on identical training sets. The scikit-learn implementation of RF combines the classifiers by averaging the probabilistic prediction in each tree.
Advantages:
Disadvantages:
In [11]:
from sklearn.ensemble import RandomForestClassifier
rf_classifier = RandomForestClassifier(max_depth=5, n_estimators=10)
rf_classifier.fit(X,Y)
# Evaluate the classifier at every point in the gricdb
Z = rf_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
# Reshape the output so that it can be overplotted on our grid
Z = Z.reshape(xx.shape)
In [12]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# Plot the training points
for i in range(n_classes):
inds = (Y==i)
ax1.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=plt.cm.Set1,marker=md[i],s=50)
# Plot the classifier with the training points on top
ax2.pcolormesh(xx, yy, Z, cmap=cm)
for i in range(n_classes):
inds = (Y==i)
ax2.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=cm,marker=md[i],s=50,edgecolor='k')
# Label the axes and remove the ticks
for ax in (ax1,ax2):
ax.set_xlabel(iris['feature_names'][0])
ax.set_ylabel(iris['feature_names'][1])
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max());
Support vector machines are another class of supervised learning methods. They rely on finding the weights necessary to create a set of hypervectors that separate the classes in a set. Like decision trees, they can be used for both classification and regression.
Advantages:
Disadvantages:
The heart of an SVC is the kernel; this "mathematical trick" is what allows the algorithm to efficiently map coordinates into feature space by only computing the inner product on pairs of images, rather than a complete coordinate transformation. The shape of the kernel also determines the available shapes for the discriminating hyperplances. In scikit-learn, there are four precompiled kernels available (you can also define your own):
In [13]:
from sklearn import svm
svm_classifier = svm.SVC()
svm_classifier.fit(X,Y)
# Evaluate the classifier at every point in the gricdb
Z = svm_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
# Reshape the output so that it can be overplotted on our grid
Z = Z.reshape(xx.shape)
In [14]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(12,6))
# Plot the training points
for i in range(n_classes):
inds = (Y==i)
ax1.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=plt.cm.Set1,marker=md[i],s=50)
# Plot the classifier with the training points on top
ax2.pcolormesh(xx, yy, Z, cmap=cm)
for i in range(n_classes):
inds = (Y==i)
ax2.scatter(X[inds,0],X[inds,1],c=cm(int(i/float(n_classes-1) * 255)),
cmap=cm,marker=md[i],s=50,edgecolor='k')
# Label the axes and remove the ticks
for ax in (ax1,ax2):
ax.set_xlabel(iris['feature_names'][0])
ax.set_ylabel(iris['feature_names'][1])
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max());
In [15]:
# Just the discrete answer
data = np.array([width,length]).reshape(1,-1)
pred_class = svm_classifier.predict(data)[0]
target_name = iris['target_names'][pred_class]
print "Overall predicted class of the new flower is {0:}.\n".format(target_name)
So what has this shown? Hopefully it's a practical introduction to the most basic concepts on the theory and implementation of several of the most widely-used classification algorithms in machine learning. I've left out several well-known ones, all of which have good implementations in scikit-learn; these include Naive Bayes, nearest neighbors, and discriminant analysis. More advanced methods include deep learning techniques like neural networks.
One of the big things that this example didn't look at was how to choose between them. While I've listed some advantages and disadvantages that may initially steer you in certain directions, if you have sufficient data, you should be trying lots of different ones and evaluating their performance. This should be done with a proper split on test/training data and then evaluating accuracy with something like a ROC curve. But in addition to that, there's a massive need to plot and examine your data (if it's higher-dimensional, then do so in various projections or reduce the dimensionality) and evaluate whether your classifications make sense. It's key for avoiding things like overfitting or badly-tuned parameters that can skew the prediction/classification you're trying to do.
Massive thanks to the team behind scikit-learn for this outstanding implementation.
In [ ]: