Classification problems are a broad category of machine learning problems that involve the prediction of values taken from a discrete, finite number of cases.
In this example, we'll build a classifier to predict to which species a flower belongs to.
In [1]:
import pandas as pd
iris = pd.read_csv('../datasets/iris.csv')
In [2]:
# Print some info about the dataset
iris.info()
In [3]:
iris['Class'].unique()
Out[3]:
In [4]:
iris.describe()
Out[4]:
In [15]:
# Create a scatterplot for sepal length and sepal width
import matplotlib.pyplot as plt
%matplotlib inline
sl = iris['Sepal_length']
sw = iris['Sepal_width']
# Create a scatterplot of these two properties using plt.scatter()
# Assign different colors to each data point according to the class it belongs to
plt.scatter(sl[iris['Class'] == 'Iris-setosa'], sw[iris['Class'] == 'Iris-setosa'], color='red')
plt.scatter(sl[iris['Class'] == 'Iris-versicolor'], sw[iris['Class'] == 'Iris-versicolor'], color='green')
plt.scatter(sl[iris['Class'] == 'Iris-virginica'], sw[iris['Class'] == 'Iris-virginica'], color='blue')
# Specify labels for the X and Y axis
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
# Show graph
plt.show()
In [16]:
# Create a scatterplot for petal length and petal width
pl = iris['Petal_length']
pw = iris['Petal_width']
# Create a scatterplot of these two properties using plt.scatter()
# Assign different colors to each data point according to the class it belongs to
plt.scatter(pl[iris['Class'] == 'Iris-setosa'], pw[iris['Class'] == 'Iris-setosa'], color='red')
plt.scatter(pl[iris['Class'] == 'Iris-versicolor'], pw[iris['Class'] == 'Iris-versicolor'], color='green')
plt.scatter(pl[iris['Class'] == 'Iris-virginica'], pw[iris['Class'] == 'Iris-virginica'], color='blue')
# Specify labels for the X and Y axis
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
# Show graph
plt.show()
We'll use scikit-learn's LogisticRegression to build out classifier.
In [9]:
X = iris.drop('Class', axis=1)
t = iris['Class'].values
RANDOM_STATE = 4321
# Use sklean's train_test_plit() method to split our data into two sets.
from sklearn.model_selection import train_test_split
Xtr, Xts, ytr, yts = train_test_split(X, t, random_state=RANDOM_STATE)
In [10]:
# Use the training set to build a LogisticRegression model
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(Xtr, ytr) # Fit a logistic regression model
In [11]:
# Use the LogisticRegression's score() method to assess the model accuracy in the training set
lr.score(Xtr, ytr)
Out[11]:
In [12]:
# Use the LogisticRegression's score() method to assess the model accuracy in the test set
lr.score(Xts, yts)
Out[12]:
Scores like the one calculated above are usually not what we want to assess. it will only return the mean error obtained between predictions and the actual classes in the training dataset.
Consider what happens, for instance, when you're training a model to classify if someone has a disease or not and 99% of the people don't have that disease. What can go wrong if you use a score like the one above to evaluate your model? Hint: What would be the score of a classifier that always returns zero(i.e. it always says that the person doesn't have the disease) in this case?
Simple score metrics are usually not recommended for classification problems. There are at least three different metrics that are commonly used depending on the context:
Some other common evaluation methods for classification models include ROC chart analysis and the related concept of Area Under Curve (AUC).
What metric would you prioritise in the case of the disease classifier described before? What are the costs of false positives and false negatives in this case?
In [13]:
# scikit-learn provides a function called "classification_report" that summarizes the three metrics above
# for a given classification model on a dataset.
from sklearn.metrics import classification_report
# Use this function to print a classification metrics report for the trained classifier.
# See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
print(classification_report(yts, lr.predict(Xts)))
Another useful technique to inspect the results given by a classification model is to take a look at its confusion matrix. This is an K x K matrix (where K is the number of distinct classes identified by the classifier) that gives us, in the position (i, j), how many examples belonging to class i were classified as belonging to class j.
That can give us insights on which classes may require more attention.
In [14]:
from sklearn.metrics import confusion_matrix
# Use scikit-learn's confusion_matrix to understand which classes were misclassified.
# See http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
confusion_matrix(yts, lr.predict(Xts))
Out[14]:
In the example above, what would you investigate? What classes is the classifier having difficulty to discriminate?