Supervised Machine Learning - scikit learn

The example uses the Iris Dataset. (The Iris dataset section is adatped from an example from Analyics Vidhya)

https://en.wikipedia.org/wiki/Iris_flower_data_set


In [ ]:
import numpy as np
import matplotlib as mp
from sklearn import datasets
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

In [ ]:
# Load the sample data set from the datasets module
dataset = datasets.load_iris()

In [ ]:
# Display the data in the test dataset
dataset

In [ ]:
# Species of Iris in the dataset
dataset['target_names']

Iris Setosa

Iris Versicolor

Iris Virginica


In [ ]:
# Names of the type of information recorded about an Iris - called features
dataset['feature_names']

In [ ]:
# First 10 sets of Iris data
dataset['data'][:10]

In [ ]:
# The classification of each of the first 10 sets of Iris data - the target
dataset['target'][:10]

Here 0 equates to setosa the first entry in the 'target_names' array


In [ ]:
# Now we create our model
model = LogisticRegression()
# We train it by passing in the test data and the actual results
model.fit(dataset.data, dataset.target)

In [ ]:
# We use the model to create predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# Using the metrics module we see the results of the model
metrics.accuracy_score(expected, predicted, normalize=True, sample_weight=None)

Digging deeper using metrics

Accuracy score, Classification report & Confusion matix

Here we will use a simple example to show metrics you can use: accuracy, classification reports and confusion matrices.

  • y_true is the test data
  • y_pred is the prediction

In [ ]:
y_true = ["cat", "ant", "cat", "cat", "ant", "bird", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat", "bird"]

In [ ]:
metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None)

5 correct predictions out of 7 values. 71% accuracy


In [ ]:
print(metrics.classification_report(y_true, y_pred,
    target_names=["ant", "bird", "cat"]))

Here we can see that the predictions:

  • precision = $2/3 = 0.67$ (2 ants in test data and matched but found an extra 1 in prediction).
  • recall = $2/2 = 1$ (2 ants in test data and these matched in prediction).
  • f1-score = $(0.67 + 1) / 2 = 0.8$ mean of precision and recall.
  • support shows that there are 2 ants, 2 birds and 3 cats in the test data.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html


In [ ]:
metrics.confusion_matrix(y_true, y_pred)

In the confusion_matrix the labels give the order of the rows.

  • ant was correctly categorised twice and was never miss categorised
  • bird was correctly categorised once and was categorised as cat once
  • cat was correctly categorised twice and was categorised as an ant once

Back to Iris predictions


In [ ]:
print(metrics.classification_report(expected, predicted,target_names=dataset['target_names']))

In [ ]:
print (metrics.confusion_matrix(expected, predicted))

In the confusion_matrix the labels give the order of the rows.

  • setosa was correctly and was never miss categorised
  • versicolor was correctly categorised 45 times and was categorised as virginica 5 times
  • virginica was correctly categorised 49 times and was categorised as versicolor once