This tutorial uses the iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) which comes preloaded with sklearn.
In [1]:
#Import dataset
from sklearn.datasets import load_iris
iris = load_iris()
Each flower in this dataset contains the following features and labels
Our train_test_split function will seperate the data as follows
In [2]:
#Import train_test_split
from sklearn.model_selection import train_test_split
In [3]:
features_train, features_test, labels_train, labels_test = train_test_split(iris.data,iris.target,test_size=0.2,random_state=1)
This tutorial uses the RandomForestClassifier model for our predictions, but you can experiment with other classifiers. To do so, import another classifier and replace the relevant code in this section.
In [4]:
#Import classifier
from sklearn.ensemble import RandomForestClassifier
In [5]:
#Create an instance of the RandomForestClassifier
rfc = RandomForestClassifier()
In [6]:
#Fit our model to the training features and labels
rfc.fit(features_train,labels_train)
Out[6]:
In [7]:
rfc_predictions = rfc.predict(features_test)
Understanding our predictions
Our predictions will be an array of 0's 1's, and 2's, depending on which flower our algorithm believes each set of measurements to represent.
In [8]:
print(rfc_predictions)
To intepret this, consider the first set of measurements in features_test:
In [9]:
print(features_test[0])
Our model believes that these measurements correspond to a setosa iris (label 0).
In [10]:
print(rfc_predictions[0])
In this case, our model is correct, since the true label indicates that this was a setosa iris (label 0).
In [11]:
print(labels_test[0])
For this section we will import two metrics from sklearn: confusion_matrix and classification_report. They will help us understand how well our model did.
In [12]:
#Import pandas to create the confusion matrix dataframe
import pandas as pd
#Import classification_report and confusion_matrix to evaluate our model
from sklearn.metrics import classification_report, confusion_matrix
As seen in the confusion matrix below, most predictions are accurate but our model misclassified one specimen of versicolor (our model thought that it was virginca).
In [13]:
#Create a dataframe with the confusion matrix
confusion_df = pd.DataFrame(confusion_matrix(labels_test, rfc_predictions),
columns=["Predicted " + name for name in iris.target_names],
index = iris.target_names)
In [14]:
confusion_df
Out[14]:
As seen in the classification report below, our model has 97% precision, recall, and accuracy.
In [15]:
print(classification_report(labels_test,rfc_predictions))
Documentation with full explanation of parameters and use: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.
Some useful parameters to experiment with:
An optional feature to take advantage of:
Check out wikipedia if confusion matrices are new (https://en.wikipedia.org/wiki/Confusion_matrix) or if you want explanation on precision and recall (https://en.wikipedia.org/wiki/Precision_and_recall).