Implementing the Random Forest Classifier from sci-kit learn

1. Import dataset

This tutorial uses the iris dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) which comes preloaded with sklearn.


In [1]:
#Import dataset
from sklearn.datasets import load_iris
iris = load_iris()

2. Prepare training and testing data

Each flower in this dataset contains the following features and labels

  • features - measurements of the flower petals and sepals
  • labels - the flower species (setosa, versicolor, or virginica) represented as a 0, 1, or 2.

Our train_test_split function will seperate the data as follows

  • (features_train, labels_train) - 80% of the data prepared for training
  • (features_test, labels_test) - 20% of the data prepared for making our predictions and evaluating our model

In [2]:
#Import train_test_split
from sklearn.model_selection import train_test_split

In [3]:
features_train, features_test, labels_train, labels_test = train_test_split(iris.data,iris.target,test_size=0.2,random_state=1)

3. Create and fit the Random Forest Classifier

This tutorial uses the RandomForestClassifier model for our predictions, but you can experiment with other classifiers. To do so, import another classifier and replace the relevant code in this section.


In [4]:
#Import classifier
from sklearn.ensemble import RandomForestClassifier

In [5]:
#Create an instance of the RandomForestClassifier
rfc = RandomForestClassifier()

In [6]:
#Fit our model to the training features and labels
rfc.fit(features_train,labels_train)


Out[6]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

4. Make Predictions using Random Forest Classifier


In [7]:
rfc_predictions = rfc.predict(features_test)

Understanding our predictions

Our predictions will be an array of 0's 1's, and 2's, depending on which flower our algorithm believes each set of measurements to represent.


In [8]:
print(rfc_predictions)


[0 1 1 0 2 1 2 0 0 2 1 0 2 1 1 0 1 1 0 0 1 1 2 0 2 1 0 0 1 2]

To intepret this, consider the first set of measurements in features_test:


In [9]:
print(features_test[0])


[ 5.8  4.   1.2  0.2]

Our model believes that these measurements correspond to a setosa iris (label 0).


In [10]:
print(rfc_predictions[0])


0

In this case, our model is correct, since the true label indicates that this was a setosa iris (label 0).


In [11]:
print(labels_test[0])


0

5. Evaluate our model

For this section we will import two metrics from sklearn: confusion_matrix and classification_report. They will help us understand how well our model did.


In [12]:
#Import pandas to create the confusion matrix dataframe
import pandas as pd

#Import classification_report and confusion_matrix to evaluate our model
from sklearn.metrics import classification_report, confusion_matrix

As seen in the confusion matrix below, most predictions are accurate but our model misclassified one specimen of versicolor (our model thought that it was virginca).


In [13]:
#Create a dataframe with the confusion matrix
confusion_df = pd.DataFrame(confusion_matrix(labels_test, rfc_predictions),
             columns=["Predicted " + name for name in iris.target_names],
             index = iris.target_names)

In [14]:
confusion_df


Out[14]:
Predicted setosa Predicted versicolor Predicted virginica
setosa 11 0 0
versicolor 0 12 1
virginica 0 0 6

As seen in the classification report below, our model has 97% precision, recall, and accuracy.


In [15]:
print(classification_report(labels_test,rfc_predictions))


             precision    recall  f1-score   support

          0       1.00      1.00      1.00        11
          1       1.00      0.92      0.96        13
          2       0.86      1.00      0.92         6

avg / total       0.97      0.97      0.97        30


Note on the RandomForestClassifier from sklearn

Documentation with full explanation of parameters and use: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html.

Some useful parameters to experiment with:

  • min_samples_leaf (the minimum samles which can be put into each lef)
  • n_estimators (the number of decision trains)
  • max_features (the size of the subset of features to be examined at each split)

An optional feature to take advantage of:

  • oob_score (a way of seeing how well the estimator did by cross-validiting on the "out of bag" data, i.e. the data for each tree that was not used in the sample). This would be usefull if you didn't want to split your dataset into a training dataset and a test dataset.

Note on metrics

Check out wikipedia if confusion matrices are new (https://en.wikipedia.org/wiki/Confusion_matrix) or if you want explanation on precision and recall (https://en.wikipedia.org/wiki/Precision_and_recall).