Title: Random Forest Classifier Example
Slug: random_forest_classifier_example_scikit
Summary: random_forest_classifier_example using Scikit.
Date: 2016-09-21 12:00
Category: Machine Learning
Tags: Trees And Forests
Authors: Chris Albon
This tutorial is based on Yhat's 2013 tutorial on Random Forests in Python. If you want a good summary of the theory and uses of random forests, I suggest you check out their guide. In the tutorial below, I annotate, correct, and expand on a short code example of random forests they present at the end of the article. Specifically, I 1) update the code so it runs in the latest version of pandas and Python, 2) write detailed comments explaining what is happening in each step, and 3) expand the code in a number of ways.
Let's get started!
The data for this tutorial is famous. Called, the iris dataset, it contains four variables measuring various parts of iris flowers of three related species, and then a fourth variable with the species name. The reason it is so famous in machine learning and statistics communities is because the data requires very little preprocessing (i.e. no missing values, all features are floating numbers, etc.).
In [1]:
# Load the library with the iris dataset
from sklearn.datasets import load_iris
# Load scikit's random forest classifier library
from sklearn.ensemble import RandomForestClassifier
# Load pandas
import pandas as pd
# Load numpy
import numpy as np
# Set random seed
np.random.seed(0)
In [2]:
# Create an object called iris with the iris data
iris = load_iris()
# Create a dataframe with the four feature variables
df = pd.DataFrame(iris.data, columns=iris.feature_names)
# View the top 5 rows
df.head()
Out[2]:
In [3]:
# Add a new column with the species names, this is what we are going to try to predict
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# View the top 5 rows
df.head()
Out[3]:
In [4]:
# Create a new column that for each row, generates a random number between 0 and 1, and
# if that value is less than or equal to .75, then sets the value of that cell as True
# and false otherwise. This is a quick and dirty way of randomly assigning some rows to
# be used as the training data and some as the test data.
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
# View the top 5 rows
df.head()
Out[4]:
In [5]:
# Create two new dataframes, one with the training rows, one with the test rows
train, test = df[df['is_train']==True], df[df['is_train']==False]
In [6]:
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))
In [7]:
# Create a list of the feature column's names
features = df.columns[:4]
# View features
features
Out[7]:
In [8]:
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]
# View target
y
Out[8]:
In [9]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
clf = RandomForestClassifier(n_jobs=2, random_state=0)
# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
clf.fit(train[features], y)
Out[9]:
Huzzah! We have done it! We have officially trained our random forest Classifier! Now let's play with it. The Classifier model itself is stored in the clf
variable.
If you have been following along, you will know we only trained our classifier on part of the data, leaving the rest out. This is, in my humble opinion, the most important part of machine learning. Why? Because by leaving out a portion of the data, we have a set of data to test the accuracy of our model!
Let's do that now.
In [10]:
# Apply the Classifier we trained to the test data (which, remember, it has never seen before)
clf.predict(test[features])
Out[10]:
What are you looking at above? Remember that we coded each of the three species of plant as 0, 1, or 2. What the list of numbers above is showing you is what species our model predicts each plant is based on the the sepal length, sepal width, petal length, and petal width. How confident is the classifier about each plant? We can see that too.
In [11]:
# View the predicted probabilities of the first 10 observations
clf.predict_proba(test[features])[0:10]
Out[11]:
There are three species of plant, thus [ 1. , 0. , 0. ]
tells us that the classifier is certain that the plant is the first class. Taking another example, [ 0.9, 0.1, 0. ]
tells us that the classifier gives a 90% probability the plant belongs to the first class and a 10% probability the plant belongs to the second class. Because 90 is greater than 10, the classifier predicts the plant is the first class.
Now that we have predicted the species of all plants in the test data, we can compare our predicted species with the that plant's actual species.
In [12]:
# Create actual english names for the plants for each predicted plant class
preds = iris.target_names[clf.predict(test[features])]
In [13]:
# View the PREDICTED species for the first five observations
preds[0:5]
Out[13]:
In [14]:
# View the ACTUAL species for the first five observations
test['species'].head()
Out[14]:
That looks pretty good! At least for the first five observations. Now let's use look at all the data.
A confusion matrix can be, no pun intended, a little confusing to interpret at first, but it is actually very straightforward. The columns are the species we predicted for the test data and the rows are the actual species for the test data. So, if we take the top row, we can wee that we predicted all 13 setosa plants in the test data perfectly. However, in the next row, we predicted 5 of the versicolor plants correctly, but mis-predicted two of the versicolor plants as virginica.
The short explanation of how to interpret a confusion matrix is: anything on the diagonal was classified correctly and anything off the diagonal was classified incorrectly.
In [15]:
# Create confusion matrix
pd.crosstab(test['species'], preds, rownames=['Actual Species'], colnames=['Predicted Species'])
Out[15]:
While we don't get regression coefficients like with OLS, we do get a score telling us how important each feature was in classifying. This is one of the most powerful parts of random forests, because we can clearly see that petal width was more important in classification than sepal width.
In [16]:
# View a list of the features and their importance scores
list(zip(train[features], clf.feature_importances_))
Out[16]: