A single decision tree - tasked to learn a dataset - might not be able to perform well due to the outliers and the breadth and depth complexity of the data.
So instead of relying on a single tree, random forests rely on a multitude of cleverly grown decision trees.
Each tree within the forest is allowed to become highly specialised in a specific area but still retains some general knowledge about most areas. When a random forest classifies, it is actualy each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.
Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.
In addition to the tree bagging of training samples at the forest level, each individual decision tree further 'feature bags' at each node-branch split. This is helpful because some datasets contain a feature that is very correlated to the target (the 'y'-label). By selecting a random sampling of features every split - if such a feature were to exist - it wouldn't show up on as many branches of the tree and there would be more diversity of the features examined.
Check my post to see more details about Random Forests!
As an example, we will predict human activity by looking at data from wearables.
For this , we train a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165633 data points.
Within the dataset, there are five target activities:
These activities were captured from 30 people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle.
The original dataset can be found on the UCI MachineLearning Repository
A copy can be found also here on GitHub (URL is below) and on Kaggle
In [1]:
import pandas as pd
import time
In [3]:
# Grab the DLA HAR dataset from the links above
# we assume that is stored in a dataset folder
#
# Load up the dataset into dataframe 'X'
#
X = pd.read_csv("../datasets/dataset-har-PUC-rio-ugulino.csv", sep=';', low_memory=False)
In [4]:
X.head(2)
Out[4]:
In [5]:
X.describe()
Out[5]:
In [6]:
#
# An easy way to show which rows have NaNs in them:
print (X[pd.isnull(X).any(axis=1)])
Great, no NaNs here. Let's go on.
In [7]:
#
# Encode the gender column: 0 as male, 1 as female
#
X.gender = X.gender.map({'Woman':1, 'Man':0})
#
# Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float)
X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float)
#
# Check data types
print (X.dtypes)
In [8]:
# column z4 is type "object". Something is wrong with the dataset.
#
# Convert that column into numeric
# Use errors='raise'. This will alert you if something ends up being
# problematic
#
#
# INFO: There is an error raised ... you will find it if you try the method
#
# print (X[pd.isnull(X).any(axis=1)])
# 122076 --> z4 = -14420-11-2011 04:50:23.713
#
# !! The data point #122076 is a wrong coded record,
# change it or drop it before calling the to_numeric methods:
#
#X.at[122076, 'z4'] = -144 // change to correct value
# I keep this value for later and drop it from the dataset
wrongRow = X.loc[122076]
X.drop(X.index[[122076]], inplace=True)
X.z4 = pd.to_numeric(X.z4, errors='raise')
print (X.dtypes)
# everything ok now
In [9]:
# Activity to predict is in "class" column
# Encode 'y' value as a dummies version of dataset's "class" column
#
y = pd.get_dummies(X['class'].copy())
# this produces a 5 column wide dummies dataframe as the y value
#
# Get rid of the user and class columns in X
#
X.drop(['class','user'], axis=1, inplace=True)
print (X.head(2))
In [10]:
print (y.head())
In [11]:
#
# Split data into test / train sets
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=7)
In [12]:
#
# Create an RForest classifier 'model'
#
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0)
You can check the SKlearn documentation to see all possible parameters.
The ones used here:
And other useful / important:
In [13]:
print ("Fitting...")
s = time.time()
model.fit(X_train, y_train)
print("completed in: ", time.time() - s, "seconds")
Note that it takes a much longer time to train a forest than a single decision tree.
This is the score based on the test dataset that we split earlier. Note how good it is.
In [14]:
print ("Scoring...")
s = time.time()
score = model.score(X_test, y_test)
print ("Score: ", round(score*100, 3))
print ("Scoring completed in: ", time.time() - s)
These are the top 5 features used in the classification.
They are all related to the movements, no gender or age.
In [15]:
# Extract feature importances
fi = pd.DataFrame({'feature': list(X_train.columns),
'importance': model.feature_importances_}).\
sort_values('importance', ascending = False)
# Display
fi.head()
Out[15]:
In [16]:
outputClassPredictionExample = wrongRow['class']
forPredictionExample = wrongRow.drop(labels=['class','user']) # remove class and user
forPredictionExample.z4 = -144 # correct the value
print("We use this example for prediction later:")
print(forPredictionExample)
print("The class shall be: ", outputClassPredictionExample)
In [17]:
model.predict(forPredictionExample.values.reshape(1, -1))
Out[17]:
Remember that these were the categories for the classes:
In [18]:
y_test.iloc[0]
Out[18]:
The fourth one is "standing up". Seems that the model predicted correctly.
Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself.
It does this by scoring each tree's predictions against that tree's out-of-bag samples. A tree's out of bag samples are those forest training samples that were withheld from a specific tree during training.
One of the advantages of using the out of bag (OOB) error is that eliminates the need to split your data into a training / testing before feeding it into the forest model, since that's part of the forest algorithm. However using the OOB error metric often underestimates the actual performance improvement and the optimal number of training iterations.
In [19]:
modelOOB = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0,
oob_score=True)
In [20]:
print ("Fitting...")
s = time.time()
modelOOB.fit(X, y)
print("completed in: ", time.time() - s, "seconds")
Time needed is similar.
Let's check the score:
In [21]:
# Display the OOB Score of data
scoreOOB = modelOOB.oob_score_
print ("OOB Score: ", round(scoreOOB*100, 3))
The out-of-bag estimation is not far away from the more precise score estimated from the test dataset.
And now we predict the same user's movement. Class output shall be "standing up", the fourth one
In [22]:
modelOOB.predict(forPredictionExample.values.reshape(1, -1))
Out[22]:
Yup!