Random Forest

A single decision tree - tasked to learn a dataset - might not be able to perform well due to the outliers and the breadth and depth complexity of the data.
So instead of relying on a single tree, random forests rely on a multitude of cleverly grown decision trees.
Each tree within the forest is allowed to become highly specialised in a specific area but still retains some general knowledge about most areas. When a random forest classifies, it is actualy each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.

Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.

In addition to the tree bagging of training samples at the forest level, each individual decision tree further 'feature bags' at each node-branch split. This is helpful because some datasets contain a feature that is very correlated to the target (the 'y'-label). By selecting a random sampling of features every split - if such a feature were to exist - it wouldn't show up on as many branches of the tree and there would be more diversity of the features examined.

Check my post to see more details about Random Forests!

Human activity prediction

As an example, we will predict human activity by looking at data from wearables.
For this , we train a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165633 data points.

Within the dataset, there are five target activities:

  • Sitting
  • Sitting Down
  • Standing
  • Standing Up
  • Walking

These activities were captured from 30 people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle.

Read the data

The original dataset can be found on the UCI MachineLearning Repository

A copy can be found also here on GitHub (URL is below) and on Kaggle


In [1]:
import pandas as pd
import time

In [3]:
# Grab the DLA HAR dataset from the links above
# we assume that is stored in a dataset folder


#
# Load up the dataset into dataframe 'X'
#
X = pd.read_csv("../datasets/dataset-har-PUC-rio-ugulino.csv", sep=';', low_memory=False)

In [4]:
X.head(2)


Out[4]:
user gender age how_tall_in_meters weight body_mass_index x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 class
0 debora Woman 46 1,62 75 28,6 -3 92 -63 -23 18 -19 5 104 -92 -150 -103 -147 sitting
1 debora Woman 46 1,62 75 28,6 -3 94 -64 -21 18 -18 -14 104 -90 -149 -104 -145 sitting

In [5]:
X.describe()


Out[5]:
age weight x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4
count 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000 165633.000000
mean 38.265146 70.819408 -6.649327 88.293667 -93.164611 -87.827504 -52.065047 -175.055200 17.423515 104.517167 -93.881726 -167.641448 -92.625171
std 13.184091 11.296527 11.616238 23.895829 39.409423 169.435194 205.159763 192.816615 52.635388 54.155843 45.389646 38.311342 19.968610
min 28.000000 55.000000 -306.000000 -271.000000 -603.000000 -494.000000 -517.000000 -617.000000 -499.000000 -506.000000 -613.000000 -702.000000 -526.000000
25% 28.000000 55.000000 -12.000000 78.000000 -120.000000 -35.000000 -29.000000 -141.000000 9.000000 95.000000 -103.000000 -190.000000 -103.000000
50% 31.000000 75.000000 -6.000000 94.000000 -98.000000 -9.000000 27.000000 -118.000000 22.000000 107.000000 -90.000000 -168.000000 -91.000000
75% 46.000000 83.000000 0.000000 101.000000 -64.000000 4.000000 86.000000 -29.000000 34.000000 120.000000 -80.000000 -153.000000 -80.000000
max 75.000000 83.000000 509.000000 533.000000 411.000000 473.000000 295.000000 122.000000 507.000000 517.000000 410.000000 -13.000000 86.000000

Pre-processing the data

What we want to do is to predict the activity class based on the accelerometer's data from the wearables.


In [6]:
#
# An easy way to show which rows have NaNs in them:
print (X[pd.isnull(X).any(axis=1)])


Empty DataFrame
Columns: [user, gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4, class]
Index: []

Great, no NaNs here. Let's go on.


In [7]:
#
# Encode the gender column: 0 as male, 1 as female
#
X.gender  = X.gender.map({'Woman':1, 'Man':0})

#
# Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float)
X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float)

#
# Check data types
print (X.dtypes)


user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                     object
class                  object
dtype: object

In [8]:
# column z4 is type "object". Something is wrong with the dataset.

#
# Convert that column into numeric
# Use errors='raise'. This will alert you if something ends up being
# problematic
#

#
# INFO: There is an error raised ... you will find it if you try the method
#
# print (X[pd.isnull(X).any(axis=1)])
# 122076 --> z4 =    -14420-11-2011 04:50:23.713
#
# !! The data point #122076 is a wrong coded record, 
# change it or drop it before calling the to_numeric methods:
#
#X.at[122076, 'z4'] = -144   // change to correct value

# I keep this value for later and drop it from the dataset
wrongRow = X.loc[122076]
X.drop(X.index[[122076]], inplace=True)

X.z4 = pd.to_numeric(X.z4, errors='raise')

print (X.dtypes)
# everything ok now


user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                      int64
class                  object
dtype: object

Extract the target values


In [9]:
# Activity to predict is in "class" column
# Encode 'y' value as a dummies version of dataset's "class" column
#
y = pd.get_dummies(X['class'].copy())
# this produces a 5 column wide dummies dataframe as the y value

#
# Get rid of the user and class columns in X
#
X.drop(['class','user'], axis=1, inplace=True) 


print (X.head(2))


   gender  age  how_tall_in_meters  weight  body_mass_index  x1  y1  z1  x2  \
0       1   46                1.62      75             28.6  -3  92 -63 -23   
1       1   46                1.62      75             28.6  -3  94 -64 -21   

   y2  z2  x3   y3  z3   x4   y4   z4  
0  18 -19   5  104 -92 -150 -103 -147  
1  18 -18 -14  104 -90 -149 -104 -145  

In [10]:
print (y.head())


   sitting  sittingdown  standing  standingup  walking
0        1            0         0           0        0
1        1            0         0           0        0
2        1            0         0           0        0
3        1            0         0           0        0
4        1            0         0           0        0

Split the dataset into training and test


In [11]:
# 
# Split  data into test / train sets
#
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=7)

Train the Random Forest model


In [12]:
#
# Create an RForest classifier 'model' 
#
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0)

You can check the SKlearn documentation to see all possible parameters.

The ones used here:

  • n_estimators: integer, optional (default=100) The number of trees in the forest. Note that this number changed from 10 to 100 (following the progress in computing performance and memory)
  • max_depth: integer or None, optional (default=None) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Setting a limit helps with the computing time and memory needed.Not setting a max depth will lead to have unpruned and fully grown trees which - depending on the dataset - will require large memory footprint.
  • oob_score: bool (default=False) Whether to use out-of-bag samples to estimate the generalization accuracy.
  • random_state: int, RandomState instance or None, optional (default=None) Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider

And other useful / important:

  • criterion: string, optional (default=”gini”) The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Same as for the Trees.
  • bootstrap: boolean, optional (default=True) Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.

In [13]:
print ("Fitting...")
s = time.time()

model.fit(X_train, y_train)

print("completed in: ", time.time() - s, "seconds")


Fitting...
completed in:  29.361350059509277 seconds

Note that it takes a much longer time to train a forest than a single decision tree.

This is the score based on the test dataset that we split earlier. Note how good it is.


In [14]:
print ("Scoring...")
s = time.time()

score = model.score(X_test, y_test)

print ("Score: ", round(score*100, 3))
print ("Scoring completed in: ", time.time() - s)


Scoring...
Score:  99.286
Scoring completed in:  1.3961372375488281

These are the top 5 features used in the classification.
They are all related to the movements, no gender or age.


In [15]:
# Extract feature importances
fi = pd.DataFrame({'feature': list(X_train.columns),
                   'importance': model.feature_importances_}).\
                    sort_values('importance', ascending = False)

# Display
fi.head()


Out[15]:
feature importance
7 z1 0.174239
12 y3 0.126232
10 z2 0.114018
9 y2 0.110663
6 y1 0.074385

Example prediction

Let's use the wrong row - that we extracted earlier from the dataset - as a prediction example. but first we need to correct it:


In [16]:
outputClassPredictionExample = wrongRow['class']

forPredictionExample = wrongRow.drop(labels=['class','user']) # remove class and user
forPredictionExample.z4 = -144 # correct the value

print("We use this example for prediction later:")
print(forPredictionExample)
print("The class shall be: ", outputClassPredictionExample)


We use this example for prediction later:
gender                   0
age                     75
how_tall_in_meters    1.67
weight                  67
body_mass_index         24
x1                      -8
y1                     101
z1                    -120
x2                     -13
y2                      91
z2                    -101
x3                      17
y3                     123
z3                    -108
x4                    -207
y4                     -82
z4                    -144
Name: 122076, dtype: object
The class shall be:  standingup

In [17]:
model.predict(forPredictionExample.values.reshape(1, -1))


Out[17]:
array([[0, 0, 0, 1, 0]], dtype=uint8)

Remember that these were the categories for the classes:


In [18]:
y_test.iloc[0]


Out[18]:
sitting        0
sittingdown    0
standing       0
standingup     0
walking        1
Name: 130833, dtype: uint8

The fourth one is "standing up". Seems that the model predicted correctly.

OutOfBag error instead of splitting into train and test

Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself.
It does this by scoring each tree's predictions against that tree's out-of-bag samples. A tree's out of bag samples are those forest training samples that were withheld from a specific tree during training.

One of the advantages of using the out of bag (OOB) error is that eliminates the need to split your data into a training / testing before feeding it into the forest model, since that's part of the forest algorithm. However using the OOB error metric often underestimates the actual performance improvement and the optimal number of training iterations.


In [19]:
modelOOB = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0, 
                               oob_score=True)

In [20]:
print ("Fitting...")
s = time.time()

modelOOB.fit(X, y)

print("completed in: ", time.time() - s, "seconds")


Fitting...
completed in:  22.724381923675537 seconds

Time needed is similar.
Let's check the score:


In [21]:
# Display the OOB Score of data
scoreOOB = modelOOB.oob_score_
print ("OOB Score: ", round(scoreOOB*100, 3))


OOB Score:  99.753

The out-of-bag estimation is not far away from the more precise score estimated from the test dataset.

And now we predict the same user's movement. Class output shall be "standing up", the fourth one


In [22]:
modelOOB.predict(forPredictionExample.values.reshape(1, -1))


Out[22]:
array([[0, 0, 0, 1, 0]], dtype=uint8)

Yup!