Random Forest

A single decision tree - tasked to learn a dataset - might not be able to perform well due to the outliers and the breadth and depth complexity of the data.
So instead of relying on a single tree, random forests rely on a multitude of cleverly grown decision trees.
Each tree within the forest is allowed to become highly specialised in a specific area but still retains some general knowledge about most areas. When a random forest classifies, it is actualy each tree in the forest working together to cast votes on what label they think a specific sample should be assigned.

Instead of sharing the entire dataset with each decision tree, the forest performs an operation which is essential a train / test split of the training data. Each decision tree in the forest randomly samples from the overall training data set. Through doing so, each tree exist in an independent subspace and the variation between trees is controlled. This technique is known as tree bagging, or bootstrap aggregating.

In addition to the tree bagging of training samples at the forest level, each individual decision tree further 'feature bags' at each node-branch split. This is helpful because some datasets contain a feature that is very correlated to the target (the 'y'-label). By selecting a random sampling of features every split - if such a feature were to exist - it wouldn't show up on as many branches of the tree and there would be more diversity of the features examined.

Check my post to see more details about Random Forests!

Human activity prediction

As an example, we will predict human activity by looking at data from wearables.
For this , we train a random forest against a public domain Human Activity Dataset titled Wearable Computing: Accelerometers' Data Classification of Body Postures and Movements, containing 165633 data points.

Within the dataset, there are five target activities:

Sitting
Sitting Down
Standing
Standing Up
Walking

These activities were captured from 30 people wearing accelerometers mounted on their waist, left thigh, right arm, and right ankle.

Read the data

The original dataset can be found on the UCI MachineLearning Repository

A copy can be found also here on GitHub (URL is below) and on Kaggle



In [1]:

    
import pandas as pd
import time



In [3]:

    
# Grab the DLA HAR dataset from the links above
# we assume that is stored in a dataset folder


#
# Load up the dataset into dataframe 'X'
#
X = pd.read_csv("../datasets/dataset-har-PUC-rio-ugulino.csv", sep=';', low_memory=False)



In [4]:

    
X.head(2)









    Out[4]:







  
    
      
      user
      gender
      age
      how_tall_in_meters
      weight
      body_mass_index
      x1
      y1
      z1
      x2
      y2
      z2
      x3
      y3
      z3
      x4
      y4
      z4
      class
    
  
  
    
      0
      debora
      Woman
      46
      1,62
      75
      28,6
      -3
      92
      -63
      -23
      18
      -19
      5
      104
      -92
      -150
      -103
      -147
      sitting
    
    
      1
      debora
      Woman
      46
      1,62
      75
      28,6
      -3
      94
      -64
      -21
      18
      -18
      -14
      104
      -90
      -149
      -104
      -145
      sitting



In [5]:

    
X.describe()









    Out[5]:







  
    
      
      age
      weight
      x1
      y1
      z1
      x2
      y2
      z2
      x3
      y3
      z3
      x4
      y4
    
  
  
    
      count
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
      165633.000000
    
    
      mean
      38.265146
      70.819408
      -6.649327
      88.293667
      -93.164611
      -87.827504
      -52.065047
      -175.055200
      17.423515
      104.517167
      -93.881726
      -167.641448
      -92.625171
    
    
      std
      13.184091
      11.296527
      11.616238
      23.895829
      39.409423
      169.435194
      205.159763
      192.816615
      52.635388
      54.155843
      45.389646
      38.311342
      19.968610
    
    
      min
      28.000000
      55.000000
      -306.000000
      -271.000000
      -603.000000
      -494.000000
      -517.000000
      -617.000000
      -499.000000
      -506.000000
      -613.000000
      -702.000000
      -526.000000
    
    
      25%
      28.000000
      55.000000
      -12.000000
      78.000000
      -120.000000
      -35.000000
      -29.000000
      -141.000000
      9.000000
      95.000000
      -103.000000
      -190.000000
      -103.000000
    
    
      50%
      31.000000
      75.000000
      -6.000000
      94.000000
      -98.000000
      -9.000000
      27.000000
      -118.000000
      22.000000
      107.000000
      -90.000000
      -168.000000
      -91.000000
    
    
      75%
      46.000000
      83.000000
      0.000000
      101.000000
      -64.000000
      4.000000
      86.000000
      -29.000000
      34.000000
      120.000000
      -80.000000
      -153.000000
      -80.000000
    
    
      max
      75.000000
      83.000000
      509.000000
      533.000000
      411.000000
      473.000000
      295.000000
      122.000000
      507.000000
      517.000000
      410.000000
      -13.000000
      86.000000

Pre-processing the data

What we want to do is to predict the activity class based on the accelerometer's data from the wearables.



In [6]:

    
#
# An easy way to show which rows have NaNs in them:
print (X[pd.isnull(X).any(axis=1)])









    



Empty DataFrame
Columns: [user, gender, age, how_tall_in_meters, weight, body_mass_index, x1, y1, z1, x2, y2, z2, x3, y3, z3, x4, y4, z4, class]
Index: []

Great, no NaNs here. Let's go on.



In [7]:

    
#
# Encode the gender column: 0 as male, 1 as female
#
X.gender  = X.gender.map({'Woman':1, 'Man':0})

#
# Clean up any column with commas in it
# so that they're properly represented as decimals instead
#
X.how_tall_in_meters = X.how_tall_in_meters.str.replace(',','.').astype(float)
X.body_mass_index = X.body_mass_index.str.replace(',','.').astype(float)

#
# Check data types
print (X.dtypes)









    



user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                     object
class                  object
dtype: object



In [8]:

    
# column z4 is type "object". Something is wrong with the dataset.

#
# Convert that column into numeric
# Use errors='raise'. This will alert you if something ends up being
# problematic
#

#
# INFO: There is an error raised ... you will find it if you try the method
#
# print (X[pd.isnull(X).any(axis=1)])
# 122076 --> z4 =    -14420-11-2011 04:50:23.713
#
# !! The data point #122076 is a wrong coded record, 
# change it or drop it before calling the to_numeric methods:
#
#X.at[122076, 'z4'] = -144   // change to correct value

# I keep this value for later and drop it from the dataset
wrongRow = X.loc[122076]
X.drop(X.index[[122076]], inplace=True)

X.z4 = pd.to_numeric(X.z4, errors='raise')

print (X.dtypes)
# everything ok now









    



user                   object
gender                  int64
age                     int64
how_tall_in_meters    float64
weight                  int64
body_mass_index       float64
x1                      int64
y1                      int64
z1                      int64
x2                      int64
y2                      int64
z2                      int64
x3                      int64
y3                      int64
z3                      int64
x4                      int64
y4                      int64
z4                      int64
class                  object
dtype: object

Extract the target values



In [9]:

    
# Activity to predict is in "class" column
# Encode 'y' value as a dummies version of dataset's "class" column
#
y = pd.get_dummies(X['class'].copy())
# this produces a 5 column wide dummies dataframe as the y value

#
# Get rid of the user and class columns in X
#
X.drop(['class','user'], axis=1, inplace=True) 


print (X.head(2))









    



   gender  age  how_tall_in_meters  weight  body_mass_index  x1  y1  z1  x2  \
0       1   46                1.62      75             28.6  -3  92 -63 -23   
1       1   46                1.62      75             28.6  -3  94 -64 -21   

   y2  z2  x3   y3  z3   x4   y4   z4  
0  18 -19   5  104 -92 -150 -103 -147  
1  18 -18 -14  104 -90 -149 -104 -145



In [10]:

    
print (y.head())









    



   sitting  sittingdown  standing  standingup  walking
0        1            0         0           0        0
1        1            0         0           0        0
2        1            0         0           0        0
3        1            0         0           0        0
4        1            0         0           0        0

Split the dataset into training and test



In [11]:

    
# 
# Split  data into test / train sets
#
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=7)

Train the Random Forest model



In [12]:

    
#
# Create an RForest classifier 'model' 
#
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0)

You can check the SKlearn documentation to see all possible parameters.

The ones used here:

n_estimators: integer, optional (default=100) The number of trees in the forest. Note that this number changed from 10 to 100 (following the progress in computing performance and memory)
max_depth: integer or None, optional (default=None) The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. Setting a limit helps with the computing time and memory needed.Not setting a max depth will lead to have unpruned and fully grown trees which - depending on the dataset - will require large memory footprint.
oob_score: bool (default=False) Whether to use out-of-bag samples to estimate the generalization accuracy.
random_state: int, RandomState instance or None, optional (default=None) Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider

And other useful / important:

criterion: string, optional (default=”gini”) The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Same as for the Trees.
bootstrap: boolean, optional (default=True) Whether bootstrap samples are used when building trees. If False, the whole datset is used to build each tree.



In [13]:

    
print ("Fitting...")
s = time.time()

model.fit(X_train, y_train)

print("completed in: ", time.time() - s, "seconds")









    



Fitting...
completed in:  29.361350059509277 seconds

Note that it takes a much longer time to train a forest than a single decision tree.

This is the score based on the test dataset that we split earlier. Note how good it is.



In [14]:

    
print ("Scoring...")
s = time.time()

score = model.score(X_test, y_test)

print ("Score: ", round(score*100, 3))
print ("Scoring completed in: ", time.time() - s)









    



Scoring...
Score:  99.286
Scoring completed in:  1.3961372375488281

These are the top 5 features used in the classification.
They are all related to the movements, no gender or age.



In [15]:

    
# Extract feature importances
fi = pd.DataFrame({'feature': list(X_train.columns),
                   'importance': model.feature_importances_}).\
                    sort_values('importance', ascending = False)

# Display
fi.head()

Example prediction

Let's use the wrong row - that we extracted earlier from the dataset - as a prediction example. but first we need to correct it:



In [16]:

    
outputClassPredictionExample = wrongRow['class']

forPredictionExample = wrongRow.drop(labels=['class','user']) # remove class and user
forPredictionExample.z4 = -144 # correct the value

print("We use this example for prediction later:")
print(forPredictionExample)
print("The class shall be: ", outputClassPredictionExample)









    



We use this example for prediction later:
gender                   0
age                     75
how_tall_in_meters    1.67
weight                  67
body_mass_index         24
x1                      -8
y1                     101
z1                    -120
x2                     -13
y2                      91
z2                    -101
x3                      17
y3                     123
z3                    -108
x4                    -207
y4                     -82
z4                    -144
Name: 122076, dtype: object
The class shall be:  standingup



In [17]:

    
model.predict(forPredictionExample.values.reshape(1, -1))









    Out[17]:





array([[0, 0, 0, 1, 0]], dtype=uint8)

Remember that these were the categories for the classes:



In [18]:

    
y_test.iloc[0]









    Out[18]:





sitting        0
sittingdown    0
standing       0
standingup     0
walking        1
Name: 130833, dtype: uint8

The fourth one is "standing up". Seems that the model predicted correctly.

OutOfBag error instead of splitting into train and test

Since each tree within the forest is only trained using a subset of the overall training set, the forest ensemble has the ability to error test itself.
It does this by scoring each tree's predictions against that tree's out-of-bag samples. A tree's out of bag samples are those forest training samples that were withheld from a specific tree during training.

One of the advantages of using the out of bag (OOB) error is that eliminates the need to split your data into a training / testing before feeding it into the forest model, since that's part of the forest algorithm. However using the OOB error metric often underestimates the actual performance improvement and the optimal number of training iterations.



In [19]:

    
modelOOB = RandomForestClassifier(n_estimators=30, max_depth= 20, random_state=0, 
                               oob_score=True)



In [20]:

    
print ("Fitting...")
s = time.time()

modelOOB.fit(X, y)

print("completed in: ", time.time() - s, "seconds")









    



Fitting...
completed in:  22.724381923675537 seconds

Time needed is similar.
Let's check the score:



In [21]:

    
# Display the OOB Score of data
scoreOOB = modelOOB.oob_score_
print ("OOB Score: ", round(scoreOOB*100, 3))









    



OOB Score:  99.753

The out-of-bag estimation is not far away from the more precise score estimated from the test dataset.

And now we predict the same user's movement. Class output shall be "standing up", the fourth one



In [22]:

    
modelOOB.predict(forPredictionExample.values.reshape(1, -1))









    Out[22]:





array([[0, 0, 0, 1, 0]], dtype=uint8)

Yup!

	user	gender	age	how_tall_in_meters	weight	body_mass_index	x1	y1	z1	x2	y2	z2	x3	y3	z3	x4	y4	z4	class
0	debora	Woman	46	1,62	75	28,6	-3	92	-63	-23	18	-19	5	104	-92	-150	-103	-147	sitting
1	debora	Woman	46	1,62	75	28,6	-3	94	-64	-21	18	-18	-14	104	-90	-149	-104	-145	sitting

	age	weight	x1	y1	z1	x2	y2	z2	x3	y3	z3	x4	y4
count	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000	165633.000000
mean	38.265146	70.819408	-6.649327	88.293667	-93.164611	-87.827504	-52.065047	-175.055200	17.423515	104.517167	-93.881726	-167.641448	-92.625171
std	13.184091	11.296527	11.616238	23.895829	39.409423	169.435194	205.159763	192.816615	52.635388	54.155843	45.389646	38.311342	19.968610
min	28.000000	55.000000	-306.000000	-271.000000	-603.000000	-494.000000	-517.000000	-617.000000	-499.000000	-506.000000	-613.000000	-702.000000	-526.000000
25%	28.000000	55.000000	-12.000000	78.000000	-120.000000	-35.000000	-29.000000	-141.000000	9.000000	95.000000	-103.000000	-190.000000	-103.000000
50%	31.000000	75.000000	-6.000000	94.000000	-98.000000	-9.000000	27.000000	-118.000000	22.000000	107.000000	-90.000000	-168.000000	-91.000000
75%	46.000000	83.000000	0.000000	101.000000	-64.000000	4.000000	86.000000	-29.000000	34.000000	120.000000	-80.000000	-153.000000	-80.000000
max	75.000000	83.000000	509.000000	533.000000	411.000000	473.000000	295.000000	122.000000	507.000000	517.000000	410.000000	-13.000000	86.000000