Random Forest classifier example


In [156]:
from sklearn.ensemble import RandomForestClassifier

Watch the video of the type of activity that was recorded


In [157]:
from IPython.display import YouTubeVideo, HTML
YouTubeVideo("XOEN9W05_4A")


Out[157]:

In [158]:
#The Donald Bren School of Information and Computer Sciences - University of California, Irvine

info_file = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.names'
data_zip = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00240/UCI%20HAR%20Dataset.zip'

In [159]:
!curl $info_file


===================================================================================================
Human Activity Recognition Using Smartphones Dataset
Version 1.0
===================================================================================================
Jorge L. Reyes-Ortiz(1,2), Davide Anguita(1), Alessandro Ghio(1), Luca Oneto(1) and Xavier Parra(2)
1 - Smartlab - Non-Linear Complex Systems Laboratory
DITEN - Universit�  degli Studi di Genova, Genoa (I-16145), Italy. 
2 - CETpD - Technical Research Centre for Dependency Care and Autonomous Living
Universitat Polit�cnica de Catalunya (BarcelonaTech). Vilanova i la Geltr� (08800), Spain
activityrecognition '@' smartlab.ws 
===================================================================================================

The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data. 

The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain. See 'features_info.txt' for more details. 

For each record it is provided:
======================================

- Triaxial acceleration from the accelerometer (total acceleration) and the estimated body acceleration.
- Triaxial Angular velocity from the gyroscope. 
- A 561-feature vector with time and frequency domain variables. 
- Its activity label. 
- An identifier of the subject who carried out the experiment.

The dataset includes the following files:
=========================================

- 'README.txt'

- 'features_info.txt': Shows information about the variables used on the feature vector.

- 'features.txt': List of all features.

- 'activity_labels.txt': Links the class labels with their activity name.

- 'train/X_train.txt': Training set.

- 'train/y_train.txt': Training labels.

- 'test/X_test.txt': Test set.

- 'test/y_test.txt': Test labels.

The following files are available for the train and test data. Their descriptions are equivalent. 

- 'train/subject_train.txt': Each row identifies the subject who performed the activity for each window sample. Its range is from 1 to 30. 

- 'train/Inertial Signals/total_acc_x_train.txt': The acceleration signal from the smartphone accelerometer X axis in standard gravity units 'g'. Every row shows a 128 element vector. The same description applies for the 'total_acc_x_train.txt' and 'total_acc_z_train.txt' files for the Y and Z axis. 

- 'train/Inertial Signals/body_acc_x_train.txt': The body acceleration signal obtained by subtracting the gravity from the total acceleration. 

- 'train/Inertial Signals/body_gyro_x_train.txt': The angular velocity vector measured by the gyroscope for each window sample. The units are radians/second. 

Notes: 
======
- Features are normalized and bounded within [-1,1].
- Each feature vector is a row on the text file.
- The units used for the accelerations (total and body) are 'g's (gravity of earth -> 9.80665 m/seg2).
- The gyroscope units are rad/seg.
- A video of the experiment including an example of the 6 recorded activities with one of the participants can be seen in the following link: http://www.youtube.com/watch?v=XOEN9W05_4A

For more information about this dataset please contact: activityrecognition '@' smartlab.ws

License:
========
Use of this dataset in publications must be acknowledged by referencing the following publication [1] 

[1] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013. 

This dataset is distributed AS-IS and no responsibility implied or explicit can be addressed to the authors or their institutions for its use or misuse. Any commercial use is prohibited.

Other Related Publications:
===========================
[2] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, Jorge L. Reyes-Ortiz.  Energy Efficient Smartphone-Based Activity Recognition using Fixed-Point Arithmetic. Journal of Universal Computer Science. Special Issue in Ambient Assisted Living: Home Care.   Volume 19, Issue 9. May 2013

[3] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. Human Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector Machine. 4th International Workshop of Ambient Assited Living, IWAAL 2012, Vitoria-Gasteiz, Spain, December 3-5, 2012. Proceedings. Lecture Notes in Computer Science 2012, pp 216-223. 

[4] Jorge Luis Reyes-Ortiz, Alessandro Ghio, Xavier Parra-Llanas, Davide Anguita, Joan Cabestany, Andreu Catal�. Human Activity and Motion Disorder Recognition: Towards Smarter Interactive Cognitive Environments. 21th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.  

==================================================================================================
Jorge L. Reyes-Ortiz, Alessandro Ghio, Luca Oneto, Davide Anguita and Xavier Parra. November 2013.

The data is extracted from the .zip file on the UCI website


In [160]:
import requests, zipfile, io
r = requests.get(data_zip)
z = zipfile.ZipFile(io.BytesIO(r.content))
X_train_df = pd.read_csv(z.open('UCI HAR Dataset/train/X_train.txt'), skipinitialspace=True, sep=' ', header=None)
X_train = X_train_df.values
y_train_df = pd.read_csv(z.open('UCI HAR Dataset/train/y_train.txt'), header=None)
y_train = y_train_df[0].values

The dataset contains readings from mobile phone accelerometer and gyroscope as subjects perform a range of actions. The raw data is then manipulated to form 560 features from the initial 6 features (accelerometer 3-axis and gyroscope 3-axis).


In [161]:
X_train.shape


Out[161]:
(7352, 561)

In [162]:
y_train.shape


Out[162]:
(7352,)

A training set of 7352 readings (X_train) labeled as one of six activities (y_train) is used to train a random forest classifier.


In [163]:
clf = RandomForestClassifier(n_jobs=2,n_estimators=150)
clf.fit(X_train, y_train)


Out[163]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=2,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

A quick sanity check where the training data is classified by the model - should be 100% accurate or very close


In [164]:
from sklearn.metrics import accuracy_score, roc_auc_score

y_predict = clf.predict(X_train)

In [165]:
accuracy = accuracy_score(y_train, y_predict)
print('---'*20)
print('\n')
print("A Random Forest Classifier with 150 estimators was {}% accurate on the test data.".format(round(accuracy*100,2)))
print('\n')
print('---'*20)


------------------------------------------------------------


A Random Forest Classifier with 150 estimators was 100.0% accurate on the test data.


------------------------------------------------------------

A second unseen dataset of 2947 datapoints is used to test the classifier.


In [166]:
X_test_df = pd.read_csv(z.open('UCI HAR Dataset/test/X_test.txt'), skipinitialspace=True, sep=' ', header=None)
X_test = X_test_df.values
y_test_df = pd.read_csv(z.open('UCI HAR Dataset/test/y_test.txt'), header=None)
y_test = y_test_df[0].values

In [167]:
X_test.shape


Out[167]:
(2947, 561)

In [168]:
y_test.shape


Out[168]:
(2947,)

Predictions for the unseen dataset are created.


In [169]:
y_predict_test = clf.predict(X_test)

The correct labels are loaded and prediction accuracy is tested


In [170]:
test_score = accuracy_score(y_test, y_predict_test)
print('---'*20)
print('\n')
print('A random Forest Classifier with 150 estimators, trained on the training data, was {}% accurate at labelling the unseen test data.'.format(round(test_score*100,2)))
print('\n')
print('---'*20)


------------------------------------------------------------


A random Forest Classifier with 150 estimators, trained on the training data, was 93.01% accurate at labelling the unseen test data.


------------------------------------------------------------

In [171]:
import matplotlib as mpl

order = clf.feature_importances_
order.sort()
mpl.pylab.plot(order)


Out[171]:
[<matplotlib.lines.Line2D at 0x11f4b6ef0>]

In [172]:
importance = pd.DataFrame(clf.feature_importances_)

In [173]:
feature = pd.read_csv(z.open('UCI HAR Dataset/features.txt'),sep=' ',header=None)

In [174]:
feature['importance'] = clf.feature_importances_

In [175]:
feature.columns = [0, 'transform', 'importance']

The features with an importance >.015 are -


In [178]:
for feat in feature.transform[feature.importance>.015].values:
    print(feat,feature.importance[feature.transform==feat].values)


tGravityAcc-mean()-X [ 0.03375814]
tGravityAcc-mean()-Y [ 0.02971596]
tGravityAcc-max()-X [ 0.02051876]
tGravityAcc-max()-Y [ 0.01746774]
tGravityAcc-min()-X [ 0.03098886]
tGravityAcc-min()-Y [ 0.02094589]
tGravityAcc-energy()-X [ 0.04043527]
tGravityAcc-energy()-Y [ 0.01599929]
angle(X,gravityMean) [ 0.0250903]
angle(Y,gravityMean) [ 0.02249183]

In [179]:
activity_labels = pd.read_csv(z.open('UCI HAR Dataset/activity_labels.txt'),index_col=0,sep=' ',header = None)

In [259]:
labels_activity = [x.lower() for x in activity_labels[1].tolist()]
labels_activity


Out[259]:
['walking',
 'walking_upstairs',
 'walking_downstairs',
 'sitting',
 'standing',
 'laying']

In [264]:
from sklearn import metrics
# testing score
score_test = metrics.f1_score(y_test, y_predict_test, pos_label=list(set(y_test)),average = None)
# training score
score_train = metrics.f1_score(y_train, y_predict, pos_label=list(set(y_train)),average = None)

print('\n','metrics.f1_score for each category in training set','\n','----'*10)
for act,dec in zip(labels_activity,score_train):
    print('{} - {}%'.format(act,round(dec*100,2)))

print('\n','metrics.f1_score for each category in test set','\n','----'*10)
for act,dec in zip(labels_activity,score_test):
    print('{} - {}%'.format(act,round(dec*100,2)))


 metrics.f1_score for each category in training set 
 ----------------------------------------
walking - 100.0%
walking_upstairs - 100.0%
walking_downstairs - 100.0%
sitting - 100.0%
standing - 100.0%
laying - 100.0%

 metrics.f1_score for each category in test set 
 ----------------------------------------
walking - 93.53%
walking_upstairs - 91.2%
walking_downstairs - 90.2%
sitting - 90.55%
standing - 91.42%
laying - 100.0%

The original paper used a SVC from R to classify the data. Here is a comparison in Python using the same type of classifier.


In [235]:
from sklearn import svm, datasets
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (14.0, 10.0)

# Run classifier
classifier = svm.SVC(kernel='linear')
y_predictSVC = classifier.fit(X_train, y_train).predict(X_test)

y_values = [0,1,2,3,4,5]
activity = labels_activity

# Compute confusion matrix
conf_arr = confusion_matrix(y_test, y_predictSVC)


norm_conf = []
for i in conf_arr:
    a = 0
    tmp_arr = []
    a = sum(i, 0)
    for j in i:
        tmp_arr.append(float(j)/float(a))
    norm_conf.append(tmp_arr)


width = len(conf_arr)
height = len(conf_arr[0])

for x in range(width):
    for y in range(height):
        ax.annotate(str(conf_arr[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')



# Show confusion matrix in a separate window
cb = fig.colorbar(res)
plt.matshow(conf_arr,cmap=plt.cm.winter)
plt.title('Confusion matrix for SVM',y=1.5)
plt.colorbar(cmap=plt.cm.YlGn)
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.yticks(y_values, activity)
plt.xticks(y_values, activity,rotation = 90,)
plt.grid(b = False, which='major', axis='both')

width = len(conf_arr)
height = len(conf_arr[0])

for x in range(width):
    for y in range(height):
        plt.annotate(str(conf_arr[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')
#plt.show()


The confusion matrix from the SVC in sklearn (Python) was very close to the result obtained by the original authors using R. (see paper https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2013-84.pdf)


In [242]:
accuracy = accuracy_score(y_test, y_predictSVC)
print('---'*20)
print('\n')
print("A SVC was {}% accurate on the test data.".format(round(accuracy*100,2)))
print('\n')
print('---'*20)


------------------------------------------------------------


A SVC was 96.4% accurate on the test data.


------------------------------------------------------------

An alternative model using a Random Forest Classifier


In [236]:
from sklearn.ensemble import RandomForestClassifier
#from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (14.0, 10.0)

# Run classifier
#classifier = svm.SVC(kernel='linear')
#y_predictSVC = classifier.fit(X_train, y_train).predict(X_test)

y_values = [0,1,2,3,4,5]
activity = labels_activity

# Compute confusion matrix
conf_arr = confusion_matrix(y_test, y_predict_test)


norm_conf = []
for i in conf_arr:
    a = 0
    tmp_arr = []
    a = sum(i, 0)
    for j in i:
        tmp_arr.append(float(j)/float(a))
    norm_conf.append(tmp_arr)


width = len(conf_arr)
height = len(conf_arr[0])

for x in range(width):
    for y in range(height):
        ax.annotate(str(conf_arr[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')



# Show confusion matrix in a separate window
cb = fig.colorbar(res)
plt.matshow(conf_arr,cmap=plt.cm.winter)
plt.title('Confusion matrix for Random Forest Classifier',y=1.5)
plt.colorbar(cmap=plt.cm.YlGn)
plt.ylabel('True label')
plt.xlabel('Predicted label')

plt.yticks(y_values, activity)
plt.xticks(y_values, activity,rotation = 90,)
plt.grid(b = False, which='major', axis='both')

width = len(conf_arr)
height = len(conf_arr[0])

for x in range(width):
    for y in range(height):
        plt.annotate(str(conf_arr[x][y]), xy=(y, x), 
                    horizontalalignment='center',
                    verticalalignment='center')
#plt.show()



In [241]:
accuracy = accuracy_score(y_test, y_predict_test)
print('---'*20)
print('\n')
print("A Random Forest Classifier with 150 estimators was {}% accurate on the test data.".format(round(accuracy*100,2)))
print('\n')
print('---'*20)


------------------------------------------------------------


A Random Forest Classifier with 150 estimators was 93.01% accurate on the test data.


------------------------------------------------------------

In [ ]: