Using Cerebral Cortex with Machine Learning Tools

Scenario: Classify the type of motion from a smartphone's accelerometer and gyroscope sensors.

This is based on a kaggle competition and example: https://www.kaggle.com/morrisb/what-does-your-smartphone-know-about-you

Reference: Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L. Reyes-Ortiz. A Public Domain Dataset for Human Activity Recognition Using Smartphones. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2013. Bruges, Belgium 24-26 April 2013.

Initialize Cerebral Cortex



In [ ]:

    
%reload_ext autoreload
from util.dependencies import *
CC = Kernel("/home/md2k/cc_conf/")
from settings import USER_ID
import pandas as pd
pd.options.display.max_rows=20

Load both data and labels into datastreams

These are then converted to Pandas Dataframes for the remainder of the example



In [ ]:

    
both_datastream = CC.get_stream('Kaggle-Features')
label_datastream = CC.get_stream('Kaggle-ActivityLabels')

both_dataframe = both_datastream.to_pandas().data
label_dataframe = label_datastream.to_pandas().data

both_dataframe = both_dataframe.drop(['timestamp','localtime','version','user'], axis=1)
label_dataframe = label_dataframe.drop(['timestamp','localtime','version','user'], axis=1)

Count the number of events in each activity class



In [ ]:

    
label_dataframe.groupby('Activity').size().reset_index(name='Counts')

Examine the data



In [ ]:

    
both_dataframe

Prepare the data for model training

The required packages are imported into the notebook before transforming the data.

Data is cleaned to remove any subject specific information
The data is scaled using the StandardScaler
Principal Component Analysis (PCA) is utilized to identify the features with the most descriptive power
Text labels are encoded, LabelEncoder, due to most ML algorithms not working well with text
The test-train split is created for model building



In [ ]:

    
from lightgbm import LGBMClassifier

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# Create datasets
tsne_data = both_dataframe.copy()
data_data = tsne_data.pop('Data')
subject_data = tsne_data.pop('subject')

# Scale data
tsne_data = StandardScaler().fit_transform(tsne_data)

# Reduce dimensions (speed up)
tsne_data = PCA(n_components=0.95, random_state=3).fit_transform(tsne_data)

# Split the data
label_encoded = LabelEncoder().fit_transform(label_dataframe.Activity)
X_train, X_test, y_train, y_test = train_test_split(tsne_data, label_encoded, random_state=3)

Train a classifier

A Gradient Boosting Machine (GBM) is trained using Gradient Boosting Decision Trees with the features identified through PCA. This example uses 50 boosted trees to identify the best model. The resulting classification accuracy peaks at 0.959 when utilizing 500+ boosted trees; however, with only 50, this is still 0.937.



In [ ]:

    
number_of_estimators=50

# Create the model
lgbm = LGBMClassifier(n_estimators=number_of_estimators)
lgbm = lgbm.fit(X_train, y_train)

# Test the model
score = accuracy_score(y_true=y_test, y_pred=lgbm.predict(X_test))
print('Classification Accuracy:',score)