Hello World! This notebook describes the decision tree based Machine Learning model I have created to segment the users of Habits app.

Looking around the data set


In [546]:
# This to clear all variable values
%reset


Once deleted, variables cannot be recovered. Proceed (y/[n])? y

In [547]:
# Import the required modules
import pandas as pd
import numpy as np
#import scipy as sp

In [548]:
# simple function to read in the user data file.
# the argument parse_dates takes in a list of colums, which are to be parsed as date format
user_data_raw = pd.read_csv("janacare_user-engagement_Aug2014-Apr2016.csv", parse_dates = [-3,-2,-1])

In [549]:
# data metrics
user_data_raw.shape # Rows , colums


Out[549]:
(372, 19)

In [550]:
# data metrics
user_data_raw.dtypes # data type of colums


Out[550]:
user_id                                                        float64
num_modules_consumed                                           float64
num_glucose_tracked                                            float64
num_of_days_steps_tracked                                      float64
num_of_days_food_tracked                                       float64
num_of_days_weight_tracked                                     float64
insulin_a1c_count                                              float64
cholesterol_count                                              float64
hemoglobin_count                                               float64
watching_videos (binary - 1 for yes, blank/0 for no)           float64
weight                                                         float64
height                                                           int64
bmi                                                              int64
age                                                              int64
gender                                                          object
has_diabetes                                                   float64
first_login                                             datetime64[ns]
last_activity                                           datetime64[ns]
age_on_platform                                                 object
dtype: object

The column name watching_videos (binary - 1 for yes, blank/0 for no) is too long and has special chars, lets change it to watching_videos


In [551]:
user_data_to_clean = user_data_raw.rename(columns = {'watching_videos (binary - 1 for yes, blank/0 for no)':'watching_videos'})

In [552]:
# Some basic statistical information on the data
user_data_to_clean.describe()


Out[552]:
user_id num_modules_consumed num_glucose_tracked num_of_days_steps_tracked num_of_days_food_tracked num_of_days_weight_tracked insulin_a1c_count cholesterol_count hemoglobin_count watching_videos weight height bmi age has_diabetes
count 371.00000 69.000000 91.000000 120.000000 78.000000 223.000000 47.000000 15.000000 0.0 97.0 372.000000 372.000000 372.000000 372.000000 39.000000
mean 13850.74124 12.072464 17.769231 53.433333 29.576923 3.210762 5.170213 4.733333 NaN 1.0 72.074597 169.306452 25.325269 49.223118 0.512821
std 12773.29800 13.693406 38.881894 80.690792 47.019344 4.490778 12.694263 1.709915 NaN 0.0 14.744092 16.112564 5.194763 13.487788 0.506370
min 4288.00000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 3.000000 NaN 1.0 40.000000 120.000000 5.000000 11.000000 0.000000
25% 6075.50000 3.000000 2.000000 6.750000 2.000000 1.000000 1.000000 4.000000 NaN 1.0 62.000000 162.000000 22.000000 39.000000 0.000000
50% 7462.00000 8.000000 5.000000 19.500000 10.500000 2.000000 2.000000 4.000000 NaN 1.0 70.000000 167.000000 25.000000 49.500000 1.000000
75% 15258.00000 15.000000 12.500000 65.000000 31.500000 3.000000 3.000000 5.000000 NaN 1.0 80.000000 172.000000 27.000000 60.000000 1.000000
max 49766.00000 78.000000 260.000000 469.000000 229.000000 40.000000 78.000000 10.000000 NaN 1.0 165.000000 349.000000 56.000000 77.000000 1.000000

Data Clean up

In the last section of looking around, I saw that a lot of rows do not have any values or have garbage values(see first row of the table above). This can cause errors when computing anything using the values in these rows, hence a clean up is required.

We will clean up only those columns, that are being used for features.

  • num_modules_consumed
  • num_glucose_tracked
  • num_of_days_food_tracked
  • watching_videos

The next two colums will not be cleaned, as they contain time data which in my opinion should not be imputed

  • first_login
  • last_activity

In [553]:
# Lets check the health of the data set
user_data_to_clean.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 19 columns):
user_id                       371 non-null float64
num_modules_consumed          69 non-null float64
num_glucose_tracked           91 non-null float64
num_of_days_steps_tracked     120 non-null float64
num_of_days_food_tracked      78 non-null float64
num_of_days_weight_tracked    223 non-null float64
insulin_a1c_count             47 non-null float64
cholesterol_count             15 non-null float64
hemoglobin_count              0 non-null float64
watching_videos               97 non-null float64
weight                        372 non-null float64
height                        372 non-null int64
bmi                           372 non-null int64
age                           372 non-null int64
gender                        372 non-null object
has_diabetes                  39 non-null float64
first_login                   372 non-null datetime64[ns]
last_activity                 302 non-null datetime64[ns]
age_on_platform               372 non-null object
dtypes: datetime64[ns](2), float64(12), int64(3), object(2)
memory usage: 55.3+ KB

As is visible from the last column (age_on_platform) data type, Pandas is not recognising it as date type format. This will make things difficult, so I delete this particular column and add a new one. Since the data in age_on_platform can be recreated by doing age_on_platform = last_activity - first_login


In [554]:
# Lets first delete the last column 
user_data_to_clean_del_last_col = user_data_to_clean.drop("age_on_platform", 1)

In [555]:
# Check if colums has been deleted. Number of column changed from 19 to 18
user_data_to_clean_del_last_col.shape


Out[555]:
(372, 18)

In [556]:
# Copy data frame 'user_data_del_last_col' into a new one
user_data_to_clean = user_data_to_clean_del_last_col

But on eyeballing I noticed some, cells of column first_login have greater value than corresponding cell of last_activity. These cells need to be swapped, since its not possible to have first_login > last_activity


In [557]:
# Run a loop through the data frame and check each row for this anamoly, if found swap
for index, row in user_data_to_clean.iterrows():
    if row.first_login > row.last_activity:
        temp_date_var = row.first_login
        user_data_to_clean.set_value(index, 'first_login', row.last_activity)
        user_data_to_clean.set_value(index, 'last_activity', temp_date_var)
        #print "\tSw\t" + "first\t" + row.first_login.isoformat() + "\tlast\t" + row.last_activity.isoformat()

In [558]:
# Create new column 'age_on_platform' which has the corresponding value in date type format
user_data_to_clean["age_on_platform"] = user_data_to_clean["last_activity"] - user_data_to_clean["first_login"]

In [559]:
# Check the result in first few rows
user_data_to_clean["age_on_platform"].head(5)


Out[559]:
0   151 days
1   129 days
2   211 days
3   235 days
4     3 days
Name: age_on_platform, dtype: timedelta64[ns]

In [560]:
# Lets check the health of the data set
user_data_to_clean.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 19 columns):
user_id                       371 non-null float64
num_modules_consumed          69 non-null float64
num_glucose_tracked           91 non-null float64
num_of_days_steps_tracked     120 non-null float64
num_of_days_food_tracked      78 non-null float64
num_of_days_weight_tracked    223 non-null float64
insulin_a1c_count             47 non-null float64
cholesterol_count             15 non-null float64
hemoglobin_count              0 non-null float64
watching_videos               97 non-null float64
weight                        372 non-null float64
height                        372 non-null int64
bmi                           372 non-null int64
age                           372 non-null int64
gender                        372 non-null object
has_diabetes                  39 non-null float64
first_login                   372 non-null datetime64[ns]
last_activity                 302 non-null datetime64[ns]
age_on_platform               302 non-null timedelta64[ns]
dtypes: datetime64[ns](2), float64(12), int64(3), object(1), timedelta64[ns](1)
memory usage: 55.3+ KB

The second column of the above table describes, the number of non-null values in the respective column. As is visible for the columns of interest for us, eg. num_modules_consumed has ONLY 69 values out of possible 371 total


In [561]:
# Lets remove all columns from the data set that do not have to be imputed - 
user_data_to_impute = user_data_to_clean.drop(["user_id", "watching_videos", "num_of_days_steps_tracked", "num_of_days_weight_tracked", "insulin_a1c_count", "weight", "height", "bmi", "age", "gender", "has_diabetes", "first_login", "last_activity", "age_on_platform", "hemoglobin_count", "cholesterol_count"], 1 )

In [562]:
user_data_to_impute.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 3 columns):
num_modules_consumed        69 non-null float64
num_glucose_tracked         91 non-null float64
num_of_days_food_tracked    78 non-null float64
dtypes: float64(3)
memory usage: 8.8 KB

The next 3 cells describes the steps to Impute data using KNN strategy, sadly this is not working well for our data set! One possible reason could be that the column is too sparse to find a neighbourer !

In future this method could be combined with the mean imputation method, so the values not covered by KNN get replaced with mean values.


In [563]:
# Import Imputation method KNN
##from fancyimpute import KNN

In [564]:
# First lets convert the Pandas Dataframe into a Numpy array. We do this since the data frame needs to be transposed,
# which is only possible if the format is an Numpy array.
##user_data_to_impute_np_array = user_data_to_impute.as_matrix()
# Lets Transpose it
##user_data_to_impute_np_array_transposed = user_data_to_impute_np_array.T

In [565]:
# Run the KNN method on the data.   function usage X_filled_knn = KNN(k=3).complete(X_incomplete)
##user_data_imputed_knn_np_array = KNN(k=5).complete(user_data_to_impute_np_array_transposed)

The above 3 steps are for KNN based Imputation, did not work well. As visible 804 items could not be imputed for and get replaced with zero

Lets use simpler method that is provided by Scikit Learn itself


In [566]:
# Lets use simpler method that is provided by Scikit Learn itself
# import the function
from sklearn.preprocessing import Imputer

In [567]:
# Create an object of class Imputer, with the relvant parameters
imputer_object = Imputer(missing_values='NaN', strategy='mean', axis=0, copy=False)

In [568]:
# Impute the data and save the generated Numpy array
user_data_imputed_np_array = imputer_object.fit_transform(user_data_to_impute)

the user_data_imputed_np_array is a NumPy array, we need to convert it back to Pandas data frame


In [569]:
# create a list of tuples, with the column name and data type for all existing columns in the Numpy array.
# exact order of columns has to be maintained
column_names_of_imputed_np_array = ['num_modules_consumed', 'num_glucose_tracked', 'num_of_days_food_tracked']
# create the Pandas data frame from the Numpy array
user_data_imputed_data_frame = pd.DataFrame(user_data_imputed_np_array, columns=column_names_of_imputed_np_array)
# Check if the data frame created now is proper
user_data_imputed_data_frame.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 3 columns):
num_modules_consumed        372 non-null float64
num_glucose_tracked         372 non-null float64
num_of_days_food_tracked    372 non-null float64
dtypes: float64(3)
memory usage: 8.8 KB

Now lets add back the useful colums that we had removed from data set, these are

  • last_activity
  • age_on_platform
  • watching_videos

In [570]:
# using the Series contructor from Pandas
user_data_imputed_data_frame['last_activity'] = pd.Series(user_data_to_clean['last_activity'])
user_data_imputed_data_frame['age_on_platform'] = pd.Series(user_data_to_clean['age_on_platform'])
# Check if every thing is Ok
user_data_imputed_data_frame.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 5 columns):
num_modules_consumed        372 non-null float64
num_glucose_tracked         372 non-null float64
num_of_days_food_tracked    372 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
dtypes: datetime64[ns](1), float64(3), timedelta64[ns](1)
memory usage: 14.6 KB

As mentioned in column description for watching_videos a blank or no value, means '0' also know as 'Not watching'

Since Scikit Learn models can ONLY deal with numerical values, lets convert all blanks to '0'


In [571]:
# fillna(0) function will fill all blank cells with '0'
user_data_imputed_data_frame['watching_videos'] = pd.Series(user_data_to_clean['watching_videos'].fillna(0))
user_data_imputed_data_frame.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 6 columns):
num_modules_consumed        372 non-null float64
num_glucose_tracked         372 non-null float64
num_of_days_food_tracked    372 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
watching_videos             372 non-null float64
dtypes: datetime64[ns](1), float64(4), timedelta64[ns](1)
memory usage: 17.5 KB

Finally the columns last_activity, age_on_platform have missing values, as evident from above table. Since this is time data, that in my opinion should not be imputed, we will drop/delete the columns.


In [572]:
# Since only these two columns are having null values, we can run the function *dropna()* on the whole data frame
# All rows with missing data get dropped
user_data_imputed_data_frame.dropna(axis=0, inplace=True)
user_data_imputed_data_frame.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 6 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
watching_videos             302 non-null float64
dtypes: datetime64[ns](1), float64(4), timedelta64[ns](1)
memory usage: 16.5 KB

Labelling the Raw data

Now comes the code that will based on the rules mentioned below label the provided data, so it can be used as trainning data for the classifer.

This tables defines the set of rules used to assign labels for Traning data

label age_on_platform last_activity num_modules_comsumed num_of_days_food_tracked num_glucose_tracked watching_videos
Generic (ignore) Converted to days to be Measured from 16Apr Good >= 3/week Bad < 3/week Good >= 30 Bad < 30 Good >= 4/week Bad < 4/week Good = 1 Bad = 0
good_new_user = 1 >= 30 days && < 180 <= 2 days >= 12 >= 20 >= 16 Good = 1
bad_new_user = 2 >= 30 days && < 180 > 2 days < 12 < 20 < 16 Bad = 0
good_mid_term_user = 3 >= 180 days && < 360 <= 7 days >= 48 >= 30 >= 96 Good = 1
bad_mid_term_user = 4 >= 180 days && <360 > 7 days < 48 < 30 < 96 Bad = 0
good_long_term_user = 5 >= 360 days <= 14 days >= 48 >= 30 >= 192 Good = 1
bad_long_term_user = 6 >= 360 days > 14 days < 48 < 30 < 192 Bad = 0

In [573]:
# This if else section will bin the rows based on the critiria for labels mentioned in the table above

user_data_imputed_data_frame_labeled = user_data_imputed_data_frame

for index, row in user_data_imputed_data_frame.iterrows():
    
    if row["age_on_platform"] >= np.timedelta64(30, 'D') and row["age_on_platform"] < np.timedelta64(180, 'D'):    
        if row['last_activity'] <= np.datetime64(2, 'D') and\
            row['num_modules_consumed'] >= 12 and\
            row['num_of_days_food_tracked'] >= 20 and\
            row['num_glucose_tracked'] >= 16 and\
            row['watching_videos'] == 1:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 1)
        else:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 2)
    
    elif row["age_on_platform"] >= np.timedelta64(180, 'D') and row["age_on_platform"] < np.timedelta64(360, 'D'):
        if row['last_activity'] <= np.datetime64(7, 'D') and\
            row['num_modules_consumed'] >= 48 and\
            row['num_of_days_food_tracked'] >= 30 and\
            row['num_glucose_tracked'] >= 96 and\
            row['watching_videos'] == 1:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 3)
        else:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 4)
            
    elif row["age_on_platform"] >= np.timedelta64(360, 'D'):
        if row['last_activity'] <= np.datetime64(14, 'D') and\
            row['num_modules_consumed'] >= 48 and\
            row['num_of_days_food_tracked'] >= 30 and\
            row['num_glucose_tracked'] >= 192 and\
            row['watching_videos'] == 1:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 5)
        else:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 6)
    else:
        user_data_imputed_data_frame_labeled.set_value(index, 'label', 0)
        
user_data_imputed_data_frame_labeled['label'].unique()


Out[573]:
array([ 2.,  4.,  0.,  6.])

The output above for the array says only 2,4,6,0 were selected as labels. Which means there are no good users in all three new, mid, long - term categories.

Consequently either I change the label selection model or get better data (which has good users) :P


In [574]:
# Look at basic info for this Labeled data frame
user_data_imputed_data_frame_labeled.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 7 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
watching_videos             302 non-null float64
label                       302 non-null float64
dtypes: datetime64[ns](1), float64(5), timedelta64[ns](1)
memory usage: 18.9 KB

One major limitation with Sci Kit Learn is with the datatypes it can deal with for features

the data type of last_activity is datetime64 and of age_on_platform is timedelta64

These we need to convert to a numerical type.


In [575]:
# Lets start with the column last_activity
# ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# This function takes a datetime64 value and converts it into float value that represents time from epoch
def convert_datetime64_to_from_epoch(dt64):
    ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
    return ts
# Lets apply this function on last_activity column
user_data_imputed_data_frame_labeled_datetime64_converted = user_data_imputed_data_frame_labeled
user_data_imputed_data_frame_labeled_datetime64_converted['last_activity'] = user_data_imputed_data_frame_labeled['last_activity'].apply(convert_datetime64_to_from_epoch)
user_data_imputed_data_frame_labeled_datetime64_converted.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 7 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null float64
age_on_platform             302 non-null timedelta64[ns]
watching_videos             302 non-null float64
label                       302 non-null float64
dtypes: float64(6), timedelta64[ns](1)
memory usage: 18.9 KB
/home/bigboy/local_bin/janacare/virtenv/lib/python2.7/site-packages/ipykernel/__main__.py:5: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future

In [576]:
# Now its time to convert the timedelta64 column named age_on_platform
def convert_timedelta64_to_sec(td64):
    ts = (td64 / np.timedelta64(1, 's'))
    return ts
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted = user_data_imputed_data_frame_labeled_datetime64_converted
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted['age_on_platform'] = user_data_imputed_data_frame_labeled_datetime64_converted['age_on_platform'].apply(convert_timedelta64_to_sec)
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 7 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null float64
age_on_platform             302 non-null float64
watching_videos             302 non-null float64
label                       302 non-null float64
dtypes: float64(7)
memory usage: 18.9 KB

In [577]:
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.describe()


Out[577]:
num_modules_consumed num_glucose_tracked num_of_days_food_tracked last_activity age_on_platform watching_videos label
count 302.000000 302.000000 302.000000 3.020000e+02 3.020000e+02 302.000000 302.000000
mean 12.072464 17.824758 29.576923 1.453187e+09 1.422882e+07 0.321192 2.642384
std 6.508527 21.239030 23.781469 1.498342e+07 1.275145e+07 0.467709 1.814831
min 1.000000 1.000000 1.000000 1.410480e+09 0.000000e+00 0.000000 0.000000
25% 12.072464 17.769231 29.576923 1.443182e+09 4.168800e+06 0.000000 2.000000
50% 12.072464 17.769231 29.576923 1.454112e+09 1.036800e+07 0.000000 2.000000
75% 12.072464 17.769231 29.576923 1.460765e+09 2.062800e+07 1.000000 4.000000
max 78.000000 260.000000 229.000000 1.480810e+09 5.762880e+07 1.000000 6.000000

In [578]:
# Save the labeled data frame as excel file
from pandas import options
options.io.excel.xlsx.writer = 'xlsxwriter'
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.to_excel('user_data_imputed_data_frame_labeled.xlsx')

Training and Testing the ML algorithm

Lets move on to the thing we all have been waiting for:

model training and testing

For the training the model we need two lists, one list with only the Labels column. Second list is actually a list of lists with each sub list containing the full row of feature columns.

Before we do anything we need to seprate out 30% of the data for testing purpose


In [579]:
# Total number of rows is 302; 30% of that is ~90
user_data_imputed_data_frame_labeled_training = user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.ix[90:]
user_data_imputed_data_frame_labeled_training.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 90 to 370
Data columns (total 7 columns):
num_modules_consumed        212 non-null float64
num_glucose_tracked         212 non-null float64
num_of_days_food_tracked    212 non-null float64
last_activity               212 non-null float64
age_on_platform             212 non-null float64
watching_videos             212 non-null float64
label                       212 non-null float64
dtypes: float64(7)
memory usage: 13.2 KB

In [580]:
# Lets first make our list of Labels column
#for index, row in user_data_imputed_data_frame.iterrows():
label_list = user_data_imputed_data_frame_labeled_training['label'].values.tolist()
# Check data type of elements of the list
type(label_list[0])


Out[580]:
float

In [581]:
# Lets convert the data type of all elements of the list to int
label_list_training = map(int, label_list)
# Check data type of elements of the list
type(label_list_training[5])


Out[581]:
int

Here we remove the datetime64 & timedelta64 columns too, the issue is Sci Kit learn methods can only deal with numerical and string features. I am trying to sort this issue


In [582]:
# Now to create the other list of lists with features as elements
# before that we will have to remove the Labels column
user_data_imputed_data_frame_UNlabeled_training = user_data_imputed_data_frame_labeled_training.drop(['label'] ,1)
user_data_imputed_data_frame_UNlabeled_training.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 90 to 370
Data columns (total 6 columns):
num_modules_consumed        212 non-null float64
num_glucose_tracked         212 non-null float64
num_of_days_food_tracked    212 non-null float64
last_activity               212 non-null float64
age_on_platform             212 non-null float64
watching_videos             212 non-null float64
dtypes: float64(6)
memory usage: 11.6 KB

In [583]:
# As you may notice, the data type of watching_videos is float, while it should be int
user_data_imputed_data_frame_UNlabeled_training['watching_videos'] = user_data_imputed_data_frame_UNlabeled_training['watching_videos'].apply(lambda x: int(x))
user_data_imputed_data_frame_UNlabeled_training.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 90 to 370
Data columns (total 6 columns):
num_modules_consumed        212 non-null float64
num_glucose_tracked         212 non-null float64
num_of_days_food_tracked    212 non-null float64
last_activity               212 non-null float64
age_on_platform             212 non-null float64
watching_videos             212 non-null int64
dtypes: float64(5), int64(1)
memory usage: 11.6 KB

In [584]:
# Finally lets create the list of list from the row contents
features_list_training = map(list, user_data_imputed_data_frame_UNlabeled_training.values)

Its time to train the model


In [585]:
from sklearn import tree

In [586]:
classifier = tree.DecisionTreeClassifier() # We create an instance of the Decision tree object
classifier = classifier.fit(features_list_training, label_list_training) # Train the classifier

In [587]:
# Testing data is the first 90 rows
user_data_imputed_data_frame_labeled_testing = user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.ix[:90]

# take the labels in seprate list
label_list_test = user_data_imputed_data_frame_labeled_testing['label'].values.tolist()
label_list_test = map(int, label_list_test)

# Drop the time and Label columns 
user_data_imputed_data_frame_UNlabeled_testing = user_data_imputed_data_frame_labeled_testing.drop(['label'] ,1) 
# Check if every thing looks ok
user_data_imputed_data_frame_UNlabeled_testing.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 6 columns):
num_modules_consumed        91 non-null float64
num_glucose_tracked         91 non-null float64
num_of_days_food_tracked    91 non-null float64
last_activity               91 non-null float64
age_on_platform             91 non-null float64
watching_videos             91 non-null float64
dtypes: float64(6)
memory usage: 5.0 KB

In [588]:
# Finally lets create the list of list from the row contents for testing
features_list_test = map(list, user_data_imputed_data_frame_UNlabeled_testing.values)

In [589]:
len(features_list_test)


Out[589]:
91

In [592]:
# the prediction results for first ten values of test data set
print list(classifier.predict(features_list_test[:20]))


[2, 2, 4, 4, 0, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 2, 2, 6, 4, 2]

In [593]:
# The labels for test data set as labeled by code
print label_list_test[:20]


[2, 2, 4, 4, 0, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 2, 2, 6, 4, 2]

Does not Now it seems to be doing very well !! :( :D

One obvious reason could be that the model currently does not factor in two important feature columns


In [ ]: