Hello World! This notebook describes the decision tree based Machine Learning model I have created to segment the users of Habits app.

Looking around the data set



In [546]:

    
# This to clear all variable values
%reset









    



Once deleted, variables cannot be recovered. Proceed (y/[n])? y



In [547]:

    
# Import the required modules
import pandas as pd
import numpy as np
#import scipy as sp



In [548]:

    
# simple function to read in the user data file.
# the argument parse_dates takes in a list of colums, which are to be parsed as date format
user_data_raw = pd.read_csv("janacare_user-engagement_Aug2014-Apr2016.csv", parse_dates = [-3,-2,-1])



In [549]:

    
# data metrics
user_data_raw.shape # Rows , colums









    Out[549]:





(372, 19)



In [550]:

    
# data metrics
user_data_raw.dtypes # data type of colums









    Out[550]:





user_id                                                        float64
num_modules_consumed                                           float64
num_glucose_tracked                                            float64
num_of_days_steps_tracked                                      float64
num_of_days_food_tracked                                       float64
num_of_days_weight_tracked                                     float64
insulin_a1c_count                                              float64
cholesterol_count                                              float64
hemoglobin_count                                               float64
watching_videos (binary - 1 for yes, blank/0 for no)           float64
weight                                                         float64
height                                                           int64
bmi                                                              int64
age                                                              int64
gender                                                          object
has_diabetes                                                   float64
first_login                                             datetime64[ns]
last_activity                                           datetime64[ns]
age_on_platform                                                 object
dtype: object

The column name watching_videos (binary - 1 for yes, blank/0 for no) is too long and has special chars, lets change it to watching_videos



In [551]:

    
user_data_to_clean = user_data_raw.rename(columns = {'watching_videos (binary - 1 for yes, blank/0 for no)':'watching_videos'})



In [552]:

    
# Some basic statistical information on the data
user_data_to_clean.describe()









    Out[552]:






  
    
      
      user_id
      num_modules_consumed
      num_glucose_tracked
      num_of_days_steps_tracked
      num_of_days_food_tracked
      num_of_days_weight_tracked
      insulin_a1c_count
      cholesterol_count
      hemoglobin_count
      watching_videos
      weight
      height
      bmi
      age
      has_diabetes
    
  
  
    
      count
      371.00000
      69.000000
      91.000000
      120.000000
      78.000000
      223.000000
      47.000000
      15.000000
      0.0
      97.0
      372.000000
      372.000000
      372.000000
      372.000000
      39.000000
    
    
      mean
      13850.74124
      12.072464
      17.769231
      53.433333
      29.576923
      3.210762
      5.170213
      4.733333
      NaN
      1.0
      72.074597
      169.306452
      25.325269
      49.223118
      0.512821
    
    
      std
      12773.29800
      13.693406
      38.881894
      80.690792
      47.019344
      4.490778
      12.694263
      1.709915
      NaN
      0.0
      14.744092
      16.112564
      5.194763
      13.487788
      0.506370
    
    
      min
      4288.00000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      3.000000
      NaN
      1.0
      40.000000
      120.000000
      5.000000
      11.000000
      0.000000
    
    
      25%
      6075.50000
      3.000000
      2.000000
      6.750000
      2.000000
      1.000000
      1.000000
      4.000000
      NaN
      1.0
      62.000000
      162.000000
      22.000000
      39.000000
      0.000000
    
    
      50%
      7462.00000
      8.000000
      5.000000
      19.500000
      10.500000
      2.000000
      2.000000
      4.000000
      NaN
      1.0
      70.000000
      167.000000
      25.000000
      49.500000
      1.000000
    
    
      75%
      15258.00000
      15.000000
      12.500000
      65.000000
      31.500000
      3.000000
      3.000000
      5.000000
      NaN
      1.0
      80.000000
      172.000000
      27.000000
      60.000000
      1.000000
    
    
      max
      49766.00000
      78.000000
      260.000000
      469.000000
      229.000000
      40.000000
      78.000000
      10.000000
      NaN
      1.0
      165.000000
      349.000000
      56.000000
      77.000000
      1.000000

Data Clean up

In the last section of looking around, I saw that a lot of rows do not have any values or have garbage values(see first row of the table above). This can cause errors when computing anything using the values in these rows, hence a clean up is required.

We will clean up only those columns, that are being used for features.

num_modules_consumed
num_glucose_tracked
num_of_days_food_tracked
watching_videos

The next two colums will not be cleaned, as they contain time data which in my opinion should not be imputed

first_login
last_activity



In [553]:

    
# Lets check the health of the data set
user_data_to_clean.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 19 columns):
user_id                       371 non-null float64
num_modules_consumed          69 non-null float64
num_glucose_tracked           91 non-null float64
num_of_days_steps_tracked     120 non-null float64
num_of_days_food_tracked      78 non-null float64
num_of_days_weight_tracked    223 non-null float64
insulin_a1c_count             47 non-null float64
cholesterol_count             15 non-null float64
hemoglobin_count              0 non-null float64
watching_videos               97 non-null float64
weight                        372 non-null float64
height                        372 non-null int64
bmi                           372 non-null int64
age                           372 non-null int64
gender                        372 non-null object
has_diabetes                  39 non-null float64
first_login                   372 non-null datetime64[ns]
last_activity                 302 non-null datetime64[ns]
age_on_platform               372 non-null object
dtypes: datetime64[ns](2), float64(12), int64(3), object(2)
memory usage: 55.3+ KB

As is visible from the last column (age_on_platform) data type, Pandas is not recognising it as date type format. This will make things difficult, so I delete this particular column and add a new one. Since the data in age_on_platform can be recreated by doing age_on_platform = last_activity - first_login



In [554]:

    
# Lets first delete the last column 
user_data_to_clean_del_last_col = user_data_to_clean.drop("age_on_platform", 1)



In [555]:

    
# Check if colums has been deleted. Number of column changed from 19 to 18
user_data_to_clean_del_last_col.shape









    Out[555]:





(372, 18)



In [556]:

    
# Copy data frame 'user_data_del_last_col' into a new one
user_data_to_clean = user_data_to_clean_del_last_col

But on eyeballing I noticed some, cells of column first_login have greater value than corresponding cell of last_activity. These cells need to be swapped, since its not possible to have first_login > last_activity



In [557]:

    
# Run a loop through the data frame and check each row for this anamoly, if found swap
for index, row in user_data_to_clean.iterrows():
    if row.first_login > row.last_activity:
        temp_date_var = row.first_login
        user_data_to_clean.set_value(index, 'first_login', row.last_activity)
        user_data_to_clean.set_value(index, 'last_activity', temp_date_var)
        #print "\tSw\t" + "first\t" + row.first_login.isoformat() + "\tlast\t" + row.last_activity.isoformat()



In [558]:

    
# Create new column 'age_on_platform' which has the corresponding value in date type format
user_data_to_clean["age_on_platform"] = user_data_to_clean["last_activity"] - user_data_to_clean["first_login"]



In [559]:

    
# Check the result in first few rows
user_data_to_clean["age_on_platform"].head(5)









    Out[559]:





0   151 days
1   129 days
2   211 days
3   235 days
4     3 days
Name: age_on_platform, dtype: timedelta64[ns]



In [560]:

    
# Lets check the health of the data set
user_data_to_clean.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 19 columns):
user_id                       371 non-null float64
num_modules_consumed          69 non-null float64
num_glucose_tracked           91 non-null float64
num_of_days_steps_tracked     120 non-null float64
num_of_days_food_tracked      78 non-null float64
num_of_days_weight_tracked    223 non-null float64
insulin_a1c_count             47 non-null float64
cholesterol_count             15 non-null float64
hemoglobin_count              0 non-null float64
watching_videos               97 non-null float64
weight                        372 non-null float64
height                        372 non-null int64
bmi                           372 non-null int64
age                           372 non-null int64
gender                        372 non-null object
has_diabetes                  39 non-null float64
first_login                   372 non-null datetime64[ns]
last_activity                 302 non-null datetime64[ns]
age_on_platform               302 non-null timedelta64[ns]
dtypes: datetime64[ns](2), float64(12), int64(3), object(1), timedelta64[ns](1)
memory usage: 55.3+ KB

The second column of the above table describes, the number of non-null values in the respective column. As is visible for the columns of interest for us, eg. num_modules_consumed has ONLY 69 values out of possible 371 total



In [561]:

    
# Lets remove all columns from the data set that do not have to be imputed - 
user_data_to_impute = user_data_to_clean.drop(["user_id", "watching_videos", "num_of_days_steps_tracked", "num_of_days_weight_tracked", "insulin_a1c_count", "weight", "height", "bmi", "age", "gender", "has_diabetes", "first_login", "last_activity", "age_on_platform", "hemoglobin_count", "cholesterol_count"], 1 )



In [562]:

    
user_data_to_impute.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 3 columns):
num_modules_consumed        69 non-null float64
num_glucose_tracked         91 non-null float64
num_of_days_food_tracked    78 non-null float64
dtypes: float64(3)
memory usage: 8.8 KB

The next 3 cells describes the steps to Impute data using KNN strategy, sadly this is not working well for our data set! One possible reason could be that the column is too sparse to find a neighbourer !

In future this method could be combined with the mean imputation method, so the values not covered by KNN get replaced with mean values.

Github repo and Documentation for fancyimpute



In [563]:

    
# Import Imputation method KNN
##from fancyimpute import KNN



In [564]:

    
# First lets convert the Pandas Dataframe into a Numpy array. We do this since the data frame needs to be transposed,
# which is only possible if the format is an Numpy array.
##user_data_to_impute_np_array = user_data_to_impute.as_matrix()
# Lets Transpose it
##user_data_to_impute_np_array_transposed = user_data_to_impute_np_array.T



In [565]:

    
# Run the KNN method on the data.   function usage X_filled_knn = KNN(k=3).complete(X_incomplete)
##user_data_imputed_knn_np_array = KNN(k=5).complete(user_data_to_impute_np_array_transposed)

The above 3 steps are for KNN based Imputation, did not work well. As visible 804 items could not be imputed for and get replaced with zero

Lets use simpler method that is provided by Scikit Learn itself



In [566]:

    
# Lets use simpler method that is provided by Scikit Learn itself
# import the function
from sklearn.preprocessing import Imputer



In [567]:

    
# Create an object of class Imputer, with the relvant parameters
imputer_object = Imputer(missing_values='NaN', strategy='mean', axis=0, copy=False)



In [568]:

    
# Impute the data and save the generated Numpy array
user_data_imputed_np_array = imputer_object.fit_transform(user_data_to_impute)

the user_data_imputed_np_array is a NumPy array, we need to convert it back to Pandas data frame



In [569]:

    
# create a list of tuples, with the column name and data type for all existing columns in the Numpy array.
# exact order of columns has to be maintained
column_names_of_imputed_np_array = ['num_modules_consumed', 'num_glucose_tracked', 'num_of_days_food_tracked']
# create the Pandas data frame from the Numpy array
user_data_imputed_data_frame = pd.DataFrame(user_data_imputed_np_array, columns=column_names_of_imputed_np_array)
# Check if the data frame created now is proper
user_data_imputed_data_frame.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 3 columns):
num_modules_consumed        372 non-null float64
num_glucose_tracked         372 non-null float64
num_of_days_food_tracked    372 non-null float64
dtypes: float64(3)
memory usage: 8.8 KB

Now lets add back the useful colums that we had removed from data set, these are

last_activity
age_on_platform
watching_videos



In [570]:

    
# using the Series contructor from Pandas
user_data_imputed_data_frame['last_activity'] = pd.Series(user_data_to_clean['last_activity'])
user_data_imputed_data_frame['age_on_platform'] = pd.Series(user_data_to_clean['age_on_platform'])
# Check if every thing is Ok
user_data_imputed_data_frame.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 5 columns):
num_modules_consumed        372 non-null float64
num_glucose_tracked         372 non-null float64
num_of_days_food_tracked    372 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
dtypes: datetime64[ns](1), float64(3), timedelta64[ns](1)
memory usage: 14.6 KB

As mentioned in column description for watching_videos a blank or no value, means '0' also know as 'Not watching'

Since Scikit Learn models can ONLY deal with numerical values, lets convert all blanks to '0'



In [571]:

    
# fillna(0) function will fill all blank cells with '0'
user_data_imputed_data_frame['watching_videos'] = pd.Series(user_data_to_clean['watching_videos'].fillna(0))
user_data_imputed_data_frame.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 372 entries, 0 to 371
Data columns (total 6 columns):
num_modules_consumed        372 non-null float64
num_glucose_tracked         372 non-null float64
num_of_days_food_tracked    372 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
watching_videos             372 non-null float64
dtypes: datetime64[ns](1), float64(4), timedelta64[ns](1)
memory usage: 17.5 KB

Finally the columns last_activity, age_on_platform have missing values, as evident from above table. Since this is time data, that in my opinion should not be imputed, we will drop/delete the columns.



In [572]:

    
# Since only these two columns are having null values, we can run the function *dropna()* on the whole data frame
# All rows with missing data get dropped
user_data_imputed_data_frame.dropna(axis=0, inplace=True)
user_data_imputed_data_frame.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 6 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
watching_videos             302 non-null float64
dtypes: datetime64[ns](1), float64(4), timedelta64[ns](1)
memory usage: 16.5 KB

Labelling the Raw data

Now comes the code that will based on the rules mentioned below label the provided data, so it can be used as trainning data for the classifer.

This tables defines the set of rules used to assign labels for Traning data

label	age_on_platform	last_activity	num_modules_comsumed	num_of_days_food_tracked	num_glucose_tracked	watching_videos
Generic (ignore)	Converted to days	to be Measured from 16Apr	Good >= 3/week Bad < 3/week	Good >= 30 Bad < 30	Good >= 4/week Bad < 4/week	Good = 1 Bad = 0
good_new_user = 1	>= 30 days && < 180	<= 2 days	>= 12	>= 20	>= 16	Good = 1
bad_new_user = 2	>= 30 days && < 180	> 2 days	< 12	< 20	< 16	Bad = 0
good_mid_term_user = 3	>= 180 days && < 360	<= 7 days	>= 48	>= 30	>= 96	Good = 1
bad_mid_term_user = 4	>= 180 days && <360	> 7 days	< 48	< 30	< 96	Bad = 0
good_long_term_user = 5	>= 360 days	<= 14 days	>= 48	>= 30	>= 192	Good = 1
bad_long_term_user = 6	>= 360 days	> 14 days	< 48	< 30	< 192	Bad = 0



In [573]:

    
# This if else section will bin the rows based on the critiria for labels mentioned in the table above

user_data_imputed_data_frame_labeled = user_data_imputed_data_frame

for index, row in user_data_imputed_data_frame.iterrows():
    
    if row["age_on_platform"] >= np.timedelta64(30, 'D') and row["age_on_platform"] < np.timedelta64(180, 'D'):    
        if row['last_activity'] <= np.datetime64(2, 'D') and\
            row['num_modules_consumed'] >= 12 and\
            row['num_of_days_food_tracked'] >= 20 and\
            row['num_glucose_tracked'] >= 16 and\
            row['watching_videos'] == 1:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 1)
        else:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 2)
    
    elif row["age_on_platform"] >= np.timedelta64(180, 'D') and row["age_on_platform"] < np.timedelta64(360, 'D'):
        if row['last_activity'] <= np.datetime64(7, 'D') and\
            row['num_modules_consumed'] >= 48 and\
            row['num_of_days_food_tracked'] >= 30 and\
            row['num_glucose_tracked'] >= 96 and\
            row['watching_videos'] == 1:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 3)
        else:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 4)
            
    elif row["age_on_platform"] >= np.timedelta64(360, 'D'):
        if row['last_activity'] <= np.datetime64(14, 'D') and\
            row['num_modules_consumed'] >= 48 and\
            row['num_of_days_food_tracked'] >= 30 and\
            row['num_glucose_tracked'] >= 192 and\
            row['watching_videos'] == 1:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 5)
        else:
            user_data_imputed_data_frame_labeled.set_value(index, 'label', 6)
    else:
        user_data_imputed_data_frame_labeled.set_value(index, 'label', 0)
        
user_data_imputed_data_frame_labeled['label'].unique()









    Out[573]:





array([ 2.,  4.,  0.,  6.])

The output above for the array says only 2,4,6,0 were selected as labels. Which means there are no good users in all three new, mid, long - term categories.

Consequently either I change the label selection model or get better data (which has good users) :P



In [574]:

    
# Look at basic info for this Labeled data frame
user_data_imputed_data_frame_labeled.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 7 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null datetime64[ns]
age_on_platform             302 non-null timedelta64[ns]
watching_videos             302 non-null float64
label                       302 non-null float64
dtypes: datetime64[ns](1), float64(5), timedelta64[ns](1)
memory usage: 18.9 KB

One major limitation with Sci Kit Learn is with the datatypes it can deal with for features

the data type of last_activity is datetime64 and of age_on_platform is timedelta64

These we need to convert to a numerical type.



In [575]:

    
# Lets start with the column last_activity
# ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# This function takes a datetime64 value and converts it into float value that represents time from epoch
def convert_datetime64_to_from_epoch(dt64):
    ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
    return ts
# Lets apply this function on last_activity column
user_data_imputed_data_frame_labeled_datetime64_converted = user_data_imputed_data_frame_labeled
user_data_imputed_data_frame_labeled_datetime64_converted['last_activity'] = user_data_imputed_data_frame_labeled['last_activity'].apply(convert_datetime64_to_from_epoch)
user_data_imputed_data_frame_labeled_datetime64_converted.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 7 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null float64
age_on_platform             302 non-null timedelta64[ns]
watching_videos             302 non-null float64
label                       302 non-null float64
dtypes: float64(6), timedelta64[ns](1)
memory usage: 18.9 KB






    



/home/bigboy/local_bin/janacare/virtenv/lib/python2.7/site-packages/ipykernel/__main__.py:5: DeprecationWarning: parsing timezone aware datetimes is deprecated; this will raise an error in the future



In [576]:

    
# Now its time to convert the timedelta64 column named age_on_platform
def convert_timedelta64_to_sec(td64):
    ts = (td64 / np.timedelta64(1, 's'))
    return ts
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted = user_data_imputed_data_frame_labeled_datetime64_converted
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted['age_on_platform'] = user_data_imputed_data_frame_labeled_datetime64_converted['age_on_platform'].apply(convert_timedelta64_to_sec)
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 302 entries, 0 to 370
Data columns (total 7 columns):
num_modules_consumed        302 non-null float64
num_glucose_tracked         302 non-null float64
num_of_days_food_tracked    302 non-null float64
last_activity               302 non-null float64
age_on_platform             302 non-null float64
watching_videos             302 non-null float64
label                       302 non-null float64
dtypes: float64(7)
memory usage: 18.9 KB



In [577]:

    
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.describe()









    Out[577]:






  
    
      
      num_modules_consumed
      num_glucose_tracked
      num_of_days_food_tracked
      last_activity
      age_on_platform
      watching_videos
      label
    
  
  
    
      count
      302.000000
      302.000000
      302.000000
      3.020000e+02
      3.020000e+02
      302.000000
      302.000000
    
    
      mean
      12.072464
      17.824758
      29.576923
      1.453187e+09
      1.422882e+07
      0.321192
      2.642384
    
    
      std
      6.508527
      21.239030
      23.781469
      1.498342e+07
      1.275145e+07
      0.467709
      1.814831
    
    
      min
      1.000000
      1.000000
      1.000000
      1.410480e+09
      0.000000e+00
      0.000000
      0.000000
    
    
      25%
      12.072464
      17.769231
      29.576923
      1.443182e+09
      4.168800e+06
      0.000000
      2.000000
    
    
      50%
      12.072464
      17.769231
      29.576923
      1.454112e+09
      1.036800e+07
      0.000000
      2.000000
    
    
      75%
      12.072464
      17.769231
      29.576923
      1.460765e+09
      2.062800e+07
      1.000000
      4.000000
    
    
      max
      78.000000
      260.000000
      229.000000
      1.480810e+09
      5.762880e+07
      1.000000
      6.000000



In [578]:

    
# Save the labeled data frame as excel file
from pandas import options
options.io.excel.xlsx.writer = 'xlsxwriter'
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.to_excel('user_data_imputed_data_frame_labeled.xlsx')

Training and Testing the ML algorithm

Lets move on to the thing we all have been waiting for:

model training and testing

For the training the model we need two lists, one list with only the Labels column. Second list is actually a list of lists with each sub list containing the full row of feature columns.

Before we do anything we need to seprate out 30% of the data for testing purpose



In [579]:

    
# Total number of rows is 302; 30% of that is ~90
user_data_imputed_data_frame_labeled_training = user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.ix[90:]
user_data_imputed_data_frame_labeled_training.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 90 to 370
Data columns (total 7 columns):
num_modules_consumed        212 non-null float64
num_glucose_tracked         212 non-null float64
num_of_days_food_tracked    212 non-null float64
last_activity               212 non-null float64
age_on_platform             212 non-null float64
watching_videos             212 non-null float64
label                       212 non-null float64
dtypes: float64(7)
memory usage: 13.2 KB



In [580]:

    
# Lets first make our list of Labels column
#for index, row in user_data_imputed_data_frame.iterrows():
label_list = user_data_imputed_data_frame_labeled_training['label'].values.tolist()
# Check data type of elements of the list
type(label_list[0])









    Out[580]:





float



In [581]:

    
# Lets convert the data type of all elements of the list to int
label_list_training = map(int, label_list)
# Check data type of elements of the list
type(label_list_training[5])









    Out[581]:





int

Here we remove the datetime64 & timedelta64 columns too, the issue is Sci Kit learn methods can only deal with numerical and string features. I am trying to sort this issue



In [582]:

    
# Now to create the other list of lists with features as elements
# before that we will have to remove the Labels column
user_data_imputed_data_frame_UNlabeled_training = user_data_imputed_data_frame_labeled_training.drop(['label'] ,1)
user_data_imputed_data_frame_UNlabeled_training.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 90 to 370
Data columns (total 6 columns):
num_modules_consumed        212 non-null float64
num_glucose_tracked         212 non-null float64
num_of_days_food_tracked    212 non-null float64
last_activity               212 non-null float64
age_on_platform             212 non-null float64
watching_videos             212 non-null float64
dtypes: float64(6)
memory usage: 11.6 KB



In [583]:

    
# As you may notice, the data type of watching_videos is float, while it should be int
user_data_imputed_data_frame_UNlabeled_training['watching_videos'] = user_data_imputed_data_frame_UNlabeled_training['watching_videos'].apply(lambda x: int(x))
user_data_imputed_data_frame_UNlabeled_training.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 212 entries, 90 to 370
Data columns (total 6 columns):
num_modules_consumed        212 non-null float64
num_glucose_tracked         212 non-null float64
num_of_days_food_tracked    212 non-null float64
last_activity               212 non-null float64
age_on_platform             212 non-null float64
watching_videos             212 non-null int64
dtypes: float64(5), int64(1)
memory usage: 11.6 KB



In [584]:

    
# Finally lets create the list of list from the row contents
features_list_training = map(list, user_data_imputed_data_frame_UNlabeled_training.values)

Its time to train the model



In [585]:

    
from sklearn import tree



In [586]:

    
classifier = tree.DecisionTreeClassifier() # We create an instance of the Decision tree object
classifier = classifier.fit(features_list_training, label_list_training) # Train the classifier



In [587]:

    
# Testing data is the first 90 rows
user_data_imputed_data_frame_labeled_testing = user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.ix[:90]

# take the labels in seprate list
label_list_test = user_data_imputed_data_frame_labeled_testing['label'].values.tolist()
label_list_test = map(int, label_list_test)

# Drop the time and Label columns 
user_data_imputed_data_frame_UNlabeled_testing = user_data_imputed_data_frame_labeled_testing.drop(['label'] ,1) 
# Check if every thing looks ok
user_data_imputed_data_frame_UNlabeled_testing.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 91 entries, 0 to 90
Data columns (total 6 columns):
num_modules_consumed        91 non-null float64
num_glucose_tracked         91 non-null float64
num_of_days_food_tracked    91 non-null float64
last_activity               91 non-null float64
age_on_platform             91 non-null float64
watching_videos             91 non-null float64
dtypes: float64(6)
memory usage: 5.0 KB



In [588]:

    
# Finally lets create the list of list from the row contents for testing
features_list_test = map(list, user_data_imputed_data_frame_UNlabeled_testing.values)



In [589]:

    
len(features_list_test)









    Out[589]:





91



In [592]:

    
# the prediction results for first ten values of test data set
print list(classifier.predict(features_list_test[:20]))









    



[2, 2, 4, 4, 0, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 2, 2, 6, 4, 2]



In [593]:

    
# The labels for test data set as labeled by code
print label_list_test[:20]









    



[2, 2, 4, 4, 0, 4, 2, 2, 4, 2, 4, 2, 2, 4, 4, 2, 2, 6, 4, 2]

Does not Now it seems to be doing very well !! :( :D

One obvious reason could be that the model currently does ~~not~~ factor in two important feature columns



In [ ]:

	user_id	num_modules_consumed	num_glucose_tracked	num_of_days_steps_tracked	num_of_days_food_tracked	num_of_days_weight_tracked	insulin_a1c_count	cholesterol_count	hemoglobin_count	watching_videos	weight	height	bmi	age	has_diabetes
count	371.00000	69.000000	91.000000	120.000000	78.000000	223.000000	47.000000	15.000000	0.0	97.0	372.000000	372.000000	372.000000	372.000000	39.000000
mean	13850.74124	12.072464	17.769231	53.433333	29.576923	3.210762	5.170213	4.733333	NaN	1.0	72.074597	169.306452	25.325269	49.223118	0.512821
std	12773.29800	13.693406	38.881894	80.690792	47.019344	4.490778	12.694263	1.709915	NaN	0.0	14.744092	16.112564	5.194763	13.487788	0.506370
min	4288.00000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	3.000000	NaN	1.0	40.000000	120.000000	5.000000	11.000000	0.000000
25%	6075.50000	3.000000	2.000000	6.750000	2.000000	1.000000	1.000000	4.000000	NaN	1.0	62.000000	162.000000	22.000000	39.000000	0.000000
50%	7462.00000	8.000000	5.000000	19.500000	10.500000	2.000000	2.000000	4.000000	NaN	1.0	70.000000	167.000000	25.000000	49.500000	1.000000
75%	15258.00000	15.000000	12.500000	65.000000	31.500000	3.000000	3.000000	5.000000	NaN	1.0	80.000000	172.000000	27.000000	60.000000	1.000000
max	49766.00000	78.000000	260.000000	469.000000	229.000000	40.000000	78.000000	10.000000	NaN	1.0	165.000000	349.000000	56.000000	77.000000	1.000000

	num_modules_consumed	num_glucose_tracked	num_of_days_food_tracked	last_activity	age_on_platform	watching_videos	label
count	302.000000	302.000000	302.000000	3.020000e+02	3.020000e+02	302.000000	302.000000
mean	12.072464	17.824758	29.576923	1.453187e+09	1.422882e+07	0.321192	2.642384
std	6.508527	21.239030	23.781469	1.498342e+07	1.275145e+07	0.467709	1.814831
min	1.000000	1.000000	1.000000	1.410480e+09	0.000000e+00	0.000000	0.000000
25%	12.072464	17.769231	29.576923	1.443182e+09	4.168800e+06	0.000000	2.000000
50%	12.072464	17.769231	29.576923	1.454112e+09	1.036800e+07	0.000000	2.000000
75%	12.072464	17.769231	29.576923	1.460765e+09	2.062800e+07	1.000000	4.000000
max	78.000000	260.000000	229.000000	1.480810e+09	5.762880e+07	1.000000	6.000000