Hello World! This notebook describes the decision tree based Machine Learning model I have created to segment the users of Habits app.
In [546]:
# This to clear all variable values
%reset
In [547]:
# Import the required modules
import pandas as pd
import numpy as np
#import scipy as sp
In [548]:
# simple function to read in the user data file.
# the argument parse_dates takes in a list of colums, which are to be parsed as date format
user_data_raw = pd.read_csv("janacare_user-engagement_Aug2014-Apr2016.csv", parse_dates = [-3,-2,-1])
In [549]:
# data metrics
user_data_raw.shape # Rows , colums
Out[549]:
In [550]:
# data metrics
user_data_raw.dtypes # data type of colums
Out[550]:
The column name watching_videos (binary - 1 for yes, blank/0 for no) is too long and has special chars, lets change it to watching_videos
In [551]:
user_data_to_clean = user_data_raw.rename(columns = {'watching_videos (binary - 1 for yes, blank/0 for no)':'watching_videos'})
In [552]:
# Some basic statistical information on the data
user_data_to_clean.describe()
Out[552]:
In the last section of looking around, I saw that a lot of rows do not have any values or have garbage values(see first row of the table above). This can cause errors when computing anything using the values in these rows, hence a clean up is required.
We will clean up only those columns, that are being used for features.
The next two colums will not be cleaned, as they contain time data which in my opinion should not be imputed
In [553]:
# Lets check the health of the data set
user_data_to_clean.info()
As is visible from the last column (age_on_platform) data type, Pandas is not recognising it as date type format. This will make things difficult, so I delete this particular column and add a new one. Since the data in age_on_platform can be recreated by doing age_on_platform = last_activity - first_login
In [554]:
# Lets first delete the last column
user_data_to_clean_del_last_col = user_data_to_clean.drop("age_on_platform", 1)
In [555]:
# Check if colums has been deleted. Number of column changed from 19 to 18
user_data_to_clean_del_last_col.shape
Out[555]:
In [556]:
# Copy data frame 'user_data_del_last_col' into a new one
user_data_to_clean = user_data_to_clean_del_last_col
But on eyeballing I noticed some, cells of column first_login have greater value than corresponding cell of last_activity. These cells need to be swapped, since its not possible to have first_login > last_activity
In [557]:
# Run a loop through the data frame and check each row for this anamoly, if found swap
for index, row in user_data_to_clean.iterrows():
if row.first_login > row.last_activity:
temp_date_var = row.first_login
user_data_to_clean.set_value(index, 'first_login', row.last_activity)
user_data_to_clean.set_value(index, 'last_activity', temp_date_var)
#print "\tSw\t" + "first\t" + row.first_login.isoformat() + "\tlast\t" + row.last_activity.isoformat()
In [558]:
# Create new column 'age_on_platform' which has the corresponding value in date type format
user_data_to_clean["age_on_platform"] = user_data_to_clean["last_activity"] - user_data_to_clean["first_login"]
In [559]:
# Check the result in first few rows
user_data_to_clean["age_on_platform"].head(5)
Out[559]:
In [560]:
# Lets check the health of the data set
user_data_to_clean.info()
The second column of the above table describes, the number of non-null values in the respective column. As is visible for the columns of interest for us, eg. num_modules_consumed has ONLY 69 values out of possible 371 total
In [561]:
# Lets remove all columns from the data set that do not have to be imputed -
user_data_to_impute = user_data_to_clean.drop(["user_id", "watching_videos", "num_of_days_steps_tracked", "num_of_days_weight_tracked", "insulin_a1c_count", "weight", "height", "bmi", "age", "gender", "has_diabetes", "first_login", "last_activity", "age_on_platform", "hemoglobin_count", "cholesterol_count"], 1 )
In [562]:
user_data_to_impute.info()
In future this method could be combined with the mean imputation method, so the values not covered by KNN get replaced with mean values.
In [563]:
# Import Imputation method KNN
##from fancyimpute import KNN
In [564]:
# First lets convert the Pandas Dataframe into a Numpy array. We do this since the data frame needs to be transposed,
# which is only possible if the format is an Numpy array.
##user_data_to_impute_np_array = user_data_to_impute.as_matrix()
# Lets Transpose it
##user_data_to_impute_np_array_transposed = user_data_to_impute_np_array.T
In [565]:
# Run the KNN method on the data. function usage X_filled_knn = KNN(k=3).complete(X_incomplete)
##user_data_imputed_knn_np_array = KNN(k=5).complete(user_data_to_impute_np_array_transposed)
The above 3 steps are for KNN based Imputation, did not work well. As visible 804 items could not be imputed for and get replaced with zero
In [566]:
# Lets use simpler method that is provided by Scikit Learn itself
# import the function
from sklearn.preprocessing import Imputer
In [567]:
# Create an object of class Imputer, with the relvant parameters
imputer_object = Imputer(missing_values='NaN', strategy='mean', axis=0, copy=False)
In [568]:
# Impute the data and save the generated Numpy array
user_data_imputed_np_array = imputer_object.fit_transform(user_data_to_impute)
In [569]:
# create a list of tuples, with the column name and data type for all existing columns in the Numpy array.
# exact order of columns has to be maintained
column_names_of_imputed_np_array = ['num_modules_consumed', 'num_glucose_tracked', 'num_of_days_food_tracked']
# create the Pandas data frame from the Numpy array
user_data_imputed_data_frame = pd.DataFrame(user_data_imputed_np_array, columns=column_names_of_imputed_np_array)
# Check if the data frame created now is proper
user_data_imputed_data_frame.info()
In [570]:
# using the Series contructor from Pandas
user_data_imputed_data_frame['last_activity'] = pd.Series(user_data_to_clean['last_activity'])
user_data_imputed_data_frame['age_on_platform'] = pd.Series(user_data_to_clean['age_on_platform'])
# Check if every thing is Ok
user_data_imputed_data_frame.info()
In [571]:
# fillna(0) function will fill all blank cells with '0'
user_data_imputed_data_frame['watching_videos'] = pd.Series(user_data_to_clean['watching_videos'].fillna(0))
user_data_imputed_data_frame.info()
In [572]:
# Since only these two columns are having null values, we can run the function *dropna()* on the whole data frame
# All rows with missing data get dropped
user_data_imputed_data_frame.dropna(axis=0, inplace=True)
user_data_imputed_data_frame.info()
Now comes the code that will based on the rules mentioned below label the provided data, so it can be used as trainning data for the classifer.
This tables defines the set of rules used to assign labels for Traning data
| label | age_on_platform | last_activity | num_modules_comsumed | num_of_days_food_tracked | num_glucose_tracked | watching_videos |
|---|---|---|---|---|---|---|
| Generic (ignore) | Converted to days | to be Measured from 16Apr | Good >= 3/week Bad < 3/week | Good >= 30 Bad < 30 | Good >= 4/week Bad < 4/week | Good = 1 Bad = 0 |
| good_new_user = 1 | >= 30 days && < 180 | <= 2 days | >= 12 | >= 20 | >= 16 | Good = 1 |
| bad_new_user = 2 | >= 30 days && < 180 | > 2 days | < 12 | < 20 | < 16 | Bad = 0 |
| good_mid_term_user = 3 | >= 180 days && < 360 | <= 7 days | >= 48 | >= 30 | >= 96 | Good = 1 |
| bad_mid_term_user = 4 | >= 180 days && <360 | > 7 days | < 48 | < 30 | < 96 | Bad = 0 |
| good_long_term_user = 5 | >= 360 days | <= 14 days | >= 48 | >= 30 | >= 192 | Good = 1 |
| bad_long_term_user = 6 | >= 360 days | > 14 days | < 48 | < 30 | < 192 | Bad = 0 |
In [573]:
# This if else section will bin the rows based on the critiria for labels mentioned in the table above
user_data_imputed_data_frame_labeled = user_data_imputed_data_frame
for index, row in user_data_imputed_data_frame.iterrows():
if row["age_on_platform"] >= np.timedelta64(30, 'D') and row["age_on_platform"] < np.timedelta64(180, 'D'):
if row['last_activity'] <= np.datetime64(2, 'D') and\
row['num_modules_consumed'] >= 12 and\
row['num_of_days_food_tracked'] >= 20 and\
row['num_glucose_tracked'] >= 16 and\
row['watching_videos'] == 1:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 1)
else:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 2)
elif row["age_on_platform"] >= np.timedelta64(180, 'D') and row["age_on_platform"] < np.timedelta64(360, 'D'):
if row['last_activity'] <= np.datetime64(7, 'D') and\
row['num_modules_consumed'] >= 48 and\
row['num_of_days_food_tracked'] >= 30 and\
row['num_glucose_tracked'] >= 96 and\
row['watching_videos'] == 1:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 3)
else:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 4)
elif row["age_on_platform"] >= np.timedelta64(360, 'D'):
if row['last_activity'] <= np.datetime64(14, 'D') and\
row['num_modules_consumed'] >= 48 and\
row['num_of_days_food_tracked'] >= 30 and\
row['num_glucose_tracked'] >= 192 and\
row['watching_videos'] == 1:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 5)
else:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 6)
else:
user_data_imputed_data_frame_labeled.set_value(index, 'label', 0)
user_data_imputed_data_frame_labeled['label'].unique()
Out[573]:
In [574]:
# Look at basic info for this Labeled data frame
user_data_imputed_data_frame_labeled.info()
In [575]:
# Lets start with the column last_activity
# ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# This function takes a datetime64 value and converts it into float value that represents time from epoch
def convert_datetime64_to_from_epoch(dt64):
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
return ts
# Lets apply this function on last_activity column
user_data_imputed_data_frame_labeled_datetime64_converted = user_data_imputed_data_frame_labeled
user_data_imputed_data_frame_labeled_datetime64_converted['last_activity'] = user_data_imputed_data_frame_labeled['last_activity'].apply(convert_datetime64_to_from_epoch)
user_data_imputed_data_frame_labeled_datetime64_converted.info()
In [576]:
# Now its time to convert the timedelta64 column named age_on_platform
def convert_timedelta64_to_sec(td64):
ts = (td64 / np.timedelta64(1, 's'))
return ts
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted = user_data_imputed_data_frame_labeled_datetime64_converted
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted['age_on_platform'] = user_data_imputed_data_frame_labeled_datetime64_converted['age_on_platform'].apply(convert_timedelta64_to_sec)
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.info()
In [577]:
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.describe()
Out[577]:
In [578]:
# Save the labeled data frame as excel file
from pandas import options
options.io.excel.xlsx.writer = 'xlsxwriter'
user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.to_excel('user_data_imputed_data_frame_labeled.xlsx')
For the training the model we need two lists, one list with only the Labels column. Second list is actually a list of lists with each sub list containing the full row of feature columns.
In [579]:
# Total number of rows is 302; 30% of that is ~90
user_data_imputed_data_frame_labeled_training = user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.ix[90:]
user_data_imputed_data_frame_labeled_training.info()
In [580]:
# Lets first make our list of Labels column
#for index, row in user_data_imputed_data_frame.iterrows():
label_list = user_data_imputed_data_frame_labeled_training['label'].values.tolist()
# Check data type of elements of the list
type(label_list[0])
Out[580]:
In [581]:
# Lets convert the data type of all elements of the list to int
label_list_training = map(int, label_list)
# Check data type of elements of the list
type(label_list_training[5])
Out[581]:
In [582]:
# Now to create the other list of lists with features as elements
# before that we will have to remove the Labels column
user_data_imputed_data_frame_UNlabeled_training = user_data_imputed_data_frame_labeled_training.drop(['label'] ,1)
user_data_imputed_data_frame_UNlabeled_training.info()
In [583]:
# As you may notice, the data type of watching_videos is float, while it should be int
user_data_imputed_data_frame_UNlabeled_training['watching_videos'] = user_data_imputed_data_frame_UNlabeled_training['watching_videos'].apply(lambda x: int(x))
user_data_imputed_data_frame_UNlabeled_training.info()
In [584]:
# Finally lets create the list of list from the row contents
features_list_training = map(list, user_data_imputed_data_frame_UNlabeled_training.values)
In [585]:
from sklearn import tree
In [586]:
classifier = tree.DecisionTreeClassifier() # We create an instance of the Decision tree object
classifier = classifier.fit(features_list_training, label_list_training) # Train the classifier
In [587]:
# Testing data is the first 90 rows
user_data_imputed_data_frame_labeled_testing = user_data_imputed_data_frame_labeled_datetime64_timedelta64_converted.ix[:90]
# take the labels in seprate list
label_list_test = user_data_imputed_data_frame_labeled_testing['label'].values.tolist()
label_list_test = map(int, label_list_test)
# Drop the time and Label columns
user_data_imputed_data_frame_UNlabeled_testing = user_data_imputed_data_frame_labeled_testing.drop(['label'] ,1)
# Check if every thing looks ok
user_data_imputed_data_frame_UNlabeled_testing.info()
In [588]:
# Finally lets create the list of list from the row contents for testing
features_list_test = map(list, user_data_imputed_data_frame_UNlabeled_testing.values)
In [589]:
len(features_list_test)
Out[589]:
In [592]:
# the prediction results for first ten values of test data set
print list(classifier.predict(features_list_test[:20]))
In [593]:
# The labels for test data set as labeled by code
print label_list_test[:20]
In [ ]: