Brian McKean

===========

BioTrak Health

https://www.biotrakhealth.com/

Session Management

Identify ineffective sessions during user trials

===========

Galvanize Data Science Immersion

Capstone Project

First data exploration



In [1]:

    
from docx import Document
from docx.shared import Inches
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid.parasite_axes import SubplotHost
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
import seaborn as sns
import requests
import scipy.stats as stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from sklearn.externals import joblib
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_curve
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

Get Data from files



In [2]:

    
# Read in data
df_users = pd.read_json('users_02_26_2017.txt')
df_sessions = pd.read_json('sessions_02_26_2017.txt')
print df_users.columns
print df_sessions.columns









    



Index([u'YOB', u'_id', u'email', u'first_name', u'gender', u'last_name',
       u'userid'],
      dtype='object')
Index([u'Build#', u'Device_info_id', u'End_GMT_time', u'End_local_time',
       u'GMT_Date', u'GMT_date', u'Local_date', u'Session_data_header',
       u'Session_name', u'Session_name:', u'Session_time_interval',
       u'Session_type', u'Start_GMT_time', u'Start_local_time',
       u'Start_time_zone', u'User_info_id', u'Version#', u'_id',
       u'session_data', u'user_info_id', u'userid'],
      dtype='object')

Initial Data Exploration



In [3]:

    
print len(df_users), " users are set up"
print len(df_sessions), "sessions are uploaded"









    



66  users are set up
356 sessions are uploaded



In [4]:

    
# Read Ratings File
df_ratings = pd.read_csv('data/ratings.csv')
print "Number of ratings is ",len(df_ratings)









    



Number of ratings is  356



In [5]:

    
print "Rated Bad:", df_ratings[df_ratings['rating']=='Bad'].count()









    



Rated Bad: Unnamed: 0     237
session_num    237
rating         237
dtype: int64



In [6]:

    
print "Rated Good:",df_ratings[df_ratings['rating']=='Good'].count()









    



Rated Good: Unnamed: 0     119
session_num    119
rating         119
dtype: int64



In [7]:

    
df_sessions = df_sessions.ix[:356,:]

Set up scoring

After a review session the project firmware engineer and I graded 356 sessions. There were 237 bad and 119 good ratings. a

In addition to coming up with the ratings we jointly decided on some heuristics for determing a good and bad sessions

If the session was less than 60 seconds the session is bad. It is possible that a user may have an properly operating device over the 60 seconds, but we consider that not an effective use as there is not enough time to use the sessions as designed.
If the session, after 60 seconds, goes to 0 (Min) or 4095 (Max), the session is bad. Some of the sessions in this category show a significant amount of good data, but the fact that the user had problems idnicate someting that needs to be addressed.
- These two extremes indicate that an electrode has lost contact with the skin. Once a session has started the electrodes should continue to make proper contact until the end of the session.
Low signal values are not a problem.
Signal may be smooth with gradual peaks or there may be sharp peaks. The peaks come from activities such as jaw clenching that may be part of a session or may simply be the user exploring the operation of the device.

Add New Features

In order to incororate the heuristics I add the following features

min_len (T/F) -- is the length at least 3600 units (60 seconds)
pegged_L (T/F) -- at least one value of 0 after 60 seconds
pegged_H (T/F) -- at least one value of 4095 afte 60 seconds



In [8]:

    
# add ratings to session data
df_sessions['rating'] = df_ratings['rating']



In [9]:

    
print df_sessions.columns









    



Index([               u'Build#',        u'Device_info_id',
                u'End_GMT_time',        u'End_local_time',
                    u'GMT_Date',              u'GMT_date',
                  u'Local_date',   u'Session_data_header',
                u'Session_name',         u'Session_name:',
       u'Session_time_interval',          u'Session_type',
              u'Start_GMT_time',      u'Start_local_time',
             u'Start_time_zone',          u'User_info_id',
                    u'Version#',                   u'_id',
                u'session_data',          u'user_info_id',
                      u'userid',                u'rating'],
      dtype='object')

A litte data exploration



In [10]:

    
df_sessions.Session_name.unique()









    Out[10]:





array([u'Open', u'Breathing for Relief', u'Head and Neck Relax',
       u'3 Minute Relax', u'Deep Breathing', u'Forest Visualization',
       u'Deep Relax for Sleep', u'Elevator Excercise',
       u'Beach Visualization', u'Cloud Visualization',
       u'Extended Exercise', u'Breath and Relax', u'Progress Muscle Relax'], dtype=object)



In [11]:

    
df_sessions.Session_type.unique()









    Out[11]:





array([u'open', u'managed'], dtype=object)



In [12]:

    
df_groups = df_sessions.groupby('Session_name')



In [13]:

    
df_groups['Device_info_id'].count()









    Out[13]:





Session_name
3 Minute Relax            32
Beach Visualization        9
Breath and Relax           4
Breathing for Relief      10
Cloud Visualization        6
Deep Breathing             6
Deep Relax for Sleep       6
Elevator Excercise         7
Extended Exercise          1
Forest Visualization       6
Head and Neck Relax       10
Open                     257
Progress Muscle Relax      2
Name: Device_info_id, dtype: int64



In [14]:

    
df_results = df_sessions[['Session_name','userid']]



In [15]:

    
# Score is True if session is bad
# we are looking for bad sessions
df_results['score'] = np.where(df_sessions['rating']=='Good', False, True)









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()

We have to change format for the time series data. It starts as a string of comma delimted values. The last value is uploaded incorrectly in some sessions so we discard it.



In [16]:

    
# Pull session data dictionary info into it's own clumns
dfNew = df_sessions.join(pd.DataFrame(df_sessions["session_data"].to_dict()).T)



In [17]:

    
dfNew.columns









    Out[17]:





Index([               u'Build#',        u'Device_info_id',
                u'End_GMT_time',        u'End_local_time',
                    u'GMT_Date',              u'GMT_date',
                  u'Local_date',   u'Session_data_header',
                u'Session_name',         u'Session_name:',
       u'Session_time_interval',          u'Session_type',
              u'Start_GMT_time',      u'Start_local_time',
             u'Start_time_zone',          u'User_info_id',
                    u'Version#',                   u'_id',
                u'session_data',          u'user_info_id',
                      u'userid',                u'rating',
                u'average_data',    u'average_data_count',
               u'baseline_data',         u'data_interval',
                  u'final_data',          u'graphed_data',
          u'graphed_data_count',              u'raw_data',
              u'raw_data_count'],
      dtype='object')



In [18]:

    
# Start calculation of features
# Convert session data from string to array of ints
# -- Drop the last entry as there was a bug that put in a bad last entry
temp = dfNew['average_data'].str.replace(" ","").str.split(',').str[:-1].map(lambda xx: np.array([int(yy) for yy in xx]))



In [19]:

    
# Some of the time series are empty, if so insert an entry of one sample at 0
# Signal is inverted -- need to change it 
temp = temp.map(lambda x: np.array( [0] if not len(x) else x ))
temp = temp.map(lambda x: 4095 - x)



In [20]:

    
# Get the mean
temp2 = temp.map(lambda x: x.mean())



In [21]:

    
temp2.fillna(0,inplace=True)
df_results['avg_data_mean'] = temp2









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [22]:

    
# Get the max
t_max = temp.map(lambda x: x.max())



In [23]:

    
df_results['avg_data_max'] = t_max









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':



In [24]:

    
# Get the min
df_results['avg_data_min'] = temp.map(lambda x: x.min())









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [25]:

    
# Get the length
df_results['avg_data_len'] = temp.map(lambda x: len(x))









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [26]:

    
# Get the standard deviation
df_results['avg_data_std'] = temp.map(lambda x: x.std())









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [27]:

    
df_results.columns









    Out[27]:





Index([ u'Session_name',        u'userid',         u'score', u'avg_data_mean',
        u'avg_data_max',  u'avg_data_min',  u'avg_data_len',  u'avg_data_std'],
      dtype='object')



In [28]:

    
# Load in session num 
df_results['session_num'] = df_results.index









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app



In [29]:

    
# Add heuristics
df_results['min_len'] = df_results['avg_data_len']>3600
df_results['pegged_low'] = [False if (len(x)<3600 or min(x[3600:])>0) else True for x in temp]
df_results['pegged_high'] = [False if (len(x)<3600 or max(x[3600:])<4095) else True for x in temp]









    



/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()
/Users/brianmckean/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy



In [30]:

    
df_results.columns









    Out[30]:





Index([ u'Session_name',        u'userid',         u'score', u'avg_data_mean',
        u'avg_data_max',  u'avg_data_min',  u'avg_data_len',  u'avg_data_std',
         u'session_num',       u'min_len',    u'pegged_low',   u'pegged_high'],
      dtype='object')



In [31]:

    
df_key_columns = [x  for x in list(df_results.columns) if x not  in ['Session_name','userid','score']]
print df_key_columns









    



['avg_data_mean', 'avg_data_max', 'avg_data_min', 'avg_data_len', 'avg_data_std', 'session_num', 'min_len', 'pegged_low', 'pegged_high']



In [32]:

    
df_res_save = df_results.copy()



In [33]:

    
y = df_results['score'].as_matrix()
X_df = df_results[df_key_columns]
X = X_df.as_matrix()



In [34]:

    
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)



In [35]:

    
df_results.describe()









    Out[35]:






  
    
      
      avg_data_mean
      avg_data_max
      avg_data_min
      avg_data_len
      avg_data_std
      session_num
    
  
  
    
      count
      356.000000
      356.000000
      356.000000
      356.000000
      356.000000
      356.000000
    
    
      mean
      2080.052353
      3521.893258
      1021.233146
      13404.457865
      602.631308
      177.500000
    
    
      std
      997.930074
      883.259504
      1220.579913
      34697.181848
      424.079052
      102.912584
    
    
      min
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
    
    
      25%
      1317.220337
      3357.250000
      0.000000
      1152.000000
      264.431434
      88.750000
    
    
      50%
      2090.857233
      3901.000000
      423.000000
      3736.000000
      508.647631
      177.500000
    
    
      75%
      2837.760875
      4047.000000
      1945.250000
      13222.000000
      939.515026
      266.250000
    
    
      max
      4095.000000
      4095.000000
      4095.000000
      382019.000000
      1667.963040
      355.000000



In [36]:

    
clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train,y_train)



In [37]:

    
scores = cross_val_score(clf, X_train, y_train, cv=10)



In [38]:

    
print scores
print scores.mean()









    



[ 0.83333333  0.79166667  0.95833333  0.91666667  0.79166667  0.875
  0.95833333  0.875       0.82608696  0.82608696]
0.865217391304



In [39]:

    
pred = clf.predict(X_train)



In [40]:

    
print clf.feature_importances_

for i in range(len(clf.feature_importances_)):
    print X_df.columns[i], clf.feature_importances_[i]
    
clf_save = clf
clf_name_save = X_df.columns
X_save = X_df.as_matrix()









    



[ 0.07482741  0.05206467  0.09356981  0.28436927  0.10351105  0.08949767
  0.15199783  0.14889914  0.00126317]
avg_data_mean 0.074827409527
avg_data_max 0.0520646674285
avg_data_min 0.0935698076556
avg_data_len 0.284369266812
avg_data_std 0.10351104636
session_num 0.0894976678294
min_len 0.151997825363
pegged_low 0.148899139956
pegged_high 0.00126316906828



In [41]:

    
pred_test = clf.predict(X_test)



In [42]:

    
clf.score(X_test,y_test)









    Out[42]:





0.92372881355932202



In [43]:

    
clf.get_params









    Out[43]:





<bound method RandomForestClassifier.get_params of RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)>



In [44]:

    
y_prob = clf.predict_proba(X_test)[:,1:]



In [45]:

    
missed = np.where([pred_test != y_test])[1]



In [46]:

    
for idx in missed:
    print df_results.iloc[idx]['session_num']



In [47]:

    
for i in range(len(missed)):
    print i, clf.predict_proba(X_test)[i]









    



0 [ 0.  1.]
1 [ 0.  1.]
2 [ 0.7  0.3]
3 [ 0.  1.]
4 [ 0.  1.]
5 [ 0.2  0.8]
6 [ 0.  1.]
7 [ 0.  1.]
8 [ 0.8  0.2]



In [ ]:



In [48]:

    
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)



In [49]:

    
print precision, recall, thresholds









    



[ 0.75238095  0.84615385  0.90588235  0.94805195  0.97260274  1.          1.
  1.          1.          1.          1.        ] [ 1.          0.97468354  0.97468354  0.92405063  0.89873418  0.88607595
  0.87341772  0.83544304  0.82278481  0.65822785  0.        ] [ 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1. ]



In [50]:

    
df = pd.DataFrame()



In [51]:

    
df['precision'] = precision



In [52]:

    
print len(precision), len(recall), len(thresholds)
thresholds = np.append(thresholds,1.0)
print len(precision), len(recall), len(thresholds)



In [53]:

    
df['recall'] = recall
df['thresholds'] = thresholds



In [54]:

    
%matplotlib inline
df.plot()









    Out[54]:





<matplotlib.axes._subplots.AxesSubplot at 0x12e04b050>



In [55]:

    
byScore = df_results.groupby('score').count()



In [56]:

    
print byScore['avg_data_min']









    



score
False    119
True     237
Name: avg_data_min, dtype: int64



In [57]:

    
## Add in the cycles and largest swing features

def count_crosses(s, pts=10, intv=600):
    '''
    Looks at series
    for each 'intv' points
    - count crossings of 'pts' lookback moving average
    INPUTS:
        s = list of measurements
        pts = how many pts to collect for moving average
        intv = interval to measure crossing (600 = 10 sec)
    OUTPUTS
        crossings / intv = average crossing per interval
        biggest_move = on a crossing biggest move

    '''
    if len(s) < intv:
        return 0, 0
    if pts >= intv:
        return 0,0
    s = np.array(s)
    crossing_counts = [0]  
    index = pts
    max_swing = 0
    up = True # True for last cross up, False for down
    while (index + intv < len(s)):
        #print index
        interval_crossings = 0
        for i in range(intv):
            last_n = s[index-pts:index]
            avg = last_n.mean()
            if up and s[index] < avg:
                up = False
                interval_crossings += 1
                swing = abs(s[index]-s[index-1])
                if swing > max_swing:
                    max_swing = swing
            if not up and s[index] > avg:
                up = True
                interval_crossings += 1
                swing = abs(s[index]-s[index-1])
                if swing > max_swing:
                    max_swing = swing
            index += 1
        crossing_counts.append(interval_crossings)
   
    time_sec = intv/60.0
    return np.array(crossing_counts).mean()/(2*time_sec), max_swing



In [58]:

    
df_results3 = df_res_save.copy()
cycles = []
swings = []

look_back = [1]
#look_back = [3]

for look in look_back:
    c = []
    s = []
    for idx, elem in enumerate(temp):
        a,b = count_crosses(elem, pts=look)
        c.append(a)
        s.append(b)
    cycles.append(c)
    swings.append(s)
    df_results3['cycles_{0}'.format(look)] = c
    df_results3['swings_{0}'.format(look)] = s
    
    #plt.hist(cycles)
    #plt.hist(swings)



In [59]:

    
#df_results3



In [60]:

    
df4 = df_results.copy()
df4['cycles_1'] = df_results3['cycles_1']
df4['swings_1'] = df_results3['swings_1']



In [ ]:



In [61]:

    
sns_plot = sns.pairplot(df4)
sns_plot.savefig("sns-pairplot.png")



In [62]:

    
y2 = df4['score'].as_matrix()
X_df2 = df4[['avg_data_mean',
        'avg_data_max',  'avg_data_min',  'avg_data_std', 'avg_data_len',
            'cycles_1','swings_1','min_len','pegged_low','pegged_high']]
X2 = X_df2.as_matrix()



In [63]:

    
X_train, X_test, y_train, y_test = train_test_split(
    X2, y2, test_size=0.33, random_state=42)



In [64]:

    
clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train,y_train)
clf = clf.fit(X_train, y_train)



In [65]:

    
clf_save = clf
clf_name_save = X_df2.columns
X_save = X_df2.as_matrix()



In [66]:

    
scores = cross_val_score(clf, X_train, y_train, cv=10)



In [67]:

    
print scores
print np.average(scores)









    



[ 0.79166667  0.95833333  0.875       0.875       0.875       0.95833333
  0.91666667  0.95833333  0.82608696  0.82608696]
0.886050724638



In [68]:

    
print clf.feature_importances_

for i in range(len(clf.feature_importances_)):
    print X_df2.columns[i], clf.feature_importances_[i]









    



[ 0.07064774  0.05054319  0.09737334  0.10564576  0.25162631  0.1348587
  0.0572746   0.13476878  0.09468364  0.00257796]
avg_data_mean 0.070647739072
avg_data_max 0.050543185561
avg_data_min 0.0973733381049
avg_data_std 0.10564575551
avg_data_len 0.251626310731
cycles_1 0.134858702122
swings_1 0.0572745976741
min_len 0.134768777261
pegged_low 0.094683635819
pegged_high 0.00257795814473



In [ ]:



In [69]:

    
precision, recall, thresholds = precision_recall_curve(y_test, y_prob)
print precision, recall, thresholds









    



[ 0.75238095  0.84615385  0.90588235  0.94805195  0.97260274  1.          1.
  1.          1.          1.          1.        ] [ 1.          0.97468354  0.97468354  0.92405063  0.89873418  0.88607595
  0.87341772  0.83544304  0.82278481  0.65822785  0.        ] [ 0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9  1. ]



In [70]:

    
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_ylabel('Precision and Recall',fontsize=20)
x1 = np.arange(0,10)/5.0
#ax.plot(x1,df['precision'])
labels = [item.get_text() for item in ax.get_xticklabels()]
labels[1] = 'Testing'
ax.plot(df['precision'])
ax.plot(df['recall'])
#ax.plot(df['thresholds'])
ax.set_xlim(0,10)
#labels = 
ax.set_xticklabels(x1)
ax.set_title('Precision and Recall Versus Probability Threshold',fontsize=24,y=1.08)
ax.set_xlabel('Probability Threshold',fontsize=20)
#plt.ylabel('Density',fontsize=18)
#plt.title('Average Frequency in Tension Measurement', fontsize=24)
ax.tick_params(axis='both',labelsize=16)
plt.legend(fontsize=20,loc=0)
plt.show()
plt.savefig('prec-recall.png',bbox_inches='tight')









    












    





<matplotlib.figure.Figure at 0x12819ba90>



In [71]:

    
df4_t = df4[df4['score']==True]



In [72]:

    
df4_f = df4[df4['score']==False]



In [73]:

    
# Set up the matplotlib figure

sns.distplot( df4_t['cycles_1'], label = 'Bad')
sns.distplot( df4_f['cycles_1'], label = 'Good')
plt.xlabel('Cycles Per Second',fontsize=20)
plt.ylabel('Density',fontsize=20)
plt.title('Average Frequency in Tension Measurement', fontsize=24, y=1.10)
plt.tick_params(axis='both',labelsize=16)
plt.legend(fontsize=20)
plt.savefig('feat-freq.png',bbox_inches='tight')



In [74]:

    
sns.distplot( df4_t['avg_data_mean'],label = 'True')
sns.distplot( df4_f['avg_data_mean'], label = 'False')
plt.legend()









    Out[74]:





<matplotlib.legend.Legend at 0x12c8afd90>



In [75]:

    
sns.distplot( df4_t['avg_data_len']/60.0,label = 'Bad')
sns.distplot( df4_f['avg_data_len']/60.0, label = 'Good')
plt.xlabel('Seconds',fontsize=20)
plt.ylabel('Density',fontsize=20)
plt.title('Length of Session', fontsize=24, y=1.08)
plt.tick_params(axis='both',labelsize=16)
plt.legend(fontsize=20)
plt.savefig('feat-session-length.png',bbox_inches='tight')



In [76]:

    
sns.distplot( df4_t['avg_data_max'],label = 'Bad')
sns.distplot( df4_f['avg_data_max'], label = 'Good')
plt.legend(loc=0)









    Out[76]:





<matplotlib.legend.Legend at 0x1207f61d0>



In [77]:

    
sns.distplot( df4_t['avg_data_min'],label = 'Bad')
sns.distplot( df4_f['avg_data_min'], label = 'Good')
plt.legend()









    Out[77]:





<matplotlib.legend.Legend at 0x120a3efd0>



In [78]:

    
sns.distplot( df4_t['avg_data_std']/4095.0,label = 'Bad')
sns.distplot( df4_f['avg_data_std']/4095.0, label = 'Good')
plt.xlabel('Std in 0-10 Tension Scale',fontsize=20)
plt.ylabel('Density',fontsize=20)
plt.title('Standard Deviation in Tension Measurement', fontsize=24,y=1.08)
plt.tick_params(axis='both',labelsize=16)
plt.legend(fontsize=20)
plt.savefig('feat-std.png',bbox_inches='tight')



In [79]:

    
good = [len(np.where(df4_f['avg_data_std']<=60)[0]),
            len(np.where(df4_f['avg_data_std']>60)[0]) ]


bad = [len(np.where(df4_t['avg_data_std']<=60)[0]),
            len(np.where(df4_t['avg_data_std']>60)[0]) ]



In [80]:

    
fig, ax = plt.subplots()

index = np.arange(2)
bar_width = 0.35

opacity = 0.4
error_config = {'ecolor': '0.3'}


rects1 = plt.bar(index, 
                 bad, 
                 bar_width,
                 alpha=opacity,
                 color='b',
                 label='Bad')

rects2 = plt.bar(index + bar_width,
                 good,  
                 bar_width,
                 alpha=opacity,
                 color='g',
                 label='Good')

plt.xlabel('Session Length',fontsize=20)
plt.ylabel('Number of Sessions',fontsize=20)
plt.title('60 Second Minimum Length',fontsize=24,y=1.08)
plt.xticks(index + bar_width , ('Less than 60 seconds', 'At least 60 seconds'))
plt.legend(fontsize=16, loc=0)
plt.tick_params(axis='both',labelsize=16)
plt.tight_layout()
plt.show()
plt.savefig('feat-60-sec_min.png',bbox_inches='tight')









    












    





<matplotlib.figure.Figure at 0x120f3a510>



In [81]:

    
sns.distplot( df4_t['pegged_low'],kde=False, label = 'Bad')
sns.distplot( df4_f['pegged_low'],kde=False, label = 'Good')

plt.legend()









    Out[81]:





<matplotlib.legend.Legend at 0x1214410d0>



In [82]:

    
sns.distplot( df4_t['pegged_high'],kde=False, label = 'Bad')
sns.distplot( df4_f['pegged_high'], kde=False, label = 'Good')
plt.legend()









    Out[82]:





<matplotlib.legend.Legend at 0x1224f7610>



In [83]:

    
# Set up the matplotlib figure
df4.columns









    Out[83]:





Index([ u'Session_name',        u'userid',         u'score', u'avg_data_mean',
        u'avg_data_max',  u'avg_data_min',  u'avg_data_len',  u'avg_data_std',
         u'session_num',       u'min_len',    u'pegged_low',   u'pegged_high',
            u'cycles_1',      u'swings_1'],
      dtype='object')



In [84]:

    
clf = RandomForestClassifier(n_estimators=10, random_state=42).fit(X_train,y_train)

scores = cross_val_score(clf, X_train, y_train, cv=10)
print scores
print np.average(scores)









    



[ 0.79166667  0.95833333  0.875       0.875       0.875       0.95833333
  0.91666667  0.95833333  0.82608696  0.82608696]
0.886050724638



In [85]:

    
joblib.dump(clf, 'brtakrf_class.pkl')









    Out[85]:





['brtakrf_class.pkl']



In [ ]:



In [86]:

    
clfab = AdaBoostClassifier(base_estimator=None, 
 n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', 
                           random_state=42).fit(X_train,y_train)



In [87]:

    
scores_ab = cross_val_score(clfab, X_train, y_train, cv=10)
print scores_ab
print np.average(scores_ab)









    



[ 0.79166667  0.91666667  0.95833333  0.875       0.875       0.875       0.875
  0.875       0.7826087   0.7826087 ]
0.860688405797



In [88]:

    
clfgb = GradientBoostingClassifier(loss='deviance', learning_rate=0.1, 
                    n_estimators=100, subsample=1.0, 
                    criterion='friedman_mse', min_samples_split=2, 
                    min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
                    max_depth=3, min_impurity_split=1e-07, init=None, 
                    random_state=42, max_features=None, 
                    verbose=0, max_leaf_nodes=None, 
                    warm_start=False, presort='auto').fit(X_train,y_train)



In [89]:

    
scores_gb = cross_val_score(clfgb, X_train, y_train, cv=10)
print scores_gb
print np.average(scores_gb)









    



[ 0.875       0.91666667  0.91666667  0.875       0.91666667  0.91666667
  0.83333333  0.91666667  0.86956522  0.91304348]
0.894927536232



In [90]:

    
scores_rf = cross_val_score(clf, X_train, y_train, cv=10)
print scores_rf









    



[ 0.79166667  0.95833333  0.875       0.875       0.875       0.95833333
  0.91666667  0.95833333  0.82608696  0.82608696]



In [91]:

    
RF_pred = clf.predict(X2)



In [92]:

    
df4['RF_pred'] = RF_pred



In [93]:

    
y2 = df4['score'].as_matrix()
X_df2 = df4[['avg_data_mean',
        'avg_data_max',  'avg_data_min',  'avg_data_std', 'avg_data_len',
            'cycles_1','swings_1','min_len','pegged_low','pegged_high','RF_pred']]
X2 = X_df2.as_matrix()



In [94]:

    
X_train, X_test, y_train, y_test = train_test_split(
    X2, y2, test_size=0.33, random_state=42)



In [95]:

    
clfab = AdaBoostClassifier(base_estimator=None, n_estimators=50, 
            learning_rate=1.0, algorithm='SAMME.R', random_state=42).fit(X_train,y_train)

scores_ab = cross_val_score(clfab, X_train, y_train, cv=10)
print scores_ab
print np.average(scores_ab)









    



[ 1.          1.          1.          0.95833333  1.          1.
  0.95833333  1.          1.          0.95652174]
0.98731884058



In [96]:

    
scores_ab = cross_val_score(clfab, X_test, y_test, cv=10)
print scores_ab
print np.average(scores_ab)









    



[ 0.91666667  0.83333333  1.          0.91666667  1.          0.91666667
  0.91666667  0.91666667  1.          0.9       ]
0.931666666667



In [97]:

    
clfgb = GradientBoostingClassifier(loss='deviance', 
            learning_rate=0.1, n_estimators=100, subsample=1.0, 
            criterion='friedman_mse', min_samples_split=2,
            min_samples_leaf=1, min_weight_fraction_leaf=0.0, 
            max_depth=3, min_impurity_split=1e-07, init=None, 
            random_state=42, max_features=None, verbose=0, 
            max_leaf_nodes=None, warm_start=False, presort='auto').fit(X_train,y_train)
scores_gb = cross_val_score(clfgb, X_train, y_train, cv=10)
print scores_gb
print np.average(scores_gb)









    



[ 1.          1.          1.          0.95833333  1.          1.          1.
  1.          1.          0.95652174]
0.991485507246



In [98]:

    
scores_gb = cross_val_score(clfgb, X_test, y_test, cv=10)
print scores_gb
print np.average(scores_gb)









    



[ 0.83333333  0.83333333  0.91666667  1.          1.          0.91666667
  0.91666667  1.          1.          0.9       ]
0.931666666667



In [99]:

    
joblib.dump(clf, 'brtakgb_class.pkl')









    Out[99]:





['brtakgb_class.pkl']



In [100]:

    
trial_users = pd.read_csv('TrialUsers.csv', header=None)



In [101]:

    
df_trial = df4[df_results['userid'].isin(trial_users[0]) |df_results['userid'].str.contains('pilot') ]



In [102]:

    
y_trial = df_trial['score'].as_matrix()
X_trial_df = df_trial[['avg_data_mean',
        'avg_data_max',  'avg_data_min',  'avg_data_std', 'avg_data_len', 
            'cycles_1','swings_1','min_len','pegged_low','pegged_high']]



In [103]:

    
X_trial = X_trial_df.as_matrix()



In [104]:

    
trial_score = clf.score(X_trial, y_trial)
print trial_score



In [105]:

    
X_trial_df2= X_trial_df.copy()
X_trial_df2['RF_pred'] = clf.predict(X_trial)
X_trial2 = X_trial_df2.as_matrix()
X_train, X_test, y_train, y_test = train_test_split(
    X_trial2, y_trial, test_size=0.33, random_state=42)



In [106]:

    
trial_score_gb = clfgb.score(X_trial2, y_trial)
print trial_score_gb



In [107]:

    
trial_score_ab = clfab.score(X_trial2, y_trial)
print trial_score_ab



In [108]:

    
clfab.feature_importances_









    Out[108]:





array([ 0.04,  0.14,  0.06,  0.22,  0.08,  0.  ,  0.1 ,  0.  ,  0.  ,
        0.  ,  0.36])



In [109]:

    
len(X_trial[0])









    Out[109]:





10



In [110]:

    
print df_trial.columns
len(df_trial.columns)









    



Index([ u'Session_name',        u'userid',         u'score', u'avg_data_mean',
        u'avg_data_max',  u'avg_data_min',  u'avg_data_len',  u'avg_data_std',
         u'session_num',       u'min_len',    u'pegged_low',   u'pegged_high',
            u'cycles_1',      u'swings_1',       u'RF_pred'],
      dtype='object')






    Out[110]:





15



In [111]:

    
print X_df.columns
print X_df2.columns
print X_trial_df.columns
print X_trial_df2.columns









    



Index([u'avg_data_mean', u'avg_data_max', u'avg_data_min', u'avg_data_len',
       u'avg_data_std', u'session_num', u'min_len', u'pegged_low',
       u'pegged_high'],
      dtype='object')
Index([u'avg_data_mean', u'avg_data_max', u'avg_data_min', u'avg_data_std',
       u'avg_data_len', u'cycles_1', u'swings_1', u'min_len', u'pegged_low',
       u'pegged_high', u'RF_pred'],
      dtype='object')
Index([u'avg_data_mean', u'avg_data_max', u'avg_data_min', u'avg_data_std',
       u'avg_data_len', u'cycles_1', u'swings_1', u'min_len', u'pegged_low',
       u'pegged_high'],
      dtype='object')
Index([u'avg_data_mean', u'avg_data_max', u'avg_data_min', u'avg_data_std',
       u'avg_data_len', u'cycles_1', u'swings_1', u'min_len', u'pegged_low',
       u'pegged_high', u'RF_pred'],
      dtype='object')



In [112]:

    
print 'Bad=',len([x  for x in y_trial if x==True])
print 'Good=',len([x  for x in y_trial if x==False])









    



Bad= 33
Good= 31



In [113]:

    
y_pred2 = clfgb.predict(X_trial2)



In [114]:

    
print 'Bad=',len([x  for x in y_pred2 if x==True])
print 'Good=',len([x  for x in y_pred2 if x==False])









    



Bad= 31
Good= 33



In [115]:

    
right = y_trial == y_pred2



In [116]:

    
print 'Correct=',len([x  for x in right if x==True])
print 'Incorrect=',len([x  for x in right if x==False])









    



Correct= 62
Incorrect= 2



In [117]:

    
y_pred2 = clf.predict(X_trial)
right = y_trial == y_pred2
print 'Correct=',len([x  for x in right if x==True])
print 'Incorrect=',len([x  for x in right if x==False])









    



Correct= 62
Incorrect= 2



In [118]:

    
#df_trial[df_trial['score']!=df_trial['RF_pred']]



In [119]:

    
# Compute the correlation matrix
corr = df_results.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
            square=True, 
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True, annot_kws={"size": 9})
plt.xticks(rotation=45)
plt.yticks(rotation=0)









    Out[119]:





(array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5]),
 <a list of 10 Text yticklabel objects>)



In [120]:

    
# Compute the correlation matrix
corr = df4.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
            square=True, 
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True, annot_kws={"size": 9})
plt.xticks(rotation=45)
plt.yticks(rotation=0)









    Out[120]:





(array([  0.5,   1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,
          9.5,  10.5,  11.5,  12.5]), <a list of 13 Text yticklabel objects>)



In [121]:

    
# Compute the correlation matrix
df9=df4.drop('RF_pred',axis=1)
corr = df9.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3,
            square=True, 
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax, annot=True, annot_kws={"size": 9})
plt.xticks(rotation=45)
plt.yticks(rotation=0)









    Out[121]:





(array([  0.5,   1.5,   2.5,   3.5,   4.5,   5.5,   6.5,   7.5,   8.5,
          9.5,  10.5,  11.5]), <a list of 12 Text yticklabel objects>)



In [122]:

    
forest = clf_save
names = clf_name_save
X = X_save

importances = forest.feature_importances_
print len(importances), len(forest.feature_importances_)
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
             axis=0)

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. feature %s (%f)" % (f + 1, names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), names[indices], rotation=45)
plt.xlim([-1, X.shape[1]])
plt.show()
plt.savefig('feat-imp.png',bbox_inches='tight')









    



10 10
Feature ranking:
1. feature avg_data_len (0.251626)
2. feature cycles_1 (0.134859)
3. feature min_len (0.134769)
4. feature avg_data_std (0.105646)
5. feature avg_data_min (0.097373)
6. feature pegged_low (0.094684)
7. feature avg_data_mean (0.070648)
8. feature swings_1 (0.057275)
9. feature avg_data_max (0.050543)
10. feature pegged_high (0.002578)






    












    





<matplotlib.figure.Figure at 0x123a8e690>

	avg_data_mean	avg_data_max	avg_data_min	avg_data_len	avg_data_std	session_num
count	356.000000	356.000000	356.000000	356.000000	356.000000	356.000000
mean	2080.052353	3521.893258	1021.233146	13404.457865	602.631308	177.500000
std	997.930074	883.259504	1220.579913	34697.181848	424.079052	102.912584
min	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000
25%	1317.220337	3357.250000	0.000000	1152.000000	264.431434	88.750000
50%	2090.857233	3901.000000	423.000000	3736.000000	508.647631	177.500000
75%	2837.760875	4047.000000	1945.250000	13222.000000	939.515026	266.250000
max	4095.000000	4095.000000	4095.000000	382019.000000	1667.963040	355.000000