Importing and Setup


In [1]:
import time
from datetime import datetime, timedelta
import statsmodels.api as sm
import pandas as pd
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc, precision_recall_curve
from prepare_data.load import cleaned_data, need, activity, client, case, advocate_demographics, advocate
from models.pipeline import run_random_forest, run_logistic_regression
from models.pipeline import make_testing_training, add_features_to_training_testing
from models.features import get_contacts_within_time, get_time_difference
from models.features import get_need_desk_success_dict, get_need_composite_index
from models.features import get_desk_need_dict,get_sum_need_days_per_case,get_time_delta,get_time_difference
from models.features import get_contacts_within_date, get_outcomes_from_cases
from models.diagnostics import get_diagnostics
from models.visualization import  make_recall_specificity_curve, make_roc_curve, plot_feature_importance

In [2]:
# A little bit of cleanup...
filtered_activity = activity[activity['Case ID'].isin(cleaned_data['case_id'])]
filtered_need = need[need['Case ID'].isin(cleaned_data['case_id'])]

In [ ]:
window_days = 30
data = cleaned_data
data = data[data['Days Until Case Closed'] > window_days]

Part A: Random Forest

Using a Random Forest to model whether a patient will disconnect from Health Leads or not.

A1 - Features

Feature selection has already been performed via inspecting previous RF models. Below are the features that were the most significant.


In [18]:
outcome_col = 'binary_outcome'
feature_cols = [
                'num_contacts_within30',
                'num_attempts_within30',
                'num_contacts_within7',
                'num_attempts_within7',
                'num_contacts_within15',
                'num_attempts_within15',
                'p_all_needs_disconnect',
                'sum_median_days',
                ]

In [14]:
rf_clf = RandomForestClassifier(n_estimators=500, max_depth=None)

A2: Manually Separating Test and Training Sets

We separate the testing and training data by a particular date, using 80% of the data to predict the future 20% of cases.


In [6]:
desk_names = cleaned_data['Case Desk Location'].unique()

In [7]:
# Creating a column in the need dataframe that is necessary to add the relevant features to the data
need_date_opened_format = '%m/%d/%Y %H:%M'
need_date_closed_format = '%m/%d/%Y %H:%M'
need['Days Until Need Closed'] = need.apply(get_time_difference, args = ('Need Date Opened', 'Need Date Closed', need_date_opened_format, need_date_closed_format), axis = 1)

In [50]:
training, testing = make_testing_training(data, percent_training=0.80, random_split=False)


models/pipeline.py:13: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
  data['computed Case Date/Time Closed'] = pd.to_datetime(data['Case Date/Time Closed'])

In [51]:
training, testing = add_features_to_training_testing(training, testing, need, desk_names)

In [4]:
training.binary_outcome.value_counts(normalize=True)
# 0    0.591141
# 1    0.408859

In [5]:
testing.binary_outcome.value_counts(normalize=True)
# 0    0.631175
# 1    0.368825

A3 -- Running the Random Forest


In [52]:
fitted_rf_model, rf_diagnostics, rf_predicted_probs = run_random_forest(rf_clf, training, testing, feature_cols, outcome_col)

A4 -- Diagnostics

We examine the ROC and Precision-Recall Curves. Optimally, we'd like to find a point on the Precision-Recall curve that has high precision (>75%) for decent recall.


In [23]:
fpr, tpr, thresholds = make_roc_curve(testing[outcome_col], rf_predicted_probs, linewidth=2)



In [24]:
precision, recall, thresholds = make_recall_specificity_curve(testing[outcome_col], rf_predicted_probs, linewidth=2)


A5 -- Diagnosing Feature Importance


In [25]:
plot_feature_importance(fitted_rf_model, feature_cols)


Feature ranking:
1. p_all_needs_disconnect - 0.339
2. sum_median_days - 0.282
3. num_contacts_within30 - 0.102
4. num_attempts_within30 - 0.095
5. num_attempts_within15 - 0.053
6. num_contacts_within15 - 0.051
7. num_contacts_within7 - 0.041
8. num_attempts_within7 - 0.037

It makes sense that the needs compexity index is one of the most significant features.

However, it's not clear why the median number of days for all needs is significant.

Part B: Using Logistic Regression

B1 -- Running the LR


In [53]:
fitted_logit_model, logit_diagnostics, predicted_logit_probs = run_logistic_regression(training, testing, feature_cols, outcome_col)


Optimization terminated successfully.
         Current function value: 0.578508
         Iterations 6

B2 -- Diagnosing the LR


In [27]:
logit_diagnostics


Out[27]:
{'accuracy': 0.6601362862010222,
 'f1': 0.4981132075471698,
 'false negative rate': 0.5427251732101617,
 'false positive rate': 0.2213225371120108,
 'precision': 0.5469613259668509,
 'sensitivity/recall/tpr': 0.45727482678983833,
 'true negative rate': 0.7786774628879892}

In [28]:
fitted_logit_model.summary()


Out[28]:
Logit Regression Results
Dep. Variable: binary_outcome No. Observations: 4696
Model: Logit Df Residuals: 4687
Method: MLE Df Model: 8
Date: Thu, 18 Sep 2014 Pseudo R-squ.: 0.1212
Time: 20:59:42 Log-Likelihood: -2791.6
converged: True LL-Null: -3176.6
LLR p-value: 6.411e-161
coef std err z P>|z| [95.0% Conf. Int.]
num_contacts_within30 -0.3942 0.036 -10.807 0.000 -0.466 -0.323
num_attempts_within30 0.2728 0.035 7.768 0.000 0.204 0.342
num_contacts_within7 0.1917 0.059 3.267 0.001 0.077 0.307
num_attempts_within7 -0.1087 0.056 -1.928 0.054 -0.219 0.002
num_contacts_within15 0.0862 0.060 1.431 0.152 -0.032 0.204
num_attempts_within15 -0.1061 0.064 -1.670 0.095 -0.231 0.018
p_all_needs_disconnect 3.0252 0.183 16.543 0.000 2.667 3.384
sum_median_days 0.0071 0.001 9.200 0.000 0.006 0.009
intercept -1.5668 0.137 -11.397 0.000 -1.836 -1.297

In [29]:
fpr, tpr, thresholds = make_roc_curve(testing[outcome_col], predicted_logit_probs)



In [30]:
precision, recall, thresholds = make_recall_specificity_curve(testing[outcome_col], predicted_logit_probs)


Part C: Feature Importance Over Time

Looking at the importance of a client picking up the phone vs. not picking up over time


In [32]:
window_range = range(1, 51)
params = {num: 
           {'ratio': 0, 
            'num_contacts_within{}'.format(num): 0, 
            'num_attempts_within{}'.format(num): 0,}
           for num in window_range}
data = cleaned_data[cleaned_data['Days Until Case Closed'] > window_range[-1]]

In [33]:
training, testing = make_testing_training(data, 0.8)

In [34]:
desk_names = data['Case Desk Location'].unique()

In [35]:
training, testing = add_features_to_training_testing(training, testing, need, desk_names)

In [ ]:
for num in window_range:
    feature_cols = [
     'num_contacts_within{}'.format(num),
     'num_attempts_within{}'.format(num),
     'Client Age',
     'p_all_needs_disconnect',
     'sum_median_days',]
    fitted_logit_model, logit_diagnostics, pred_logit_probs = run_logistic_regression(training, testing, feature_cols, outcome_col)
    num_contacts_within_param = fitted_logit_model.params['num_contacts_within{}'.format(num)]
    num_attempts_within_param = fitted_logit_model.params['num_attempts_within{}'.format(num)]
    ratio = float(num_contacts_within_param)/num_attempts_within_param
    params[num]['ratio'] = ratio
    params[num]['num_contacts_within{}'.format(num)] = num_contacts_within_param
    params[num]['num_attempts_within{}'.format(num)] = num_attempts_within_param

In [37]:
contacts = [params[num]['num_contacts_within{}'.format(num)] for num in window_range]

In [38]:
attempts = [params[num]['num_attempts_within{}'.format(num)] for num in window_range]

In [42]:
plt.title('Impact on Disconnection')
plt.plot(window_range, [1 - exp(c) for c in contacts], 'g', linewidth = 2, label = 'Successful Phone Calls')
plt.plot(window_range, [exp(a) - 1 for a in attempts], 'b', linewidth = 2, label = 'Unsuccessful Phone Calls')
plt.xlabel('Days Since Case Opened')
plt.ylabel('% Impact')
plt.legend(loc='best')
fig = plt.gcf()


As we brought up in our final presentation to Health Leads, this graph shows that the importance of successful phone calls trumps those of unsuccessful phone calls throughout the entire client relationship.

Given that, we suggested Health Leads to make more frequent calls in the beginning stages of a relationship.

Part D: Ensemble Logitistic Regression & Random Forest


In [43]:
# Let's try a combination of the RF & Logistic Regression
logit_weights = arange(0, 1.1, 0.1)

In [44]:
ensemble_diagnostics = {weight: {
                                 'roc': 0, 
                                 'precision-recall': 0} 
                        for weight in logit_weights}

In [54]:
for weight in logit_weights:
    ensemble_probs = weight*predicted_logit_probs + (1 - weight)*np.array(rf_predicted_probs)
    fpr, tpr, thresholds = roc_curve(testing[outcome_col], ensemble_probs, pos_label = 1)
    roc_auc = auc(fpr, tpr)
    precision, recall, thresholds = precision_recall_curve(testing[outcome_col], ensemble_probs)
    precision_recall_auc = auc(recall, precision)
    ensemble_diagnostics[weight]['roc'] = roc_auc
    ensemble_diagnostics[weight]['precision-recall'] = precision_recall_auc

In [55]:
plt.plot(logit_weights, [ensemble_diagnostics[weight]['roc'] for weight in logit_weights])


Out[55]:
[<matplotlib.lines.Line2D at 0x124ceb350>]

In [59]:
plt.plot(logit_weights, [ensemble_diagnostics[weight]['precision-recall'] for weight in logit_weights])


Out[59]:
[<matplotlib.lines.Line2D at 0x10fd93210>]

It seems like the most predictive model would be a combination of a Random Forest and Logistic Regression, with the most weight placed on the Random Forest.