Santandar Customer Satisfaction

Step 1: Frame

From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.

Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.

In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.

Predict the probability of each customer to be unsatisfied

Step 2: Acquire

The competition is hosted on Kaggle

The data section has three files:

train.csv Training dataset to create the model. It has the target column - indicating whether the customer was happy or not
test.csv Test dataset for which the predictions are the be made
sample_submission.csv Format for submitting the predictions on Kaggle's website

The datasets are downloaded and are available at the data folder

Step 3: Explore

Read the datasets



In [1]:

    
import numpy as np
import pandas as pd



In [2]:

    
#Read train, test and sample submission datasets
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
samplesub = pd.read_csv("../data/sample_submission.csv")

Exercise 1 Find column types for train and test.



In [ ]:

Exercise 2 Find unique column types for train and test



In [ ]:

Exercise 3 Find number of rows and columns in train and test



In [ ]:

Exercise 4 Find the columns that has missing values

Hint: look up at the pandas function isnull



In [ ]:

Exercise 5 Find number of unsatisfied customers in the train dataset



In [3]:

    
#Create the labels
labels=train.iloc[:,-1]



In [4]:

    
#Find number of unsatisfied customers using `labels`

Step 4: Refine

Exercise 5 Find features that show no variance

Question: Why is this important?



In [6]:

    
# Step 1: Find standard deviation
train_std =



In [7]:

    
# Step 2: Find columns that has standard deviation as 0
columns_with_0_variance =



In [8]:

    
#train.columns.values in columns_with_0_variance.index
train_columns = train.columns.values
columns_with_0_variance_columns = columns_with_0_variance.index.values



In [10]:

    
#Need to subset columns that are present in train but not in the dataset with 0 variance
selected_columns = np.in1d(train_columns, columns_with_0_variance_columns)



In [11]:

    
len(selected_columns)









    Out[11]:





371



In [12]:

    
#Create train and test 
train_updated = train.iloc[:,~selected_columns[1:len(selected_columns)-1]]
test_updated = test.iloc[:,~selected_columns[1:len(selected_columns)-1]]



In [13]:

    
#Check if the number of columns in both the datasets are the same
print train_updated.shape, test_updated.shape









    



(76020, 335) (75818, 335)



In [14]:

    
#Check if column names in train and test are the same
train_updated.columns.values in test_updated.columns.values









    Out[14]:





True

Step 5: Model

We will cover the following

Model 1: Logistic Regression (L1/L2)

Model 4: Decision Tree

Visualizing decision tree

Cross-validation

Error Metrics

Regularization

Regularization is tuning or selecting the preferred level of model complexity so your models are better at predicting (generalizing). If you don't do this your models may be too complex and overfit or too simple and underfit, either way giving poor predictions.

Logistic Regression(L1/L2)



In [15]:

    
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import cross_validation



In [16]:

    
y = np.array(labels)



In [17]:

    
#Why do we need scaling?
scaler = preprocessing.StandardScaler()
scaler = scaler.fit(train_updated)



In [18]:

    
train_scaled = scaler.transform(train_updated)
#Remember - need to use the same scaler function on test
test_scaled = scaler.transform(test_updated)



In [19]:

    
#lr = linear_model.LogisticRegression()
logReg = linear_model.LogisticRegression(tol=0.1, n_jobs=6)



In [20]:

    
%timeit -n 1 -r 1 logReg.fit(train_scaled, y)









    



1 loops, best of 1: 8.01 s per loop



In [21]:

    
logRegPrediction = logReg.predict(test_scaled)

Exercise 6 Predict probability of each customer to be unsatisfied in the test dataset



In [ ]:

Exercise 7 Fit L1 Regularization model. Evaluate the results



In [ ]:

Exercise 8 Add the prediction to sample sub. Save it as csv. Submit solution to kaggle



In [ ]:

Decision Trees



In [148]:

    
from sklearn import tree



In [149]:

    
decisionTreeModel = tree.DecisionTreeClassifier()



In [150]:

    
decisionTreeModel.fit(train_updated, y)









    Out[150]:





DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Exercise 8 Predict using decision tree model. Save it as csv. Submit solution to kaggle



In [ ]:



In [ ]: