From frontline support teams to C-suites, customer satisfaction is a key measure of success. Unhappy customers don't stick around. What's more, unhappy customers rarely voice their dissatisfaction before leaving.
Santander Bank is asking Kagglers to help them identify dissatisfied customers early in their relationship. Doing so would allow Santander to take proactive steps to improve a customer's happiness before it's too late.
In this competition, you'll work with hundreds of anonymized features to predict if a customer is satisfied or dissatisfied with their banking experience.
The competition is hosted on Kaggle
The data section has three files:
The datasets are downloaded and are available at the data folder
In [1]:
import numpy as np
import pandas as pd
In [2]:
#Read train, test and sample submission datasets
train = pd.read_csv("../data/train.csv")
test = pd.read_csv("../data/test.csv")
samplesub = pd.read_csv("../data/sample_submission.csv")
Exercise 1 Find column types for train and test.
In [ ]:
Exercise 2 Find unique column types for train and test
In [ ]:
Exercise 3 Find number of rows and columns in train and test
In [ ]:
Exercise 4 Find the columns that has missing values
Hint: look up at the pandas
function isnull
In [ ]:
Exercise 5 Find number of unsatisfied customers in the train dataset
In [3]:
#Create the labels
labels=train.iloc[:,-1]
In [4]:
#Find number of unsatisfied customers using `labels`
Exercise 5 Find features that show no variance
Question: Why is this important?
In [6]:
# Step 1: Find standard deviation
train_std =
In [7]:
# Step 2: Find columns that has standard deviation as 0
columns_with_0_variance =
In [8]:
#train.columns.values in columns_with_0_variance.index
train_columns = train.columns.values
columns_with_0_variance_columns = columns_with_0_variance.index.values
In [10]:
#Need to subset columns that are present in train but not in the dataset with 0 variance
selected_columns = np.in1d(train_columns, columns_with_0_variance_columns)
In [11]:
len(selected_columns)
Out[11]:
In [12]:
#Create train and test
train_updated = train.iloc[:,~selected_columns[1:len(selected_columns)-1]]
test_updated = test.iloc[:,~selected_columns[1:len(selected_columns)-1]]
In [13]:
#Check if the number of columns in both the datasets are the same
print train_updated.shape, test_updated.shape
In [14]:
#Check if column names in train and test are the same
train_updated.columns.values in test_updated.columns.values
Out[14]:
We will cover the following
Regularization is tuning or selecting the preferred level of model complexity so your models are better at predicting (generalizing). If you don't do this your models may be too complex and overfit or too simple and underfit, either way giving poor predictions.
In [15]:
from sklearn import preprocessing
from sklearn import linear_model
from sklearn import cross_validation
In [16]:
y = np.array(labels)
In [17]:
#Why do we need scaling?
scaler = preprocessing.StandardScaler()
scaler = scaler.fit(train_updated)
In [18]:
train_scaled = scaler.transform(train_updated)
#Remember - need to use the same scaler function on test
test_scaled = scaler.transform(test_updated)
In [19]:
#lr = linear_model.LogisticRegression()
logReg = linear_model.LogisticRegression(tol=0.1, n_jobs=6)
In [20]:
%timeit -n 1 -r 1 logReg.fit(train_scaled, y)
In [21]:
logRegPrediction = logReg.predict(test_scaled)
Exercise 6 Predict probability of each customer to be unsatisfied in the test dataset
In [ ]:
Exercise 7 Fit L1
Regularization model. Evaluate the results
In [ ]:
Exercise 8 Add the prediction to sample sub. Save it as csv. Submit solution to kaggle
In [ ]:
In [148]:
from sklearn import tree
In [149]:
decisionTreeModel = tree.DecisionTreeClassifier()
In [150]:
decisionTreeModel.fit(train_updated, y)
Out[150]:
Exercise 8 Predict using decision tree model. Save it as csv. Submit solution to kaggle
In [ ]:
In [ ]: