Customer Churn: losing/attrition of the customers from the company. Especially, the industries that the user acquisition is costly, it is crucially important for one company to reduce and ideally make the customer churn to 0 to sustain their recurring revenue. If you consider customer retention is always cheaper than customer acquisition and generally depends on the data of the user(usage of the service or product), it poses a great/exciting/hard problem for machine learning.
Dataset is from a telecom service provider where they have the service usage(international plan, voicemail plan, usage in daytime, usage in evenings and nights and so on) and basic demographic information(state and area code) of the user. For labels, I have a single data point whether the customer is churned out or not.
In [1]:
# Download the dataset
from urllib import request
response = request.urlopen('https://raw.githubusercontent.com/EricChiang/churn/master/data/churn.csv')
raw_data = response.read().decode('utf-8')
In [2]:
# Convert to numpy
import numpy as np
data = []
for line in raw_data.splitlines()[1:]:
words = line.split(',')
data.append(words)
data = np.array(data)
column_names = raw_data.splitlines()[0].split(',')
n_obs = data.shape[0]
In [3]:
print(column_names)
print(data.shape)
In [4]:
data[:2]
Out[4]:
In [5]:
# Select only the numeric features
X = data[:, [1,2,6,7,8,9,10]].astype(np.float)
# Convert bools to floats
X_ = (data[:, [4,5]] == 'no').astype(np.float)
X = np.hstack((X, X_))
Y = (data[:, -1] == 'True.').astype(np.int)
In [6]:
X[:2]
Out[6]:
In [7]:
print('Number of churn cases ', Y.sum())
In [8]:
# Insert code here
random_sample = np.random.rand(n_obs)
X_train, X_test = X[random_sample<0.6], X[random_sample>=0.6]
Y_train, Y_test = Y[random_sample<0.6], Y[random_sample>=0.6]
print(Y_train.shape, Y_test.shape)
In [9]:
# Insert code here
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, Y_train)
Out[9]:
In [10]:
# Insert code here
y_pred = clf.predict(X_test)
from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test, y_pred)
Out[10]:
In [11]:
(Y_test == y_pred).mean()
Out[11]: