Exercise 02

Estimate a classifier to predict churn

Problem Formulation

Customer Churn: losing/attrition of the customers from the company. Especially, the industries that the user acquisition is costly, it is crucially important for one company to reduce and ideally make the customer churn to 0 to sustain their recurring revenue. If you consider customer retention is always cheaper than customer acquisition and generally depends on the data of the user(usage of the service or product), it poses a great/exciting/hard problem for machine learning.

Data

Dataset is from a telecom service provider where they have the service usage(international plan, voicemail plan, usage in daytime, usage in evenings and nights and so on) and basic demographic information(state and area code) of the user. For labels, I have a single data point whether the customer is churned out or not.


In [1]:
# Download the dataset
from urllib import request
response = request.urlopen('https://raw.githubusercontent.com/EricChiang/churn/master/data/churn.csv')
raw_data = response.read().decode('utf-8')

In [2]:
# Convert to numpy
import numpy as np
data = []
for line in raw_data.splitlines()[1:]:
    words = line.split(',')
    data.append(words)
data = np.array(data)
column_names = raw_data.splitlines()[0].split(',')
n_obs = data.shape[0]

In [3]:
print(column_names)
print(data.shape)


['State', 'Account Length', 'Area Code', 'Phone', "Int'l Plan", 'VMail Plan', 'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins', 'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge', 'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls', 'Churn?']
(3333, 21)

In [4]:
data[:2]


Out[4]:
array([['KS', '128', '415', '382-4657', 'no', 'yes', '25', '265.100000',
        '110', '45.070000', '197.400000', '99', '16.780000', '244.700000',
        '91', '11.010000', '10.000000', '3', '2.700000', '1', 'False.'],
       ['OH', '107', '415', '371-7191', 'no', 'yes', '26', '161.600000',
        '123', '27.470000', '195.500000', '103', '16.620000', '254.400000',
        '103', '11.450000', '13.700000', '3', '3.700000', '1', 'False.']], 
      dtype='<U10')

In [5]:
# Select only the numeric features
X = data[:, [1,2,6,7,8,9,10]].astype(np.float)
# Convert bools to floats
X_ = (data[:, [4,5]] == 'no').astype(np.float)
X = np.hstack((X, X_))
Y = (data[:, -1] == 'True.').astype(np.int)

In [6]:
X[:2]


Out[6]:
array([[ 128.  ,  415.  ,   25.  ,  265.1 ,  110.  ,   45.07,  197.4 ,
           1.  ,    0.  ],
       [ 107.  ,  415.  ,   26.  ,  161.6 ,  123.  ,   27.47,  195.5 ,
           1.  ,    0.  ]])

In [7]:
print('Number of churn cases ', Y.sum())


Number of churn cases  483

Exercise 02.1

Split the training set in two sets with 70% and 30% of the data, respectively.


Partir la base de datos es dos partes de 70%


In [8]:
# Insert code here
random_sample = np.random.rand(n_obs)
X_train, X_test = X[random_sample<0.6], X[random_sample>=0.6]
Y_train, Y_test = Y[random_sample<0.6], Y[random_sample>=0.6]

print(Y_train.shape, Y_test.shape)


(2034,) (1299,)

Exercise 02.2

Train a logistic regression using the 70% set


Entrenar una regresion logistica usando la particion del 70%


In [9]:
# Insert code here
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, Y_train)


Out[9]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Exercise 02.3

a) Create a confusion matrix using the prediction on the 30% set.

b) Estimate the accuracy of the model in the 30% set


a) Estimar la matriz de confusion en la base del 30%.

b) Calcular el poder de prediccion usando la base del 30%.


In [10]:
# Insert code here
y_pred = clf.predict(X_test)

from sklearn.metrics import confusion_matrix
confusion_matrix(Y_test, y_pred)


Out[10]:
array([[1097,   25],
       [ 155,   22]])

In [11]:
(Y_test == y_pred).mean()


Out[11]:
0.86143187066974591