Student Name: Shaival Dalal
Student Netid: sd3462
This is a hands-on task where we build a predictive model using Decision Trees discussed in class. For this part, we will be using the data in cell2cell_data.csv
(you can find this on NYU Classes).
These historical data consist of 39,859 customers: 19,901 customers that churned (i.e., left the company) and 19,958 that did not churn (see the "churndep"
variable). Here are the data set's 11 possible predictor variables for churning behavior:
Pos. Var. Name Var. Description
----- ---------- --------------------------------------------------------------
1 revenue Mean monthly revenue in dollars
2 outcalls Mean number of outbound voice calls
3 incalls Mean number of inbound voice calls
4 months Months in Service
5 eqpdays Number of days the customer has had his/her current equipment
6 webcap Handset is web capable
7 marryyes Married (1=Yes; 0=No)
8 travel Has traveled to non-US country (1=Yes; 0=No)
9 pcown Owns a personal computer (1=Yes; 0=No)
10 creditcd Possesses a credit card (1=Yes; 0=No)
11 retcalls Number of calls previously made to retention team
The 12th column, the dependent variable "churndep"
, equals 1 if the customer churned, and 0 otherwise.
1. Load the data and prepare it for modeling. Note that the features are already processed for you, so the only thing needed here is split the data into training and testing. Use pandas to create two data frames: train_df and test_df, where train_df has 80% of the data chosen uniformly at random without replacement (test_df should have the other 20%). Also, make sure to write your own code to do the splits. You may use any random() function numpy but do not use the data splitting functions from Sklearn.
(2 Points)
In [1]:
#Importing basic libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier #Decision Tree
import matplotlib.pyplot as plt # To plot graphs
from sklearn.metrics import accuracy_score # To test accuracy
from sklearn import tree
churn=pd.read_csv("../Datasets/Cell2Cell_data.csv")
#We set the training data size to 80% and the remaining to the test set. (trainsize)
trainsize=0.8
#Setting seed to reproduce results if needed
np.random.seed(3462)
#Using numpy's random number generator to generate numbers between 0 and 1. We select values less than the training size which is set to 80%
indx=np.random.rand(len(churn))<trainsize
train_churn=churn[indx]
test_churn=churn[~indx]
2. If we had to, how would we prove to ourselves or a colleague that our data was indeed randomly sampled on X? And by prove, I mean empirically, not just showing this person our code. Don't actually do the work, just describe in your own words a test you could here. Hint: think about this in terms of selection bias and use notes from our 2nd lecture.
(1 Point)
**Answer**
We use the random number generator provided by an external library called NumPy. By using this library we generate unique random numbers which are free from selection bias. We can safely assume that selection bias is non-existent
3. Now build and train a decision tree classifier using DecisionTreeClassifier()
(manual page) on train_df to predict the "churndep"
target variable. Make sure to use criterion='entropy'
when instantiating an instance of DecisionTreeClassifier()
. For all other settings you should use all of the default options.
(1 Point)
In [2]:
features_train=train_churn.loc[:,'revenue':'retcalls']
target_train=train_churn.loc[:,'churndep']
dtree=DecisionTreeClassifier(criterion="entropy")
trained=dtree.fit(features_train,target_train)
4. Using the resulting model from 2.3, show a bar plot of feature names and their feature importance (hint: check the attributes of the DecisionTreeClassifier()
object directly in IPython or check the manual!).
(3 Points)
In [3]:
featurelength=np.arange(len(list(features_train)))
names=list(features_train)
importances=pd.DataFrame({"Features":list(features_train),"Importance":trained.feature_importances_})
importances.sort_values(by='Importance',ascending=False,inplace=True)
plt.figure(figsize=(10,5))
plt.title("Feature Importance")
plt.bar(featurelength,importances["Importance"],align="center",color="blue")
plt.xticks(featurelength,importances["Features"],rotation="60")
plt.xlabel('Features')
plt.ylabel('Importance')
## We can alternatively use a vertical bar plot to represent information ##
'''
plt.barh(featurelength,importances["Importance"],align="center",color="blue")
plt.yticks(featurelength,importances["Features"])
plt.ylabel('Features')
plt.xlabel('Importance')
'''
plt.show()
5. Is the relationship between the top 3 most important features (as measured here) negative or positive? If your marketing director asked you to explain the top 3 drivers of churn, how would you interpret the relationship between these 3 features and the churn outcome? What "real-life" connection can you draw between each variable and churn?
(2 Points)
In [4]:
column_names=list(churn[importances[:3]["Features"]])
column_names.extend(["churndep"])
churn[column_names].corr()
Out[4]:
The top 3 features are revenue, eqpdays and outcalls.
The top 3 drivers of churn are monthly revenue (revenue), the number of days the customer has had his/her current equipment (eqpdays) and the mean number of outbound voice calls (outcalls).
6. Using the classifier built in 2.3, try predicting "churndep"
on both the train_df and test_df data sets. What is the accuracy on each?
(1 Point)
In [5]:
# Splitting test dataset into target and features
features_test=test_churn.loc[:,'revenue':'retcalls']
target_test=test_churn.loc[:,'churndep']
# Predicting target for train and test dataset
results_test=trained.predict(features_test)
results_train=trained.predict(features_train)
test_accuracy=accuracy_score(target_test,results_test)
train_accuracy=accuracy_score(target_train,results_train)
print("Accuracy for the test dataset is %.3f%% and accuracy for the training dataset is %.3f%%" %(test_accuracy*100,train_accuracy*100))
1. Generate a list of 10 values of each for the parameters min_samples_split and min_samples_leaf.
(1 Point)
In [6]:
# We can use graphviz to visualise the decision tree which may help us
# tree.export_graphviz(trained,out_file="DecisionTree")
splits=np.arange(10,1000,100)
leafnodes=np.arange(10,1000,100)
2. Explain in words your reasoning for choosing the above ranges.
**Answer**
The model we developed suffers from overfitting as demonstrated by the radical difference in accuracy when run on train and test data set.
3. For each combination of values in 3.1 (there should be 100), build a new classifier and check the classifier's accuracy on the test data. Plot the test set accuracy for these options. Use the values of min_samples_split
as the x-axis and generate a new series (line) for each of min_samples_leaf
.
(5 Points)
In [7]:
def DtreeIter(train_features,train_target,test_features,test_target,samplesplit,sampleleaf):
treeOpt=DecisionTreeClassifier(criterion="entropy",min_samples_split=samplesplit,min_samples_leaf=sampleleaf)
treeOpt=treeOpt.fit(train_features,train_target)
result_Opt=treeOpt.predict(test_features)
return accuracy_score(test_target,result_Opt)
result_optimise=dict()
for values in splits:
result_optimise[values]=list()
for values in splits:
for nodes in leafnodes:
result_optimise[values].append([DtreeIter(features_train,target_train,features_test,target_test,values,nodes)])
#To find out best parameters
optimal_split=max(result_optimise, key=lambda x: result_optimise[x][1])
optimal_accuracy=max(result_optimise[optimal_split])
optimal_leaf=leafnodes[list(result_optimise[optimal_split]).index(optimal_accuracy)]
print("Optimal 'Sample Split Size' is %d and 'Optimal Leaf Samples' are %d. Best accuracy is %.2f%%" %(optimal_split,optimal_leaf,optimal_accuracy[0]*100))
plt.figure(figsize=(10,5))
plt.plot(splits,result_optimise[leafnodes[0]],'b',label='Leaf={}'.format(leafnodes[0]))
plt.plot(splits,result_optimise[leafnodes[1]],'r',label='Leaf={}'.format(leafnodes[1]))
plt.plot(splits,result_optimise[leafnodes[2]],'y',label='Leaf={}'.format(leafnodes[2]))
plt.plot(splits,result_optimise[leafnodes[3]],'g',label='Leaf={}'.format(leafnodes[3]))
plt.plot(splits,result_optimise[leafnodes[4]],'c',label='Leaf={}'.format(leafnodes[4]))
plt.plot(splits,result_optimise[leafnodes[5]],'m',label='Leaf={}'.format(leafnodes[5]))
plt.plot(splits,result_optimise[leafnodes[6]],'k',label='Leaf={}'.format(leafnodes[6]))
plt.plot(splits,result_optimise[leafnodes[7]],'b',label='Leaf={}'.format(leafnodes[7]))
plt.plot(splits,result_optimise[leafnodes[8]],'r',label='Leaf={}'.format(leafnodes[8]))
plt.plot(splits,result_optimise[leafnodes[9]],'y',label='Leaf={}'.format(leafnodes[9]))
plt.legend(loc=4)
plt.xlabel('Min Sample Splits')
plt.ylabel('Accuracy')
plt.title('Classifier Accuracy')
plt.show()
4. Which configuration returns the best accuracy? What is this accuracy? (Note, if you don't see much variation in the test set accuracy across values of min_samples_split or min_samples_leaf, try redoing the above steps with a different range of values).
(1 Point)
**Answer**
When we set the Sample Split size to 710 and the Optimal Leaf Samples to 110, we get the best accuracy of 60.31% This accuracy represents the percentage of times our model predicts the correct output. Values predicted by the model are compared with actual value in the test data set to determine this metric.
5. If you were working for a marketing department, how would you use your churn production model in a real business environment? Explain why churn prediction might be good for the business and how one might improve churn by using this model.
(2 Points)
**Answer**
Churn prediction is an extremely important activity for any company. In the marketing department, churn can be of both high performing salesmen as well as customers.
By referring to the churn prediction model, the company can take decisive steps to pursue its employees and customers.
1. Load the timeseries data set, and prepare the dataset by converting the variables to date-time format (hint: use date tools). (1 point)
In [18]:
from scipy import stats
from statsmodels.graphics.api import qqplot
fever=pd.read_csv("../Datasets/cases.csv")
# We can directly read and convert using the read_csv function by using the below command:
# fever=pd.read_csv("../Datasets/cases.csv",parse_dates=[0])
fever["YEAR"]=pd.to_datetime(fever["YEAR"],format="%Y")
2. Plot the autocorrelation function (ACF) and partial autocorrelation function (PCF) of the cases timeseries. (1 point)
In [19]:
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import pacf,acf
plt.figure(1)
plt.figure(figsize=(10,5))
plt.title("Autocorrelation Plot")
autocorrelation_plot(fever["YFCASES"])
plt.figure(2)
plt.figure(figsize=(10,5))
plt.title("Partial Autocorrelation Plot")
plt.plot(pacf(fever["YFCASES"]))
plt.show()
3. Describe what the plots indicate (in terms of autocorrelation and autoregressive parameter (p) and moving average (q)). 2 points.
Some rules of thumb to recall:
Rule 1: If the ACF shows exponential decay, the PACF has a spike at lag 1, and no correlation for other lags, then use one autoregressive (p)parameter
Rule 2: If the ACF shows a sine-wave shape pattern or a set of exponential decays, the PACF has spikes at lags 1 and 2, and no correlation for other lags, the use two autoregressive (p) parameters.
Rule 3: If the ACF has a spike at lag 1, no correlation for other lags, and the PACF damps out exponentially, then use one moving average (q) parameter.
Rule 4: If the ACF has spikes at lags 1 and 2, no correlation for other lags, and the PACF has a sine-wave shape pattern or a set of exponential decays, then use two moving average (q) parameter.
Rule 5: If the ACF shows exponential decay starting at lag 1, and the PACF shows exponential decay starting at lag 1, then use one autoregressive (p) and one moving average (q) parameter.
**Answer**
We use "Rule 2" and select the autocorrelation parameter as 2 i.e. p=2 and q=0
4. Another approach to assessing the presence of autocorrelation is by using the Durbin-Waton (DW) statistic. The value of the DW statistic is close to 2 if the errors are uncorrelated. What is DW for our data, and does this match what you observed from the ACF and PCF plots? (1 point)
In [20]:
# We run Durbin Watson test on the residuls that we obtain from the OLS.
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.stattools import durbin_watson
ols_residuals=OLS(fever["YFCASES"],np.ones(len(fever["YFCASES"]))).fit()
durbin_watson(ols_residuals.resid)
Out[20]:
**Answer**
5. Removing serial dependency by modeling a simple ARMA process with p and q as derived above. Take a look at what the resulting process looks like (plot) (1 point)
In [21]:
from statsmodels import tsa
import statsmodels.api as sm
indexedFever=fever.set_index("YEAR")
cases=indexedFever.astype(float)
arma_result = sm.tsa.ARMA(cases,(2,0)).fit()
cases['forecast'] = arma_result.predict(start = 260 , end= 309, dynamic= True)
cases[['YFCASES', 'forecast']].plot(figsize=(10, 5))
plt.show()
6. Calculate the residuals, and test the null hypothesis that the residuals come from a normal distribution, and construct a qq-plot. Do the results of the hypothesis test and qq-plot align? (1 point)
In [22]:
print(stats.normaltest(arma_result.resid))
figureP3 = plt.figure(figsize=(10,5))
ax = figureP3.add_subplot(1,1,1)
figP3 = qqplot(arma_result.resid, line='q', ax=ax, fit=True)
plt.show()
**Answer**
We performed the test of normality on our data and also plotted a qqplot. The results of our normality test and our qqplot do not match.
7. Now investigate the autocorrelation of your ARMA(p,q) model. Did it improve? These can be examined graphically, but a statistic will help. Next, we calculate the lag, autocorrelation (AC), Q statistic and Prob>Q. The Ljung–Box Q test is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero. The null hypothesis is, H0: The data are independently distributed (i.e. the correlations in the population from which the sample is taken are 0, so that any observed correlations in the data result from randomness of the sampling process). (Hint: use qstat in tsa.acf).
In [23]:
plt.figure(figsize=(15,5))
plt.title("Autocorrelation plot")
autocorrelation_plot(arma_result.resid)
plt.show()
acfValue=acf(arma_result.resid,qstat=True)
autocorrelation_value=acfValue[0]
qstat_value=acfValue[1]
p_value=acfValue[2]
acfValue
Out[23]:
**Answer**
8. Compute prediction for years 2009-2012 and analyze their fit against actual values. (1 point)
In [24]:
from pandas import datetime
begin_year = datetime(2009,1,1)
end_year = datetime(2012,1,1)
forecasted = arma_result.predict(start=begin_year, end=end_year)
forecasted
Out[24]:
9. Calculate the forecast error via MAE and MFE. (2 points) Reminders: Mean absolute error: The mean absolute error (MAE) value is computed as the average absolute error value. If MAE is zero the forecast is perfect. As compared to the mean squared error (MSE), this measure of fit “de-emphasizes” outliers (unique or rare large error values will affect the MAE less than the MSE.
Mean Forecast Error (MFE, also known as Bias). The MFE is the average error in the observations. A large positive MFE means that the forecast is undershooting the actual observations. A large negative MFE means the forecast is overshooting the actual observations. A value near zero is ideal, and generally a small value means a pretty good fit.
The MAE is a better indicator of fit than the MFE.
In [25]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
ferror_begin=datetime(1700,1,1)
ferror_end=datetime(2008,1,1)
predictionARMA=arma_result.predict(start=ferror_begin,end=ferror_end)
MAE=mean_absolute_error(fever["YFCASES"],predictionARMA)
MFE=mean_squared_error(fever["YFCASES"],predictionARMA)
print("MAE is %f and MFE is %f" %(MAE,MFE))
**Answer**