TO DO
Acknowlegements
In this notebook we'll explore a dataset containing information about a telecom company's customers. It comes from an IBM Watson repository.
We've chosen this dataset because it is complicated enough to teach us things but not so complicated that it sidetracks us. It also gives us an opportunity to reason about a type of business problem that you will encounter (if you haven't already).
The dataset comes to us in the form of an Excel file. The name of the file is "WAFN-UseC-Telcom-Customer-Churn.xlsx". It's quite a cumbersome name, but we'll stick with it so we'll always know where it came from (you can quickly google the name in a pinch). And in general datasets come to us with various names and it's better to get to used to that right from the start.
We get data in two ways -- by creating it or by getting it from somewhere else. Once we get data, data scientists spend a surprising amount of time getting their heads wrapped around the data, cleaning it, and preparing it to be analyzed. We'll illustrate the main steps of this process here.
The objective of this notebook is to provide a template for data exploration -- the first and perhaps most critical step in any kind of data analysis project including machine learning.
There are two main reasons for visualizing data:
Missing values, data attributes with values that have different orders of magnitude, or skewed/unrepresentative data can all affect the quality of what you can infer from the data.
So how should data be visualized? There is no set recipe. Here are some guidelines.
First thing to do is to load up the data. This depends on the source of the input file -- it could come from a file on the local computer system or from a remote location like an Amazon AWS S3 bucket. For us, the file is already stored locally in the Data folder.
NOTE: We'll use some standard Python packages to load, manipulate, and visualize the data. These packages are tools that make our lives as data scientists a lot easier and less tedious. Packages are loaded using the "import" keyword or the "from A import B" or the "from A import B as C" locutions.
In [1]:
# We keep plotting simple and use common packages and defaults
import matplotlib.pyplot as plt
import seaborn as sns
# Set the aesthetics for Seaborn visuals
sns.set(context='notebook',
style='whitegrid',
palette='deep',
font='sans-serif',
font_scale=1.3,
color_codes=True,
rc=None)
%matplotlib inline
In [2]:
import os # to navigate the file system
import numpy as np # for number crunching
import pandas as pd # for data loading and manipulation
# OS-independent way to navigate the file system
# One directory up in relation to directory of this notebook
new_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)
# Where the file is
file_url = new_dir + os.sep + "Data" + os.sep + "WA_Fn-UseC_-Telco-Customer-Churn.xlsx"
file_url
Out[2]:
In [3]:
# Read the excel sheet into a pandas dataframe
df_churn = pd.read_excel(file_url, sheetname=0)
The data is not in a form that can be manipulated for exploration and visualization.
In [4]:
# Look at the first few lines of the data -- scroll to the right to see more columns
df_churn.head()
Out[4]:
How many rows (# of customers) and how many columns (# attritbutes of each customer) do we have?
In [5]:
# Number of rows and columns
num_rows, num_cols = df_churn.shape
num_rows, num_cols
Out[5]:
This meams we have 7,043 customers with each customer tagged with 21 individual attributes such as customerID, gender, SeniorCitizen, etc. Let's get a complete list of the attributes.
In [6]:
# Here is a list of the features with the first 5 values of each feature.
feature_list = list(df_churn)
# First 5 values of each feature in the list
first_5 = [list(df_churn[attribute][0:5]) for attribute in feature_list]
list(zip(feature_list, first_5))
Out[6]:
Some of the features have a discrete set of possible values (e.g., gender, PaymentMethod) while some others can take a range of values that need not be discrete (e.g., tenure, MonthlyCharges, TotalCharges).
Some of the features in our dataset are categorical -- their values come from a small handful of discrete possibilities. Features like gender and payment method fit are categorical. Categorical features are also known as discrete features. Let's separate the discrete/categorical features from the rest -- we'll get a better grip if we look at them separately first.
In [7]:
# Identify all the features that are categorical
# Feature index numbers that are *not* categorical.
# Just count from the dataset starting at customerID's index = 0
not_categorical = [0,5,18,19] # CustomerID is not a feature but a unique identifier
# the categorical features are then the complement of the above list
categorical = list(set(range(df_churn.shape[1])) - set(not_categorical))
# get the unique values of the categorical features
[[feature_list[feature_index], list(df_churn.iloc[:, feature_index].unique())] \
for feature_index in categorical]
Out[7]:
In [8]:
# Put the numerical features into a list for subsquent use
numeric_features = [feature_list[n] for n in not_categorical][1:] # CustomerID is not a numeric feature
numeric_features
Out[8]:
Notice that the SeniorCitizen attribute or feature is respresented numerically as a 1 or 0 -- but these numbers actually represent "Yes" or "No". In othere words, SeniorCitizen is a categorical attribute.
In [9]:
# View a selection of rows and columns in the df_churn dataframe
df_map = {'telco churn data': df_churn}
# We're calling our dataset 'telco churn data'
def table_view(data_frame_name, feature_list, start_row=3, end_row=5):
'''
Displays selected columns and rows of a data frame.
'''
# Verify the inputs are sane
# get the size of the dataframe
num_rows, num_cols = df_map[data_frame_name].shape
if (start_row < 0) | (start_row > num_rows) :
return print("Please use a valid Start Row number. It can be any number from 0 to {}".format(num_rows))
if (end_row < 0) | (end_row > num_rows) | (end_row < start_row + 1):
return print("Please use a valid End Row number. \
It can be any number from 0 to {} \
and must be greater than your start row number".format(num_rows))
view = df_map[data_frame_name][feature_list].iloc[start_row:end_row]
return view
In [10]:
# Using the table_view function defined above
# The name of our dataset (as defined above) is 'telco churn data'
# You can select any and any number of attributes in selected_columns
selected_columns = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']
table_view('telco churn data', selected_columns, 10, 14)
Out[10]:
In [11]:
# We can explore views in an interactive way
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import Button, HBox, VBox
from IPython.display import display
from IPython.display import clear_output
In [12]:
# Layout the interactive view widgets
# Data Frame Chooser Dropdown
dataFrame = widgets.Select(
options=['telco churn data'],
value='telco churn data',
description='Data Source:',
disabled=False
)
# Start Row Text Field
startRow = widgets.IntText(
value=7,
description='Start Row:',
disabled=False
)
# End Row Text Field
endRow = widgets.IntText(
value=12,
description='End Row:',
disabled=False
)
# Attribute Selector (Multiple Select)
allFeatures = widgets.SelectMultiple(
options = feature_list,
rows = 20,
description = 'Select Mulitple Features:',
value = ['gender', 'SeniorCitizen', 'tenure', 'MonthlyCharges']
)
# Button
button = widgets.Button(
description='Show View',
disabled=False,
button_style='info', # 'success', 'info', 'warning', 'danger' or ''
tooltip='Go!',
icon=''
)
def on_button_clicked(b):
# Pass the values of the widgets to the table_view function
clear_output()
return print(table_view(dataFrame.value, list(allFeatures.value), startRow.value, endRow.value))
button.on_click(on_button_clicked)
In [13]:
# Display the elements
HBox([VBox([dataFrame, allFeatures, button]), VBox([startRow, endRow])])
In [14]:
# Here are summary statistics - rough format but still useful
#df_churn['TotalCharges'].astype(float)
df_churn[['tenure', 'MonthlyCharges', 'TotalCharges']].describe(include='all')
Out[14]:
In [15]:
fig, (ax1,ax2,ax3) = plt.subplots(nrows=1, ncols=3, figsize=(12,4))
sns.boxplot(x=df_churn['tenure'], ax = ax1, palette='Set1')
#ax1.set_title("Tenure")
sns.boxplot(x=df_churn['MonthlyCharges'], ax = ax2, palette='Set2')
#ax2.set_title("Monthly Charge")
# There are some monthly charges missing -- supress for now because we haven't yet handled missing values
#sns.boxplot(x=df_churn['TotalCharges'], ax = ax3, palette='Set3')
#ax3.set_title("Total Charge")
plt.tight_layout()
In [16]:
# Set up the plot
def plotFeatureHist(feature_name):
fig, ax = plt.subplots(figsize=(12,7))
sns.distplot(df_churn[feature_name], kde=False)
return plt.show()
In [17]:
# How are the monthly charges distributed?
plotFeatureHist('MonthlyCharges')
In [18]:
# How are tenure and monthly charges related?
g = sns.JointGrid(x="MonthlyCharges", y="tenure", data=df_churn)
g = g.plot(sns.regplot, sns.distplot)
In [19]:
# Pairwise scatter plots of the numerical attributes
cols_numeric = ['tenure', 'MonthlyCharges', 'TotalCharges']
sns.set(style='whitegrid', context='notebook')
sns.pairplot(df_churn[cols_numeric], size=3.5)
Out[19]:
In [20]:
# Calculate the correlation table
# Not sure why SeniorCitizen appears but TotalCharges doesn't appear.
corr = df_churn.corr()
corr
Out[20]:
In [21]:
# Correlation Density Plot
feature_display_names = ['Tenure', 'Monthly Charges', 'Total Charges']
cm = df_churn.corr()
sns.set(font_scale=1)
# NOTE: fmt directive controls number of decimal points displayed
hm = sns.heatmap(cm,
cbar=True,
annot=True,
square=False,
fmt='.2f',
annot_kws={'size':14},
yticklabels=feature_display_names,
xticklabels=feature_display_names)
plt.title('Correlation of Numerical Features')
Out[21]:
In [22]:
# Remind ourselves of the categorical features in the dataset
[feature_list[feature_index] for feature_index in categorical]
Out[22]:
How are the categorical features of the dataset distributed? For example, are there very few senior citizens? Is it the case that an overwhelming number of customers in the dataset have no dependents? Understanding how the features are balanced will give us a sense of how generalizable the results obtained from the dataset will be.
In [23]:
# Set up the plot
def plotFeatureCount(feature_name, count_flag):
fig, ax = plt.subplots(figsize=(12,7))
if count_flag == 'Count':
ax = sns.countplot(x=feature_name, data=df_churn)
elif count_flag == 'Percentage':
x = df_churn[feature_name].unique()
y = [len([val for val in df_churn[feature_name] if val == x_val])/len(df_churn[feature_name]) * 100 \
for x_val in x]
ax = sns.barplot(x,y)
plt.ylabel(count_flag)
return plt.show()
In [24]:
# set up the plot for interactivity
# Dropdown
w_features = widgets.Dropdown(
options = [feature_list[feature_index] for feature_index in categorical],
description = 'Select Feature:',
value = 'gender',
button_style='info'
)
w_radio = widgets.RadioButtons(
options=['Count', 'Percentage'],
value='Count',
description='Display:',
disabled=False
)
def on_value_change(change):
# Pass the value of the dropdown to the plotFeatureCount function
clear_output()
return plotFeatureCount(w_features.value, w_radio.value)
w_features.observe(on_value_change)
w_radio.observe(on_value_change)
In [25]:
# Show a default plot
plotFeatureCount(w_features.value, w_radio.value)
# Show the widgets
HBox([w_features, w_radio])
In [26]:
# One way to visualize the effect of gender on churn -- i.e. interdependence between variables
from plotnine import *
(ggplot(df_churn, aes(x='Churn', fill='gender')) + geom_bar(position='fill'))
Out[26]:
In [27]:
# Does the streaming movies along with gender affect churn?
from plotnine import *
(ggplot(df_churn, aes(x='Churn', fill='gender')) + geom_bar(position='fill') + facet_wrap('~StreamingMovies'))
Out[27]:
In [28]:
# Density Plot
sns.kdeplot(df_churn.query("Churn == 'No'").tenure, shade=True, alpha=0.2, label='No', color='salmon')
sns.kdeplot(df_churn.query("Churn == 'Yes'").tenure, shade=True, alpha=0.2, label='Yes', color='dodgerblue')
plt.title('The first 20 months are critical')
plt.xlabel('Tenure')
Out[28]:
In [29]:
# Facet Plots
g = sns.FacetGrid(df_churn, row='Churn', col='gender', margin_titles=True)
g.map(sns.distplot, 'tenure')
Out[29]:
In [30]:
# Pivots
# Do monthly charges depend on gender? They don't seem to; but they do seem to depend on whether
# or not the person is a senior citizen -- 0 = Not a senior citizen, 1 = senior citizen
fig, ax = plt.subplots(figsize=(10,7))
sns.boxplot(x='gender', y='MonthlyCharges', hue="SeniorCitizen", data=df_churn, palette='Set3')
plt.legend(loc='upper right')
Out[30]:
In [31]:
# Jitter plot
# Do monthly charges depend on the length of contract?
from plotnine import *
(ggplot(df_churn, aes(x='Contract', y='MonthlyCharges')) + geom_jitter(position=position_jitter(0.4)))
Out[31]:
In [32]:
# Jitter plot
# How does online backup service affect monthly charges?
from plotnine import *
(ggplot(df_churn, aes(x='OnlineBackup', y='MonthlyCharges')) + geom_jitter(position=position_jitter(0.4)))
Out[32]:
In [33]:
# For each column in the dataset, add up the rows in which the column data is missing
# You can do this across the entire dataset for a quick look at what's missing
df_churn.isnull().sum()
Out[33]:
The zeros mean there are no missing data values. This is very nice, but unusual in most datasets, so we've lucked out. When there are missing values it takes effort and judgement to decide how to handle them. Sometimes it's not clear how to handle missing data even though there are a number of standard techniques to choose from.
One thing to watch out for is a value that seems to be missing, except that it really is an empty string like '' or a string with some spaces such as ' '. These usually trip up the plotting functions and that's one (stressful) way to identify them.
In [34]:
# For each of the features, find rows where they might be empty -- we'll have to handle these appropriately
def isEmpty(feature):
empty_rows = []
for i in range(len(df_churn)):
if isinstance(df_churn[feature][i], str):
empty_rows.append(i)
return empty_rows
In [35]:
# Which of the numerical features have no numerical values?
empty = [[feature,isEmpty(feature)] for feature in numeric_features]
empty
Out[35]:
In [36]:
# Let's have a look at these rows where one or more numberical features are empty.
df_churn.iloc[empty[2][1]]
Out[36]:
It turns out that TotalCharges are empty when tenure = 0. These rows do have a monthly charge. We'll just make the total charge equal the monthly charge in these cases.
In [37]:
df_churn.iloc[488]['MonthlyCharges']
Out[37]:
In [38]:
# Get the monthly charges for these rows
monthly_charges = [df_churn.iloc[loc]['MonthlyCharges'] for loc in empty[2][1]]
monthly_charges
#df_churn.set_value(488, 'TotalCharges', 52.555)
#df_churn.iloc[488]
Out[38]:
In [39]:
# For all customers whose tenure is 0 months, set the TotalCharges equal to the MonthlyCharges
# This is the new dataset
[df_churn.set_value(loc, 'TotalCharges', df_churn.iloc[loc]['MonthlyCharges']) for loc in empty[2][1]]
df_churn.shape
Out[39]:
In [40]:
# Let's check if the values of TotalCharges are as they should be.
df_churn.iloc[empty[2][1]][['MonthlyCharges','TotalCharges']]
Out[40]:
Data exploration and visualization are useful first steps in machine learning because they help give us ideas for how to investigate the data as well as prepare the data so that its suitable for machine learning. Some important activities include:
To systematicaly explore the dataset, take the following steps:
When there are too many features (columns) in the dataset to apply the methods above, there are other techiques that come into play both for visualization and for analysis. These techniques allow us to rank the importance of features so we can prioritize the ones to scrutinize. We'll get to these techiques when we study feature engineering later in the course.
In [ ]: