A Systematic Approach to Visualizing Data

Exploring a Telecom Customer Churn Dataset

TO DO

  • Nothing so far.

Acknowlegements

  • Thanks to David Wihl for fixing a plotting error.

Introduction

In this notebook we'll explore a dataset containing information about a telecom company's customers. It comes from an IBM Watson repository.

We've chosen this dataset because it is complicated enough to teach us things but not so complicated that it sidetracks us. It also gives us an opportunity to reason about a type of business problem that you will encounter (if you haven't already).

The dataset comes to us in the form of an Excel file. The name of the file is "WAFN-UseC-Telcom-Customer-Churn.xlsx". It's quite a cumbersome name, but we'll stick with it so we'll always know where it came from (you can quickly google the name in a pinch). And in general datasets come to us with various names and it's better to get to used to that right from the start.

We get data in two ways -- by creating it or by getting it from somewhere else. Once we get data, data scientists spend a surprising amount of time getting their heads wrapped around the data, cleaning it, and preparing it to be analyzed. We'll illustrate the main steps of this process here.

The objective of this notebook is to provide a template for data exploration -- the first and perhaps most critical step in any kind of data analysis project including machine learning.

There are two main reasons for visualizing data:

  • To know what you have
  • To get ideas for how to investigate the data

Missing values, data attributes with values that have different orders of magnitude, or skewed/unrepresentative data can all affect the quality of what you can infer from the data.

So how should data be visualized? There is no set recipe. Here are some guidelines.

1. Get the Data into Manipulable Form

First thing to do is to load up the data. This depends on the source of the input file -- it could come from a file on the local computer system or from a remote location like an Amazon AWS S3 bucket. For us, the file is already stored locally in the Data folder.

NOTE: We'll use some standard Python packages to load, manipulate, and visualize the data. These packages are tools that make our lives as data scientists a lot easier and less tedious. Packages are loaded using the "import" keyword or the "from A import B" or the "from A import B as C" locutions.


In [1]:
# We keep plotting simple and use common packages and defaults
import matplotlib.pyplot as plt
import seaborn as sns
# Set the aesthetics for Seaborn visuals
sns.set(context='notebook', 
        style='whitegrid', 
        palette='deep', 
        font='sans-serif', 
        font_scale=1.3, 
        color_codes=True, 
        rc=None)
%matplotlib inline

In [2]:
import os # to navigate the file system
import numpy as np # for number crunching
import pandas as pd # for data loading and manipulation

# OS-independent way to navigate the file system
# One directory up in relation to directory of this notebook
new_dir = os.path.normpath(os.getcwd() + os.sep + os.pardir)
# Where the file is
file_url = new_dir + os.sep + "Data" + os.sep + "WA_Fn-UseC_-Telco-Customer-Churn.xlsx"
file_url


Out[2]:
'/Users/js/Dropbox/Repos/MLCourse/Data/WA_Fn-UseC_-Telco-Customer-Churn.xlsx'

In [3]:
# Read the excel sheet into a pandas dataframe
df_churn = pd.read_excel(file_url, sheetname=0)

The data is not in a form that can be manipulated for exploration and visualization.

2. Get a Handle on the Structure of the Data

2a. Number of Rows and Columns


In [4]:
# Look at the first few lines of the data -- scroll to the right to see more columns
df_churn.head()


Out[4]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

5 rows × 21 columns

How many rows (# of customers) and how many columns (# attritbutes of each customer) do we have?


In [5]:
# Number of rows and columns
num_rows, num_cols = df_churn.shape
num_rows, num_cols


Out[5]:
(7043, 21)

This meams we have 7,043 customers with each customer tagged with 21 individual attributes such as customerID, gender, SeniorCitizen, etc. Let's get a complete list of the attributes.

2b. An Overview of Customer Features (or Attributes)

Here is a complete list of features (also known as attributes) that describe each of the 7,043 customers. We don't know yet if all customers are tagged with all 21 features -- we'll find out soon.


In [6]:
# Here is a list of the features with the first 5 values of each feature.
feature_list = list(df_churn)
# First 5 values of each feature in the list
first_5 = [list(df_churn[attribute][0:5]) for attribute in feature_list]
list(zip(feature_list, first_5))


Out[6]:
[('customerID',
  ['7590-VHVEG', '5575-GNVDE', '3668-QPYBK', '7795-CFOCW', '9237-HQITU']),
 ('gender', ['Female', 'Male', 'Male', 'Male', 'Female']),
 ('SeniorCitizen', [0, 0, 0, 0, 0]),
 ('Partner', ['Yes', 'No', 'No', 'No', 'No']),
 ('Dependents', ['No', 'No', 'No', 'No', 'No']),
 ('tenure', [1, 34, 2, 45, 2]),
 ('PhoneService', ['No', 'Yes', 'Yes', 'No', 'Yes']),
 ('MultipleLines', ['No phone service', 'No', 'No', 'No phone service', 'No']),
 ('InternetService', ['DSL', 'DSL', 'DSL', 'DSL', 'Fiber optic']),
 ('OnlineSecurity', ['No', 'Yes', 'Yes', 'Yes', 'No']),
 ('OnlineBackup', ['Yes', 'No', 'Yes', 'No', 'No']),
 ('DeviceProtection', ['No', 'Yes', 'No', 'Yes', 'No']),
 ('TechSupport', ['No', 'No', 'No', 'Yes', 'No']),
 ('StreamingTV', ['No', 'No', 'No', 'No', 'No']),
 ('StreamingMovies', ['No', 'No', 'No', 'No', 'No']),
 ('Contract',
  ['Month-to-month',
   'One year',
   'Month-to-month',
   'One year',
   'Month-to-month']),
 ('PaperlessBilling', ['Yes', 'No', 'Yes', 'No', 'Yes']),
 ('PaymentMethod',
  ['Electronic check',
   'Mailed check',
   'Mailed check',
   'Bank transfer (automatic)',
   'Electronic check']),
 ('MonthlyCharges',
  [29.850000000000001,
   56.950000000000003,
   53.850000000000001,
   42.299999999999997,
   70.700000000000003]),
 ('TotalCharges', [29.85, 1889.5, 108.15, 1840.75, 151.65]),
 ('Churn', ['No', 'No', 'Yes', 'No', 'Yes'])]

Some of the features have a discrete set of possible values (e.g., gender, PaymentMethod) while some others can take a range of values that need not be discrete (e.g., tenure, MonthlyCharges, TotalCharges).

2c. Categorical Feature and Their Possible Values

Some of the features in our dataset are categorical -- their values come from a small handful of discrete possibilities. Features like gender and payment method fit are categorical. Categorical features are also known as discrete features. Let's separate the discrete/categorical features from the rest -- we'll get a better grip if we look at them separately first.


In [7]:
# Identify all the features that are categorical

# Feature index numbers that are *not* categorical.
# Just count from the dataset starting at customerID's index = 0
not_categorical = [0,5,18,19] # CustomerID is not a feature but a unique identifier
# the categorical features are then the complement of the above list
categorical = list(set(range(df_churn.shape[1])) - set(not_categorical))
# get the unique values of the categorical features
[[feature_list[feature_index], list(df_churn.iloc[:, feature_index].unique())] \
    for feature_index in categorical]


Out[7]:
[['gender', ['Female', 'Male']],
 ['SeniorCitizen', [0, 1]],
 ['Partner', ['Yes', 'No']],
 ['Dependents', ['No', 'Yes']],
 ['PhoneService', ['No', 'Yes']],
 ['MultipleLines', ['No phone service', 'No', 'Yes']],
 ['InternetService', ['DSL', 'Fiber optic', 'No']],
 ['OnlineSecurity', ['No', 'Yes', 'No internet service']],
 ['OnlineBackup', ['Yes', 'No', 'No internet service']],
 ['DeviceProtection', ['No', 'Yes', 'No internet service']],
 ['TechSupport', ['No', 'Yes', 'No internet service']],
 ['StreamingTV', ['No', 'Yes', 'No internet service']],
 ['StreamingMovies', ['No', 'Yes', 'No internet service']],
 ['Contract', ['Month-to-month', 'One year', 'Two year']],
 ['PaperlessBilling', ['Yes', 'No']],
 ['PaymentMethod',
  ['Electronic check',
   'Mailed check',
   'Bank transfer (automatic)',
   'Credit card (automatic)']],
 ['Churn', ['No', 'Yes']]]

2d. Numerical Features

Three of the features in this dataset are numerical:

  • tenure
  • MonthlyCharges
  • TotalCharges

In [8]:
# Put the numerical features into a list for subsquent use
numeric_features = [feature_list[n] for n in not_categorical][1:] # CustomerID is not a numeric feature
numeric_features


Out[8]:
['tenure', 'MonthlyCharges', 'TotalCharges']

Notice that the SeniorCitizen attribute or feature is respresented numerically as a 1 or 0 -- but these numbers actually represent "Yes" or "No". In othere words, SeniorCitizen is a categorical attribute.

3. Visualize the Numerical Features

Let's build ourselves a handy way to look at any set of attributes we choose. We can use this to isolate and explore various groups of attributes.


In [9]:
# View a selection of rows and columns in the df_churn dataframe
df_map = {'telco churn data': df_churn}
# We're calling our dataset 'telco churn data'
def table_view(data_frame_name, feature_list, start_row=3, end_row=5):
    '''
    Displays selected columns and rows of a data frame.
    '''
    # Verify the inputs are sane
    # get the size of the dataframe
    num_rows, num_cols = df_map[data_frame_name].shape
    if (start_row < 0) | (start_row > num_rows) :
        return print("Please use a valid Start Row number. It can be any number from 0 to {}".format(num_rows))
    if (end_row < 0) | (end_row > num_rows) | (end_row < start_row + 1):
        return print("Please use a valid End Row number. \
                     It can be any number from 0 to {} \
                     and must be greater than your start row number".format(num_rows)) 
    
    view = df_map[data_frame_name][feature_list].iloc[start_row:end_row]
    
    return view

In [10]:
# Using the table_view function defined above
# The name of our dataset (as defined above) is 'telco churn data'
# You can select any and any number of attributes in selected_columns
selected_columns = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']
table_view('telco churn data', selected_columns, 10, 14)


Out[10]:
tenure SeniorCitizen MonthlyCharges TotalCharges
10 13 0 49.95 587.45
11 16 0 18.95 326.8
12 58 0 100.35 5681.1
13 49 0 103.70 5036.3

In [11]:
# We can  explore views in an interactive way
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from ipywidgets import Button, HBox, VBox
from IPython.display import display
from IPython.display import clear_output

In [12]:
# Layout the interactive view widgets

# Data Frame Chooser Dropdown
dataFrame = widgets.Select(
             options=['telco churn data'],
             value='telco churn data',
             description='Data Source:',
             disabled=False
            )

# Start Row Text Field
startRow = widgets.IntText(
            value=7,
            description='Start Row:',
            disabled=False
           )

# End Row Text Field
endRow = widgets.IntText(
           value=12,
           description='End Row:',
           disabled=False
          )

# Attribute Selector (Multiple Select)
allFeatures = widgets.SelectMultiple(
               options = feature_list,
               rows = 20,
               description = 'Select Mulitple Features:',
               value = ['gender', 'SeniorCitizen', 'tenure', 'MonthlyCharges']
              )

# Button
button = widgets.Button(
          description='Show View',
          disabled=False,
          button_style='info', # 'success', 'info', 'warning', 'danger' or ''
          tooltip='Go!',
          icon=''
         )

def on_button_clicked(b):
    # Pass the values of the widgets to the table_view function
    clear_output()
    return print(table_view(dataFrame.value, list(allFeatures.value), startRow.value, endRow.value))
    
button.on_click(on_button_clicked)

A Simple Display of Slices of the Dataset


In [13]:
# Display the elements
HBox([VBox([dataFrame, allFeatures, button]), VBox([startRow, endRow])])


Summary Statistics


In [14]:
# Here are summary statistics - rough format but still useful
#df_churn['TotalCharges'].astype(float)
df_churn[['tenure', 'MonthlyCharges', 'TotalCharges']].describe(include='all')


Out[14]:
tenure MonthlyCharges TotalCharges
count 7043.000000 7043.000000 7043
unique NaN NaN 6531
top NaN NaN
freq NaN NaN 11
mean 32.371149 64.761692 NaN
std 24.559481 30.090047 NaN
min 0.000000 18.250000 NaN
25% 9.000000 35.500000 NaN
50% 29.000000 70.350000 NaN
75% 55.000000 89.850000 NaN
max 72.000000 118.750000 NaN

Box Plots


In [15]:
fig, (ax1,ax2,ax3) = plt.subplots(nrows=1, ncols=3, figsize=(12,4))
sns.boxplot(x=df_churn['tenure'], ax = ax1, palette='Set1') 
#ax1.set_title("Tenure")
sns.boxplot(x=df_churn['MonthlyCharges'], ax = ax2, palette='Set2')
#ax2.set_title("Monthly Charge")
# There are some monthly charges missing -- supress for now because we haven't yet handled missing values
#sns.boxplot(x=df_churn['TotalCharges'], ax = ax3, palette='Set3')
#ax3.set_title("Total Charge")
plt.tight_layout()


Histogram


In [16]:
# Set up the plot
def plotFeatureHist(feature_name):
    fig, ax = plt.subplots(figsize=(12,7))
    sns.distplot(df_churn[feature_name], kde=False)
    return plt.show()

In [17]:
# How are the monthly charges distributed?
plotFeatureHist('MonthlyCharges')


4. Visualize Relationships Between Numerical Features


In [18]:
# How are tenure and monthly charges related?
g = sns.JointGrid(x="MonthlyCharges", y="tenure", data=df_churn)
g = g.plot(sns.regplot, sns.distplot)



In [19]:
# Pairwise scatter plots of the numerical attributes
cols_numeric = ['tenure', 'MonthlyCharges', 'TotalCharges']
sns.set(style='whitegrid', context='notebook')
sns.pairplot(df_churn[cols_numeric], size=3.5)


Out[19]:
<seaborn.axisgrid.PairGrid at 0x11439d2b0>

How are the numerical attributes correlated?


In [20]:
# Calculate the correlation table
# Not sure why SeniorCitizen appears but TotalCharges doesn't appear.
corr = df_churn.corr()
corr


Out[20]:
SeniorCitizen tenure MonthlyCharges
SeniorCitizen 1.000000 0.016567 0.220173
tenure 0.016567 1.000000 0.247900
MonthlyCharges 0.220173 0.247900 1.000000

In [21]:
# Correlation Density Plot
feature_display_names = ['Tenure', 'Monthly Charges', 'Total Charges']
cm = df_churn.corr()
sns.set(font_scale=1)
# NOTE: fmt directive controls number of decimal points displayed
hm = sns.heatmap(cm,
                cbar=True,
                annot=True,
                square=False,
                fmt='.2f',
                annot_kws={'size':14},
                yticklabels=feature_display_names,
                xticklabels=feature_display_names)
plt.title('Correlation of Numerical Features')


Out[21]:
<matplotlib.text.Text at 0x11459ebe0>

5. Visualize the Categorical Features


In [22]:
# Remind ourselves of the categorical features in the dataset
[feature_list[feature_index] for feature_index in categorical]


Out[22]:
['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'MultipleLines',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'Churn']

How Balanced are the Categorical Features?

How are the categorical features of the dataset distributed? For example, are there very few senior citizens? Is it the case that an overwhelming number of customers in the dataset have no dependents? Understanding how the features are balanced will give us a sense of how generalizable the results obtained from the dataset will be.


In [23]:
# Set up the plot
def plotFeatureCount(feature_name, count_flag):
    fig, ax = plt.subplots(figsize=(12,7))
    
    if count_flag == 'Count':
        ax = sns.countplot(x=feature_name, data=df_churn)
    elif count_flag == 'Percentage':
        x = df_churn[feature_name].unique()
        y = [len([val for val in df_churn[feature_name] if val == x_val])/len(df_churn[feature_name]) * 100 \
             for x_val in x]
        ax = sns.barplot(x,y)
    
    plt.ylabel(count_flag)
    return plt.show()

In [24]:
# set up the plot for interactivity
# Dropdown
w_features = widgets.Dropdown(
               options = [feature_list[feature_index] for feature_index in categorical],
               description = 'Select Feature:',
               value = 'gender',
               button_style='info'
              )

w_radio = widgets.RadioButtons(
    options=['Count', 'Percentage'],
    value='Count',
    description='Display:',
    disabled=False
)

def on_value_change(change):
    # Pass the value of the dropdown to the plotFeatureCount function
    clear_output()
    return plotFeatureCount(w_features.value, w_radio.value)

w_features.observe(on_value_change)
w_radio.observe(on_value_change)

Display Categorical Features


In [25]:
# Show a default plot
plotFeatureCount(w_features.value, w_radio.value)
# Show the widgets
HBox([w_features, w_radio])


EXERCISE 1

How balanced are the categorical features? Do you anticipate any problems using this dataset to predict if a customer will switch (be subject to churn) or not?

6. Visualize Relationships Between Categorical Features

Does gender make a difference for churn?


In [26]:
# One way to visualize the effect of gender on churn -- i.e. interdependence between variables
from plotnine import *
(ggplot(df_churn, aes(x='Churn', fill='gender')) + geom_bar(position='fill'))


/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
Out[26]:
<ggplot: (293013182)>

Does a factor in addition to gender affect churn?


In [27]:
# Does the streaming movies along with gender affect churn?
from plotnine import *
(ggplot(df_churn, aes(x='Churn', fill='gender')) + geom_bar(position='fill') + facet_wrap('~StreamingMovies'))


Out[27]:
<ggplot: (-9223372036560993052)>

7. Visualize Relationships Between Numerical and Categorical Attributes

Does tenure affect churn?


In [28]:
# Density Plot
sns.kdeplot(df_churn.query("Churn == 'No'").tenure, shade=True, alpha=0.2, label='No', color='salmon')
sns.kdeplot(df_churn.query("Churn == 'Yes'").tenure, shade=True, alpha=0.2, label='Yes', color='dodgerblue')
plt.title('The first 20 months are critical')
plt.xlabel('Tenure')


Out[28]:
<matplotlib.text.Text at 0x117412b70>

Does tenure in addition to gender affect churn?


In [29]:
# Facet Plots
g = sns.FacetGrid(df_churn, row='Churn', col='gender', margin_titles=True)
g.map(sns.distplot, 'tenure')


Out[29]:
<seaborn.axisgrid.FacetGrid at 0x118394ef0>

Do senior citizens pay more per month?


In [30]:
# Pivots
# Do monthly charges depend on gender? They don't seem to; but they do seem to depend on whether 
# or not the person is a senior citizen -- 0 = Not a senior citizen, 1 = senior citizen
fig, ax = plt.subplots(figsize=(10,7))
sns.boxplot(x='gender', y='MonthlyCharges', hue="SeniorCitizen", data=df_churn, palette='Set3')
plt.legend(loc='upper right')


Out[30]:
<matplotlib.legend.Legend at 0x118d9a390>

Do monthly charges depend on the lenght of the contract?


In [31]:
# Jitter plot
# Do monthly charges depend on the length of contract?
from plotnine import *
(ggplot(df_churn, aes(x='Contract', y='MonthlyCharges')) + geom_jitter(position=position_jitter(0.4)))


Out[31]:
<ggplot: (294155542)>

Does having online backup service increase monthly charges?


In [32]:
# Jitter plot
# How does online backup service affect monthly charges?
from plotnine import *
(ggplot(df_churn, aes(x='OnlineBackup', y='MonthlyCharges')) + geom_jitter(position=position_jitter(0.4)))


Out[32]:
<ggplot: (-9223372036564670439)>

8. Find and Handle Missing Values


In [33]:
# For each column in the dataset, add up the rows in which the column data is missing
# You can do this across the entire dataset for a quick look at what's missing
df_churn.isnull().sum()


Out[33]:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

The zeros mean there are no missing data values. This is very nice, but unusual in most datasets, so we've lucked out. When there are missing values it takes effort and judgement to decide how to handle them. Sometimes it's not clear how to handle missing data even though there are a number of standard techniques to choose from.

One thing to watch out for is a value that seems to be missing, except that it really is an empty string like '' or a string with some spaces such as ' '. These usually trip up the plotting functions and that's one (stressful) way to identify them.


In [34]:
# For each of the features, find rows where they might be empty -- we'll have to handle these appropriately
def isEmpty(feature):
    empty_rows = []
    for i in range(len(df_churn)):
        if isinstance(df_churn[feature][i], str):
            empty_rows.append(i)
    
    return empty_rows

In [35]:
# Which of the numerical features have no numerical values?
empty = [[feature,isEmpty(feature)] for feature in numeric_features]
empty


Out[35]:
[['tenure', []],
 ['MonthlyCharges', []],
 ['TotalCharges',
  [488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754]]]

In [36]:
# Let's have a look at these rows where one or more numberical features are empty.
df_churn.iloc[empty[2][1]]


Out[36]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL Yes ... Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 No
753 3115-CZMZD Male 0 No Yes 0 Yes No No No internet service ... No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 No
936 5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL Yes ... Yes No Yes Yes Two year No Mailed check 80.85 No
1082 4367-NUYAO Male 0 Yes Yes 0 Yes Yes No No internet service ... No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 No
1340 1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL Yes ... Yes Yes Yes No Two year No Credit card (automatic) 56.05 No
3331 7644-OMVMY Male 0 Yes Yes 0 Yes No No No internet service ... No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 No
3826 3213-VVOLG Male 0 Yes Yes 0 Yes Yes No No internet service ... No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 No
4380 2520-SGTTA Female 0 Yes Yes 0 Yes No No No internet service ... No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 No
5218 2923-ARZLG Male 0 Yes Yes 0 Yes No No No internet service ... No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 No
6670 4075-WKNIU Female 0 Yes Yes 0 Yes Yes DSL No ... Yes Yes Yes No Two year No Mailed check 73.35 No
6754 2775-SEFEE Male 0 No Yes 0 Yes Yes DSL Yes ... No Yes No No Two year Yes Bank transfer (automatic) 61.90 No

11 rows × 21 columns

It turns out that TotalCharges are empty when tenure = 0. These rows do have a monthly charge. We'll just make the total charge equal the monthly charge in these cases.


In [37]:
df_churn.iloc[488]['MonthlyCharges']


Out[37]:
52.549999999999997

In [38]:
# Get the monthly charges for these rows
monthly_charges = [df_churn.iloc[loc]['MonthlyCharges'] for loc in empty[2][1]]
monthly_charges
#df_churn.set_value(488, 'TotalCharges', 52.555)
#df_churn.iloc[488]


Out[38]:
[52.549999999999997,
 20.25,
 80.849999999999994,
 25.75,
 56.049999999999997,
 19.850000000000001,
 25.350000000000001,
 20.0,
 19.699999999999999,
 73.349999999999994,
 61.899999999999999]

In [39]:
# For all customers whose tenure is 0 months, set the TotalCharges equal to the MonthlyCharges
# This is the new dataset
[df_churn.set_value(loc, 'TotalCharges', df_churn.iloc[loc]['MonthlyCharges']) for loc in empty[2][1]]
df_churn.shape


Out[39]:
(7043, 21)

In [40]:
# Let's check if the values of TotalCharges are as they should be.
df_churn.iloc[empty[2][1]][['MonthlyCharges','TotalCharges']]


Out[40]:
MonthlyCharges TotalCharges
488 52.55 52.55
753 20.25 20.25
936 80.85 80.85
1082 25.75 25.75
1340 56.05 56.05
3331 19.85 19.85
3826 25.35 25.35
4380 20.00 20
5218 19.70 19.7
6670 73.35 73.35
6754 61.90 61.9

Exercise 2

What other checks on the data would you do to catch weird values appearing in the data?

Summary

Data exploration and visualization are useful first steps in machine learning because they help give us ideas for how to investigate the data as well as prepare the data so that its suitable for machine learning. Some important activities include:

  • Finding and handling missing data
  • Identifying skewed features that require appropriate transformation
  • Ensuring that the data is representative (which has implications for how should use the data and the kinds of inferences we can plausibly make from it)
  • Seeing if features should be combined to improve prediction
  • Normalizing or scaling features if feature values range widely (e.g., one features has values in the range 1 to 10 while another feature has values in the range 100,000 to 1,000,000).

To systematicaly explore the dataset, take the following steps:

  1. Get the data into manipulable form.
  2. Get a handle on the structure of the data.
  3. Visualize the numerical features.
  4. Visualize the relationships between numerical features.
  5. Visualize the categorical features.
  6. Visualize the relationships between categorical features.
  7. Visualize the relationships between numerical and categorical features.
  8. Find and handle missing values.

When there are too many features (columns) in the dataset to apply the methods above, there are other techiques that come into play both for visualization and for analysis. These techniques allow us to rank the importance of features so we can prioritize the ones to scrutinize. We'll get to these techiques when we study feature engineering later in the course.


In [ ]: