Unlocking the Black Box: How to Visualize Data Science Project Pipeline with Yellowbrick Library

No matter whether you are a novice data scientist or a well-seasoned and established professional working in the field for a long time, you most likely faced a challenge of interpreting results generated at any stage of the data science pipeline, be it data ingestion or wrangling, feature selection or model evaluation. This issue becomes even more prominent when the need arises to present interim findings to a group of stakeholders, clients, etc. How do you deal in that case with the long arrays of numbers, scientific notations and formulas which tell a story of your data set? That's when visualization library like Yellowbrick becomes an essential tool in the arsenal of any data scientist and helps to undertake that endevour by providing interpretable and comprehensive visualization means for any stage of a project pipeline.

Introduction

In this post we will explain how to integrate visualization step into each stage of your project without a need to create customized and time-consuming charts, while getting the benefit of drawing necessary insights into the data you are working with. Because, let's agree on that, unlike computers, human eye perceives graphical represenation of information way better, than it does with bits and digits. Yellowbrick machine learning visualization library serves just that purpose - to "create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models and assist in diagnosing problems throughout the machine learning workflow" ( http://www.scikit-yb.org/en/latest/about.html ).

For the purpose of this exercise we will be using a dataset from UCI Machine Learning Repository on Absenteeism at Work ( https://archive.ics.uci.edu/ml/machine-learning-databases/00445/ ). This data set contains a mix of continuous, binary and hierarchical features, along with continuous target representing a number of work hours an employee has been absent for from work. Such a variety in data makes for an interesting wrangling, feature selection and model evaluation task, results of which we will make sure to visualize along the way.

To begin, we will need to pip install and import Yellowbrick Pyhton library. To do that, simply run the following command from your command line: $ pip install yellowbrick

Once that's done, let's import Yellowbrick along with other essential packages, libraries and user-preference set up into the Jupyter Notebook.



In [1]:

    
import numpy as np
import pandas as pd
%matplotlib inline
from cycler import cycler
import matplotlib.style
import matplotlib as mpl
mpl.style.use('seaborn-white')
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from sklearn.cluster import KMeans
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import KFold
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier, RandomTreesEmbedding, GradientBoostingClassifier
import warnings
warnings.filterwarnings("ignore")
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split as tts
from sklearn.metrics import roc_curve
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from yellowbrick.features import Rank1D
from yellowbrick.features import Rank2D
from yellowbrick.classifier import ClassBalance
from yellowbrick.model_selection import LearningCurve
from yellowbrick.model_selection import ValidationCurve
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ClassificationReport
from yellowbrick.features.importances import FeatureImportances

Data Ingestion and Wrangling

Now we are ready to proceed with downloading a zipped archive containing the dataset directly from the UCI Machine Learning Repository and extracting the data file. To perform this step, we will be using the urllib.request module which helps with opening URLs (mostly HTTP) in a complex world.



In [2]:

    
import urllib.request

print('Beginning file download...')

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00445/Absenteeism_at_work_AAA.zip'  

urllib.request.urlretrieve( url ## , Specify a path to folder you want the archive to be stored in, e.g. '/Users/Yara/Downloads/Absenteeism_at_work_AAA.zip')   
                          )









    



Beginning file download...






    Out[2]:





('/Users/Yara/Downloads/Absenteeism_at_work.zip',
 <http.client.HTTPMessage at 0x20445c48550>)

Unzip the archive and extract a CSV data file which we will be using. Zipfile module does that flawlessly.



In [4]:

    
import zipfile
         
fantasy_zip = zipfile.ZipFile('C:\\Users\\Yara\\Downloads\\Absenteeism_at_work_AAA.zip')
fantasy_zip.extract('Absenteeism_at_work.csv', 'C:\\Users\\Yara\\Downloads')
 
fantasy_zip.close()

Load the data and place it in the same folder as your Python code.



In [5]:

    
dataset = pd.read_csv('C:\\Users\\Yara\\Downloads\\Absenteeism_at_work.csv', 'Absenteeism_at_work.csv', delimiter=';')

Let's take a look at a couple of randomly selected rows from the loaded data set.



In [6]:

    
dataset.sample(10)









    Out[6]:







  
    
      
      ID
      Reason for absence
      Month of absence
      Day of the week
      Seasons
      Transportation expense
      Distance from Residence to Work
      Service time
      Age
      Work load Average/day
      ...
      Disciplinary failure
      Education
      Son
      Social drinker
      Social smoker
      Pet
      Weight
      Height
      Body mass index
      Absenteeism time in hours
    
  
  
    
      208
      28
      19
      5
      3
      3
      225
      26
      9
      28
      378.884
      ...
      0
      1
      1
      0
      0
      2
      69
      169
      24
      8
    
    
      540
      10
      22
      11
      3
      4
      361
      52
      3
      28
      268.519
      ...
      0
      1
      1
      1
      0
      4
      80
      172
      27
      8
    
    
      732
      10
      22
      7
      4
      1
      361
      52
      3
      28
      264.604
      ...
      0
      1
      1
      1
      0
      4
      80
      172
      27
      8
    
    
      524
      28
      13
      10
      5
      4
      225
      26
      9
      28
      284.853
      ...
      0
      1
      1
      0
      0
      2
      69
      169
      24
      1
    
    
      629
      10
      19
      3
      2
      2
      361
      52
      3
      28
      222.196
      ...
      0
      1
      1
      1
      0
      4
      80
      172
      27
      8
    
    
      277
      19
      0
      9
      3
      1
      291
      50
      12
      32
      294.217
      ...
      1
      1
      0
      1
      0
      0
      65
      169
      23
      0
    
    
      136
      11
      22
      1
      5
      2
      289
      36
      13
      33
      308.593
      ...
      0
      1
      2
      1
      0
      1
      90
      172
      30
      3
    
    
      609
      25
      25
      2
      2
      2
      235
      16
      8
      32
      264.249
      ...
      0
      3
      0
      0
      0
      0
      75
      178
      25
      3
    
    
      460
      22
      23
      7
      5
      1
      179
      26
      9
      30
      230.290
      ...
      0
      3
      0
      0
      0
      0
      56
      171
      19
      2
    
    
      294
      33
      0
      10
      6
      4
      248
      25
      14
      47
      265.017
      ...
      1
      1
      2
      0
      0
      1
      86
      165
      32
      0
    
  

10 rows × 21 columns



In [7]:

    
dataset.ID.count()









    Out[7]:





740

As we can see, selected dataset contains 740 instances, each instance representing an employed individual. Features provided in the dataset are those considered to be related to the number of hours an employee was absent from work (target). For the purpose of this exercise, we will subjectively group all instances into 3 categories, thus, converting continuous target into categorical. To identify appropriate bins for the target, let's look at the min, max and mean values.



In [8]:

    
# Getting basic statistical information for the target
print(dataset.loc[:, 'Absenteeism time in hours'].mean())
print(dataset.loc[:, 'Absenteeism time in hours'].min())
print(dataset.loc[:, 'Absenteeism time in hours'].max())









    



6.924324324324324
0
120

If approximately 7 hours of absence is an average value accross our dataset, it makes sense to group records in the following manner:

1) Low rate of absence (Low), if 'Absenteeism time in hours' value is < 6;

2) Medium rate of absence (Medium), if 'Absenteeism time in hours' value is between 6 and 30;

3) High rate of absence (High), if 'Absenteeism time in hours' value is > 30.

Upon grouping, we will be further exploring data and selecting relevant features from the dataset in order to predict an absentee category for the instances in the test portion of the data.



In [9]:

    
dataset['Absenteeism time in hours'] = np.where(dataset['Absenteeism time in hours'] < 6, 1, dataset['Absenteeism time in hours'])
dataset['Absenteeism time in hours'] = np.where(dataset['Absenteeism time in hours'].between(6, 30), 2, dataset['Absenteeism time in hours'])
dataset['Absenteeism time in hours'] = np.where(dataset['Absenteeism time in hours'] > 30, 3, dataset['Absenteeism time in hours'])



In [10]:

    
#Let's look at the data now!
dataset.head()









    Out[10]:







  
    
      
      ID
      Reason for absence
      Month of absence
      Day of the week
      Seasons
      Transportation expense
      Distance from Residence to Work
      Service time
      Age
      Work load Average/day
      ...
      Disciplinary failure
      Education
      Son
      Social drinker
      Social smoker
      Pet
      Weight
      Height
      Body mass index
      Absenteeism time in hours
    
  
  
    
      0
      11
      26
      7
      3
      1
      289
      36
      13
      33
      239.554
      ...
      0
      1
      2
      1
      0
      1
      90
      172
      30
      1
    
    
      1
      36
      0
      7
      3
      1
      118
      13
      18
      50
      239.554
      ...
      1
      1
      1
      1
      0
      0
      98
      178
      31
      1
    
    
      2
      3
      23
      7
      4
      1
      179
      51
      18
      38
      239.554
      ...
      0
      1
      0
      1
      0
      0
      89
      170
      31
      1
    
    
      3
      7
      7
      7
      5
      1
      279
      5
      14
      39
      239.554
      ...
      0
      1
      2
      1
      1
      0
      68
      168
      24
      1
    
    
      4
      11
      23
      7
      5
      1
      289
      36
      13
      33
      239.554
      ...
      0
      1
      2
      1
      0
      1
      90
      172
      30
      1
    
  

5 rows × 21 columns

Once the target is taken care of, time to look at the features. Those of them storing unique identifiers and / or data which might 'leak' information to the model, should be dropped from the data set. For instance, 'Reason for absence' feature stores the information 'from the future' since it will not be available in the real world business scenario when running the model on a new set of data. Therefore, it is highly correlated with the target.



In [11]:

    
dataset = dataset.drop(['ID', 'Reason for absence'], axis=1)



In [12]:

    
dataset.columns









    Out[12]:





Index(['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index',
       'Absenteeism time in hours'],
      dtype='object')

We are now left with the set of features and a target to use in a machine learning model of our choice. So, let's separate features from the target, and split our dataset into a matrix of features (X) and an array of target values (y).



In [13]:

    
features = ['Month of absence', 'Day of the week', 'Seasons',
       'Transportation expense', 'Distance from Residence to Work',
       'Service time', 'Age', 'Work load Average/day ', 'Hit target',
       'Disciplinary failure', 'Education', 'Son', 'Social drinker',
       'Social smoker', 'Pet', 'Weight', 'Height', 'Body mass index']
target = ['Absenteeism time in hours']



In [14]:

    
X = dataset.drop(['Absenteeism time in hours'], axis=1)
y = dataset.loc[:, 'Absenteeism time in hours']



In [15]:

    
# Setting up some visual preferences prior to visualizing data
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

Exploratory Analysis and Feature Selection

Whenever one deals with a categorical target, it is important to remember to test the data set for class imbalance issue. Machine learning models struggle with performing well on imbalanced data where one class is overrepresented, while the other one is underrepresented. While such data sets are representative of the real life, e.g. no company will have majority or even half of its employees missing work on a massive scale, they need to be adjusted for the machine learning purposes, to improve algorithms' ability to pick up patterns present in that data.

And to check for the potential class imbalance in our data, we will use Class Balance Visualizer from Yellowbrick.



In [16]:

    
# Calculating population breakdown by target category
Target = y.value_counts()
print(color.BOLD, 'Low:', color.END, Target[1])
print(color.BOLD, 'Medium:', color.END, Target[2])
print(color.BOLD, 'High:', color.END, Target[3])

# Creating class labels
classes = ["Low", "Medium", "High"]

# Instantiate the classification model and visualizer
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['red', 'limegreen', 'yellow'])
forest = RandomForestClassifier()
fig, ax = plt.subplots(figsize=(10, 7))
visualizer = ClassBalance(forest, classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(axis='x')

visualizer.fit(X, y)  # Fit the training data to the visualizer
visualizer.score(X, y)  # Evaluate the model on the test data
g = visualizer.show()









    



 Low:  468
 Medium:  244
 High:  28

There is an obvious class imbalance here, therefore, we can expect the model to have difficulties learning the pattern for Medium and High categories, unless data resampling is performed or class weight parameter applied within selected model if chosen algorithm allows it.

With that being said, let's proceed with assessing feature importance and selecting those which will be used further in a model of our choice. Yellowbrick library provides a number of convenient vizualizers to perform feature analysis, and we will use a couple of them for demonstration purposes, as well as to make sure that consistent results are returned when different methods are applied.

Rank 1D visualizer utilizes Shapiro-Wilk algorithm that takes into account only a single feature at a time and assesses the normality of the distribution of instances with respect to the feature. Let's see how it works!



In [17]:

    
# Creating 1D visualizer with the Sharpiro feature ranking algorithm
fig, ax = plt.subplots(figsize=(10, 7))
visualizer = Rank1D(features=features, ax=ax, algorithm='shapiro')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

visualizer.fit(X, y)
visualizer.transform(X)
visualizer.show()

Rank 2D Visualizer, in its turn, utilizes a ranking algorithm that takes into account pairs of features at a time. It provides an option for a user to select ranking algorithm of their choice. We are going to experiment with covariance and Pearson, and compare the results.



In [18]:

    
# Instantiate visualizer using covariance ranking algorithm
figsize=(10, 7)
fig, ax = plt.subplots(figsize=figsize)
visualizer = Rank2D(features=features, ax=ax, algorithm='covariance', colormap='summer')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

visualizer.fit(X, y)
visualizer.transform(X)
visualizer.show()



In [19]:

    
# Instantiate visualizer using Pearson ranking algorithm
figsize=(10, 7)
fig, ax = plt.subplots(figsize=figsize)
visualizer = Rank2D(features=features, algorithm='pearson', colormap='winter')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

visualizer.fit(X, y)
visualizer.transform(X)
visualizer.show()

Visual representation of feature correlation makes it much easier to spot pairs of features, which have high or low correlation coefficients. For instance, lighter colours on both plots indicate strong correlation between such pairs of features as 'Body mass index' and 'Weight'; 'Seasons' and 'Month of absence', etc.

Another way of estimating feature importance relative to the model is to rank them by featureimportances attribute when data is fitted to the model. The Yellowbrick Feature Importances visualizer utilizes this attribute to rank and plot features' relative importances. Let's look at how this approach works with Ridge, Lasso and ElasticNet models.



In [20]:

    
# Visualizing Ridge, Lasso and ElasticNet feature selection models side by side for comparison

# Ridge
# Create a new figure
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['red'])
fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(311)
labels = features
viz = FeatureImportances(Ridge(alpha=0.1), ax=ax, labels=labels, relative=False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

# Fit and display
viz.fit(X, y)
viz.show()

# ElasticNet
# Create a new figure
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['salmon'])
fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(312)
labels = features
viz = FeatureImportances(ElasticNet(alpha=0.01), ax=ax, labels=labels, relative=False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

# Fit and display
viz.fit(X, y)
viz.show()

# Lasso
# Create a new figure
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['purple'])
fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(313)
labels = features
viz = FeatureImportances(Lasso(alpha=0.01), ax=ax, labels=labels, relative=False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

# Fit and display
viz.fit(X, y)
viz.show()

Having analyzed the output of all utilized visualizations (Shapiro algorithm, Pearson Correlation Ranking, Covariance Ranking, Lasso, Ridge and ElasticNet), we can now select a set of features which have meaningful coefficient values (positive or negative). These are the features to be kept in the model:

Disciplinary failure
Day of the week
Seasons
Distance from Residence to Work
Number of children (Son)
Social drinker
Social smoker
Height
Weight
BMI
Pet
Month of absence

Graphic visualization of the feature coefficients calculated in a number of different ways significantly simplifies feature selection process, making it more obvious, as it provides an easy way to visualy compare multiple values and consider only those which are statistically significant to the model.

Now let's drop features which didn't make it and proceed with creating models.



In [21]:

    
# Dropping features from X based on visual feature importance visualization
X = X.drop(['Transportation expense', 'Age', 'Transportation expense', 'Service time', 'Hit target', 'Education','Work load Average/day '], axis=1)

Some of the features which are going to be further utilized in the modeling stage, might be of a hierarchical type and require encoding. Let's look at the top couple of rows to see if we have any of those.



In [22]:

    
X.head()









    Out[22]:







  
    
      
      Month of absence
      Day of the week
      Seasons
      Distance from Residence to Work
      Disciplinary failure
      Son
      Social drinker
      Social smoker
      Pet
      Weight
      Height
      Body mass index
    
  
  
    
      0
      7
      3
      1
      36
      0
      2
      1
      0
      1
      90
      172
      30
    
    
      1
      7
      3
      1
      13
      1
      1
      1
      0
      0
      98
      178
      31
    
    
      2
      7
      4
      1
      51
      0
      0
      1
      0
      0
      89
      170
      31
    
    
      3
      7
      5
      1
      5
      0
      2
      1
      1
      0
      68
      168
      24
    
    
      4
      7
      5
      1
      36
      0
      2
      1
      0
      1
      90
      172
      30

Looks like 'Month of absence', 'Day of week' and 'Seasons' are not binary. Therefore, we'll be using pandas get_dummies function to encode them.



In [23]:

    
# Encoding some categorical features
X = pd.get_dummies(data=X, columns=['Month of absence', 'Day of the week', 'Seasons'])



In [24]:

    
X.head()









    Out[24]:







  
    
      
      Distance from Residence to Work
      Disciplinary failure
      Son
      Social drinker
      Social smoker
      Pet
      Weight
      Height
      Body mass index
      Month of absence_0
      ...
      Month of absence_12
      Day of the week_2
      Day of the week_3
      Day of the week_4
      Day of the week_5
      Day of the week_6
      Seasons_1
      Seasons_2
      Seasons_3
      Seasons_4
    
  
  
    
      0
      36
      0
      2
      1
      0
      1
      90
      172
      30
      0
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      1
      13
      1
      1
      1
      0
      0
      98
      178
      31
      0
      ...
      0
      0
      1
      0
      0
      0
      1
      0
      0
      0
    
    
      2
      51
      0
      0
      1
      0
      0
      89
      170
      31
      0
      ...
      0
      0
      0
      1
      0
      0
      1
      0
      0
      0
    
    
      3
      5
      0
      2
      1
      1
      0
      68
      168
      24
      0
      ...
      0
      0
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      4
      36
      0
      2
      1
      0
      1
      90
      172
      30
      0
      ...
      0
      0
      0
      0
      1
      0
      1
      0
      0
      0
    
  

5 rows × 31 columns



In [25]:

    
print(X.columns)









    



Index(['Distance from Residence to Work', 'Disciplinary failure', 'Son',
       'Social drinker', 'Social smoker', 'Pet', 'Weight', 'Height',
       'Body mass index', 'Month of absence_0', 'Month of absence_1',
       'Month of absence_2', 'Month of absence_3', 'Month of absence_4',
       'Month of absence_5', 'Month of absence_6', 'Month of absence_7',
       'Month of absence_8', 'Month of absence_9', 'Month of absence_10',
       'Month of absence_11', 'Month of absence_12', 'Day of the week_2',
       'Day of the week_3', 'Day of the week_4', 'Day of the week_5',
       'Day of the week_6', 'Seasons_1', 'Seasons_2', 'Seasons_3',
       'Seasons_4'],
      dtype='object')

Model Evaluation and Selection

Our matrix of features X is now ready to be fitted to a model, but first we need to split the data into train and test portions for further model validation.



In [26]:

    
# Perform 80/20 training/test split
X_train, X_test, y_train, y_test = tts(X, y, test_size=0.20, random_state=42)

For the purpose of model evaluation and selection we will be using Yellowbrick's Classification Report Visualizer, which displays the precision, recall, F1, and support scores for the model. In order to support easier interpretation and problem detection, the report integrates numerical scores with a color-coded heatmap. All heatmaps are normalized, i.e. in the range from 0 to 1, to facilitate easy comparison of classification models across different classification reports.



In [27]:

    
# Creating a function to visualize estimators
def visual_model_selection(X, y, estimator):
    visualizer = ClassificationReport(estimator, classes=['Low', 'Medium', 'High'], cmap='PRGn')
    visualizer.fit(X, y)  
    visualizer.score(X, y)
    visualizer.show()



In [28]:

    
visual_model_selection(X, y, BaggingClassifier())



In [29]:

    
visual_model_selection(X, y, LogisticRegression(class_weight='balanced'))



In [30]:

    
visual_model_selection(X, y, KNeighborsClassifier())



In [31]:

    
visual_model_selection(X, y, RandomForestClassifier(class_weight='balanced'))



In [32]:

    
visual_model_selection(X, y, ExtraTreesClassifier(class_weight='balanced'))

For the purposes of this exercise we will consider F1 score when estimating models' performance and making a selection. All of the above models visualized through Yellowbrick's Classification Report Visualizer make clear that classifier algorithms performed the best. We need to pay special attention to the F1 score for the underrepresented classes, such as "High" and "Medium", as they contained significantly less instances than "Low" class. Therefore, high F1 score for all three classes indicate a very strong performance of the following models: Bagging Classifier, Random Forest Classifier, Extra Trees Classifier.

We will also use Class Prediction Error visualizer for these models to confirm their strong performance.



In [33]:

    
# Visualizaing class prediction error for Bagging Classifier model
classes = ['Low', 'Medium', 'High']

mpl.rcParams['axes.prop_cycle'] = cycler('color', ['turquoise', 'cyan', 'teal', 'coral', 'blue', 'lime', 'lavender', 'lightblue', 'darkgreen', 'tan', 'salmon', 'gold', 'darkred', 'darkblue'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(311)
visualizer = ClassPredictionError(BaggingClassifier(), classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.show()

# Visualizaing class prediction error for Random Forest Classifier model
classes = ['Low', 'Medium', 'High']

mpl.rcParams['axes.prop_cycle'] = cycler('color', ['coral', 'tan', 'darkred'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(312)
visualizer = ClassPredictionError(RandomForestClassifier(class_weight='balanced'), classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.show()

# Visualizaing class prediction error for Extra Trees Classifier model
classes = ['Low', 'Medium', 'High']

mpl.rcParams['axes.prop_cycle'] = cycler('color', ['limegreen', 'yellow', 'orange'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(313)
visualizer = ClassPredictionError(ExtraTreesClassifier(class_weight='balanced'), classes=classes, ax=ax)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.grid(False)

visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
g = visualizer.show()

Model Optimization

Now we can conclude that ExtraTreesClassifier seems to perform better as it had no instances from "High" class reported under the "Low" class.

However, decision trees become more overfit the deeper they are because at each level of the tree the partitions are dealing with a smaller subset of data. One way to avoid overfitting is by adjusting the depth of the tree. Yellowbrick's Validation Curve visualizer explores the relationship of the "max_depth" parameter to the R2 score with 10 shuffle split cross-validation.

So let's proceed with hyperparameter tuning for our selected ExtraTreesClassifier model using Validation Curve visualizer!



In [37]:

    
# Performing Hyperparameter tuning 
# Validation Curve
mpl.rcParams['axes.prop_cycle'] = cycler('color', ['purple', 'darkblue'])

fig = plt.gcf()
fig.set_size_inches(10,10)
ax = plt.subplot(411)
viz = ValidationCurve(ExtraTreesClassifier(class_weight='balanced'), ax=ax, param_name="max_depth", param_range=np.arange(1, 11), cv=3, scoring="accuracy")

# Fit and show the visualizer
viz.fit(X, y)
viz.show()

We can observe on the above chart that even though training score keeps rising continuosly, cross validation score drops down at max_depth=7. Therefore, we will chose that parameter for our selected model to optimize its performance.



In [38]:

    
visual_model_selection(X, y, ExtraTreesClassifier(class_weight='balanced', max_depth=7))

Conclusions

As we demonstrated in this article, visualization techniques prove to be a useful tool in the machine learning toolkit, and Yellowbrick provides a wide selection of visualizers to meet the needs at every step and stage of the data science project pipeline. Ranging from feature analysis and selection, to model selection and optimization, Yellowbrick visualizers make it easy to make a decision as to which features to keep in the model, which model performs best, and how to tune model's hyperparameters to achieve its optimal performance for future use. Moreover, visualizing algorithmic output also makes it easy to present insights to the audience and stakeholders, and contribute to the simplified interpretability of the machine learning results.

	ID	Reason for absence	Month of absence	Day of the week	Seasons	Transportation expense	Distance from Residence to Work	Service time	Age	Work load Average/day	...	Disciplinary failure	Education	Son	Social drinker	Pet	Weight	Height	Body mass index	Absenteeism time in hours
208	28	19	5	3	3	225	26	9	28	378.884	...	0	1	1	0	2	69	169	24	8
540	10	22	11	3	4	361	52	3	28	268.519	...	0	1	1	1	4	80	172	27	8
732	10	22	7	4	1	361	52	3	28	264.604	...	0	1	1	1	4	80	172	27	8
524	28	13	10	5	4	225	26	9	28	284.853	...	0	1	1	0	2	69	169	24	1
629	10	19	3	2	2	361	52	3	28	222.196	...	0	1	1	1	4	80	172	27	8
277	19	0	9	3	1	291	50	12	32	294.217	...	1	1	0	1	0	65	169	23	0
136	11	22	1	5	2	289	36	13	33	308.593	...	0	1	2	1	1	90	172	30	3
609	25	25	2	2	2	235	16	8	32	264.249	...	0	3	0	0	0	75	178	25	3
460	22	23	7	5	1	179	26	9	30	230.290	...	0	3	0	0	0	56	171	19	2
294	33	0	10	6	4	248	25	14	47	265.017	...	1	1	2	0	1	86	165	32	0

	ID	Reason for absence	Month of absence	Day of the week	Seasons	Transportation expense	Distance from Residence to Work	Service time	Age	Work load Average/day	...	Disciplinary failure	Education	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index	Absenteeism time in hours
0	11	26	7	3	1	289	36	13	33	239.554	...	0	1	2	1	0	1	90	172	30	1
1	36	0	7	3	1	118	13	18	50	239.554	...	1	1	1	1	0	0	98	178	31	1
2	3	23	7	4	1	179	51	18	38	239.554	...	0	1	0	1	0	0	89	170	31	1
3	7	7	7	5	1	279	5	14	39	239.554	...	0	1	2	1	1	0	68	168	24	1
4	11	23	7	5	1	289	36	13	33	239.554	...	0	1	2	1	0	1	90	172	30	1

	Month of absence	Day of the week	Seasons	Distance from Residence to Work	Disciplinary failure	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index
0	7	3	1	36	0	2	1	0	1	90	172	30
1	7	3	1	13	1	1	1	0	0	98	178	31
2	7	4	1	51	0	0	1	0	0	89	170	31
3	7	5	1	5	0	2	1	1	0	68	168	24
4	7	5	1	36	0	2	1	0	1	90	172	30

	Distance from Residence to Work	Disciplinary failure	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index	...	Day of the week_3	Day of the week_4	Day of the week_5	Seasons_1
0	36	0	2	1	0	1	90	172	30	...	1	0	0	1
1	13	1	1	1	0	0	98	178	31	...	1	0	0	1
2	51	0	0	1	0	0	89	170	31	...	0	1	0	1
3	5	0	2	1	1	0	68	168	24	...	0	0	1	1
4	36	0	2	1	0	1	90	172	30	...	0	0	1	1

	Month of absence	Day of the week	Seasons	Distance from Residence to Work	Disciplinary failure	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index
0	7	3	1	36	0	2	1	0	1	90	172	30
1	7	3	1	13	1	1	1	0	0	98	178	31
2	7	4	1	51	0	0	1	0	0	89	170	31
3	7	5	1	5	0	2	1	1	0	68	168	24
4	7	5	1	36	0	2	1	0	1	90	172	30

	Distance from Residence to Work	Disciplinary failure	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index	...	Day of the week_3	Day of the week_4	Day of the week_5	Seasons_1
0	36	0	2	1	0	1	90	172	30	...	1	0	0	1
1	13	1	1	1	0	0	98	178	31	...	1	0	0	1
2	51	0	0	1	0	0	89	170	31	...	0	1	0	1
3	5	0	2	1	1	0	68	168	24	...	0	0	1	1
4	36	0	2	1	0	1	90	172	30	...	0	0	1	1

	Month of absence	Day of the week	Seasons	Distance from Residence to Work	Disciplinary failure	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index
0	7	3	1	36	0	2	1	0	1	90	172	30
1	7	3	1	13	1	1	1	0	0	98	178	31
2	7	4	1	51	0	0	1	0	0	89	170	31
3	7	5	1	5	0	2	1	1	0	68	168	24
4	7	5	1	36	0	2	1	0	1	90	172	30

	Distance from Residence to Work	Disciplinary failure	Son	Social drinker	Social smoker	Pet	Weight	Height	Body mass index	...	Day of the week_3	Day of the week_4	Day of the week_5	Seasons_1
0	36	0	2	1	0	1	90	172	30	...	1	0	0	1
1	13	1	1	1	0	0	98	178	31	...	1	0	0	1
2	51	0	0	1	0	0	89	170	31	...	0	1	0	1
3	5	0	2	1	1	0	68	168	24	...	0	0	1	1
4	36	0	2	1	0	1	90	172	30	...	0	0	1	1