In [1]:
%matplotlib inline

Yellowbrick Examples

Ths notebook is a sample of the examples that yellowbrick provides.


In [2]:
import os
import sys 

# Modify the path 
sys.path.append("..")

import pandas as pd
import yellowbrick as yb 
import matplotlib.pyplot as plt

Load Medical Appointment Data

The data used in this example is hosted by Kaggle at following link: https://www.kaggle.com/joniarroba/noshowappointments/downloads/medical-appointment-no-shows.zip

The data is part of a kaggle challenge to discover if it is possible to predict if a patient will show up for an appointment.

The data is downloaded, unzipped and stored locally within a directory named data.


In [3]:
data = pd.read_csv("data/No-show-Issue-Comma-300k.csv")

In [4]:
data.head()


Out[4]:
Age Gender AppointmentRegistration ApointmentData DayOfTheWeek Status Diabetes Alcoolism HiperTension Handcap Smokes Scholarship Tuberculosis Sms_Reminder AwaitingTime
0 19 M 2014-12-16T14:46:25Z 2015-01-14T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 0 -29
1 24 F 2015-08-18T07:01:26Z 2015-08-19T00:00:00Z Wednesday Show-Up 0 0 0 0 0 0 0 0 -1
2 4 F 2014-02-17T12:53:46Z 2014-02-18T00:00:00Z Tuesday Show-Up 0 0 0 0 0 0 0 0 -1
3 5 M 2014-07-23T17:02:11Z 2014-08-07T00:00:00Z Thursday Show-Up 0 0 0 0 0 0 0 1 -15
4 38 M 2015-10-21T15:20:09Z 2015-10-27T00:00:00Z Tuesday Show-Up 0 0 0 0 0 0 0 1 -6

In [5]:
data.columns = ['Age','Gender','Appointment Registration','Appointment Date',
                   'Day Of Week','Status','Diabetes','Alcoholism','Hypertension','Handicap',
                   'Smoker','Scholarship','Tuberculosis','SMS Reminder','Awaiting Time']

In [6]:
data.describe()


Out[6]:
Age Diabetes Alcoholism Hypertension Handicap Smoker Scholarship Tuberculosis SMS Reminder Awaiting Time
count 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000 300000.000000
mean 37.808017 0.077967 0.025010 0.215890 0.020523 0.052370 0.096897 0.000450 0.574173 -13.841813
std 22.809014 0.268120 0.156156 0.411439 0.155934 0.222772 0.295818 0.021208 0.499826 15.687697
min -2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -398.000000
25% 19.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -20.000000
50% 38.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 -8.000000
75% 56.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 -4.000000
max 113.000000 1.000000 1.000000 1.000000 4.000000 1.000000 1.000000 1.000000 2.000000 -1.000000

In [7]:
features = ['Age','Gender','Appointment Registration','Appointment Date',
            'Day Of Week','Diabetes','Alcoholism','Hypertension','Handicap',
            'Smoker','Scholarship','Tuberculosis','SMS Reminder','Awaiting Time']

numerical_features = data.describe().columns.values

Feature Analysis

Feature analysis visualizers are designed to visualize instances in data space in order to detect features or targets that might impact downstream fitting. Because ML operates on high-dimensional data sets (usually at least 35), the visualizers focus on aggregation, optimization, and other techniques to give overviews of the data. It is our intent that the steering process will allow the data scientist to zoom and filter and explore the relationships between their instances and between dimensions.

At the moment we have three feature analysis visualizers implemented:

  • Rank2D: rank pairs of features to detect covariance
  • RadViz: plot data points along axes ordered around a circle to detect separability
  • Parallel Coordinates: plot instances as lines along vertical axes to detect clusters

Feature analysis visualizers implement the Transformer API from Scikit-Learn, meaning they can be used as intermediate transform steps in a Pipeline (particularly a VisualPipeline). They are instantiated in the same way, and then fit and transform are called on them, which draws the instances correctly. Finally show or show is called which displays the image.


In [8]:
# Feature Analysis Imports 
# NOTE that all these are available for import from the `yellowbrick.features` module 
from yellowbrick.features.rankd import Rank2D 
from yellowbrick.features.radviz import RadViz 
from yellowbrick.features.pcoords import ParallelCoordinates

Rank2D

Rank1D and Rank2D evaluate single features or pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1] allowing them to be ranked. A similar concept to SPLOMs, the scores are visualized on a lower-left triangle heatmap so that patterns between pairs of features can be easily discerned for downstream analysis.


In [9]:
# To help interpret the column features being described in the visualization
pd.DataFrame(numerical_features)


Out[9]:
0
0 Age
1 Diabetes
2 Alcoholism
3 Hypertension
4 Handicap
5 Smoker
6 Scholarship
7 Tuberculosis
8 SMS Reminder
9 Awaiting Time

In [10]:
# For this visualizer numerical features are required 
X = data[numerical_features].as_matrix()
y = data.Status.as_matrix()

# Instantiate the visualizer with the Covariance ranking algorithm 
visualizer = Rank2D(features=numerical_features, algorithm='covariance')

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.show()    # Draw/show/show the data


Diagnostic Interpretation from Rank2D(Covariance):

Some features share covariance with age but most of the features do not share any measureable covariance.


In [11]:
# Instantiate the visualizer with the Pearson ranking algorithm 
visualizer = Rank2D(features=numerical_features, algorithm='pearson')

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.show()    # Draw/show/show the data


Diagnostic Interpretation from Rank2D(Pearson):

Some features share a positive linear relation mostly with age and a little bit with diabetes but most of the features do not demonstrate a relationship.

RadViz

RadViz is a multivariate data visualization algorithm that plots each feature dimension uniformely around the circumference of a circle then plots points on the interior of the circle such that the point normalizes its values on the axes from the center to each arc. This meachanism allows as many dimensions as will easily fit on a circle, greatly expanding the dimensionality of the visualization.

Data scientists use this method to dect separability between classes. E.g. is there an opportunity to learn from the feature set or is there just too much noise?


In [12]:
#Need to specify the classes of interest
classes = data.Status.unique().tolist()

# For this visualizer numerical features are required
X = data[numerical_features].as_matrix()
# Additional step here of converting categorical data 0's and 1's
y = data.Status.replace(classes,[0,1]).as_matrix()

In [13]:
# Instantiate the visualizer
visualizer = visualizer = RadViz(classes=classes, features=numerical_features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()         # Draw/show/show the data


For regression, the RadViz visualizer should use a color sequence to display the target information, as opposed to discrete colors.

Diagnostic Interpretation from RadViz:

It doesn't appear from this visual for there to be much differentiation between the classes. The dimensionality still interestingly shows that the other features have any interesting relations hip with Age.

Parallel Coordinates

Parallel coordinates displays each feature as a vertical axis spaced evenly along the horizontal, and each instance as a line drawn between each individual axis. This allows many dimensions; in fact given infinite horizontal space (e.g. a scrollbar) an infinite number of dimensions can be displayed!

Data scientists use this method to detect clusters of instances that have similar classes, and to note features that have high varaince or different distributions.


In [14]:
# Instantiate the visualizer
visualizer = visualizer = ParallelCoordinates(classes=classes, features=numerical_features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.show()         # Draw/show/show the data


Classifier Evaluation

Classification models attempt to predict a target in a discrete space, that is assign an instance of dependent variables one or more categories. Classification score visualizers display the differences between classes as well as a number of classifier-specific visual evaluations. We currently have implemented three classifier evaluations:

  • ClassificationReport: Presents the confusion matrix of the classifier as a heatmap
  • ROCAUC: Presents the graph of receiver operating characteristics along with area under the curve
  • ClassBalance: Displays the difference between the class balances and support

Estimator score visualizers wrap Scikit-Learn estimators and expose the Estimator API such that they have fit(), predict(), and score() methods that call the appropriate estimator methods under the hood. Score visualizers can wrap an estimator and be passed in as the final step in a Pipeline or VisualPipeline.


In [15]:
# Classifier Evaluation Imports 

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from yellowbrick.classifier import ClassificationReport, ROCAUC, ClassBalance

Classification Report

The classification report visualizer displays the precision, recall, and F1 scores for the model. Integrates numerical scores as well color-coded heatmap in order for easy interpretation and detection.


In [16]:
# Create the train and test data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [17]:
# Instantiate the classification model and visualizer 
bayes = GaussianNB()
visualizer = ClassificationReport(bayes, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data 
g = visualizer.show()             # Draw/show/show the data


ROCAUC

Plot the ROC to visualize the tradeoff between the classifier's sensitivity and specificity.


In [18]:
# Instantiate the classification model and visualizer 
logistic = LogisticRegression()
visualizer = ROCAUC(logistic)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data 
g = visualizer.show()             # Draw/show/show the data


ClassBalance

Class balance chart that shows the support for each class in the fitted classification model.


In [19]:
# Instantiate the classification model and visualizer 
forest = RandomForestClassifier()
visualizer = ClassBalance(forest, classes=classes)

visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer.score(X_test, y_test)  # Evaluate the model on the test data 
g = visualizer.show()             # Draw/show/show the data