In [1]:
%matplotlib inline
In [2]:
import os
import sys
# Modify the path
sys.path.append("..")
import pandas as pd
import yellowbrick as yb
import matplotlib.pyplot as plt
The data used in this example is hosted by Kaggle at following link: https://www.kaggle.com/joniarroba/noshowappointments/downloads/medical-appointment-no-shows.zip
The data is part of a kaggle challenge to discover if it is possible to predict if a patient will show up for an appointment.
The data is downloaded, unzipped and stored locally within a directory named data.
In [3]:
data = pd.read_csv("data/No-show-Issue-Comma-300k.csv")
In [4]:
data.head()
Out[4]:
In [5]:
data.columns = ['Age','Gender','Appointment Registration','Appointment Date',
'Day Of Week','Status','Diabetes','Alcoholism','Hypertension','Handicap',
'Smoker','Scholarship','Tuberculosis','SMS Reminder','Awaiting Time']
In [6]:
data.describe()
Out[6]:
In [7]:
features = ['Age','Gender','Appointment Registration','Appointment Date',
'Day Of Week','Diabetes','Alcoholism','Hypertension','Handicap',
'Smoker','Scholarship','Tuberculosis','SMS Reminder','Awaiting Time']
numerical_features = data.describe().columns.values
Feature analysis visualizers are designed to visualize instances in data space in order to detect features or targets that might impact downstream fitting. Because ML operates on high-dimensional data sets (usually at least 35), the visualizers focus on aggregation, optimization, and other techniques to give overviews of the data. It is our intent that the steering process will allow the data scientist to zoom and filter and explore the relationships between their instances and between dimensions.
At the moment we have three feature analysis visualizers implemented:
Feature analysis visualizers implement the Transformer
API from Scikit-Learn, meaning they can be used as intermediate transform steps in a Pipeline
(particularly a VisualPipeline
). They are instantiated in the same way, and then fit and transform are called on them, which draws the instances correctly. Finally show
or show
is called which displays the image.
In [8]:
# Feature Analysis Imports
# NOTE that all these are available for import from the `yellowbrick.features` module
from yellowbrick.features.rankd import Rank2D
from yellowbrick.features.radviz import RadViz
from yellowbrick.features.pcoords import ParallelCoordinates
Rank1D and Rank2D evaluate single features or pairs of features using a variety of metrics that score the features on the scale [-1, 1] or [0, 1] allowing them to be ranked. A similar concept to SPLOMs, the scores are visualized on a lower-left triangle heatmap so that patterns between pairs of features can be easily discerned for downstream analysis.
In [9]:
# To help interpret the column features being described in the visualization
pd.DataFrame(numerical_features)
Out[9]:
In [10]:
# For this visualizer numerical features are required
X = data[numerical_features].as_matrix()
y = data.Status.as_matrix()
# Instantiate the visualizer with the Covariance ranking algorithm
visualizer = Rank2D(features=numerical_features, algorithm='covariance')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
In [11]:
# Instantiate the visualizer with the Pearson ranking algorithm
visualizer = Rank2D(features=numerical_features, algorithm='pearson')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
RadViz is a multivariate data visualization algorithm that plots each feature dimension uniformely around the circumference of a circle then plots points on the interior of the circle such that the point normalizes its values on the axes from the center to each arc. This meachanism allows as many dimensions as will easily fit on a circle, greatly expanding the dimensionality of the visualization.
Data scientists use this method to dect separability between classes. E.g. is there an opportunity to learn from the feature set or is there just too much noise?
In [12]:
#Need to specify the classes of interest
classes = data.Status.unique().tolist()
# For this visualizer numerical features are required
X = data[numerical_features].as_matrix()
# Additional step here of converting categorical data 0's and 1's
y = data.Status.replace(classes,[0,1]).as_matrix()
In [13]:
# Instantiate the visualizer
visualizer = visualizer = RadViz(classes=classes, features=numerical_features)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
For regression, the RadViz visualizer should use a color sequence to display the target information, as opposed to discrete colors.
Parallel coordinates displays each feature as a vertical axis spaced evenly along the horizontal, and each instance as a line drawn between each individual axis. This allows many dimensions; in fact given infinite horizontal space (e.g. a scrollbar) an infinite number of dimensions can be displayed!
Data scientists use this method to detect clusters of instances that have similar classes, and to note features that have high varaince or different distributions.
In [14]:
# Instantiate the visualizer
visualizer = visualizer = ParallelCoordinates(classes=classes, features=numerical_features)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.show() # Draw/show/show the data
Classification models attempt to predict a target in a discrete space, that is assign an instance of dependent variables one or more categories. Classification score visualizers display the differences between classes as well as a number of classifier-specific visual evaluations. We currently have implemented three classifier evaluations:
Estimator score visualizers wrap Scikit-Learn estimators and expose the Estimator API such that they have fit(), predict(), and score() methods that call the appropriate estimator methods under the hood. Score visualizers can wrap an estimator and be passed in as the final step in a Pipeline or VisualPipeline.
In [15]:
# Classifier Evaluation Imports
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from yellowbrick.classifier import ClassificationReport, ROCAUC, ClassBalance
In [16]:
# Create the train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
In [17]:
# Instantiate the classification model and visualizer
bayes = GaussianNB()
visualizer = ClassificationReport(bayes, classes=classes)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
In [18]:
# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = ROCAUC(logistic)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data
In [19]:
# Instantiate the classification model and visualizer
forest = RandomForestClassifier()
visualizer = ClassBalance(forest, classes=classes)
visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.show() # Draw/show/show the data