Here I introduce pairplotr, a tool I developed to do pairwise plots of features, including mixtures of numerical and categorical ones, starting from a cleaned Pandas dataframe with neither missing data nor data id columns.
This demo imports an already cleaned Titanic dataset and demonstrates certain features of pyplotr.
Plot details vary according to whether they are on- or off-diagonal and whether the intersecting rows and columns correspond to numerical or categorical variables.
All descriptions assume the first row/column has index 1.
Here's a description of the types of subplot encountered:
In [2]:
%matplotlib inline
import sys
import pairplotr.pairplotr as ppr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]:
df = pd.read_pickle('trimmed_titanic_data.pkl')
In [4]:
df.info()
Note, how the data has no missing values. This is required for the current version of pairplotr.
In [5]:
df.head(10)
Out[5]:
Additionally, the data must have no fields that could be considered an id. For instance, the Titanic survival dataset had a PassengerId field that I removed. The reason for this is to avoid a high number of categorical feature values that causes the code to slow to a crawl.
In [6]:
visualize_df = df.copy()
categorical_features = ['Survived','Pclass','Sex','Embarked','Title','Parch','SibSp']
for feature in categorical_features:
visualize_df[feature] = visualize_df[feature].astype('category')
visualize_df.info()
Note, Parch and SibSp are numerical, though I find it easier to visualize them as categories because there are so few values for them (max 8).
Now that the desired types have been stored in a dictionary we can move on to graphing the pair plot.
In [7]:
%%time
ppr.compare_data(visualize_df,fig_size=16)
We can also select specific features to graph using the plot_vars keyword argument:
In [8]:
%%time
ppr.compare_data(visualize_df,fig_size=16,plot_vars=['Survived','Sex','Pclass','Age','Fare'])
We can zoom in on individual plots by using the zoom keyword argument:
In [9]:
%%time
ppr.compare_data(visualize_df,fig_size=16,zoom=['Sex','Pclass'])
In [10]:
%%time
ppr.compare_data(visualize_df,fig_size=16,zoom=['Pclass','Age'],plot_medians=True)
Note how there is now a scale for the Age feature and the frequencies corresponding to each bin.
This currently only works for category vs category and category vs numerical comparisons and only for different features. This will be changed soon.
Additionally, we can make it so that numerical vs numerical feature comparisons highlight points based on a particular color using the scatter_plot_filter keyword argument:
In [11]:
%%time
ppr.compare_data(visualize_df,fig_size=16,scatter_plot_filter='Survived')
Here is an example interpretation using pairplotr:
Row/column 1/1 indicates that survival (1) and death (0) are indicated by cyan and gray, respectively.
Row/column 3/1 indicates that most women survived (I'd guess about ~80%).
Row/column 3/2 indicates that more than half of all women were from Pclasses 1 and 2. This makes me curious about what characteristics women from Pclass 3 might have.
We can slice the data using normal Pandas notation and use it with pairplotr. Here's an example that investigates women from Pclass 3:
In [12]:
%%time
where = (visualize_df['Sex']=='female')&(visualize_df['Pclass']==3) # Women from Pclass 3
ppr.compare_data(visualize_df[where],scatter_plot_filter='Survived')
Row/column 1/1 automatically shows that only about half of Pclass 3 women survived.
Row/column 8/1 is interesting. It seems to indicate that most women from Embarked values Q and C survived, while the bulk of Pclass 3 women from Embarked S died.
Row/column pairs 8/5 and 8/6 seem to indicate that Embarked S had a higher concentration of larger amounts of Siblings/Spouses and Parents/Childen.
Additionally, row/colum pairs 5/1 and 6/1 seem to indicate that women with less family had a better chance to survive. Here I zoom in on these two figures to check:
In [13]:
%%time
where = (visualize_df['Sex']=='female')&(visualize_df['Pclass']==3) # Women from Pclass 3
ppr.compare_data(visualize_df[where],scatter_plot_filter='Survived',zoom=['SibSp','Survived'])
In [14]:
%%time
where = (visualize_df['Sex']=='female')&(visualize_df['Pclass']==3) # Women from Pclass 3
ppr.compare_data(visualize_df[where],scatter_plot_filter='Survived',zoom=['Parch','Survived'])
Indeed, more than half of Pclass 3 women with no family survived while less than half did with otherwise.
I've introduced pairplotr and showed how to set features as categorical, graph mixed numerical/categorical features, restrict the graphed features, and zoom in on individual plots, graph subsets of the data. Additionally, I demonstrated a simple interpretation of the Titanic dataset.
I hope you find this tool useful and please give me any suggestions for improving it.