In [1]:
%matplotlib inline
from stemgraphic.num import density_plot
import pandas as pd
import warnings
import scipy.stats as stats
warnings.filterwarnings("ignore")
As Sherlock Holmes would say "Data! Data! Data!"
We will grab a csv of the Titanic dataset (train.csv - since we want the Survived column, and that's ok because we are not training models here, just looking at data), from R. Bilbro's Titanic repo:
In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/rebeccabilbro/titanic/master/data/train.csv', index_col='PassengerId')
In [3]:
df.head()
Out[3]:
In [4]:
df.describe(include='all')
Out[4]:
Let's look at age density for the whole dataset. The default is to plot a density curve estimate (density=True) with the area under the curve filled (density_fill=True) based on 1000 data points (display=1000) and a stem-and-leaf leaf order of 1 (leaf_order=1) which will round or decimate based on the calculated scale.
If you want to use full precision, set display=None and leaf_order=None. This can be expensive, depending on the number of secondary plots and the amount of data.
Let's briefly look at all the options using help:
In [5]:
help(density_plot)
As option, since we have a full dataframe, we will specify the variable of interest. We will also limit the variable to min/max, since it doesn't make any sense to look at the curve for impossible ages (i.e., less than zero)
In [6]:
density_plot(df, var='Age', limit_var=True);
We can also combine the density with various combinations of secondary plots. For example, histogram and box plot:
In [7]:
density_plot(df, var='Age', box=True, hist=True, limit_var=True);
Let's look at this where the passengers embarked (Southampton (S), Cherbourg (C) and Queenstown (Q))
In [8]:
density_plot(df, var='Age', limit_var=True, hues='Embarked');
Wouldn't it be nice to have a legend that is a bit more explanatory than C, Q and S? If we prepare the data some, this is handled automatically by density_plot.
In [9]:
df2 = df.copy()
In [10]:
# survived indicator
df2.Survived = df2.Survived.astype('category')
df2.Survived.cat.set_categories(['Did not survived', 'survived'], rename=True, inplace=True)
In [11]:
# passenger class
df2.Pclass = df2.Pclass.astype('category')
df2.Pclass.cat.set_categories(['1st', '2nd', '3rd'], rename=True, inplace=True)
In [12]:
# embarked at
df2.Embarked = df2.Embarked.astype('category')
df2.Embarked.cat.set_categories(['Cherbourg', 'Queenstown', 'Southampton'], rename=True, inplace=True)
In [13]:
# sex is already spelled out, so we just need to convert to a category
df2.Sex = df2.Sex.astype('category')
In [14]:
df2.head()
Out[14]:
Now we are ready to try our previous plot, but with categoricals. We'll also add a title:
In [15]:
density_plot(df2, var='Age', limit_var=True, hues='Embarked', title='Passenger Age by port of origin');
Let's look at the fare. With money, beside stem-and-leaf plots, box plots are useful to get a better feel of the data. Often pricing will follow a lognormal or a double gamma. Let's check them out.
In [16]:
density_plot(df2, var='Fare', box=True, fit=stats.dgamma , limit_var=True, title='Fare distribution vs double gamma');
In [17]:
density_plot(df2, var='Fare', box=True, fit=stats.lognorm, limit_var=True, title='Fare distribution vs log normal');
Either would work for a simulation, but it looks like a log normal distribution is a bit closer to reality.
In [18]:
density_plot(df, var='Fare', density=True, hist=True, strip=True, jitter=True);
Multivariate density + hist + boxplot (note the boxplot is for the overall distribution, and in a different color):
In [19]:
density_plot(df2, var='Age', box=True, hist=True, limit_var=True, hues='Pclass');
Multivariate density + swarm:
In [20]:
density_plot(df2, var='Age', swarm=True, limit_var=True, hues='Pclass');
Using the kind keyword to combine multiple plot selections as a string:
In [21]:
density_plot(df2, var='Fare', kind='hist+rug+violin', limit_var=True, hues='Embarked');
Quick loop of fare paid per port of origin, with passenger class as hue:
In [22]:
for port in df2.Embarked.cat.categories:
density_plot(df2[df2.Embarked==port], var='Age', box=True, title=port,
limit_var=True, hues='Pclass');
In [ ]: