Density plots



In [1]:

    
%matplotlib inline
from stemgraphic.num import density_plot
import pandas as pd
import warnings
import scipy.stats as stats
warnings.filterwarnings("ignore")

As Sherlock Holmes would say "Data! Data! Data!"

We will grab a csv of the Titanic dataset (train.csv - since we want the Survived column, and that's ok because we are not training models here, just looking at data), from R. Bilbro's Titanic repo:



In [2]:

    
df = pd.read_csv('https://raw.githubusercontent.com/rebeccabilbro/titanic/master/data/train.csv', index_col='PassengerId')



In [3]:

    
df.head()









    Out[3]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S



In [4]:

    
df.describe(include='all')









    Out[4]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      count
      891.000000
      891.000000
      891
      891
      714.000000
      891.000000
      891.000000
      891
      891.000000
      204
      889
    
    
      unique
      NaN
      NaN
      891
      2
      NaN
      NaN
      NaN
      681
      NaN
      147
      3
    
    
      top
      NaN
      NaN
      Dodge, Master. Washington
      male
      NaN
      NaN
      NaN
      CA. 2343
      NaN
      B96 B98
      S
    
    
      freq
      NaN
      NaN
      1
      577
      NaN
      NaN
      NaN
      7
      NaN
      4
      644
    
    
      mean
      0.383838
      2.308642
      NaN
      NaN
      29.699118
      0.523008
      0.381594
      NaN
      32.204208
      NaN
      NaN
    
    
      std
      0.486592
      0.836071
      NaN
      NaN
      14.526497
      1.102743
      0.806057
      NaN
      49.693429
      NaN
      NaN
    
    
      min
      0.000000
      1.000000
      NaN
      NaN
      0.420000
      0.000000
      0.000000
      NaN
      0.000000
      NaN
      NaN
    
    
      25%
      0.000000
      2.000000
      NaN
      NaN
      20.125000
      0.000000
      0.000000
      NaN
      7.910400
      NaN
      NaN
    
    
      50%
      0.000000
      3.000000
      NaN
      NaN
      28.000000
      0.000000
      0.000000
      NaN
      14.454200
      NaN
      NaN
    
    
      75%
      1.000000
      3.000000
      NaN
      NaN
      38.000000
      1.000000
      0.000000
      NaN
      31.000000
      NaN
      NaN
    
    
      max
      1.000000
      3.000000
      NaN
      NaN
      80.000000
      8.000000
      6.000000
      NaN
      512.329200
      NaN
      NaN

Let's look at age density for the whole dataset. The default is to plot a density curve estimate (density=True) with the area under the curve filled (density_fill=True) based on 1000 data points (display=1000) and a stem-and-leaf leaf order of 1 (leaf_order=1) which will round or decimate based on the calculated scale.

If you want to use full precision, set display=None and leaf_order=None. This can be expensive, depending on the number of secondary plots and the amount of data.

Let's briefly look at all the options using help:



In [5]:

    
help(density_plot)









    



Help on function density_plot in module stemgraphic.graphic:

density_plot(df, var=None, ax=None, bins=None, box=None, density=True, density_fill=True, display=1000, fig_only=True, fit=None, hist=None, hues=None, hue_labels=None, jitter=None, kind=None, leaf_order=1, legend=True, limit_var=False, norm_hist=None, random_state=None, rug=None, scale=None, singular=True, strip=None, swarm=None, title=None, violin=None, x_min=0, x_max=None, y_axis_label=True)
    density_plot.
    
    Various density and distribution plots conveniently packaged into one function. Density plot normally forces
    tails at each end which might go beyond the data. To force min/max to be driven by the data, use limit_var.
    To specify min and max use x_min and x_max instead. Nota Bene: defaults to _decimation_ and _quantization_ mode.
    
    See density_plot notebook for examples of the different combinations of plots.
    
    Why this instead of seaborn:
    
    Stem-and-leaf plots naturally quantize data. The amount of loss is based on scale and leaf_order and on the data
    itself. This function which wraps several seaborn distribution plots was added in order to compare various
    measures of density and distributions based on various levels of decimation (sampling, set through display)
    and of quantization (set through scale and leaf_order). Also, there is no option in seaborn to fill the area
    under the curve...
    
    
    :param df: list, numpy array, time series, pandas or dask dataframe
    :param var: variable to plot, required if df is a dataframe
    :param ax: matplotlib axes instance, usually from a figure or other plot
    :param bins: Specification of hist bins, or None to use Freedman-Diaconis rule
    :param box: bool, if True plots a box plot. Similar to using violin, use one or the other
    :param density: bool, if True (default) plots a density plot
    :param density_fill: bool, if True (default) fill the area under the density curve
    :param display: maximum number rows to use (1000 default) for calculations, forces sampling if < len(df)
    :param fig_only: bool, if True (default) returns fig, ax, else returns fix, ax, max_peak, true_min, true_max
    :param fit: object with fit method, returning a tuple that can be passed to a pdf method
    :param hist: bool, if True plot a histogram
    :param hues: optional, a categorical variable for multiple plots
    :param hue_labels: optional, if using a column that is an object and/or categorical needing translation
    :param jitter: for strip plots only, add jitter. strip + jitter is similar to using swarm, use one or the other
    :param leaf_order: the order of magnitude of the leaf. The higher the order, the less quantization.
    :param legend: bool, if True plots a legend
    :param limit_var: use min / max from the data, not density plot
    :param norm_hist: bool, if True histogram will be normed
    :param random_state: initial random seed for the sampling process, for reproducible research
    :param rug: bool, if True plot a rug plot
    :param scale: force a specific scale for building the plot. Defaults to None (automatic).
    :param singular: force display of a density plot using a singular value, by simulating values of each side
    :param strip: bool, if True displays a strip plot
    :param swarm: swarm plot, similar to strip plot. use one or the other
    :param title: if present, adds a title to the plot
    :param violin: bool, if True plots a violin plot. Similar to using box, use one or the other
    :param x_min: force X axis minimum value. See also limit_var
    :param x_max: force Y axis minimum value. See also limit_var
    :param y_axis_label: bool, if True displays y axis ticks and label
    :return: see fig_only

As option, since we have a full dataframe, we will specify the variable of interest. We will also limit the variable to min/max, since it doesn't make any sense to look at the curve for impossible ages (i.e., less than zero)



In [6]:

    
density_plot(df, var='Age', limit_var=True);

We can also combine the density with various combinations of secondary plots. For example, histogram and box plot:



In [7]:

    
density_plot(df, var='Age', box=True, hist=True, limit_var=True);

Let's look at this where the passengers embarked (Southampton (S), Cherbourg (C) and Queenstown (Q))



In [8]:

    
density_plot(df, var='Age', limit_var=True, hues='Embarked');

Wouldn't it be nice to have a legend that is a bit more explanatory than C, Q and S? If we prepare the data some, this is handled automatically by density_plot.

Categorical support



In [9]:

    
df2 = df.copy()



In [10]:

    
# survived indicator
df2.Survived = df2.Survived.astype('category')
df2.Survived.cat.set_categories(['Did not survived', 'survived'], rename=True, inplace=True)



In [11]:

    
# passenger class
df2.Pclass = df2.Pclass.astype('category')
df2.Pclass.cat.set_categories(['1st', '2nd', '3rd'], rename=True, inplace=True)



In [12]:

    
# embarked at
df2.Embarked = df2.Embarked.astype('category')
df2.Embarked.cat.set_categories(['Cherbourg', 'Queenstown', 'Southampton'], rename=True, inplace=True)



In [13]:

    
# sex is already spelled out, so we just need to convert to a category
df2.Sex = df2.Sex.astype('category')



In [14]:

    
df2.head()









    Out[14]:







  
    
      
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
    
      PassengerId
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      Did not survived
      3rd
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      Southampton
    
    
      2
      survived
      1st
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      Cherbourg
    
    
      3
      survived
      3rd
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      Southampton
    
    
      4
      survived
      1st
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      Southampton
    
    
      5
      Did not survived
      3rd
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      Southampton

Now we are ready to try our previous plot, but with categoricals. We'll also add a title:



In [15]:

    
density_plot(df2, var='Age', limit_var=True, hues='Embarked', title='Passenger Age by port of origin');

Let's look at the fare. With money, beside stem-and-leaf plots, box plots are useful to get a better feel of the data. Often pricing will follow a lognormal or a double gamma. Let's check them out.



In [16]:

    
density_plot(df2, var='Fare', box=True, fit=stats.dgamma , limit_var=True, title='Fare distribution vs double gamma');



In [17]:

    
density_plot(df2, var='Fare', box=True, fit=stats.lognorm, limit_var=True, title='Fare distribution vs log normal');

Either would work for a simulation, but it looks like a log normal distribution is a bit closer to reality.

Other variations

density + histogram + strip plot with jitter



In [18]:

    
density_plot(df, var='Fare', density=True, hist=True, strip=True, jitter=True);

Multivariate density + hist + boxplot (note the boxplot is for the overall distribution, and in a different color):



In [19]:

    
density_plot(df2, var='Age', box=True, hist=True, limit_var=True, hues='Pclass');

Multivariate density + swarm:



In [20]:

    
density_plot(df2, var='Age', swarm=True, limit_var=True, hues='Pclass');

Using the kind keyword to combine multiple plot selections as a string:



In [21]:

    
density_plot(df2, var='Fare', kind='hist+rug+violin', limit_var=True, hues='Embarked');

Quick loop of fare paid per port of origin, with passenger class as hue:



In [22]:

    
for port in df2.Embarked.cat.categories:
    density_plot(df2[df2.Embarked==port], var='Age', box=True, title=port,
                 limit_var=True, hues='Pclass');



In [ ]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
count	891.000000	891.000000	891	891	714.000000	891.000000	891.000000	891	891.000000	204	889
unique	NaN	NaN	891	2	NaN	NaN	NaN	681	NaN	147	3
top	NaN	NaN	Dodge, Master. Washington	male	NaN	NaN	NaN	CA. 2343	NaN	B96 B98	S
freq	NaN	NaN	1	577	NaN	NaN	NaN	7	NaN	4	644
mean	0.383838	2.308642	NaN	NaN	29.699118	0.523008	0.381594	NaN	32.204208	NaN	NaN
std	0.486592	0.836071	NaN	NaN	14.526497	1.102743	0.806057	NaN	49.693429	NaN	NaN
min	0.000000	1.000000	NaN	NaN	0.420000	0.000000	0.000000	NaN	0.000000	NaN	NaN
25%	0.000000	2.000000	NaN	NaN	20.125000	0.000000	0.000000	NaN	7.910400	NaN	NaN
50%	0.000000	3.000000	NaN	NaN	28.000000	0.000000	0.000000	NaN	14.454200	NaN	NaN
75%	1.000000	3.000000	NaN	NaN	38.000000	1.000000	0.000000	NaN	31.000000	NaN	NaN
max	1.000000	3.000000	NaN	NaN	80.000000	8.000000	6.000000	NaN	512.329200	NaN	NaN

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	Did not survived	3rd	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	Southampton
2	survived	1st	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	Cherbourg
3	survived	3rd	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	Southampton
4	survived	1st	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	Southampton
5	Did not survived	3rd	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	Southampton