Plotnine: Introduction

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:

    
import pandas as pd

Plotnine

http://plotnine.readthedocs.io/en/stable/

Built on top of Matplotlib, but providing
1. High level functions
2. Implementation of the Grammar of Graphics, which became famous due to the ggplot2 R package
3. The syntax is highly similar to the ggplot2 R package
Works well with Pandas



In [9]:

    
import plotnine as p9

Introduction

We will use the Titanic example data set:



In [10]:

    
titanic = pd.read_csv('../data/titanic.csv')



In [11]:

    
titanic.head()









    Out[11]:







  
    
      
      PassengerId
      Survived
      Pclass
      Name
      Sex
      Age
      SibSp
      Parch
      Ticket
      Fare
      Cabin
      Embarked
    
  
  
    
      0
      1
      0
      3
      Braund, Mr. Owen Harris
      male
      22.0
      1
      0
      A/5 21171
      7.2500
      NaN
      S
    
    
      1
      2
      1
      1
      Cumings, Mrs. John Bradley (Florence Briggs Th...
      female
      38.0
      1
      0
      PC 17599
      71.2833
      C85
      C
    
    
      2
      3
      1
      3
      Heikkinen, Miss. Laina
      female
      26.0
      0
      0
      STON/O2. 3101282
      7.9250
      NaN
      S
    
    
      3
      4
      1
      1
      Futrelle, Mrs. Jacques Heath (Lily May Peel)
      female
      35.0
      1
      0
      113803
      53.1000
      C123
      S
    
    
      4
      5
      0
      3
      Allen, Mr. William Henry
      male
      35.0
      0
      0
      373450
      8.0500
      NaN
      S

Let's consider following question:

For each class at the Titanic, how many people survived and how many died?

Hence, we should define the size of respectively the zeros (died) and ones (survived) groups of column Survived, also grouped by the Pclass. In Pandas terminology:



In [12]:

    
survived_stat = titanic.groupby(["Pclass", "Survived"]).size().rename('count').reset_index()
survived_stat
# Remark: the `rename` syntax is to provide the count column a column name

Providing this data in a bar chart with pure Pandas is still partly supported:



In [19]:

    
survived_stat.plot(x='Survived', y='count', kind='bar')
## A possible other way of plotting this could be using groupby again:   
#survived_stat.groupby('Pclass').plot(x='Survived', y='count', kind='bar') # (try yourself by uncommenting)









    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2a06821ad0>

but with mixed results...

Plotting libraries focussing on the grammar of graphics are really targeting these grouped plots. For example, the plotting of the resulting counts can be expressed in the grammar of graphics:



In [20]:

    
(p9.ggplot(survived_stat, 
           p9.aes(x='Survived', y='count', fill='factor(Survived)'))
    + p9.geom_bar(stat='identity', position='dodge')
    + p9.facet_wrap(facets='Pclass'))









    












    Out[20]:





<ggplot: (8738654613509)>

Moreover, these count operations are embedded in the typical Grammar of Graphics packages and we can do these operations directly on the original titanic data set in a single coding step:



In [21]:

    
(p9.ggplot(titanic,
           p9.aes(x='Survived', fill='factor(Survived)'))
    + p9.geom_bar(stat='count', position='dodge')
    + p9.facet_wrap(facets='Pclass'))









    












    Out[21]:





<ggplot: (8738654587613)>

Remember:

The Grammar of Graphics is especially suitbale for these so-called tidy dataframe representations (check here for more about `tidy` data)
plotnine is a library that supports the Grammar of graphics

Building a plotnine graph

Building plots with plotnine is typically an iterative process. As illustrated in the introduction, a graph is setup by layering different elements on top of each other using the + operator. putting everything together in brackets () provides Python-compatible syntax.

data

Bind the plot to a specific data frame using the data argument:



In [22]:

    
(p9.ggplot(data=titanic))









    












    Out[22]:





<ggplot: (8738654625505)>

We haven 't defined anything else, so just an empty figure is available.

aesthestics

Define aesthetics (aes), by selecting variables used in the plot and linking them to presentation such as plotting size, shape color, etc. You can interpret this as: how the variable will influence the plotted objects/geometries:

The most important aes are: x, y, alpha, color, colour, fill, linetype, shape, size and stroke



In [23]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare')))









    












    Out[23]:





<ggplot: (8738654341497)>

geometry

Still nothing plotted yet, as we have to define what kind of geometry will be used for the plot. The easiest is probably using points:



In [24]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
)









    












    Out[24]:





<ggplot: (8738654223565)>

EXERCISE:

Starting from the code of the last figure, adapt the code in such a way that the Sex variable defines the color of the points in the graph.
As both sex categories overlap, use an alternative geometry, so called geom_jitter



In [25]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare', color='Sex'))
     + p9.geom_jitter()
)









    












    Out[25]:





<ggplot: (8738654322953)>

These are the basic elements to have a graph, but other elements can be added to the graph:

labels

Change the labels:



In [26]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
     + p9.xlab("Cabin class")
)









    












    Out[26]:





<ggplot: (8738654159745)>

Use the power of groupby and define facets to group the plot by a grouping variable:



In [27]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')#, dir='v')
)









    












    Out[27]:





<ggplot: (8738654159473)>

scales

Defining scale for colors, axes,...

For example, a log-version of the y-axis could support the interpretation of the lower numbers:



In [28]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point() 
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')
     + p9.scale_y_log10()
)









    



/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py:856: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)






    












    Out[28]:





<ggplot: (8738654159713)>

theme

Changing theme :



In [29]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point() 
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')
     + p9.scale_y_log10()
     + p9.theme_bw()
)









    



/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py:856: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)






    












    Out[29]:





<ggplot: (8738653896445)>

or changing specific theming elements, e.g. text size:



In [30]:

    
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point() 
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')
     + p9.scale_y_log10()
     + p9.theme_bw()
     + p9.theme(text=p9.element_text(size=14))
)









    



/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py:856: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)






    












    Out[30]:





<ggplot: (8738653826685)>

more...

adding statistical derivatives
changing the plot coordinate system

Remember:

Start with defining your data, aes variables and a geometry
Further extend your plot with scale_*, theme_*, xlab/ylab, facet_*

plotnine is built on top of Matplotlib

As plotnine is built on top of Matplotlib, we can still retrieve the matplotlib figure object from plotnine for eventual customization:



In [31]:

    
myplot = (p9.ggplot(titanic, 
                    p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
)

The trick is to use the draw() function in plotnine:



In [32]:

    
my_plt_version = myplot.draw()



In [33]:

    
my_plt_version.axes[0].set_title("Titanic fare price per cabin class")
ax2 = my_plt_version.add_axes([0.5, 0.5, 0.3, 0.3], label="ax2")
my_plt_version









    Out[33]:

Remember: Similar to Pandas handling above, we can set up a matplotlib `Figure` with plotnine. Use `draw()` and the Matplotlib `Figure` is returned.

(OPTIONAL SECTION) Some more plotnine functionalities to remember...

Histogram: Getting the univariaite distribution of the Age



In [34]:

    
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age'))
     + p9.geom_histogram(bins=30))









    












    Out[34]:





<ggplot: (8738653893049)>

EXERCISE:

Make a histogram of the age, grouped by the Sex of the passengers
Make sure both graphs are underneath each other instead of next to each other to enhance comparison



In [35]:

    
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age'))
     + p9.geom_histogram(bins=30)
     + p9.facet_wrap('Sex', nrow=2)
)









    












    Out[35]:





<ggplot: (8738654509773)>

boxplot/violin plot: Getting the univariaite distribution of Age per Sex



In [36]:

    
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
     + p9.geom_boxplot())









    












    Out[36]:





<ggplot: (8738654216733)>

Actually, a violinplot provides more inside to the distribution:



In [37]:

    
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
     + p9.geom_violin()
)









    












    Out[37]:





<ggplot: (8738654108761)>

EXERCISE:

Make a violin plot of the Age for each `Sex`
Add `jitter` to the plot to see the actual data points
Adjust the transparency of the jitter dots to improve readability



In [38]:

    
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
     + p9.geom_violin()
     + p9.geom_jitter(alpha=0.2)
)









    












    Out[38]:





<ggplot: (8738654127629)>

regressions

plotnine supports a number of statistical functions with the [geom_smooth function]:(http://plotnine.readthedocs.io/en/stable/generated/plotnine.stats.stat_smooth.html#plotnine.stats.stat_smooth)

The available methods are:

* 'auto'       # Use loess if (n<1000), glm otherwise
* 'lm', 'ols'  # Linear Model
* 'wls'        # Weighted Linear Model
* 'rlm'        # Robust Linear Model
* 'glm'        # Generalized linear Model
* 'gls'        # Generalized Least Squares
* 'lowess'     # Locally Weighted Regression (simple)
* 'loess'      # Locally Weighted Regression
* 'mavg'       # Moving Average
* 'gpr'        # Gaussian Process Regressor

each of these functions are provided by existing Python libraries and integrated in plotnine, so make sure to have these dependencies installed (read the error message!)



In [39]:

    
(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), 
           p9.aes(x='Fare', y='Age', color="Sex"))
     + p9.geom_point()
     + p9.geom_rug(alpha=0.2)
     + p9.geom_smooth(method='lm')
)









    



/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)






    












    Out[39]:





<ggplot: (8738654182813)>



In [40]:

    
(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), 
           p9.aes(x='Fare', y='Age', color="Sex"))
     + p9.geom_point()
     + p9.geom_rug(alpha=0.2)
     + p9.geom_smooth(method='lm')
     + p9.facet_wrap("Survived")
     + p9.scale_color_brewer(type="qual")
)









    



/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)






    












    Out[40]:





<ggplot: (8738654046117)>

Need more plotnine inspiration?

Remember [plotnine gallery](http://plotnine.readthedocs.io/en/stable/gallery.html) and [great documentation](http://plotnine.readthedocs.io/en/stable/api.html)

Important resources to start from!

What is `tidy`?

If you're wondering what tidy data representations are, you can read the scientific paper by Hadley Wickham, http://vita.had.co.nz/papers/tidy-data.pdf.

Here, we just introduce the main principle very briefly:

Compare:

un-tidy

WWTP	Treatment A	Treatment B
Destelbergen	8.	6.3
Landegem	7.5	5.2
Dendermonde	8.3	6.2
Eeklo	6.5	7.2

versus

tidy

WWTP	Treatment	pH
Destelbergen	A	8.
Landegem	A	7.5
Dendermonde	A	8.3
Eeklo	A	6.5
Destelbergen	B	6.3
Landegem	B	5.2
Dendermonde	B	6.2
Eeklo	B	7.2

This is sometimes also referred as short versus long format for a specific variable... Plotnine (and other grammar of graphics libraries) work better on tidy data, as it better supports groupby-like transactions!

Remember:

A tidy data set is setup as follows:

Each variable forms a column and contains values
Each observation forms a row
Each type of observational unit forms a table.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Plotnine

Introduction

Building a plotnine graph

data

aesthestics

geometry

labels

facets

scales

theme

more...

plotnine is built on top of Matplotlib

(OPTIONAL SECTION) Some more plotnine functionalities to remember...

Need more plotnine inspiration?

What is tidy?

un-tidy

tidy

What is `tidy`?