Plotnine: Introduction

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



In [1]:
import pandas as pd
  • Built on top of Matplotlib, but providing
    1. High level functions
    2. Implementation of the Grammar of Graphics, which became famous due to the ggplot2 R package
    3. The syntax is highly similar to the ggplot2 R package
  • Works well with Pandas

In [9]:
import plotnine as p9

Introduction

We will use the Titanic example data set:


In [10]:
titanic = pd.read_csv('../data/titanic.csv')

In [11]:
titanic.head()


Out[11]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Let's consider following question:

For each class at the Titanic, how many people survived and how many died?

Hence, we should define the size of respectively the zeros (died) and ones (survived) groups of column Survived, also grouped by the Pclass. In Pandas terminology:


In [12]:
survived_stat = titanic.groupby(["Pclass", "Survived"]).size().rename('count').reset_index()
survived_stat
# Remark: the `rename` syntax is to provide the count column a column name


Out[12]:
Pclass Survived count
0 1 0 80
1 1 1 136
2 2 0 97
3 2 1 87
4 3 0 372
5 3 1 119

Providing this data in a bar chart with pure Pandas is still partly supported:


In [19]:
survived_stat.plot(x='Survived', y='count', kind='bar')
## A possible other way of plotting this could be using groupby again:   
#survived_stat.groupby('Pclass').plot(x='Survived', y='count', kind='bar') # (try yourself by uncommenting)


Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f2a06821ad0>

but with mixed results...

Plotting libraries focussing on the grammar of graphics are really targeting these grouped plots. For example, the plotting of the resulting counts can be expressed in the grammar of graphics:


In [20]:
(p9.ggplot(survived_stat, 
           p9.aes(x='Survived', y='count', fill='factor(Survived)'))
    + p9.geom_bar(stat='identity', position='dodge')
    + p9.facet_wrap(facets='Pclass'))


Out[20]:
<ggplot: (8738654613509)>

Moreover, these count operations are embedded in the typical Grammar of Graphics packages and we can do these operations directly on the original titanic data set in a single coding step:


In [21]:
(p9.ggplot(titanic,
           p9.aes(x='Survived', fill='factor(Survived)'))
    + p9.geom_bar(stat='count', position='dodge')
    + p9.facet_wrap(facets='Pclass'))


Out[21]:
<ggplot: (8738654587613)>
Remember:
  • The Grammar of Graphics is especially suitbale for these so-called tidy dataframe representations (check here for more about `tidy` data)
  • plotnine is a library that supports the Grammar of graphics

Building a plotnine graph

Building plots with plotnine is typically an iterative process. As illustrated in the introduction, a graph is setup by layering different elements on top of each other using the + operator. putting everything together in brackets () provides Python-compatible syntax.

data

  • Bind the plot to a specific data frame using the data argument:

In [22]:
(p9.ggplot(data=titanic))


Out[22]:
<ggplot: (8738654625505)>

We haven 't defined anything else, so just an empty figure is available.

aesthestics

  • Define aesthetics (aes), by selecting variables used in the plot and linking them to presentation such as plotting size, shape color, etc. You can interpret this as: how the variable will influence the plotted objects/geometries:

The most important aes are: x, y, alpha, color, colour, fill, linetype, shape, size and stroke


In [23]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare')))


Out[23]:
<ggplot: (8738654341497)>

geometry

  • Still nothing plotted yet, as we have to define what kind of geometry will be used for the plot. The easiest is probably using points:

In [24]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
)


Out[24]:
<ggplot: (8738654223565)>
EXERCISE:
  • Starting from the code of the last figure, adapt the code in such a way that the Sex variable defines the color of the points in the graph.
  • As both sex categories overlap, use an alternative geometry, so called geom_jitter

In [25]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare', color='Sex'))
     + p9.geom_jitter()
)


Out[25]:
<ggplot: (8738654322953)>

These are the basic elements to have a graph, but other elements can be added to the graph:

labels


In [26]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
     + p9.xlab("Cabin class")
)


Out[26]:
<ggplot: (8738654159745)>

facets

  • Use the power of groupby and define facets to group the plot by a grouping variable:

In [27]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')#, dir='v')
)


Out[27]:
<ggplot: (8738654159473)>

scales

  • Defining scale for colors, axes,...

For example, a log-version of the y-axis could support the interpretation of the lower numbers:


In [28]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point() 
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')
     + p9.scale_y_log10()
)


/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py:856: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[28]:
<ggplot: (8738654159713)>

theme


In [29]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point() 
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')
     + p9.scale_y_log10()
     + p9.theme_bw()
)


/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py:856: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[29]:
<ggplot: (8738653896445)>

or changing specific theming elements, e.g. text size:


In [30]:
(p9.ggplot(titanic,
           p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point() 
     + p9.xlab("Cabin class")
     + p9.facet_wrap('Sex')
     + p9.scale_y_log10()
     + p9.theme_bw()
     + p9.theme(text=p9.element_text(size=14))
)


/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/pandas/core/series.py:856: RuntimeWarning: divide by zero encountered in log10
  result = getattr(ufunc, method)(*inputs, **kwargs)
Out[30]:
<ggplot: (8738653826685)>

more...

Remember:
  • Start with defining your data, aes variables and a geometry
  • Further extend your plot with scale_*, theme_*, xlab/ylab, facet_*

plotnine is built on top of Matplotlib

As plotnine is built on top of Matplotlib, we can still retrieve the matplotlib figure object from plotnine for eventual customization:


In [31]:
myplot = (p9.ggplot(titanic, 
                    p9.aes(x='factor(Pclass)', y='Fare'))
     + p9.geom_point()
)

The trick is to use the draw() function in plotnine:


In [32]:
my_plt_version = myplot.draw()



In [33]:
my_plt_version.axes[0].set_title("Titanic fare price per cabin class")
ax2 = my_plt_version.add_axes([0.5, 0.5, 0.3, 0.3], label="ax2")
my_plt_version


Out[33]:
Remember: Similar to Pandas handling above, we can set up a matplotlib `Figure` with plotnine. Use `draw()` and the Matplotlib `Figure` is returned.

(OPTIONAL SECTION) Some more plotnine functionalities to remember...

Histogram: Getting the univariaite distribution of the Age


In [34]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age'))
     + p9.geom_histogram(bins=30))


Out[34]:
<ggplot: (8738653893049)>
EXERCISE:
  • Make a histogram of the age, grouped by the Sex of the passengers
  • Make sure both graphs are underneath each other instead of next to each other to enhance comparison

In [35]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age'))
     + p9.geom_histogram(bins=30)
     + p9.facet_wrap('Sex', nrow=2)
)


Out[35]:
<ggplot: (8738654509773)>

boxplot/violin plot: Getting the univariaite distribution of Age per Sex


In [36]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
     + p9.geom_boxplot())


Out[36]:
<ggplot: (8738654216733)>

Actually, a violinplot provides more inside to the distribution:


In [37]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
     + p9.geom_violin()
)


Out[37]:
<ggplot: (8738654108761)>
EXERCISE:
  • Make a violin plot of the Age for each `Sex`
  • Add `jitter` to the plot to see the actual data points
  • Adjust the transparency of the jitter dots to improve readability

In [38]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
     + p9.geom_violin()
     + p9.geom_jitter(alpha=0.2)
)


Out[38]:
<ggplot: (8738654127629)>

regressions

plotnine supports a number of statistical functions with the [geom_smooth function]:(http://plotnine.readthedocs.io/en/stable/generated/plotnine.stats.stat_smooth.html#plotnine.stats.stat_smooth)

The available methods are:

* 'auto'       # Use loess if (n<1000), glm otherwise
* 'lm', 'ols'  # Linear Model
* 'wls'        # Weighted Linear Model
* 'rlm'        # Robust Linear Model
* 'glm'        # Generalized linear Model
* 'gls'        # Generalized Least Squares
* 'lowess'     # Locally Weighted Regression (simple)
* 'loess'      # Locally Weighted Regression
* 'mavg'       # Moving Average
* 'gpr'        # Gaussian Process Regressor

each of these functions are provided by existing Python libraries and integrated in plotnine, so make sure to have these dependencies installed (read the error message!)


In [39]:
(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), 
           p9.aes(x='Fare', y='Age', color="Sex"))
     + p9.geom_point()
     + p9.geom_rug(alpha=0.2)
     + p9.geom_smooth(method='lm')
)


/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
Out[39]:
<ggplot: (8738654182813)>

In [40]:
(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']), 
           p9.aes(x='Fare', y='Age', color="Sex"))
     + p9.geom_point()
     + p9.geom_rug(alpha=0.2)
     + p9.geom_smooth(method='lm')
     + p9.facet_wrap("Survived")
     + p9.scale_color_brewer(type="qual")
)


/home/stijnvanhoey/miniconda3/envs/DS-python-data-analysis/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
Out[40]:
<ggplot: (8738654046117)>

Need more plotnine inspiration?

Remember [plotnine gallery](http://plotnine.readthedocs.io/en/stable/gallery.html) and [great documentation](http://plotnine.readthedocs.io/en/stable/api.html)

Important resources to start from!

What is tidy?

If you're wondering what tidy data representations are, you can read the scientific paper by Hadley Wickham, http://vita.had.co.nz/papers/tidy-data.pdf.

Here, we just introduce the main principle very briefly:

Compare:

un-tidy

WWTP Treatment A Treatment B
Destelbergen 8. 6.3
Landegem 7.5 5.2
Dendermonde 8.3 6.2
Eeklo 6.5 7.2

versus

tidy

WWTP Treatment pH
Destelbergen A 8.
Landegem A 7.5
Dendermonde A 8.3
Eeklo A 6.5
Destelbergen B 6.3
Landegem B 5.2
Dendermonde B 6.2
Eeklo B 7.2

This is sometimes also referred as short versus long format for a specific variable... Plotnine (and other grammar of graphics libraries) work better on tidy data, as it better supports groupby-like transactions!

Remember:

A tidy data set is setup as follows:
  • Each variable forms a column and contains values
  • Each observation forms a row
  • Each type of observational unit forms a table.