Plotnine: Introduction
DS Data manipulation, analysis and visualisation in Python
December, 2019© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons
In [1]:
import pandas as pd
ggplot2 R package ggplot2 R package
In [9]:
import plotnine as p9
We will use the Titanic example data set:
In [10]:
titanic = pd.read_csv('../data/titanic.csv')
In [11]:
titanic.head()
Out[11]:
Let's consider following question:
For each class at the Titanic, how many people survived and how many died?
Hence, we should define the size of respectively the zeros (died) and ones (survived) groups of column Survived, also grouped by the Pclass. In Pandas terminology:
In [12]:
survived_stat = titanic.groupby(["Pclass", "Survived"]).size().rename('count').reset_index()
survived_stat
# Remark: the `rename` syntax is to provide the count column a column name
Out[12]:
Providing this data in a bar chart with pure Pandas is still partly supported:
In [19]:
survived_stat.plot(x='Survived', y='count', kind='bar')
## A possible other way of plotting this could be using groupby again:
#survived_stat.groupby('Pclass').plot(x='Survived', y='count', kind='bar') # (try yourself by uncommenting)
Out[19]:
but with mixed results...
Plotting libraries focussing on the grammar of graphics are really targeting these grouped plots. For example, the plotting of the resulting counts can be expressed in the grammar of graphics:
In [20]:
(p9.ggplot(survived_stat,
p9.aes(x='Survived', y='count', fill='factor(Survived)'))
+ p9.geom_bar(stat='identity', position='dodge')
+ p9.facet_wrap(facets='Pclass'))
Out[20]:
Moreover, these count operations are embedded in the typical Grammar of Graphics packages and we can do these operations directly on the original titanic data set in a single coding step:
In [21]:
(p9.ggplot(titanic,
p9.aes(x='Survived', fill='factor(Survived)'))
+ p9.geom_bar(stat='count', position='dodge')
+ p9.facet_wrap(facets='Pclass'))
Out[21]:
Building plots with plotnine is typically an iterative process. As illustrated in the introduction, a graph is setup by layering different elements on top of each other using the + operator. putting everything together in brackets () provides Python-compatible syntax.
In [22]:
(p9.ggplot(data=titanic))
Out[22]:
We haven 't defined anything else, so just an empty figure is available.
The most important aes are: x, y, alpha, color, colour, fill, linetype, shape, size and stroke
In [23]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare')))
Out[23]:
In [24]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
)
Out[24]:
Sex variable defines the color of the points in the graph. geom_jitter
In [25]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare', color='Sex'))
+ p9.geom_jitter()
)
Out[25]:
These are the basic elements to have a graph, but other elements can be added to the graph:
In [26]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
+ p9.xlab("Cabin class")
)
Out[26]:
groupby and define facets to group the plot by a grouping variable:
In [27]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
+ p9.xlab("Cabin class")
+ p9.facet_wrap('Sex')#, dir='v')
)
Out[27]:
For example, a log-version of the y-axis could support the interpretation of the lower numbers:
In [28]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
+ p9.xlab("Cabin class")
+ p9.facet_wrap('Sex')
+ p9.scale_y_log10()
)
Out[28]:
In [29]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
+ p9.xlab("Cabin class")
+ p9.facet_wrap('Sex')
+ p9.scale_y_log10()
+ p9.theme_bw()
)
Out[29]:
or changing specific theming elements, e.g. text size:
In [30]:
(p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
+ p9.xlab("Cabin class")
+ p9.facet_wrap('Sex')
+ p9.scale_y_log10()
+ p9.theme_bw()
+ p9.theme(text=p9.element_text(size=14))
)
Out[30]:
data, aes variables and a geometryscale_*, theme_*, xlab/ylab, facet_*As plotnine is built on top of Matplotlib, we can still retrieve the matplotlib figure object from plotnine for eventual customization:
In [31]:
myplot = (p9.ggplot(titanic,
p9.aes(x='factor(Pclass)', y='Fare'))
+ p9.geom_point()
)
The trick is to use the draw() function in plotnine:
In [32]:
my_plt_version = myplot.draw()
In [33]:
my_plt_version.axes[0].set_title("Titanic fare price per cabin class")
ax2 = my_plt_version.add_axes([0.5, 0.5, 0.3, 0.3], label="ax2")
my_plt_version
Out[33]:
Histogram: Getting the univariaite distribution of the Age
In [34]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age'))
+ p9.geom_histogram(bins=30))
Out[34]:
Sex of the passengers
In [35]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Age'))
+ p9.geom_histogram(bins=30)
+ p9.facet_wrap('Sex', nrow=2)
)
Out[35]:
boxplot/violin plot: Getting the univariaite distribution of Age per Sex
In [36]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
+ p9.geom_boxplot())
Out[36]:
Actually, a violinplot provides more inside to the distribution:
In [37]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
+ p9.geom_violin()
)
Out[37]:
In [38]:
(p9.ggplot(titanic.dropna(subset=['Age']), p9.aes(x='Sex', y='Age'))
+ p9.geom_violin()
+ p9.geom_jitter(alpha=0.2)
)
Out[38]:
regressions
plotnine supports a number of statistical functions with the [geom_smooth function]:(http://plotnine.readthedocs.io/en/stable/generated/plotnine.stats.stat_smooth.html#plotnine.stats.stat_smooth)
The available methods are:
* 'auto' # Use loess if (n<1000), glm otherwise
* 'lm', 'ols' # Linear Model
* 'wls' # Weighted Linear Model
* 'rlm' # Robust Linear Model
* 'glm' # Generalized linear Model
* 'gls' # Generalized Least Squares
* 'lowess' # Locally Weighted Regression (simple)
* 'loess' # Locally Weighted Regression
* 'mavg' # Moving Average
* 'gpr' # Gaussian Process Regressor
each of these functions are provided by existing Python libraries and integrated in plotnine, so make sure to have these dependencies installed (read the error message!)
In [39]:
(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']),
p9.aes(x='Fare', y='Age', color="Sex"))
+ p9.geom_point()
+ p9.geom_rug(alpha=0.2)
+ p9.geom_smooth(method='lm')
)
Out[39]:
In [40]:
(p9.ggplot(titanic.dropna(subset=['Age', 'Sex', 'Fare']),
p9.aes(x='Fare', y='Age', color="Sex"))
+ p9.geom_point()
+ p9.geom_rug(alpha=0.2)
+ p9.geom_smooth(method='lm')
+ p9.facet_wrap("Survived")
+ p9.scale_color_brewer(type="qual")
)
Out[40]:
If you're wondering what tidy data representations are, you can read the scientific paper by Hadley Wickham, http://vita.had.co.nz/papers/tidy-data.pdf.
Here, we just introduce the main principle very briefly:
This is sometimes also referred as short versus long format for a specific variable... Plotnine (and other grammar of graphics libraries) work better on tidy data, as it better supports groupby-like transactions!
variable forms a column and contains valuesobservation forms a rowobservational unit forms a table.