Python's Visualization Landscape

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons



Remark:

The packages used in this notebook are not provided by default in the conda environment of the course. In case you want to try these featutes yourself, make sure to install these packages with conda.

To make some of the more general plotting packages available:

conda install -c conda-forge bokeh plotly altair vega

an additional advice will appear about the making the vega nbextension available. This can be activated with the command:

jupyter nbextension enable vega --py --sys-prefix

and use the interaction between plotly and pandas, install cufflinks as well

pip install cufflinks --upgrade

To run the large data set section, additional package installations are required:

conda install -c bokeh datashader holoviews

What have we done so far?

What we have encountered until now:


In [ ]:
import numpy as np
import pandas as pd

import matplotlib.pylab as plt
import plotnine as p9
import seaborn as sns

To 'grammar of graphics' or not to 'grammar of graphics'?

Introduction

There is titanic again...


In [ ]:
titanic = pd.read_csv("../data/titanic.csv")

Pandas plot...


In [ ]:
fig, ax = plt.subplots()
plt.style.use('ggplot')
survival_rate = titanic.groupby("Pclass")['Survived'].mean()
survival_rate.plot(kind='bar', color='grey', 
                   rot=0, figsize=(6, 4), ax=ax)
ylab = ax.set_ylabel("Survival rate")
xlab = ax.set_xlabel("Cabin class")

Plotnine plot...


In [ ]:
(p9.ggplot(titanic, p9.aes(x="factor(Pclass)", 
                           y="Survived"))   #add color/fill
     + p9.geom_bar(stat='stat_summary', width=0.5)
     + p9.theme(figure_size=(5, 3))
     + p9.ylab("Survival rate")
     + p9.xlab("Cabin class")
)

An important difference is the imperative approach from matplotlib versus the declarative approach from plotnine:

imperative declarative
Specify how something should be done Specify what should be done
Manually specify the individual plotting steps Individual plotting steps based on declaration
e.g. for ax in axes: ax.plot(... e.g. + facet_wrap('my_variable)
(seaborn lands somewhere in between)

Which approach to use, is also a matter of personal preference....

Although, take following elements into account:

  • When your data consists of only 1 factor variable, such as
ID variable 1 variable 2 variabel ...
1 0.2 0.8 ...
2 0.3 0.1 ...
3 0.9 0.6 ...
4 0.1 0.7 ...
... ... ... ...

the added value of using a grammar of graphics approach is LOW.

  • When working with timeseries data from sensors or continuous logging, such as
datetime station 1 station 2 station ...
2017-12-20T17:50:46Z 0.2 0.8 ...
2017-12-20T17:50:52Z 0.3 0.1 ...
2017-12-20T17:51:03Z 0.9 0.6 ...
2017-12-20T17:51:40Z 0.1 0.7 ...
... ... ... ...

the added value of using a grammar of graphics approach is LOW.

  • When working with different experiments, different conditions, (factorial) experimental designs, such as
ID substrate addition (ml) measured_value
1 Eindhoven 0.3 7.2
2 Eindhoven 0.6 6.7
3 Eindhoven 0.9 5.2
4 Destelbergen 0.3 7.2
5 Destelbergen 0.6 6.8
... ... ... ...

the added value of using a grammar of graphics approach is HIGH. Represent you're data tidy to achieve maximal benefit!

Remember:
  • These packages will support you towards static, publication quality figures in a variety of hardcopy formats
  • In general, start with a high-level function and finish with matplotlib

Still...

I've been wasting too much time on this one stupid detail for this publication graph


In [ ]:
fig.savefig("my_plot_with_one_issue.pdf")
Notice:
  • In the end... there is still Inkscape to the rescue!

Seaborn


In [ ]:
plt.style.use('seaborn-white')

Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

Seaborn provides a set of particularly interesting plot functions:

scatterplot matrix

We've already encountered the pairplot, a typical quick explorative plot function


In [ ]:
# the discharge data for a number of measurement stations as example
flow_data = pd.read_csv("../data/vmm_flowdata.csv", parse_dates=True, index_col=0)
flow_data = flow_data.dropna()
flow_data['year'] = flow_data.index.year
flow_data.head()

In [ ]:
# pairplot
sns.pairplot(flow_data, vars=["L06_347", "LS06_347", "LS06_348"], 
             hue='year', palette=sns.color_palette("Blues_d"), 
             diag_kind='kde', dropna=True)

heatmap

Let's just start from a Ghent data set: The city of Ghent provides data about migration in the different districts as open data, https://data.stad.gent/data/58


In [ ]:
district_migration = pd.read_csv("https://datatank.stad.gent/4/bevolking/wijkmigratieperduizend.csv", 
                                 sep=";", index_col=0)
district_migration.index.name = "wijk"
district_migration.head()

In [ ]:
# cleaning the column headers
district_migration.columns = [year[-4:] for year in district_migration.columns]
district_migration.head()

In [ ]:
#adding a total column
district_migration['TOTAAL'] = district_migration.sum(axis=1)

In [ ]:
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(district_migration, annot=True, fmt=".1f", linewidths=.5, 
            cmap="PiYG", ax=ax, vmin=-20, vmax=20)
ylab = ax.set_ylabel("")
ax.set_title("Migration of Ghent districts", size=14)

jointplot

jointplot provides a very convenient function to check the combined distribution of two variables in a DataFrame (bivariate plot)

Using the default options on the flow_data dataset


In [ ]:
g = sns.jointplot(data=flow_data, 
                  x='LS06_347', y='LS06_348')

In [ ]:
g = sns.jointplot(data=flow_data, 
                  x='LS06_347', y='LS06_348', 
                  kind="reg", space=0)

more options, applied on the migration data set:


In [ ]:
g = sns.jointplot(data=district_migration.transpose(), 
                  x='Oud Gentbrugge', y='Nieuw Gent - UZ', 
                  kind="kde", height=7, space=0) # kde
Notice!:
  • Watch out with the interpretation. The representations (`kde`, `regression`) is based on a very limited set of data points!

Adding the data points itself, provides at least this info to the user:


In [ ]:
g = (sns.jointplot(
        data=district_migration.transpose(), 
        x='Oud Gentbrugge', y='Nieuw Gent - UZ', 
        kind="scatter", height=7, space=0, stat_func=None,
        marginal_kws=dict(bins=20, rug=True)
        ).plot_joint(sns.kdeplot, zorder=0, 
                     n_levels=5, cmap='Reds'))
g.savefig("my_great_plot.pdf")

jointplot

With catplot and relplot, Seaborn provides similarities with the Grammar of Graphics


In [ ]:
sns.catplot(data=titanic, x="Survived",
            col="Pclass", kind="count")
Remember - Check the package galleries:

Interactivity and the web matter these days!

Bokeh

Bokeh is a Python interactive visualization library that targets modern web browsers for presentation


In [ ]:
from bokeh.plotting import figure, output_file, show

By default, Bokeh will open a new webpage to plot the figure. Still, the integration with notebooks is provided as well:


In [ ]:
from bokeh.io import output_notebook

In [ ]:
output_notebook()

In [ ]:
p = figure()
p.line(x=[1, 2, 3], y=[4,6,2])
show(p)
Notice!:
  • Bokeh does not support eps, pdf export of plots directly. Exporting to svg is available but still limited, see documentation
  • .

To accomodate the users of Pandas, a pd.DataFrame can also be used as the input for a Bokeh plot:


In [ ]:
from bokeh.models import ColumnDataSource
source_data = ColumnDataSource(data=flow_data)

In [ ]:
flow_data.head()

Useful to know when you want to use the index as well:

If the DataFrame has a named index column, then CDS will also have a column with this name. However, if the index name (or any subname of a MultiIndex) is None, then the CDS will have a column generically named index for the index.


In [ ]:
p = figure(x_axis_type="datetime", plot_height=300, plot_width=900)
p.line(x='Time', y='L06_347', source=source_data)
show(p)

The setup of the graph, is by adding new elements to the figure object, e.g. adding annotations:


In [ ]:
from bokeh.models import ColumnDataSource, BoxAnnotation, Label

In [ ]:
p = figure(x_axis_type="datetime", plot_height=300, plot_width=900)
p.line(x='Time', y='L06_347', source=source_data)
p.circle(x='Time', y='L06_347', source=source_data, fill_alpha= 0.3, line_alpha=0.3)

alarm_box = BoxAnnotation(bottom=10, fill_alpha=0.3, 
                          fill_color='#ff6666')  # arbitrary value; this is NOT the real-case value
p.add_layout(alarm_box)

alarm_label = Label(text="Flood risk", x_units='screen', 
                    x= 10, y=10, text_color="#330000")
p.add_layout(alarm_label)

show(p)

Also this jointplot and this gapminder reproduction is based on Bokeh!

More Bokeh?

Plotly

plotly.py is an interactive, browser-based graphing library for Python


In [ ]:
import plotly

In the last years, plotly has been developed a lot and provides now a lot of functionalities for interactive plotting, see https://plot.ly/python/#fundamentals. It consists of two main components: plotly provides all the basic components (so called plotly.graph_objects) to create plots and plotly express provides a more high-level wrapper around plotly.graph_objects for rapid data exploration and figure generation. The latter focuses on tidy data representation.

As an example: create a histogram using the plotly graph_objects:


In [ ]:
import plotly.graph_objects as go

fig = go.Figure(data=[go.Histogram(x=titanic['Fare'].values)])
fig.show()

Can be done in plotly express as well, supporting direct interaction with a Pandas DataFrame:


In [ ]:
import plotly.express as px

fig = px.histogram(titanic, x="Fare")
fig.show()
Notice!:
  • Prior versions of plotly.py contained functionality for creating figures in both "online" and "offline" modes. Version 4 of plotly is "offline"-only. Make sure you check the latest documentation and watch out with outdated stackoverflow suggestions. The previous commercial/online version is rebranded into chart studio.

As mentioned in the example, the interaction of plotly with Pandas is supported:

.1. Indirectly, by using the plotly specific dictionary syntax:


In [ ]:
import plotly.graph_objects as go

df = flow_data[["L06_347", "LS06_348"]]

fig = go.Figure({
    "data": [{'x': df.index, 
              'y': df[col], 
              'name': col} for col in df.columns],  # remark, we use a list comprehension here ;-)
    "layout": {"title": {"text": "Streamflow data"}}
})
fig.show()

.2. or using the plotly object oriented approach with graph objects:


In [ ]:
df = flow_data[["L06_347", "LS06_348"]]

fig = go.Figure()

for col in df.columns:
    fig.add_trace(go.Scatter(
                    x=df.index,
                    y=df[col],
                    name=col))
    
fig.layout=go.Layout(
        title=go.layout.Title(text="Streamflow data")
    )
fig.show()

.3. or using the plotly express functionalities:


In [ ]:
df = flow_data[["L06_347", "LS06_348"]].reset_index()  # reset index, as plotly express can not use the index directly
df = df.melt(id_vars="Time")  # from wide to long format
df.head()

As mentioned, plotly express targets tidy data (cfr. plotnine,...), so we converted the data to tidy/long format before plotting:


In [ ]:
import plotly.express as px

fig = px.line(df, x='Time', y='value', color="variable", title="Streamflow data")
fig.show()

.4. or by installing an additional package, cufflinks, which enables Pandas plotting with iplot instead of plot:


In [ ]:
import cufflinks as cf

df = flow_data[["L06_347", "LS06_348"]]
fig = df.iplot(kind='scatter', asFigure=True)
fig.show()

cufflinks applied on the data set of district migration:


In [ ]:
district_migration.transpose().iplot(kind='box', asFigure=True).show()
Plotly
  • Check the package gallery for plot examples.
  • Plotly express provides high level plotting functionalities and plotly graph objects the low level components.
  • More information about the cufflinks connection with Pandas is available here.

For R users...:

Both plotly and Bokeh provide interactivity (sliders,..), but are not the full equivalent of [`Rshiny`](https://shiny.rstudio.com/).
A similar functionality of Rshiny is provided by [`dash`](https://plot.ly/products/dash/), created by the same company as plotly.

You like web-development and Javascript?

Altair

Altair is a declarative statistical visualization library for Python, based on Vega-Lite.


In [ ]:
import altair as alt

Reconsider the titanic example of the start fo this notebook:


In [ ]:
fig, ax = plt.subplots()
plt.style.use('ggplot')
survival_rate = titanic.groupby("Pclass")['Survived'].mean()
survival_rate.plot(kind='bar', color='grey', 
                   rot=0, figsize=(6, 4), ax=ax)
ylab = ax.set_ylabel("Survival rate")
xlab = ax.set_xlabel("Cabin class")

Translating this to Altair syntax:


In [ ]:
alt.Chart(titanic).mark_bar().encode(
    x=alt.X('Pclass:O', axis=alt.Axis(title='Cabin class')),
    y=alt.Y('mean(Survived):Q', 
            axis=alt.Axis(format='%', 
                          title='survival_rate'))
)

Similar to plotnine with aesthetic, expressing the influence of a varibale on the plot building can be encoded:


In [ ]:
alt.Chart(titanic).mark_bar().encode(
    x=alt.X('Pclass:O', axis=alt.Axis(title='Cabin class')),
    y=alt.Y('mean(Survived):Q', 
            axis=alt.Axis(format='%', 
                          title='survival_rate')),
    column="Sex"
)

The typical ingedrients of the grammar of graphics are available again:


In [ ]:
(alt.Chart(titanic)  # Link with the data
     .mark_circle().encode(  # defining a geometry
        x="Fare:Q",   # provide aesthetics by linking variables to channels
        y="Age:Q",
        column="Pclass:O",
        color="Sex:N",
))
# scales,... can be adjusted as well

For information on this ...:Q, ...:N,...:O, see the data type section of the documentation:

Data Type Shorthand Code Description
quantitative Q a continuous real-valued quantity
ordinal O a discrete ordered quantity
nominal N a discrete unordered category
temporal T a time or date value
Remember
  • Altair provides a pure-Python Grammar of Graphics implementation!
  • Altair is built on top of the Vega-Lite visualization grammar, which can be interpreted as a language to specify a graph (from data to figure).
  • Altair easily integrates with web-technology (HTML/Javascript)

You're data sets are HUGE?

When you're working with a lot of records, the visualization of the individual points does not always make sense as there are simply to many dots overlapping eachother (check this notebook for a more detailed explanation).

Consider the open data set:

Bird tracking - GPS tracking of Lesser Black-backed Gulls and Herring Gulls breeding at the southern North Sea coast https://www.gbif.org/dataset/83e20573-f7dd-4852-9159-21566e1e691e with > 8e6 records

Working with such a data set on a local machine is not straightforward anymore, as this data set will consume a lot of memory to be handled by the default plotting libraries. Moreover, visualizing every single dot is not useful anymore at coarser zoom levels.

The package datashader provides a solution for this size of data sets and works together with other packages such as bokeh and holoviews.

We download just a single year (e.g. 2018) of data from the gull data set and store it in the data folder. The 2018 data file has around 4.8 million records.


In [ ]:
import pandas as pd, holoviews as hv
from colorcet import fire
from datashader.geo import lnglat_to_meters
from holoviews.element.tiles import EsriImagery
from holoviews.operation.datashader import rasterize, shade

df = pd.read_csv('../data/HG_OOSTENDE-gps-2018.csv', usecols=['location-long', 'location-lat'])
df.columns = ['longitude', 'latitude']
df.loc[:,'longitude'], df.loc[:,'latitude'] = lnglat_to_meters(df.longitude, df.latitude)

In [ ]:
hv.extension('bokeh')

map_tiles  = EsriImagery().opts(alpha=1.0, width=800, height=800, bgcolor='black')
points     = hv.Points(df, ['longitude', 'latitude'])
rasterized = shade(rasterize(points, x_sampling=1, y_sampling=1, width=800, height=800), cmap=fire)

map_tiles * rasterized
When not to use datashader
  • Plotting less than 1e5 or 1e6 data points
  • When every datapoint matters; standard Bokeh will render all of them
  • For full interactivity (hover tools) with every datapoint

When to use datashader
  • Actual big data; when Bokeh/Matplotlib have trouble
  • When the distribution matters more than individual points
  • When you find yourself sampling or binning to better understand the distribution

([source](http://nbviewer.jupyter.org/github/bokeh/bokeh-notebooks/blob/master/tutorial/A2%20-%20Visualizing%20Big%20Data%20with%20Datashader.ipynb))
More alternatives for large data set visualisation that are worthwhile exploring::
  • vaex which also provides on the fly binning or aggregating the data on a grid to be represented.
  • Glumpy and Vispy, which both rely on OpenGL to achieve high performance

You want to dive deeper into Python viz?

For a overview of the status of Python visualisation packages and tools, have a look at the pyviz website.


In [ ]:
from IPython.display import Image
Image('https://raw.githubusercontent.com/rougier/python-visualization-landscape/master/landscape.png')

or check the interactive version here.