04. Working with Tabular Datasets

As we have already discovered, elements are simple wrappers around your data that provide a semantically meaningful representation. Tabular data (also called columnar data) is one of the most common, general, and versatile data formats, corresponding to how data is laid out in a spreadsheet. There are many different ways to put data into a tabular format, but for interactive analysis having tidy data provides flexibility and simplicity.

In this tutorial all the information you have learned in the previous sections will finally really pay off. We will discover how to facet data and use different element types to explore and visualize the data contained in a real dataset.


In [ ]:
import numpy as np
import scipy.stats as ss
import pandas as pd
import holoviews as hv
hv.extension('bokeh')
%opts Curve Scatter [tools=['hover']]

What is tabular, tidy data?


In [ ]:
macro_df = pd.read_csv('../data/macro.csv')
macro_df.head()

For tidy data, the columns of the table represent variables or dimensions and the rows represent observations.

Declaring dimensions

Mathematical variables can usually be described as dependent or independent. In HoloViews these correspond to value dimensions and key dimensions (respectively).

In this dataset 'country' and 'year' are independent variables or key dimensions, while the remainder are automatically inferred as value dimensions:


In [ ]:
macro = hv.Dataset(macro_df, kdims=['country', 'year'])
macro

We will also give the dimensions more sensible labels using redim.label:


In [ ]:
macro = macro.redim.label(growth='GDP Growth', unem='Unemployment', year='Year', country='Country')

Mapping dimensions to elements

Once we have a Dataset with multiple dimensions we can map these dimensions onto elements onto the .to method. The method takes four main arguments:

  1. The element you want to convert to
  2. The key dimensions (or independent variables to display)
  3. The dependent variables to display
  4. The dimensions to group by

As a first simple example let's go through such a declaration:

  1. We will use a Curve
  2. Our independent variable will be the 'year'
  3. Our dependent variable will be 'unem'
  4. We will groupby the 'country'.

In [ ]:
curves = macro.to(hv.Curve, kdims='year', vdims='unem', groupby='country')
curves

Alternatively we could also group by the year and view the unemployment rate by country as Bars instead:


In [ ]:
%%opts Bars [width=600 xrotation=45]
bars = macro.sort('country').to(hv.Bars, kdims='country', vdims='unem', groupby='year')
bars

In [ ]:
# Exercise: Create a HeatMap using ``macro.to``, declaring vdims 'year' and 'country', and kdims 'growth'
# You'll need to declare ``width`` and ``xrotation`` plot options for HeatMap to make the plot readable
# You can also add ``tools=['hover']`` to get more info

In [ ]:

Displaying distributions

Often we want to summarize the distribution of values, e.g. to reveal the distribution of unemployment rates for each OECD country across time. This means we want to ignore the 'year' dimension in our dataset, letting it be summarized instead. To stop HoloViews from grouping by the extra variable, we pass an empty list to the groupby argument.


In [ ]:
%%opts BoxWhisker [width=800 xrotation=30] (box_fill_color=Palette('Category20'))
macro.to(hv.BoxWhisker, 'country', 'growth', groupby=[])

In [ ]:
# Exercise: Display the distribution of GDP growth by year using the BoxWhisker element

Faceting dimensions

In the previous section we discovered how to facet our data using the .overlay, .grid and .layout methods. Instead of working with more abstract FM modulation signals, we now have concrete variables to group by, namely the 'country' and 'year':


In [ ]:
%%opts Scatter [width=800 height=400 size_index='growth'] (color=Palette('Category20') size=5)
%%opts NdOverlay [legend_position='left']
macro.to(hv.Scatter, 'year', ['unem', 'growth']).overlay().relabel('OECD Unemployment 1960 - 1990')

In [ ]:
# Exercise: Instead of faceting using an .overlay() of Scatter elements, facet the data using a .grid() 
# of Curve or Area elements

In [ ]:


In [ ]:
# Exercise: You'll notice that you get quite a lot of countries in the grid. 
# You can try supplying a short list of countries to the 'macro.select` method to get a more-practical subset.
# Hint: You may want to pass the shared_yaxis=True plot option to GridSpace, to get a numeric axis

In [ ]:

Aggregating

Another common operation is computing aggregates. We can also compute and visualize these easily using the aggregate method. Simply supply the dimension(s) to aggregate by and supply a function and optionally a secondary function to compute the spread. Once we have computed the aggregate we can simply pass it to the Curve and ErrorBars:


In [ ]:
%%opts Curve [width=600]
agg = macro.aggregate('year', function=np.mean, spreadfn=np.std)
(hv.Curve(agg) * hv.ErrorBars(agg, kdims=['year'], vdims=['growth', 'growth_std']))

In [ ]:
# Exercise: Display aggregate GDP growth by country, building it up in a series of steps
# Step 1. First, aggregate the data by country rather than by year, using
# np.mean and ss.sem as the function and spreadfn, respectively, then 
# make a `Bars` element from the resulting ``agg``

In [ ]:
agg = macro.aggregate('country', function=np.mean, spreadfn=ss.sem)
hv.Bars(agg)

In [ ]:
# Step 2: You should now have a bars plot, but with no error bars. To add the error bars,
# print the 'agg' as text to see which vdims are available (which will be different for 
# different spreadfns), then overlay ErrorBars as above but for the new kdims and
# the appropriate vdims
# Hint: You'll want to make the plot wider and use an xrotation to see the labels clearly

In [ ]:

Onward

The next section shows a similar approach, but for working with gridded data, in multidimensional array formats.