Tabular Datasets

In this guide we will explore how to work with tabular data in HoloViews. Tabular data has a fixed list of column headings, with values stored in an arbitrarily long list of rows. Spreadsheets, relational databases, CSV files, and many other typical data sources fit naturally into this format. HoloViews defines an extensible system of interfaces to load, manipulate, and visualize this kind of data, as well as allowing conversion of any of the non-tabular data types into tabular data for analysis or data interchange.

By default HoloViews will use one of these data storage formats for tabular data:

  • A pure Python dictionary containing 1D NumPy-arrays for each column.

    {'x': np.array([0, 1, 2]), 'y': np.array([0, 1, 2])}

  • A purely NumPy array format for numeric data.

    np.array([[0, 0], [1, 1], [2, 3]])

  • Pandas DataFrames

    pd.DataFrame(np.array([[0, 0], [1, 1], [2, 3]]), columns=['x', 'y'])

  • Dask DataFrames

A number of additonal standard constructors are supported:

  • A tuple of array (or array-like) objects

    ([0, 1, 2], [0, 1, 2])

  • A list of tuples:

    [(0, 0), (1, 1), (2, 2)]


In [ ]:
import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import opts

hv.extension('bokeh', 'matplotlib')

opts.defaults(opts.Scatter(size=10, padding=0.1))

A simple Dataset

Usually when working with data we have one or more independent variables, taking the form of categories, labels, discrete sample coordinates, or bins. We refer to these independent variables as key dimensions (or kdims for short) in HoloViews. The observer or dependent variables, on the other hand, are referred to as value dimensions (vdims), and are ordinarily measured or calculated given the independent variables. The simplest useful form of a Dataset object is therefore a column 'x' and a column 'y' corresponding to the key dimensions and value dimensions respectively. An obvious visual representation of this data is a Table:


In [ ]:
xs = np.linspace(0, 10, 11)
ys = np.sin(xs)

table = hv.Table((xs, ys), 'x', 'y')
table

However, this data has many more meaningful visual representations, and therefore the first important concept is that Dataset objects can be converted to other objects as long as their dimensionality allows it, meaning that you can easily create the different objects from the same data (and cast between the objects once created):


In [ ]:
(hv.Scatter(table) + hv.Curve(table) + hv.Area(table) + hv.Bars(table)).cols(2)

Each of these three plots uses the same data, but represents a different assumption about the semantic meaning of that data -- the Scatter plot is appropriate if that data consists of independent samples, the Curve plot is appropriate for samples chosen from an underlying smooth function, and the Bars plot is appropriate for independent categories of data. Since all these plots have the same dimensionality, they can easily be converted to each other, but there is normally only one of these representations that is semantically appropriate for the underlying data. For this particular data, the semantically appropriate choice is Curve, since the y values are samples from the continuous function exp.

As a guide to which Elements can be converted to each other, those of the same dimensionality here should be interchangeable, because of the underlying similarity of their columnar representation:

  • 0D: BoxWhisker, Spikes, Distribution
  • 1D: Area, Bars, BoxWhisker, Curve, ErrorBars, Scatter, Spread
  • 2D: Bars, Bivariate, BoxWhisker, HeatMap, Points, VectorField
  • 3D: Scatter3D, TriSurface, VectorField

This categorization is based only on the kdims, which define the space in which the data has been sampled or defined. An Element can also have any number of value dimensions (vdims), which may be mapped onto various attributes of a plot such as the color, size, and orientation of the plotted items. For a reference of how to use these various Element types, see the Elements Reference.

Data types and Constructors

As discussed above, Dataset provides an extensible interface to store and operate on data in different formats. All interfaces support a number of standard constructors.

Storage formats

Dataset types can be constructed using one of three supported formats, (a) a dictionary of columns, (b) an NxD array with N rows and D columns, or (c) pandas dataframes:


In [ ]:
print(hv.Scatter({'x': xs, 'y': ys}) +
      hv.Scatter(np.column_stack([xs, ys])) +
      hv.Scatter(pd.DataFrame({'x': xs, 'y': ys})))

Literals

In addition to the main storage formats, Dataset Elements support construction from three Python literal formats: (a) An iterator of y-values, (b) a tuple of columns, and (c) an iterator of row tuples.


In [ ]:
print(hv.Scatter(ys) + hv.Scatter((xs, ys)) + hv.Scatter(zip(xs, ys)))

For these inputs, the data will need to be copied to a new data structure, having one of the three storage formats above. By default Dataset will try to construct a simple array, falling back to either pandas dataframes (if available) or the dictionary-based format if the data is not purely numeric. Additionally, the interfaces will try to maintain the provided data's type, so numpy arrays and pandas DataFrames will always be parsed first by their respective array and dataframe interfaces.


In [ ]:
df = pd.DataFrame({'x': xs, 'y': ys, 'z': ys*2})
print(type(hv.Scatter(df).data))

Dataset will attempt to parse the supplied data, falling back to each consecutive interface if the previous could not interpret the data. The default list of fallbacks and simultaneously the list of allowed datatypes is:


In [ ]:
hv.Dataset.datatype

Note these include grid based datatypes, which are covered in Gridded Datasets. To select a particular storage format explicitly, supply one or more allowed datatypes (note that the 'array' interface only supports data with matching types):


In [ ]:
print(type(hv.Scatter((xs.astype('float64'), ys), datatype=['array']).data))
print(type(hv.Scatter((xs, ys), datatype=['dictionary']).data))
print(type(hv.Scatter((xs, ys), datatype=['dataframe']).data))

Sharing Data

Since the formats with labelled columns do not require any specific order, each Element can effectively become a view into a single set of data. By specifying different key and value dimensions, many Elements can show different values, while sharing the same underlying data source.


In [ ]:
overlay = hv.Scatter(df, 'x', 'y') * hv.Scatter(df, 'x', 'z')
overlay

We can quickly confirm that the data is actually shared:


In [ ]:
overlay.Scatter.I.data is overlay.Scatter.II.data

For columnar data, this approach is much more efficient than creating copies of the data for each Element, and allows for some advanced features like linked brushing in the Bokeh backend.

Converting to raw data

Column types make it easy to export the data to the three basic formats: arrays, dataframes, and a dictionary of columns.

Array

In [ ]:
table.array()
Pandas DataFrame

In [ ]:
table.dframe().head()
Dataset dictionary

In [ ]:
table.columns()

Creating tabular data from Elements using the .table and .dframe methods

If you have data in some other HoloViews element and would like to use the columnar data features, you can easily tabularize any of the core Element types into a Table Element. Similarly, the .dframe() method will convert an Element into a pandas DataFrame. These methods are very useful if you want to then transform the data into a different Element type, or to perform different types of analysis.

Tabularizing simple Elements

For a simple example, we can create a Curve of an exponential function and cast it to a Table, with the same result as creating the Table directly from the data as done earlier in this user guide:


In [ ]:
xs = np.arange(10)
curve = hv.Curve(zip(xs, np.sin(xs)))
curve * hv.Scatter(curve) + hv.Table(curve)

Similarly, we can get a pandas dataframe of the Curve using curve.dframe():


In [ ]:
curve.dframe()

Tabularizing space containers

Even deeply nested objects can be deconstructed in this way, serializing them to make it easier to get your raw data out of a collection of specialized Element types. Let's say we want to make multiple observations of a noisy signal. We can collect the data into a HoloMap to visualize it and then call .collapse() to get a Dataset object to which we can apply operations or transformations to other Element types. Deconstructing nested data in this way only works if the data is homogeneous. In practical terms this requires that your data structure contains Elements (of any type) held in these Container types: NdLayout, GridSpace, HoloMap, and NdOverlay, with all dimensions consistent throughout (so that they can all fit into the same set of columns). To read more about these containers see the Dimensioned Containers guide.

Let's now go back to the Image example. We will collect a number of observations of some noisy data into a HoloMap and display it:


In [ ]:
obs_hmap = hv.HoloMap({i: hv.Image(np.random.randn(10, 10), bounds=(0,0,3,3))
                       for i in range(3)}, kdims='Observation')
obs_hmap

Now we can serialize this data just as before, where this time we get a four-column (4D) table. The key dimensions of both the HoloMap and the Images, as well as the z-values of each Image, are all merged into a single table. We can visualize the samples we have collected by converting it to a Scatter3D object.


In [ ]:
hv.output(backend='matplotlib', size=150)

collapsed = obs_hmap.collapse()
scatter_layout = collapsed.to.scatter3d() + hv.Table(collapsed)
scatter_layout.opts(
    opts.Scatter3D(color='z', cmap='hot', edgecolor='black', s=50))

Here the z dimension is shown by color, as in the original images, and the other three dimensions determine where the datapoint is shown in 3D. This way of deconstructing objects will work for any data structure that satisfies the conditions described above, no matter how nested. If we vary the amount of noise while continuing to performing multiple observations, we can create an NdLayout of HoloMaps, one for each noise level, and animated by the observation number.


In [ ]:
extents = (0, 0, 3, 3)

error_hmap = hv.HoloMap({
    (i, j): hv.Image(j*np.random.randn(3, 3), bounds=extents)
    for i in range(3) for j in np.linspace(0, 1, 3)},
    ['Observation', 'noise'])

noise_layout = error_hmap.layout('noise')
noise_layout

And again, we can easily convert the object to a Table:


In [ ]:
noise_layout.table()

Applying operations to the data

Sorting by columns

Once data is in columnar form, it is simple to apply a variety of operations. For instance, Dataset can be sorted by their dimensions using the .sort() method. By default, this method will sort by the key dimensions in an ascending order, but any other dimension(s) can be sorted by providing them as an argument list to the sort method. The reverse argument also allows sorting in descending order:


In [ ]:
hv.output(backend='bokeh')

bars = hv.Bars((['C', 'A', 'B', 'D'], [2, 7, 3, 4]))
(bars +
 bars.sort().relabel('sorted') +
 bars.sort(['y']).relabel('y-sorted') +
 bars.sort(reverse=True).relabel('reverse sorted')).cols(2)

Working with categorical or grouped data

Data is often grouped in various ways, and the Dataset interface provides various means to easily compare between groups and apply statistical aggregates. We'll start by generating some synthetic data with two groups along the x axis and 4 groups along the y axis.


In [ ]:
n = np.arange(1000)
xs = np.repeat(range(2), 500)
ys = n%4
zs = np.random.randn(1000)
table = hv.Table((xs, ys, zs), ['x', 'y'], 'z')
table

Since there are repeat observations of the same x- and y-values, we may want to reduce the data before we display it or else use a datatype that supports plotting distributions in this way. The BoxWhisker type allows doing exactly that:


In [ ]:
hv.BoxWhisker(table)

Aggregating/Reducing dimensions

Most types require the data to be non-duplicated before being displayed. For this purpose, HoloViews makes it easy to aggregate and reduce the data. These two operations are simple complements of each other--aggregate computes a statistic for each group in the supplied dimensions, while reduce combines all the groups except the supplied dimensions. Supplying only a function and no dimensions will simply aggregate or reduce all available key dimensions.


In [ ]:
hv.Bars(table).aggregate('x', function=np.mean) + hv.Bars(table).reduce(x=np.mean)

(A) aggregates over both the x and y dimension, computing the mean for each x/y group, while (B) reduces the x dimension leaving just the mean for each group along y.

Collapsing multiple Dataset Elements

When multiple observations are broken out into a HoloMap they can easily be combined using the collapse method. Here we create a number of Curves with increasingly larger y-values. By collapsing them with a function and a spreadfn we can compute the mean curve with a confidence interval. We then simply cast the collapsed Curve to a Spread and Curve Element to visualize them.


In [ ]:
hmap = hv.HoloMap({i: hv.Curve(np.arange(10)*i) for i in range(10)})
collapsed = hmap.collapse(function=np.mean, spreadfn=np.std)
hv.Spread(collapsed) * hv.Curve(collapsed) + hv.Table(collapsed)