Working with large data using Datashader

The various plotting-library backends supported by HoloViews, such as Matplotlib and Bokeh, each have a variety of limitations on the amount of data that is practical to work with. Bokeh in particular mirrors your data directly into an HTML page viewable in your browser, which can cause problems when data sizes approach the limited memory available for each web page in current browsers.

Luckily, a visualization of even the largest dataset will be constrained by the resolution of your display device, and so one approach to handling such data is to pre-render or rasterize the data into a fixed-size array or image before sending it to the backend. The Datashader library provides a high-performance big-data server-side rasterization pipeline that works seamlessly with HoloViews to support datasets that are orders of magnitude larger than those supported natively by the plotting-library backends, including millions or billions of points even on ordinary laptops.

Here, we will see how and when to use Datashader with HoloViews Elements and Containers. For simplicity in this discussion we'll focus on simple synthetic datasets, but the Datashader docs include a wide variety of real datasets that give a much better idea of the power of using Datashader with HoloViews, and PyViz.org shows how to install and work with HoloViews and Datashader together.


In [ ]:
import numpy as np
import holoviews as hv
from holoviews import opts
import datashader as ds
from holoviews.operation.datashader import datashade, shade, dynspread, rasterize
from holoviews.operation import decimate
from holoviews import opts

hv.extension('bokeh','matplotlib')

decimate.max_samples=1000
dynspread.max_px=20
dynspread.threshold=0.5

def random_walk(n, f=5000):
    """Random walk in a 2D space, smoothed with a filter of length f"""
    xs = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()
    ys = np.convolve(np.random.normal(0, 0.1, size=n), np.ones(f)/f).cumsum()
    xs += 0.1*np.sin(0.1*np.array(range(n-1+f))) # add wobble on x axis
    xs += np.random.normal(0, 0.005, size=n-1+f) # add measurement noise
    ys += np.random.normal(0, 0.005, size=n-1+f)
    return np.column_stack([xs, ys])

def random_cov():
    """Random covariance for use in generating 2D Gaussian distributions"""
    A = np.random.randn(2,2)
    return np.dot(A, A.T)

def time_series(T = 1, N = 100, mu = 0.1, sigma = 0.1, S0 = 20):  
    """Parameterized noisy time series"""
    dt = float(T)/N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size = N) 
    W = np.cumsum(W)*np.sqrt(dt) # standard brownian motion
    X = (mu-0.5*sigma**2)*t + sigma*W 
    S = S0*np.exp(X) # geometric brownian motion
    return S

Principles of datashading

Because HoloViews elements are fundamentally data containers, not visualizations, you can very quickly declare elements such as Points or Path containing datasets that may be as large as the full memory available on your machine (or even larger if using Dask dataframes). So even for very large datasets, you can easily specify a data structure that you can work with for making selections, sampling, aggregations, and so on. However, as soon as you try to visualize it directly with either the matplotlib or bokeh plotting extensions, the rendering process may be prohibitively expensive.

Let's start with a simple example we can visualize as normal:


In [ ]:
np.random.seed(1)
points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000,)),label="Points")
paths = hv.Path([random_walk(2000,30)], label="Paths")

points + paths

These browser-based plots are fully interactive, as you can see if you select the Wheel Zoom or Box Zoom tools and use your scroll wheel or click and drag.

Because all of the data in these plots gets transferred directly into the web browser, the interactive functionality will be available even on a static export of this figure as a web page. Note that even though the visualization above is not computationally expensive, even with just 1000 points as in the scatterplot above, the plot already suffers from overplotting, with later points obscuring previously plotted points.

With much larger datasets, these issues will quickly make it impossible to see the true structure of the data. We can easily declare 50X or 1000X larger versions of the same plots above, but if we tried to visualize them they would be nearly unusable even if the browser did not crash:


In [ ]:
np.random.seed(1)
points = hv.Points(np.random.multivariate_normal((0,0), [[0.1, 0.1], [0.1, 1.0]], (1000000,)),label="Points")
paths = hv.Path([0.15*random_walk(100000) for i in range(10)],label="Paths")

#points + paths  ## Danger! Browsers can't handle 1 million points!

Luckily, HoloViews Elements are just containers for data and associated metadata, not plots, so HoloViews can generate entirely different types of visualizations from the same data structure when appropriate. For instance, in the plot on the left below you can see the result of applying a decimate() operation acting on the points object, which will automatically downsample this million-point dataset to at most 1000 points at any time as you zoom in or out:


In [ ]:
decimate(points) + datashade(points) + datashade(paths)

Decimating a plot in this way can be useful, but it discards most of the data, yet still suffers from overplotting. If you have Datashader installed, you can instead use the datashade() operation to create a dynamic Datashader-based Bokeh plot. The middle plot above shows the result of using datashade() to create a dynamic Datashader-based plot out of an Element with arbitrarily large data. In the Datashader version, a new image is regenerated automatically on every zoom or pan event, revealing all the data available at that zoom level and avoiding issues with overplotting by dynamically rescaling the colors used. The same process is used for the line-based data in the Paths plot.

These two Datashader-based plots are similar to the native Bokeh plots above, but instead of making a static Bokeh plot that embeds points or line segments directly into the browser, HoloViews sets up a Bokeh plot with dynamic callbacks that render the data as an RGB image using Datashader instead. The dynamic re-rendering provides an interactive user experience even though the data itself is never provided directly to the browser. Of course, because the full data is not in the browser, a static export of this page (e.g. on holoviews.org or on anaconda.org) will only show the initially rendered version, and will not update with new images when zooming as it will when there is a live Python process available.

Though you can no longer have a completely interactive exported file, with the Datashader version on a live server you can now change the number of data points from 1000000 to 10000000 or more to see how well your machine will handle larger datasets. It will get a bit slower, but if you have enough memory, it should still be very usable, and should never crash your browser as transferring the whole dataset into your browser would. If you don't have enough memory, you can instead set up a Dask dataframe as shown in other Datashader examples, which will provide out-of-core and/or distributed processing to handle even the largest datasets.

The datashade() operation is actually a "macro" or shortcut that combines the two main computations done by datashader, namely shade() and rasterize():


In [ ]:
rasterize(points).hist() + shade(rasterize(points)) + datashade(points)

In all three of the above plots, rasterize() is being called to aggregate the data (a large set of x,y locations) into a rectangular grid, with each grid cell counting up the number of points that fall into it. In the plot on the left, only rasterize() is done, and the resulting numeric array of counts is passed to Bokeh for colormapping. Bokeh can then use dynamic (client-side, browser-based) operations in JavaScript, allowing users to have dynamic control over even static HTML plots. For instance, in this case, users can use the Box Select tool and select a range of the histogram shown, dynamically remapping the colors used in the plot to cover the selected range.

The other two plots should be identical. In both cases, the numerical array output of rasterize() is mapped into RGB colors by Datashader itself, in Python ("server-side"), which allows special Datashader computations like the histogram-equalization in the above plots and the "spreading" discussed below. The shade() and datashade() operations accept a cmap argument that lets you control the colormap used, which can be selected to match the HoloViews/Bokeh cmap option but is strictly independent of it. See hv.help(rasterize), hv.help(shade), and hv.help(datashade) for options that can be selected, and the Datashader web site for all the details. You can also try the lower-level hv.aggregate() (for points and lines) and `hv.regrid() (for image/raster data) operations, which may provide more control.

Spreading

The Datashader examples above treat points and lines as infinitesimal in width, such that a given point or small bit of line segment appears in at most one pixel. This approach ensures that the overall distribution of the points will be mathematically well founded -- each pixel will scale in value directly by the number of points that fall into it, or by the lines that cross it.

However, many monitors are sufficiently high resolution that the resulting point or line can be difficult to see---a single pixel may not actually be visible on its own, and its color may likely be very difficult to make out. To compensate for this, HoloViews provides access to Datashader's image-based "spreading", which makes isolated pixels "spread" into adjacent ones for visibility. There are two varieties of spreading supported:

  1. spread: fixed spreading of a certain number of pixels, which is useful if you want to be sure how much spreading is done regardless of the properties of the data.
  2. dynspread: spreads up to a maximum size as long as it does not exceed a specified fraction of adjacency between pixels.

Dynamic spreading is typically more useful, because it adjusts depending on how close the datapoints are to each other on screen. Both types of spreading require Datashader to do the colormapping (applying shade), because they operate on RGB pixels, not data arrays.

You can compare the results in the two plots below after zooming in:


In [ ]:
datashade(points) + dynspread(datashade(points))

Both plots show the same data, and look identical when zoomed out, but when zoomed in enough you should be able to see the individual data points on the right while the ones on the left are barely visible. The dynspread parameters typically need some hand tuning, as the only purpose of such spreading is to make things visible on a particular monitor for a particular observer; the underlying mathematical operations in Datashader do not normally need parameters to be adjusted.

The same operation works similarly for line segments:


In [ ]:
datashade(paths) + dynspread(datashade(paths))

Multidimensional plots

The above plots show two dimensions of data plotted along x and y, but Datashader operations can be used with additional dimensions as well. For instance, an extra dimension (here called k), can be treated as a category label and used to colorize the points or lines. Compared to a standard scatterplot that would suffer from overplotting, here the result will be merged mathematically by Datashader, completely avoiding any overplotting issues except local ones due to spreading:


In [ ]:
np.random.seed(3)
kdims=['d1','d2']
num_ks=8

def rand_gauss2d():
    return 100*np.random.multivariate_normal(np.random.randn(2), random_cov(), (100000,))

gaussians = {i: hv.Points(rand_gauss2d(), kdims) for i in range(num_ks)}
lines = {i: hv.Curve(time_series(N=10000, S0=200+np.random.rand())) for i in range(num_ks)}

gaussspread = dynspread(datashade(hv.NdOverlay(gaussians, kdims='k'), aggregator=ds.count_cat('k')))
linespread  = dynspread(datashade(hv.NdOverlay(lines,     kdims='k'), aggregator=ds.count_cat('k')))

(gaussspread + linespread).opts(opts.RGB(width=400))

Because Bokeh only ever sees an image, providing legends and keys has to be done separately, though we are working to make this process more seamless. For now, you can show a legend by adding a suitable collection of labeled points:


In [ ]:
# definition copied here to ensure independent pan/zoom state for each dynamic plot
gaussspread = dynspread(datashade(hv.NdOverlay(gaussians, kdims=['k']), aggregator=ds.count_cat('k')))

from datashader.colors import Sets1to3 # default datashade() and shade() color cycle
color_key = list(enumerate(Sets1to3[0:num_ks]))
color_points = hv.NdOverlay({k: hv.Points([0,0], label=str(k)).opts(color=v) for k, v in color_key})

(color_points * gaussspread).opts(width=600)

Here the dummy points are at 0,0 for this dataset, but would need to be at another suitable value for data that is in a different range.

Working with time series

HoloViews also makes it possible to datashade large timeseries using the datashade and rasterize operations:


In [ ]:
import pandas as pd

dates = pd.date_range(start="2014-01-01", end="2016-01-01", freq='1D') # or '1min'
curve = hv.Curve((dates, time_series(N=len(dates), sigma = 1)))
datashade(curve, cmap=["blue"]).opts(width=800)

HoloViews also supplies some operations that are useful in combination with Datashader timeseries. For instance, you can compute a rolling mean of the results and then show a subset of outlier points, which will then support hover, selection, and other interactive Bokeh features:


In [ ]:
from holoviews.operation.timeseries import rolling, rolling_outlier_std
smoothed = rolling(curve, rolling_window=50)
outliers = rolling_outlier_std(curve, rolling_window=50, sigma=2)

ds_curve = datashade(curve, cmap=["blue"])
spread = dynspread(datashade(smoothed, cmap=["red"]),max_px=1) 

(ds_curve * spread * outliers).opts(
    opts.Scatter(line_color="black", fill_color="red", size=10, tools=['hover', 'box_select'], width=800))

Note that the above plot will look blocky in a static export (such as on anaconda.org), because the exported version is generated without taking the size of the actual plot (using default height and width for Datashader) into account, whereas the live notebook automatically regenerates the plot to match the visible area on the page. The result of all these operations can be laid out, overlaid, selected, and sampled just like any other HoloViews element, letting you work naturally with even very large datasets.

Hover info

As you can see in the examples above, converting the data to an image using Datashader makes it feasible to work with even very large datasets interactively. One unfortunate side effect is that the original datapoints and line segments can no longer be used to support "tooltips" or "hover" information directly for RGB images generated with datashade; that data simply is not present at the browser level, and so the browser cannot unambiguously report information about any specific datapoint. Luckily, you can still provide hover information that reports properties of a subset of the data in a separate layer (as above), or you can provide information for a spatial region of the plot rather than for specific datapoints. For instance, in some small rectangle you can provide statistics such as the mean, count, standard deviation, etc:


In [ ]:
from holoviews.streams import RangeXY

fixed_hover = (datashade(points, width=400, height=400) *  
               hv.QuadMesh(rasterize(points, width=10, height=10, dynamic=False)))

dynamic_hover = (datashade(points, width=400, height=400) * 
                 hv.util.Dynamic(rasterize(points, width=10, height=10, streams=[RangeXY]), operation=hv.QuadMesh))

(fixed_hover + dynamic_hover).opts(opts.QuadMesh(tools=['hover'], alpha=0, hover_alpha=0.2))

In the above examples, the plot on the left provides hover information at a fixed spatial scale, while the one on the right reports on an area that scales with the zoom level so that arbitrarily small regions of data space can be examined, which is generally more useful (but requires a live Python server). Note that you can activate the hover tool for Image elements output by the rasterize operation.

Element types supported for Datashading

Fundamentally, what datashader does is to rasterize data, i.e., render a representation of it into a regularly gridded rectangular portion of a two-dimensional plane. Datashader natively supports four basic types of rasterization:

  • points: zero-dimensional objects aggregated by position alone, each point covering zero area in the plane and thus falling into exactly one grid cell of the resulting array (if the point is within the bounds being aggregated).
  • line: polyline/multiline objects (connected series of line segments), with each segment having a fixed length but zero width and crossing each grid cell at most once.
  • trimesh: irregularly spaced triangular grid, with each triangle covering a portion of the 2D plane and thus potentially crossing multiple grid cells (thus requiring interpolation/upsampling). Depending on the zoom level, a single pixel can also include multiple triangles, which then becomes similar to the points case (requiring aggregation/downsampling of all triangles covered by the pixel).
  • raster: an axis-aligned regularly gridded two-dimensional subregion of the plane, with each grid cell in the source data covering more than one grid cell in the output grid (requiring interpolation/upsampling), or with each grid cell in the output grid including contributions from more than one grid cell in the input grid (requiring aggregation/downsampling).

Datashader focuses on implementing those four cases very efficiently, and HoloViews in turn can use them to render a very large range of specific types of data:

Supported Elements

Other HoloViews elements could be supported, but do not currently have a useful datashaded representation:

Elements not yet supported

There are also other Elements that are not expected to be useful with datashader because they are isolated annotations, are already summaries or aggregations of other data, have graphical representations that are only meaningful at a certain size, or are text based:

Not useful to support

Examples of each supported Element type:


In [ ]:
hv.output(backend='matplotlib')

opts.defaults(
    opts.Image(aspect=1, axiswise=True, xaxis='bare', yaxis='bare'),
    opts.RGB(aspect=1, axiswise=True, xaxis='bare', yaxis='bare'),
    opts.Layout(vspace=0.1, hspace=0.1, sublabel_format=""))

np.random.seed(12)
N=100
pts = [(10*i/N, np.sin(10*i/N)) for i in range(N)]

x = y = np.linspace(0, 5, int(np.sqrt(N)))
xs,ys = np.meshgrid(x,y)
z = np.sin(xs)*np.cos(ys)

r = 0.5*np.sin(0.1*xs**2+0.05*ys**2)+0.5
g = 0.5*np.sin(0.02*xs**2+0.2*ys**2)+0.5
b = 0.5*np.sin(0.02*xs**2+0.02*ys**2)+0.5

opts2 = dict(filled=True, edge_color='z')
tri = hv.TriMesh.from_vertices(hv.Points(np.random.randn(N,3), vdims='z')).opts(**opts2)
(tri + tri.edgepaths + datashade(tri, aggregator=ds.mean('z')) + datashade(tri.edgepaths)).cols(2)

shadeable  = [elemtype(pts) for elemtype in [hv.Curve, hv.Scatter, hv.Points]]
shadeable += [hv.Path([pts])]
shadeable += [hv.Image((x,y,z)), hv.QuadMesh((x,y,z))]
shadeable += [hv.Graph(((np.zeros(N), np.arange(N)),))]
shadeable += [tri.edgepaths]
shadeable += [tri]
shadeable += [hv.operation.contours(hv.Image((x,y,z)), levels=10)]

rasterizable = [hv.RGB(np.dstack([r,g,b])), hv.HSV(np.dstack([g,b,r]))]

hv.Layout([dynspread(datashade(e.relabel(e.__class__.name))) for e in shadeable] + 
          [          rasterize(e.relabel(e.__class__.name))  for e in rasterizable]).cols(6)

Here we called datashade() on each Element type, letting Datashader do the full process of rasterization and shading, except that for RGB and HSV we only called rasterize() or else the results would have been converted into a monochrome image.

For comparison, you can see the corresponding non-datashaded plots (as long as you leave N lower than 10000 unless you have a long time to wait!):


In [ ]:
rgb_opts = opts.RGB(aspect=1, axiswise=True, xaxis='bare', yaxis='bare')
hv.Layout([e.relabel(e.__class__.name).opts(rgb_opts) for e in shadeable + rasterizable]).cols(6)

These two examples use Matplotlib, but if they were switched to Bokeh and you had a live server, they would support dynamic re-rendering on zoom and pan so that you could explore the full range of data available (e.g. even very large raster images, networks, paths, point clouds, or meshes).

Container types supported for datashading

In the above examples datashade() was called directly on each Element, but it can also be called on Containers, in which case each Element in the Container will be datashaded separately (for all Container types other than a Layout):


In [ ]:
hv.output(dpi=80, size=100)

curves = {'+':hv.Curve(pts), '-':hv.Curve([(x, -1.0*y) for x, y in pts])}

supported = [hv.HoloMap(curves,'sign'), hv.Overlay(list(curves.values())), hv.NdOverlay(curves), hv.GridSpace(hv.NdOverlay(curves))]
hv.Layout([datashade(e.relabel(e.__class__.name)) for e in supported]).cols(4)

In [ ]:
dynspread(datashade(hv.NdLayout(curves,'sign')))

Optimizing performance

Datashader and HoloViews have different design principles that are worth keeping in mind when using them in combination, if you want to ensure good overall performance. By design, Datashader supports only a small number of operations and datatypes, focusing only on what can be implemented very efficiently. HoloViews instead focuses on supporting the typical workflows of Python users, recognizing that the most computationally efficient choice is only going to be faster overall if it also minimizes the time users have to spend getting things working.

HoloViews thus helps you get something working quickly, but once it is working and you realize that you need to do this often or that it comes up against the limits of your computing hardware, you can consider whether you can get much better performance by considering the following issues and suggestions.

Use a Datashader-supported data structure

HoloViews helpfully tries to convert whatever data you have provided into what Datashader supports, which is good for optimizing your time to an initial solution, but will not always be the fastest approach computationally. If you ensure that you store your data in a format that Datashader understands already, HoloViews can simply pass it down to Datashader without copying or transforming it:

  1. For point, line, and trimesh data, Datashader supports Dask and Pandas dataframes, and so those two data sources will be fastest. Of those two, Dask Dataframes will usually be somewhat faster and also able to make use of distributed computational resources and out-of-core processing.
  2. For rasters, Datashader supports xarray objects, and so if your data is provided as an xarray plotting will be faster.

See the Datashader docs for examples of dealing with even quite large datasets (in the billions of points) on commodity hardware, including many HoloViews-based examples.

Cache initial processing with precompute=True

In the typical case of having datasets much larger than the plot resolution, HoloViews Datashader-based operations that work on the full dataset (rasterize, aggregate,regrid) are computationally expensive; the others are not (shade, spread, dynspread, etc.)

The expensive operations are all of type ResamplingOperation, which has a parameter precompute (see hv.help(hv.operation.datashader.rasterize), etc.) Precompute can be used to get faster performance in interactive usage by caching the last set of data used in plotting (after any transformations needed) and reusing it when it is requested again. precompute is False by default, because it requires using memory to store the cached data, but if you have enough memory, you can enable it so that repeated interactions (such as zooming and panning) will be much faster than the first one. In practice, most Datashader-plots don't need to do extensive precomputing, but enabling it for TriMesh plots (or anything based on TriMesh, such as QuadMesh) can greatly speed up interactive usage.

Project data only once

If you are working with geographic data using GeoViews that needs to be projected before display and/or before datashading, GeoViews will have to do this every time you update a plot, which can drown out the performance improvement you get by using Datashader. GeoViews allows you to project the entire dataset at once using gv.operation.project, and once you do this you should be able to use Datashader at full speed.

If you follow these suggestions, the combination of HoloViews and Datashader will allow you to work uniformly with data covering a huge range of sizes. Per session or per plot, you can trade off the ability to export user-manipulable plots against file size and browser compatibility, and allowing you to render even the largest dataset faithfully. HoloViews makes the full power of Datashader available in just a few lines of code, giving you a natural way to work with your data regardless of its size.