HoloViews supports even high-dimensional datasets easily, and the standard mechanisms discussed already work well as long as you select a small enough subset of the data to display at any one time. However, some datasets are just inherently large, even for a single frame of data, and cannot safely be transferred for display in any standard web browser. Luckily, HoloViews makes it simple for you to use the separate datashader
together with any of the plotting extension libraries, including Bokeh and Matplotlib. The datashader library is designed to complement standard plotting libraries by providing faithful visualizations for very large datasets, focusing on revealing the overall distribution, not just individual data points.
Datashader uses computations accellerated using Numba, making it fast to work with datasets of millions or billions of datapoints stored in dask
dataframes. Dask dataframes provide an API that is functionally equivalent to pandas, but allows working with data out of core while scaling out to many processors and even clusters. Here we will use Dask to load a large CSV file of taxi coordinates.
In [ ]:
import pandas as pd
import holoviews as hv
import dask.dataframe as dd
import datashader as ds
import geoviews as gv
from holoviews.operation.datashader import datashade, aggregate
hv.extension('bokeh')
As a first step we will load a large dataset using dask. If you have followed the setup instructions you will have downloaded a large CSV containing 12 million taxi trips. Let's load this data using dask to create a dataframe ddf
:
In [ ]:
ddf = dd.read_csv('../data/nyc_taxi.csv', parse_dates=['tpep_pickup_datetime'])
ddf['hour'] = ddf.tpep_pickup_datetime.dt.hour
# If your machine is low on RAM (<8GB) don't persist (though everything will be much slower)
ddf = ddf.persist()
print('%s Rows' % len(ddf))
print('Columns:', list(ddf.columns))
In previous sections we have already seen how to declare a set of Points
from a pandas DataFrame. Here we do the same for a Dask dataframe passed in with the desired key dimensions:
In [ ]:
points = hv.Points(ddf, kdims=['dropoff_x', 'dropoff_y'])
We could now simply type points
, and Bokeh will attempt to display this data as a standard Bokeh plot. Before doing that, however, remember that we have 12 million rows of data, and no current plotting program will handle this well! Instead of letting Bokeh see this data, let's convert it to something far more tractable using the datashader
operation. This operation will aggregate the data on a 2D grid, apply shading to assign pixel colors to each bin in this grid, and build an RGB
Element (just a fixed-sized image) we can safely display:
In [ ]:
%opts RGB [width=600 height=500 bgcolor="black"]
datashade(points)
If you zoom in you will note that the plot rerenders depending on the zoom level, which allows the full dataset to be explored interactively even though only an image of it is ever sent to the browser. The way this works is that datashade
is a dynamic operation that also declares some linked streams. These linked streams are automatically instantiated and dynamically supply the plot size, x_range
, and y_range
from the Bokeh plot to the operation based on your current viewport as you zoom or pan:
In [ ]:
datashade.streams
In [ ]:
# Exercise: Plot the taxi pickup locations ('pickup_x' and 'pickup_y' columns)
# Warning: Don't try to display hv.Points() directly; it's too big! Use datashade() for any display
# Optional: Change the cmap on the datashade operation to inferno
from datashader.colors import inferno
Using the GeoViews (geographic) extension for HoloViews, we can display a map in the background. Just declare a Bokeh WMTSTileSource and pass it to the gv.WMTS Element, then we can overlay it:
In [ ]:
%opts RGB [xaxis=None yaxis=None]
import geoviews as gv
from bokeh.models import WMTSTileSource
url = 'https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{Z}/{Y}/{X}.jpg'
wmts = WMTSTileSource(url=url)
gv.WMTS(wmts) * datashade(points)
In [ ]:
# Exercise: Overlay the taxi pickup data on top of the Wikipedia tile source
wiki_url = 'https://maps.wikimedia.org/osm-intl/{Z}/{X}/{Y}@2x.png'
So far we have simply been counting taxi dropoffs, but our dataset is much richer than that. We have information about a number of variables including the total cost of a taxi ride, the total_amount
. Datashader provides a number of aggregator
functions, which you can supply to the datashade operation. Here use the ds.mean
aggregator to compute the average cost of a trip at a dropoff location:
In [ ]:
selected = points.select(total_amount=(None, 1000))
selected.data = selected.data.persist()
gv.WMTS(wmts) * datashade(selected, aggregator=ds.mean('total_amount'))
In [ ]:
# Exercise: Use the ds.min or ds.max aggregator to visualize ``tip_amount`` by dropoff location
# Optional: Eliminate outliers by using select
In [ ]:
%opts Image [width=600 height=500 logz=True xaxis=None yaxis=None]
taxi_ds = hv.Dataset(ddf)
grouped = taxi_ds.to(hv.Points, ['dropoff_x', 'dropoff_y'], groupby=['hour'], dynamic=True)
aggregate(grouped).redim.values(hour=range(24))
In [ ]:
# Exercise: Facet the trips in the morning hours as an NdLayout using aggregate(grouped.layout())
# Hint: You can reuse the existing grouped variable or select a subset before using the .to method
The actual points are never given directly to Bokeh, and so the normal Bokeh hover (and other) tools will not normally be useful with Datashader output. However, we can easily verlay an invisible QuadMesh
to reveal information on hover, providing information about values in a local area while still only ever sending a fixed-size array to the browser to avoid issues with large data.
In [ ]:
%%opts QuadMesh [width=800 height=400 tools=['hover']] (alpha=0 hover_line_alpha=1 hover_fill_alpha=0)
hover_info = aggregate(points, width=40, height=20, streams=[hv.streams.RangeXY]).map(hv.QuadMesh, hv.Image)
gv.WMTS(wmts) * datashade(points) * hover_info
As you can see, datashader requires taking some extra steps into consideration, but it makes it practical to work with even quite large datasets on an ordinary laptop. On a 16GB machine, datasets 10X or 100X the one used here should be very practical, as illustrated at the datashader web site.
The final section of this tutorial will show how to bring all of the concepts covered so far into an interactive web app for working with large or small datasets.