Exporting and Archiving

Most of the other user guides show you how to use HoloViews for interactive, exploratory visualization of your data, while the Applying Customization user guide shows how to use HoloViews completely non-interactively, generating and rendering images directly to disk using hv.save. In this notebook, we show how HoloViews works together with the Jupyter Notebook to establish a fully interactive yet also fully reproducible scientific or engineering workflow for generating reports or publications. That is, as you interactively explore your data and build visualizations in the notebook, you can automatically generate and export them as figures that will feed directly into your papers or web pages, along with records of how those figures were generated and even storing the actual data involved so that it can be re-analyzed later.


In [ ]:
import holoviews as hv
from holoviews import opts
from holoviews.operation import contours
hv.extension('matplotlib')

Exporting specific files

During interactive exploration in the Jupyter Notebook, your results are always visible within the notebook itself, but you can explicitly request that any visualization is also exported to an external file on disk:


In [ ]:
penguins = hv.RGB.load_image('../assets/penguins.png')
hv.save(penguins, 'penguin_plot.png', fmt='svg')
penguins

This mechanism can be used to provide a clear link between the steps for generating the figure, and the file on disk. You can now load the exported PNG image back into HoloViews, if you like, using hv.RGB.load_image although the result would be a bit confusing due to the nested axes.

The fig="png" part of the hv.save function call above specified that the file should be saved in PNG format, which is useful for posting on web pages or editing in raster-based graphics programs. Note that hv.save also accepts HoloMaps which can be saved to formats such as 'scrubber', 'widgets' or even 'gif' or 'mp4' (if the necessary matplotlib dependencies are available).

If the file extension is part of the filename, that will automatically be used to set the format. Conversely, if the format is explicitly specified, then the extension does not have to be part of the filename (and any filename extension that is provided will be ignored). Sometimes the two pieces of information are independent: for instance, a filename ending in .html can support either the 'widgets' or 'scrubber' formats.

For a publication, you will usually want to select SVG format because this vector format preserves the full resolution of all text and drawing elements. SVG files can be be used in some document preparation programs directly (e.g. LibreOffice), and can easily be converted and manipulated in vector graphics editors such as Inkscape.

Exporting notebooks

The hv.save function is useful when you want specific plots saved into specific files. Often, however, a notebook will contain an entire suite of results contained in multiple different cells, and manually specifying these cells and their filenames is error-prone, with a high likelihood of accidentally creating multiple files with the same name or using different names in different notebooks for the same objects.

To make the exporting process easier for large numbers of outputs, as well as more predictable, HoloViews also offers a powerful automatic notebook exporting facility, creating an archive of all your results. Automatic export is very useful in the common case of having a notebook that contains a series of figures to be used in a report or publication, particularly if you are repeatedly re-running the notebook as you finalize your results, and want the full set of current outputs to be available to an external document preparation system.

The advantage of using this archival system over simply converting the notebook to a static HTML file with nbconvert is that you can generate a collection of individual file assets in one or more desired file formats.

To turn on automatic adding of your files to the export archive, run hv.archive.auto():


In [ ]:
hv.archive.auto()

This object's behavior can be customized extensively; try pressing tab within the parentheses for a list of options, which are described more fully below.

By default, the output will go into a directory with the same name as your notebook, and the names for each object will be generated from the groups and labels used by HoloViews. Objects that contain HoloMaps are not exported by default, since those are usually rendered as animations that are not suitable for inclusion in publications, but you can change it to .auto(holomap='gif') if you want those as well.

Adding files to an archive

To see how the auto-exporting works, let's define a few HoloViews objects:


In [ ]:
penguins[:,:,'R'].relabel("Red") + penguins[:,:,'G'].relabel("Green") + penguins[:,:,'B'].relabel("Blue")

In [ ]:
penguins * hv.Arrow(0.15, 0.3, 'Penguin', '>')

In [ ]:
cs = contours(penguins[:,:,'R'], levels=[0.10,0.80])
overlay = penguins[:, :, 'R'] * cs
overlay.opts(
    opts.Contours(linewidth=1.3, cmap='Autumn'),
    opts.Image(cmap="gray"))

We can now list what has been captured, along with the names that have been generated:


In [ ]:
hv.archive.contents()

Here each object has resulted in two files, one in SVG format and one in Python "pickle" format (which appears as a zip file with extension .hvz in the listing). We'll ignore the pickle files for now, focusing on the SVG images.

The name generation code for these files is heavily customizable, but by default it consists of a list of dimension values and objects:

{dimension},{dimension},...{group}-{label},{group}-{label},....

The {dimension} shows what dimension values are included anywhere in this object, if it contains any high-level Dimensioned objects like HoloMap, NdOverlay, and GridSpace. Of course, nearly all HoloViews objects have dimensions, such as x and y in this case, but those dimensions are not used in the filenames because they are explicitly shown in the plots; only the top-level dimensions are used (those that determine which plot this is, not those that are shown in the plot itself.)

The {group}-{label} information lists the names HoloViews uses for default titles and for attribute access for the various objects that make up a given displayed object. E.g. the first SVG image in the list is a Layout of the three given Image objects, and the second one is an Overlay of an RGB object and an Arrow object. This information usually helps distinguish one plot from another, because they will typically be plots of objects that have different labels.

If the generated names are not unique, a numerical suffix will be added to make them unique. A maximum filename length is enforced, which can be set with hv.archive.max_filename=num.

If you prefer a fixed-width filename, you can use a hash for each name instead (or in addition), where :.8 specifies how many characters to keep from the hash:


In [ ]:
hv.archive.filename_formatter="{SHA:.8}"
cs

In [ ]:
hv.archive.contents()

You can see that the newest files added have the shorter, fixed-width format, though the names are no longer meaningful. If the filename_formatter had been set from the start, all filenames would have been of this type, which has both practical advantages (short names, all the same length) and disadvantages (no semantic clue about the contents).

Generated indexes

In addition to the files that were added to the archive for each of the cell outputs above, the archive exporter will also add an index.html file with a static copy of the notebook, with each cell labelled with the filename used to save it once hv.archive.export() is called (you can verify this for yourself after this call is executed below). This HTML file acts as a definitive index to your results, showing how they were generated and where they were exported on disk.

The exporter will also add a cleared, runnable copy of the notebook index.ipynb (with output deleted), so that you can later regenerate all of the output, with changes if necessary.

The exported archive will thus be a complete set of your results, along with a record of how they were generated, plus a recipe for regenerating them -- i.e., fully reproducible research! This HTML file and .ipynb file can the be submitted as supplemental materials for a paper, allowing any reader to build on your results, or it can just be kept privately so that future collaborators can start where this research left off.

Adding your own data to the archive

Of course, your results may depend on a lot of external packages, libraries, code files, and so on, which will not automatically be included or listed in the exported archive.

Luckily, the archive support is very general, and you can add any object to it that you want to be exported along with your output. For instance, you can store arbitrary metadata of your choosing, such as version control information, here as a JSON-format text file:


In [ ]:
import json
hv.archive.add(filename='metadata.json', 
               data=json.dumps({'repository':'git@github.com:ioam/holoviews.git',
                                'commit':'437e8d69'}), info={'mime_type':'text/json'})

The new file can now be seen in the contents listing:


In [ ]:
hv.archive.contents()

You can get a more direct list of filenames using the listing method:


In [ ]:
listing = hv.archive.listing()
listing

In this way, you should be able to automatically generate output files, with customizable filenames, storing any data or metadata you like along with them so that you can keep track of all the important information for reproducing these results later.

Controlling the behavior of hv.archive

The hv.archive object provides numerous parameters that can be changed. You can e.g.:

  • output the whole directory to a single compressed ZIP or tar archive file (e.g. hv.archive.set_param(pack=False, archive_format='zip') or archive_format='tar')

  • generate a new directory or archive every time the notebook is run (hv.archive.uniq_name=True); otherwise the old output directory is erased each time

  • choose your own name for the output directory or archive (e.g. hv.archive.export_name="{timestamp}")

  • change the format of the optional timestamp (e.g. to retain snapshots hourly, archive.set_param(export_name="{timestamp}", timestamp_format="%Y_%m_%d-%H"))

  • select PNG output, at a specified rendering resolution: hv.archive.exporters=[hv.renderer('bokeh').instance(size=50)])

These options and any others listed above can all be set in the hv.archive.auto() call at the start, for convenience and to ensure that they apply to all of the files that are added.

Writing the archive to disk

To actually write the files you have stored in the archive to disk, you need to call export() after any cell that might contain computation-intensive code. Usually it's best to do so as the last or nearly last cell in your notebook, though here we do it earlier because we wanted to show how to use the exported files.


In [ ]:
hv.archive.export()

Shortly after the export() command has been executed, the output should be available as a directory on disk, by default in the same directory as the notebook file, named with the name of the notebook:


In [ ]:
import os
os.getcwd()
if os.path.exists(hv.archive.notebook_name):
    print('\n'.join(sorted(os.listdir(hv.archive.notebook_name))))

For technical reasons to do with how the IPython Notebook interacts with JavaScript, if you use the Jupyter Notebook command Run all, the hv.archive.export() command is not actually executed when the cell with that call is encountered during the run. Instead, the export() is queued until after the final cell in the notebook has been executed. This asynchronous execution has several awkward but not serious consequences:

  • It is not possible for the export() cell to show whether any errors were encountered during exporting, because these will not occur until after the notebook has completed processing. To see any errors, you can run hv.archive.last_export_status() separately, after the Run all has completed. E.g. just press shift-[Enter] in the following cell, which will tell you whether the previous export was successful.

  • If you use Run all, the directory listing os.listdir() above will show the results from the previous time this notebook was run, since it executes before the export. Again, you can use shift-[Enter] to update the data once complete.

  • The Export name: in the output of hv.archive.export() will not always show the actual name of the directory or archive that will be created. In particular, it may say {notebook}, which when saving will actually expand to the name of your Jupyter Notebook.


In [ ]:
hv.archive.last_export_status()

Accessing your saved data

By default, HoloViews saves not only your rendered plots (PNG, SVG, etc.), but also the actual HoloViews objects that the plots visualize, which contain all your actual data. The objects are stored in compressed Python pickle files (.hvz), which are visible in the directory listings above but have been ignored until now. The plots are what you need for writing a document, but the raw data is is a crucial record to keep as well. For instance, you now can load in the HoloViews object, and manipulate it just as you could when it was originally defined. E.g. we can re-load our Levels Overlay file, which has the contours overlaid on top of the image, and easily pull out the underlying Image object:


In [ ]:
import os
from holoviews.core.io import Unpickler
c, a = None,None
hvz_file = [f for f in listing if f.endswith('hvz')][0]
path = os.path.join(hv.archive.notebook_name, hvz_file)

if os.path.isfile(path):
    print('Unpickling {filename}'.format(filename=hvz_file))
    obj = Unpickler.load(open(path,"rb"))
    print(obj)
else:
    print('Could not find file {path}'.format(path=path))
    print('Current directory is {cwd}'.format(cwd=os.getcwd()))
    print('Containing files and directories: {listing}'.format(listing=os.listdir(os.getcwd())))

Given the Image, you can also access the underlying array data, because HoloViews objects are simply containers for your data and associated metadata. This means that years from now, as long as you can still run HoloViews, you can now easily re-load and explore your data, plotting it entirely different ways or running different analyses, even if you no longer have any of the original code you used to generate the data. All you need is HoloViews, which is permanently archived on GitHub and is fully open source and thus should always remain available. Because the data is stored conveniently in the archive alongside the figure that was published, you can see immediately which file corresponds to the data underlying any given plot in your paper, and immediately start working with the data, rather than laboriously trying to reconstruct the data from a saved figure.

If you do not want the pickle files, you can of course turn them off if you prefer, by changing hv.archive.auto() to:

hv.archive.auto(exporters=[hv.renderer('matplotlib').instance(holomap=None)])

Here, the exporters list has been updated to include the usual default exporters without the Pickler exporter that would usually be included.

Using HoloViews to do reproducible research

The export options from HoloViews help you establish a feasible workflow for doing reproducible research: starting from interactive exploration, either export specific files with hv.save, or enable hv.archive.auto(), which will store a copy of your notebook and its output ready for inclusion in a document but retaining the complete recipe for reproducing the results later.

Why reproducible research matters

To understand why these capabilities are important, let's consider the process by which scientific results are typically generated and published without HoloViews. Scientists and engineers use a wide variety of data-analysis tools, ranging from GUI-based programs like Excel spreadsheets, mixed GUI/command-line programs like Matlab, or purely scriptable tools like matplotlib or bokeh. The process by which figures are created in any of these tools typically involves copying data from its original source, selecting it, transforming it, choosing portions of it to put into a figure, choosing the various plot options for a subfigure, combining different subfigures into a complete figure, generating a publishable figure file with the full figure, and then inserting that into a report or publication.

If using GUI tools, often the final figure is the only record of that process, and even just a few weeks or months later a researcher will often be completely unable to say precisely how a given figure was generated. Moreover, this process needs to be repeated whenever new data is collected, which is an error-prone and time-consuming process. The lack of records is a serious problem for building on past work and revisiting the assumptions involved, which greatly slows progress both for individual researchers and for the field as a whole. Graphical environments for capturing and replaying a user's GUI-based workflow have been developed, but these have greatly restricted the process of exploration, because they only support a few of the many analyses required, and thus they have rarely been successful in practice. With GUI tools it is also very difficult to "curate" the sequence of steps involved, i.e., eliminating dead ends, speculative work, and unnecessary steps, with a goal of showing the clear path from incoming data to a final figure.

In principle, using scriptable or command-line tools offers the promise of capturing the steps involved, in a form that can be curated. In practice, however, the situation is often no better than with GUI tools, because the data is typically taken through many manual steps that culminate in a published figure, and without a laboriously manually created record of what steps are involved, the provenance of a given figure remains unknown. Where reproducible workflows are created in this way, they tend to be "after the fact", as an explicit exercise to accompany a publication, and thus (a) they are rarely done, (b) they are very difficult to do if any of the steps were not recorded originally.

A Jupyter notebook helps significantly to make the scriptable-tools approach viable, by recording both code and the resulting output, and can thus in principle act as a record for establishing the full provenance of a figure. But because typical plotting libraries require so much plotting-specific code before any plot is visible, the notebook quickly becomes unreadable. To make notebooks readable, researchers then typically move the plotting code for a specific figure to some external file, which then drifts out of sync with the notebook so that the notebook no longer acts as a record of the link between the original data and the resulting figure.

HoloViews provides the final missing piece in this approach, by allowing researchers to work directly with their data interactively in a notebook, using small amounts of code that focus on the data and analyses rather than plotting code, yet showing the results directly alongside the specification for generating them. This user guide will describe how use a Jupyter notebook with HoloViews to export your results in a way that preserves the information about how those results were generated, providing a clear chain of provenance and making reproducible research practical at last.

For more information on how HoloViews can help build a reproducible workflow, see our 2015 paper on using HoloViews for reproducible research.