Workdocs-Cloudfiles: non-embedded images in the notebook and labnotebook entries

Overview
Notebook Setup
Dropbox API Class
Custom figure class

Overview

A very nice feature of IPython/Jupyter notebooks is they allow embedded images, without the need for potentially breakable references to external image files. However, the flip-side of this is that notebooks with lots of images in become very large, and unwieldy in various ways - particularly in combination with git and github repositories. Github is not designed to be passing large binary file objects such as .png files (or notebooks with embedded .png files) as they do not diff well, and can clog up repositories causing all kinds of problems, such as obscenely slow git command line calls, warnings or downright refusals from github to accept pushes containing large files. Also, nbviewer does not work well with very large notebooks, and we very much want things to be working with nbviewer.

Now that I'm ramping up use of the LabNotebook, it is therefore becoming apparent that a different approach to images is needed.

Carl Boettiger's solution to this problem is to store all images on flickr, and simply link to them. He has a very streamlined solution to this which involves knitr and hash-based file indexing. I want a similar solution to this, but with the following differences:

I am working with python + ipython/jupyter + nbconvert, rather than R + knitr
I am using the ipynb-workdocs system, which involves having a master notebook that spawns into pdf files, html/nbviewer files, and reveal slideshows, with variable contents according to a per-cell tag

The solution I have come up with has two components.

The first is a mechanism for pushing figures to a cloud storage as soon as they are added, so that they can be linked to in html and derivative nbviewer notebook, labnotebook webpages, and web-hosted reveal slideshows. The second component is a custom figure class that uses IPython's custom display logic to expose two a different command call for nbconverted latex (and subsequently compile PDFs), which links to local files rather than cloud-hosted files.

For the cloud-hosting, I played around with the flickr python api a bit, but kept getting errors, so turned to dropbox, which so far seems to be working very well. This object is initialized at the top of the notebook, and new folder is created if necessary. When files are uploaded, it first checks whether a file already exists with the same name, and if so deletes it before proceeding to upload the file.

An additional advantage of having the dropbox api for uploading figures is that we can also upload the nbconverted pdf to the same folder, which we can also link to from the labnotebook webpage, the nbviewer notebook, and generally share with collaborators, rather than e-mailing large pdfs etc.

The end result that notebooks (and various nbconverted derivatives) sans all embedded images are slimmed down enormously, and things such as a github-hosted LabNotebook repository become scalable in the medium-term.

Ok, let's get cracking:

Notebook Setup

Importage



In [5]:

    
from IPython.display import Image,display

%matplotlib inline
from wand.image import Image as wi

Define some variables



In [17]:

    
# put some system-specific variables in the namespace ('le' dict)
%run ~/set_localenv_vars.py

# output folder
outdir = le['data_dir'] + '/about_workdocs-cloudfiles'
!mkdir -p $outdir


# local image files
imfile1 =  outdir + '/batwatchers_breakfast.png'
imfile2 =  outdir + '/mpl3d_example.png' # we will generate this

#Dropbox access token 
dbx_access_token = 'XXXXXXXXXXXXXXXXXXXX'

Calico document tools



In [9]:

    
%%javascript
IPython.load_extensions('calico-spell-check', 'calico-document-tools','calico-cell-tools');



In [10]:

    
#%load /home/jgriffiths/Code/libraries_of_mine/github/ipynb-thesis/Notebooks/chapter_utils.py

Dropbox API Class

This class takes care of uploading and getting links for new images



In [11]:

    
class cloudfiles_nb(object):

    
  def __init__(self,access_token):#app_key,app_secret)

    from dropbox.client import DropboxClient
    from dropbox.session import DropboxSession    
  
    self.client = DropboxClient(access_token)
    
    self.base_dir = 'workdocs-cloudfiles'        
    self.folders_list = [p['path'].replace('/%s/' %self.base_dir, '')\
                         for p in self.client.metadata(self.base_dir)['contents']]
    self.upload_file_res = {}
    
  def initialize_folder(self,folder_name):
    

    self.thisfolder = '%s/%s' %(self.base_dir,folder_name)
    
    if folder_name in self.folders_list:
      print 'folder already exists'
      res = None
    else:
      print 'creating folder'
      res = self.client.file_create_folder(self.thisfolder)
        
    # do something for error
    
    return res

    
  def upload_file(self,filepath):
        
    f = open(filepath, 'r')
    filename = filepath.split('/')[-1]

    newfile = '%s/%s' %(self.thisfolder,filename)
    
    # if filename alread exists, delate and replace
    filecheck = cnb.client.search(cnb.thisfolder, filename)
    if filecheck: del_res = cnb.client.file_delete(newfile)
        
    res = self.client.put_file(newfile, f)
    
    return res


  def get_file_link(self,getfile):
        
    res = self.client.media('%s/%s' %(self.thisfolder,getfile))
    
    # something for error
    
    return res

Everything goes inside a folder in my dropbox root called 'workdocs-cloudfiles'. Under that, we will have separate folders on a per-notebook basis, with the folders having the same names as the notebook files.

Initialize the cloud folder for this notebook, and tell it what the notebook name (and so folder name) to use:



In [12]:

    
nb_name = 'about_workdocs-cloudfiles'

cnb = cloudfiles_nb(dbx_access_token)



In [13]:

    
res = cnb.initialize_folder(nb_name)
res









    



folder already exists

Just to demonstrate how to use the api, let's upload one of the image files:



In [14]:

    
imfile1









    Out[14]:





'/media/sf_SharedFolder/Data/about_workdocs-cloudfiles/batwatchers_breakfast.png'



In [291]:

    
res = cnb.upload_file(imfile1)
res









    Out[291]:





{u'bytes': 227764,
 u'client_mtime': u'Thu, 16 Apr 2015 08:16:56 +0000',
 u'icon': u'page_white_picture',
 u'is_dir': False,
 u'mime_type': u'image/png',
 u'modified': u'Thu, 16 Apr 2015 08:16:56 +0000',
 u'path': u'/workdocs-cloudfiles/about_workdocs-cloudfiles/batwatchers_breakfast.png',
 u'rev': u'41215009579ca',
 u'revision': 266773,
 u'root': u'dropbox',
 u'shareable': False,
 u'size': u'222.4 KB',
 u'thumb_exists': True}

Now searching for this file confirms it is there:



In [292]:

    
res = cnb.client.search(cnb.thisfolder,imfile1.split('/')[-1])
res









    Out[292]:





[{u'bytes': 227764,
  u'client_mtime': u'Thu, 16 Apr 2015 08:16:56 +0000',
  u'icon': u'page_white_picture',
  u'is_dir': False,
  u'mime_type': u'image/png',
  u'modified': u'Thu, 16 Apr 2015 08:16:56 +0000',
  u'modifier': None,
  u'path': u'/workdocs-cloudfiles/about_workdocs-cloudfiles/batwatchers_breakfast.png',
  u'read_only': False,
  u'rev': u'41215009579ca',
  u'revision': 266773,
  u'root': u'dropbox',
  u'size': u'222.4 KB',
  u'thumb_exists': True}]

We can get the url for the file and link to it in the notebook:



In [293]:

    
res = cnb.get_file_link('batwatchers_breakfast.png')
res









    Out[293]:





{u'expires': u'Thu, 16 Apr 2015 12:17:57 +0000',
 u'url': u'https://dl.dropboxusercontent.com/1/view/w9t0sbu73l11dex/workdocs-cloudfiles/about_workdocs-cloudfiles/batwatchers_breakfast.png'}



In [294]:

    
Image(res['url'], embed=False)









    Out[294]:

In general, however, we will use the custom figure class defined below for inserting figures, which uses html img tags rather than IPython's display.Image class.

We can also delete the file (note the slightly different form to the upload syntax))



In [289]:

    
cnb.client.file_delete('%s/%s' %(cnb.thisfolder, imfile1.split('/')[-1]))









    Out[289]:





{u'bytes': 0,
 u'client_mtime': u'Wed, 31 Dec 1969 23:59:59 +0000',
 u'icon': u'page_white_picture',
 u'is_deleted': True,
 u'is_dir': False,
 u'mime_type': u'image/png',
 u'modified': u'Thu, 16 Apr 2015 08:15:48 +0000',
 u'modifier': None,
 u'path': u'/workdocs-cloudfiles/about_workdocs-cloudfiles/batwatchers_breakfast.png',
 u'read_only': False,
 u'rev': u'41214009579ca',
 u'revision': 266772,
 u'root': u'dropbox',
 u'size': u'0 bytes',
 u'thumb_exists': True}

Now searching for the file will not show anything



In [290]:

    
cnb.client.search(cnb.thisfolder,imfile1.split('/')[-1])









    Out[290]:





[]

Now let's incorporate this cloud storage mechanism into a custom figure class:

Custom figure class

This class uses the _repr_html_ and _repr_latex_ to essentially put different text strings into html-related and pdf-related nbconvert outputs, which refer to the uploaded cloud figures and to the original local image files, respectively.



In [19]:

    
class nb_fig(object):
    
  def __init__(self, local_file,label,cap,fignum,dropbox_obj,size=(500,400)):
    self.local_file = local_file
    self.size = size
    self.cap = cap
    self.label = label
    self.fignum = fignum

    res1 = dropbox_obj.upload_file(local_file)
    res2 = dropbox_obj.get_file_link(local_file.split('/')[-1])
    
    self.cloud_file = res2['url']
    
  def _repr_html_(self):
    html_str = '<center><img src="%s" alt="Just in case" \
                title="Figure %s. %s. %s" height="%spx" width="%spx" />\
                Figure %s. %s. %s </center>' %(self.cloud_file,
                                               self.fignum, self.cap,self.label,
                                               self.size[0],self.size[1],
                                               self.fignum,self.label,self.cap)
    return html_str

  # the 'newpage' here could possibly be replace with something a bit better. 
  # I put it in because otherwise figures seem to break up text and section headings
  # in rather scrappy ways. 
  def _repr_latex_(self):
    ltx_str = r'\begin{figure}[htbp!] \centering \vspace{20pt} \begin{center} \
                \includegraphics[width=1.0\textwidth]{%s} \
                \end{center}{ \hspace*{\fill} \\} \caption[%s]{%s} \label{fig:%s} \
                \end{figure} \newpage' %(self.local_file,self.label,self.cap,self.label)
    return ltx_str

The usage is fairly straightforward: when adding a figure, we provide it with the local image file, a figure label, a figure number, a figure caption, and the dropbox api object instance defined above.

Let's start with the BB figure:



In [20]:

    
cap = 'Come to the batwatchers breakfast, where all is joy and happiness.'
label = 'BB figure'
fignum = '1.1'
im = nb_fig(imfile1,label,cap,fignum,cnb,size=(800,500))
display(im)









    




                Figure 1.1. BB figure. Come to the batwatchers breakfast, where all is joy and happiness.

Note a nice feature of using the html img tag the way we are doing is that if you hover your cursor over the image, you will see the full image title and caption flash up. Just an added tidbit :)

Now let's generate the second figure and insert:



In [21]:

    
#%load http://matplotlib.org/mpl_examples/mplot3d/contourf3d_demo2.py



In [355]:

    
"""
.. versionadded:: 1.1.0
   This demo depends on new features added to contourf3d.
"""

from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm

fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y, Z = axes3d.get_test_data(0.05)
ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3)
cset = ax.contourf(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm)
cset = ax.contourf(X, Y, Z, zdir='x', offset=-40, cmap=cm.coolwarm)
cset = ax.contourf(X, Y, Z, zdir='y', offset=40, cmap=cm.coolwarm)

ax.set_xlabel('X')
ax.set_xlim(-40, 40)
ax.set_ylabel('Y')
ax.set_ylim(-40, 40)
ax.set_zlabel('Z')
ax.set_zlim(-100, 100)

#plt.show()


##
fig.savefig(imfile2,bbox_inches='tight')
plt.close()
clear_output()



In [22]:

    
cap = 'This is a nice example of 3D plotting with matplotlib. '
label = 'Mplot3D figure. '
fignum = '2.1'
im = nb_fig(imfile2,label,cap,fignum,cnb,size=(800,500))
display(im)









    




                Figure 2.1. Mplot3D figure. . This is a nice example of 3D plotting with matplotlib.

Using with nbconvert

The next thing to do after setting up the cloud storage api and creating the figures is run nbconvert.

A key point here is that the way I do this is to use ipynb-workdocs to tag cells for inclusion in html, pdf, and slides outputs. In general, the pdf-tagged cells are a small subset of the html-tagged cells, as I generally want to use html for more complete code documentation, and pdfs for summaries of key results. Rough notes, personal reminders, and anything else not intended for either html, pdf, or slideshows, is simply not tagged, and remains in the master notebook for only my eyes to see.

In this notebook I have tagged a few of the main documentation paragraphs, section headings, and the figures for inclusion in the pdf.

The nbconvert command is run separately to the cells in this notebook, but allowing that the flow has been interrupted briefly while I have run that command, let's now use the cloud api tool to upload the resulting PDF to the notebook cloud folder:



In [365]:

    
pdf_file = 'about_workdocs-cloudfiles__workdocs__2015-04-16/about_workdocs-cloudfiles__pdf_nb__2015-04-16_tidied.pdf'
cnb.upload_file(nbc_dir + '/' + pdf_file)









    Out[365]:





{u'bytes': 381218,
 u'client_mtime': u'Thu, 16 Apr 2015 09:02:40 +0000',
 u'icon': u'page_white_acrobat',
 u'is_dir': False,
 u'mime_type': u'application/pdf',
 u'modified': u'Thu, 16 Apr 2015 09:02:40 +0000',
 u'path': u'/workdocs-cloudfiles/about_workdocs-cloudfiles/about_workdocs-cloudfiles__pdf_nb__2015-04-16_tidied.pdf',
 u'rev': u'41236009579ca',
 u'revision': 266806,
 u'root': u'dropbox',
 u'shareable': False,
 u'size': u'372.3 KB',
 u'thumb_exists': False}

We can grab the download link for the PDF:



In [366]:

    
cnb.get_file_link(pdf_file.split('/')[-1])









    Out[366]:





{u'expires': u'Thu, 16 Apr 2015 13:04:14 +0000',
 u'url': u'https://dl.dropboxusercontent.com/1/view/hxhqa41jyrrb4qh/workdocs-cloudfiles/about_workdocs-cloudfiles/about_workdocs-cloudfiles__pdf_nb__2015-04-16_tidied.pdf'}

As the PDF is only short, we'll also, for completeness, insert it into the notebook (but not the pdf itself; i.e. the following cell is tagged 'html' but not 'pdf'):



In [383]:

    
for page in np.arange(0,6):
  print '\n\nPDF Page %s' %(page+1)
  display(wi(filename=pdf_file + '[%s]' %page))

Styling



In [4]:

    
%run ~/set_localenv_vars.py
css_file = le['work_folder'] + '/masters/styles/CFDPython_css_modified_2.css'
from IPython.display import HTML,display
display(HTML(open(css_file, 'r').read()))