A very nice feature of IPython/Jupyter notebooks is they allow embedded images, without the need for potentially breakable references to external image files. However, the flip-side of this is that notebooks with lots of images in become very large, and unwieldy in various ways - particularly in combination with git and github repositories. Github is not designed to be passing large binary file objects such as .png files (or notebooks with embedded .png files) as they do not diff well, and can clog up repositories causing all kinds of problems, such as obscenely slow git command line calls, warnings or downright refusals from github to accept pushes containing large files. Also, nbviewer does not work well with very large notebooks, and we very much want things to be working with nbviewer.
Now that I'm ramping up use of the LabNotebook, it is therefore becoming apparent that a different approach to images is needed.
Carl Boettiger's solution to this problem is to store all images on flickr, and simply link to them. He has a very streamlined solution to this which involves knitr and hash-based file indexing. I want a similar solution to this, but with the following differences:
The solution I have come up with has two components.
The first is a mechanism for pushing figures to a cloud storage as soon as they are added, so that they can be linked to in html and derivative nbviewer notebook, labnotebook webpages, and web-hosted reveal slideshows. The second component is a custom figure class that uses IPython's custom display logic to expose two a different command call for nbconverted latex (and subsequently compile PDFs), which links to local files rather than cloud-hosted files.
For the cloud-hosting, I played around with the flickr python api a bit, but kept getting errors, so turned to dropbox, which so far seems to be working very well. This object is initialized at the top of the notebook, and new folder is created if necessary. When files are uploaded, it first checks whether a file already exists with the same name, and if so deletes it before proceeding to upload the file.
An additional advantage of having the dropbox api for uploading figures is that we can also upload the nbconverted pdf to the same folder, which we can also link to from the labnotebook webpage, the nbviewer notebook, and generally share with collaborators, rather than e-mailing large pdfs etc.
The end result that notebooks (and various nbconverted derivatives) sans all embedded images are slimmed down enormously, and things such as a github-hosted LabNotebook repository become scalable in the medium-term.
Ok, let's get cracking:
Importage
In [5]:
from IPython.display import Image,display
%matplotlib inline
from wand.image import Image as wi
Define some variables
In [17]:
# put some system-specific variables in the namespace ('le' dict)
%run ~/set_localenv_vars.py
# output folder
outdir = le['data_dir'] + '/about_workdocs-cloudfiles'
!mkdir -p $outdir
# local image files
imfile1 = outdir + '/batwatchers_breakfast.png'
imfile2 = outdir + '/mpl3d_example.png' # we will generate this
#Dropbox access token
dbx_access_token = 'XXXXXXXXXXXXXXXXXXXX'
Calico document tools
In [9]:
%%javascript
IPython.load_extensions('calico-spell-check', 'calico-document-tools','calico-cell-tools');
In [10]:
#%load /home/jgriffiths/Code/libraries_of_mine/github/ipynb-thesis/Notebooks/chapter_utils.py
This class takes care of uploading and getting links for new images
In [11]:
class cloudfiles_nb(object):
def __init__(self,access_token):#app_key,app_secret)
from dropbox.client import DropboxClient
from dropbox.session import DropboxSession
self.client = DropboxClient(access_token)
self.base_dir = 'workdocs-cloudfiles'
self.folders_list = [p['path'].replace('/%s/' %self.base_dir, '')\
for p in self.client.metadata(self.base_dir)['contents']]
self.upload_file_res = {}
def initialize_folder(self,folder_name):
self.thisfolder = '%s/%s' %(self.base_dir,folder_name)
if folder_name in self.folders_list:
print 'folder already exists'
res = None
else:
print 'creating folder'
res = self.client.file_create_folder(self.thisfolder)
# do something for error
return res
def upload_file(self,filepath):
f = open(filepath, 'r')
filename = filepath.split('/')[-1]
newfile = '%s/%s' %(self.thisfolder,filename)
# if filename alread exists, delate and replace
filecheck = cnb.client.search(cnb.thisfolder, filename)
if filecheck: del_res = cnb.client.file_delete(newfile)
res = self.client.put_file(newfile, f)
return res
def get_file_link(self,getfile):
res = self.client.media('%s/%s' %(self.thisfolder,getfile))
# something for error
return res
Everything goes inside a folder in my dropbox root called 'workdocs-cloudfiles'. Under that, we will have separate folders on a per-notebook basis, with the folders having the same names as the notebook files.
Initialize the cloud folder for this notebook, and tell it what the notebook name (and so folder name) to use:
In [12]:
nb_name = 'about_workdocs-cloudfiles'
cnb = cloudfiles_nb(dbx_access_token)
In [13]:
res = cnb.initialize_folder(nb_name)
res
Just to demonstrate how to use the api, let's upload one of the image files:
In [14]:
imfile1
Out[14]:
In [291]:
res = cnb.upload_file(imfile1)
res
Out[291]:
Now searching for this file confirms it is there:
In [292]:
res = cnb.client.search(cnb.thisfolder,imfile1.split('/')[-1])
res
Out[292]:
We can get the url for the file and link to it in the notebook:
In [293]:
res = cnb.get_file_link('batwatchers_breakfast.png')
res
Out[293]:
In [294]:
Image(res['url'], embed=False)
Out[294]:
In general, however, we will use the custom figure class defined below for inserting figures, which uses html img tags rather than IPython's display.Image class.
We can also delete the file (note the slightly different form to the upload syntax))
In [289]:
cnb.client.file_delete('%s/%s' %(cnb.thisfolder, imfile1.split('/')[-1]))
Out[289]:
Now searching for the file will not show anything
In [290]:
cnb.client.search(cnb.thisfolder,imfile1.split('/')[-1])
Out[290]:
Now let's incorporate this cloud storage mechanism into a custom figure class:
This class uses the _repr_html_ and _repr_latex_ to essentially put different text strings into html-related and pdf-related nbconvert outputs, which refer to the uploaded cloud figures and to the original local image files, respectively.
In [19]:
class nb_fig(object):
def __init__(self, local_file,label,cap,fignum,dropbox_obj,size=(500,400)):
self.local_file = local_file
self.size = size
self.cap = cap
self.label = label
self.fignum = fignum
res1 = dropbox_obj.upload_file(local_file)
res2 = dropbox_obj.get_file_link(local_file.split('/')[-1])
self.cloud_file = res2['url']
def _repr_html_(self):
html_str = '<center><img src="%s" alt="Just in case" \
title="Figure %s. %s. %s" height="%spx" width="%spx" />\
Figure %s. %s. %s </center>' %(self.cloud_file,
self.fignum, self.cap,self.label,
self.size[0],self.size[1],
self.fignum,self.label,self.cap)
return html_str
# the 'newpage' here could possibly be replace with something a bit better.
# I put it in because otherwise figures seem to break up text and section headings
# in rather scrappy ways.
def _repr_latex_(self):
ltx_str = r'\begin{figure}[htbp!] \centering \vspace{20pt} \begin{center} \
\includegraphics[width=1.0\textwidth]{%s} \
\end{center}{ \hspace*{\fill} \\} \caption[%s]{%s} \label{fig:%s} \
\end{figure} \newpage' %(self.local_file,self.label,self.cap,self.label)
return ltx_str
The usage is fairly straightforward: when adding a figure, we provide it with the local image file, a figure label, a figure number, a figure caption, and the dropbox api object instance defined above.
Let's start with the BB figure:
In [20]:
cap = 'Come to the batwatchers breakfast, where all is joy and happiness.'
label = 'BB figure'
fignum = '1.1'
im = nb_fig(imfile1,label,cap,fignum,cnb,size=(800,500))
display(im)
Note a nice feature of using the html img tag the way we are doing is that if you hover your cursor over the image, you will see the full image title and caption flash up. Just an added tidbit :)
Now let's generate the second figure and insert:
In [21]:
#%load http://matplotlib.org/mpl_examples/mplot3d/contourf3d_demo2.py
In [355]:
"""
.. versionadded:: 1.1.0
This demo depends on new features added to contourf3d.
"""
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt
from matplotlib import cm
fig = plt.figure()
ax = fig.gca(projection='3d')
X, Y, Z = axes3d.get_test_data(0.05)
ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3)
cset = ax.contourf(X, Y, Z, zdir='z', offset=-100, cmap=cm.coolwarm)
cset = ax.contourf(X, Y, Z, zdir='x', offset=-40, cmap=cm.coolwarm)
cset = ax.contourf(X, Y, Z, zdir='y', offset=40, cmap=cm.coolwarm)
ax.set_xlabel('X')
ax.set_xlim(-40, 40)
ax.set_ylabel('Y')
ax.set_ylim(-40, 40)
ax.set_zlabel('Z')
ax.set_zlim(-100, 100)
#plt.show()
##
fig.savefig(imfile2,bbox_inches='tight')
plt.close()
clear_output()
In [22]:
cap = 'This is a nice example of 3D plotting with matplotlib. '
label = 'Mplot3D figure. '
fignum = '2.1'
im = nb_fig(imfile2,label,cap,fignum,cnb,size=(800,500))
display(im)
The next thing to do after setting up the cloud storage api and creating the figures is run nbconvert.
A key point here is that the way I do this is to use ipynb-workdocs to tag cells for inclusion in html, pdf, and slides outputs. In general, the pdf-tagged cells are a small subset of the html-tagged cells, as I generally want to use html for more complete code documentation, and pdfs for summaries of key results. Rough notes, personal reminders, and anything else not intended for either html, pdf, or slideshows, is simply not tagged, and remains in the master notebook for only my eyes to see.
In this notebook I have tagged a few of the main documentation paragraphs, section headings, and the figures for inclusion in the pdf.
The nbconvert command is run separately to the cells in this notebook, but allowing that the flow has been interrupted briefly while I have run that command, let's now use the cloud api tool to upload the resulting PDF to the notebook cloud folder:
In [365]:
pdf_file = 'about_workdocs-cloudfiles__workdocs__2015-04-16/about_workdocs-cloudfiles__pdf_nb__2015-04-16_tidied.pdf'
cnb.upload_file(nbc_dir + '/' + pdf_file)
Out[365]:
We can grab the download link for the PDF:
In [366]:
cnb.get_file_link(pdf_file.split('/')[-1])
Out[366]:
As the PDF is only short, we'll also, for completeness, insert it into the notebook (but not the pdf itself; i.e. the following cell is tagged 'html' but not 'pdf'):
In [383]:
for page in np.arange(0,6):
print '\n\nPDF Page %s' %(page+1)
display(wi(filename=pdf_file + '[%s]' %page))
In [4]:
%run ~/set_localenv_vars.py
css_file = le['work_folder'] + '/masters/styles/CFDPython_css_modified_2.css'
from IPython.display import HTML,display
display(HTML(open(css_file, 'r').read()))