Overview

You can either run this as a notebook, changing one variable below, or use the command-line makepdf.py script.

Requirements

  • The standard scientific Python stack, which comes with the Anaconda Python distribution ... that's what you should be using anyway.
  • wordcloud which can be installed with pip install wordcloud and requires PIL (PIL comes with the Anaconda Python distribution)
  • Weasyprint. It has some dependencies. You should be able to pip install weasyprint.

Strategy

In the end, we want a nice looking PDF document. There are quite a few tools for generating PDFs. The most popular is reportlab (people seem to recommend reportlab's platypus for "simple" pdf generation). I wanted something a little easier to control. I thought about using Markdown document (see CommonMark) as intermediate format, and using python-markdown2 to generate HTML and xhtml2pdf to generate the PDF (similar to what is done here). In the end, it seemed easier to just write out the simple HTML myself. Similarly, I thought about Sphinx with reStructuredText, but that needs a working LaTeX environment to produce PDFs.

For the actual HTML to PDF conversoin, some options include xhtml2pdf (which wraps reportlab) and pdfkit which wraps wkhtmltopdf. I didn't like the looks of their PDFs, so I went with Weasyprint.

For the wordcloud, I used the free Raleway font, downloaded via FontSquirrel.

Running the thing!

To run this, all you need to do is change the xl_filename in the cell below, then run all of the cells.

This will create an HTML file, a wordcloud png, and a PDF all in the same directory as your Excel file.

If you like, you can change the css used for formatting; it's at the top of the script, and passed on as the css argument to generatepdf.

You can add extra stop words (i.e. things not to use in the word cloud and word count). If you're doing it via the script, it's the -s option. If you pass it in via the stopwords argument to generatepdf, you'll need to pass in a complete list of stopwords. So, you probably want to take the set wordcloud.STOPWORDS and add your words to it.

Note: I used to do everything in the notebook. However, once I made a command-line script, it seemed like a giant error-prone mess to duplicate code. So, code is imported from the script here, and the script is decently documented.


In [ ]:
import makepdf
import wordcloud

ERRORS!!

At the moment, the below generates a ton of PangoFontDescription and PangoLayoutIter errors. These are annoying, but don't affect the output. Proceed with confidence.


In [ ]:
xl_filename = 'data/MGLFall2015/Analytical-Physics-I--(Fall-2015-16).xlsx'
stopwords = wordcloud.STOPWORDS.union(['class','course'])
makepdf.generatepdf(xl_filename,stopwords=stopwords)

But what does it look like?

We know the output name of the html file, so let's just display it directly.


In [ ]:
from IPython.display import display, HTML 
import os

In [ ]:
filename,ext = os.path.splitext(xl_filename)
assert ext == '.xlsx'
html_filename = filename + '.html'
wc_filename = filename + '-wordcloud.png'
html = open(html_filename).read()

What output?!

For privacy reasons, the actual output is removed in the public version of this script.


In [ ]:
display(HTML(html.replace(os.path.split(wc_filename)[-1],wc_filename)))

In [ ]: