You can either run this as a notebook, changing one variable below, or use the command-line makepdf.py script.
pip install wordcloud and requires PIL (PIL comes with the Anaconda Python distribution)pip install weasyprint.In the end, we want a nice looking PDF document. There are quite a few tools for generating PDFs. The most popular is reportlab (people seem to recommend reportlab's platypus for "simple" pdf generation). I wanted something a little easier to control. I thought about using Markdown document (see CommonMark) as intermediate format, and using python-markdown2 to generate HTML and xhtml2pdf to generate the PDF (similar to what is done here). In the end, it seemed easier to just write out the simple HTML myself. Similarly, I thought about Sphinx with reStructuredText, but that needs a working LaTeX environment to produce PDFs.
For the actual HTML to PDF conversoin, some options include xhtml2pdf (which wraps reportlab) and pdfkit which wraps wkhtmltopdf. I didn't like the looks of their PDFs, so I went with Weasyprint.
For the wordcloud, I used the free Raleway font, downloaded via FontSquirrel.
To run this, all you need to do is change the xl_filename in the cell below, then run all of the cells.
This will create an HTML file, a wordcloud png, and a PDF all in the same directory as your Excel file.
If you like, you can change the css used for formatting; it's at the top of the script, and passed on as the css argument to generatepdf.
You can add extra stop words (i.e. things not to use in the word cloud and word count). If you're doing it via the script, it's the -s option. If you pass it in via the stopwords argument to generatepdf, you'll need to pass in a complete list of stopwords. So, you probably want to take the set wordcloud.STOPWORDS and add your words to it.
Note: I used to do everything in the notebook. However, once I made a command-line script, it seemed like a giant error-prone mess to duplicate code. So, code is imported from the script here, and the script is decently documented.
In [ ]:
import makepdf
import wordcloud
In [ ]:
xl_filename = 'data/MGLFall2015/Analytical-Physics-I--(Fall-2015-16).xlsx'
stopwords = wordcloud.STOPWORDS.union(['class','course'])
makepdf.generatepdf(xl_filename,stopwords=stopwords)
In [ ]:
from IPython.display import display, HTML
import os
In [ ]:
filename,ext = os.path.splitext(xl_filename)
assert ext == '.xlsx'
html_filename = filename + '.html'
wc_filename = filename + '-wordcloud.png'
html = open(html_filename).read()
In [ ]:
display(HTML(html.replace(os.path.split(wc_filename)[-1],wc_filename)))
In [ ]: