Chapter 0 - Stylometry with Python

Python for Stylometry

In this course, we will offer an introduction to a series of more advanced methods in current stylometry. We will do so, using the attractive format of Jupyter/IPython notebooks. You are free to follow along with the static versions of these notebooks on the course repository, or run the notebooks interactively on your own machine. In this part of the course, we will use Python throughout, which is a modern and lightweight alternative to other scripting languages in the field, such as Perl or R. Creating your own code for performing stylometric experiments gives a lot of advantages over using static GUI's; most notably, you will have a much better control and understanding of what is actually going on. As such, self-coding will allow you much more freedom in setting up the workflow that is right for you and your stylometric problems.

Choosing the stylometric programming environment which is right for you, is a complex matter - and one which will strongly depend on personal taste and previous experience. Much also depends on the ad hoc problems you are trying to solve and, in many cases, it doesn't hurt to be fluent in a number of different languages and coding environments. In general, however, we do recommend to stay away from commercial, closed-source alternatives, such as Matlab, which are less common in the Humanities anyways. Closed-source alternatives can be expensive and they limit your own insight into the bits of source code you use; moreover, they severely restrict the shareability of your code.

In the small universe of stylometry (and even textual analysis at large in the Digital Humanities), Python and R seem to be competing for "world domination" at present. In recent years, I have personally come to favour Python as my main language over R, because of the following reasons:

Python (IMHO) is a more intuitive coding syntax than R. The possibilities for diverse forms of looping are great - to give but one example.
Out-of-the box, Python seems to offer better support to process strings and natural languages in particular - also when it comes to less mainstream languages, e.g. left-to-right languages such as Hebrew.
Python is rapidly becoming the most popular scripting language in Machine Learning, in the context of teaching in academia, as well as for industrial prototyping at companies like Google.
A number of highly effective, easy-to-work-with third-party libraries are available for predictive analytics (e.g. scikit-learn).
In recent years, a very efficient scientific "ecosystem" has been developed, allowing the seamless installation and updating of third-party packages, using commands like pip, easy_install or conda.
Python works really well with Jupyter notebooks (like the present one), which are great for interactive data exploration as well as code sharing.

In particular, I like Python because it is a more versatile language than R, with respect to other tasks than pure stylometry. Parsing XML, for instance, or setting up a webserver are easily done in Python, but are more difficult to achieve with R. Nevertheless, Python also comes with a number of outspoken disadvantages, which we should be clear about from the start:

The difference between Python 2 and Python 3 is very confusing for beginners, because some (even important) packages do not support both versions to the same extent, or in the same fashion. Apart from setting up your coding environment correctly, installing cutting-edge, third-party packages can be more complex in Python, because the intricate web of dependencies, which some packages assume.
The possibilities for data visualization in Python have been lagging behind in recent years, especially when we compare matplotlib's plots to R's gorgeous default output. Nevertheless, Python has been catching up well recently, and with the recent introduction of excellent 'data viz' libraries such as seaborn or bokeh, it seems only a matter of time before Python will reach a similar status.
Especially when dealing with a lot of data and complicated algorithms, Python can be slow in comparison to other, more mature programming languages such as Java or C++. This is true of course for most other scripting languages such as R too, of course.

In conclusion: it is generally important to realize that Python might not be the ultimate answer to all your problems and needs in stylometry. At the same time, this course serves to demonstrate its capabilities and suggest which potential it can bring to your projects.

The Scientific Python Ecosystem

In the world of scientific computing (which stylometry is increasingly part of), a highly stable "ecosystem" has been developed in recent years. In general, this ecosystem takes the form of:

a number of basic external packages which have become fairly standard components of any coder's flow (e.g. numpy).
a number of easy ways to install and update these inter-related packages (e.g. using pip).

In general, installing Continuum's Anaconda distribution of Python will get you a long way, when you are setting up your machine for the first time, and this is also the installation procedure which we recommend for this course (see the repo's README for more details).

In this course, we will offer an introduction to a number of basic packages, which - when used in combination - offer a powerful and fully-fledged arsenal of stylometric tools. Much of the course materials here stem from my recent experiences in co-developing pystyl: a open-source Python library for stylometric analysis - which you are of course most welcome to check out or contribute to.

The main packages and the associated topics will be covered in this course include:

numpy: scientific number crunching and matrix manipulation, with an excursions to the pandas library build on top of it;
matplotlib: the currently default package in Python to visualize scientific data, with excursions to seaborn and bokeh (offering more aesthetically pleasing extensions etc.);
sklearn: a standard library for Machine Learning and predictive data analytics in Python. We will also focus on this library in the context of its excellent support for text 'vectorization' (i.e. turning texts into numbers).
(if there is time:) gensim: an awesome package for semantic analytics in Python, inclusing word2vec-style embeddings and topic modeling.