In this course, we will offer an introduction to a series of more advanced methods in current stylometry. We will do so, using the attractive format of Jupyter/IPython notebooks. You are free to follow along with the static versions of these notebooks on the course repository, or run the notebooks interactively on your own machine. In this part of the course, we will use Python throughout, which is a modern and lightweight alternative to other scripting languages in the field, such as Perl or R. Creating your own code for performing stylometric experiments gives a lot of advantages over using static GUI's; most notably, you will have a much better control and understanding of what is actually going on. As such, self-coding will allow you much more freedom in setting up the workflow that is right for you and your stylometric problems.
Choosing the stylometric programming environment which is right for you, is a complex matter - and one which will strongly depend on personal taste and previous experience. Much also depends on the ad hoc problems you are trying to solve and, in many cases, it doesn't hurt to be fluent in a number of different languages and coding environments. In general, however, we do recommend to stay away from commercial, closed-source alternatives, such as Matlab, which are less common in the Humanities anyways. Closed-source alternatives can be expensive and they limit your own insight into the bits of source code you use; moreover, they severely restrict the shareability of your code.
In the small universe of stylometry (and even textual analysis at large in the Digital Humanities), Python and R seem to be competing for "world domination" at present. In recent years, I have personally come to favour Python as my main language over R, because of the following reasons:
scikit-learn). pip, easy_install or conda.In particular, I like Python because it is a more versatile language than R, with respect to other tasks than pure stylometry. Parsing XML, for instance, or setting up a webserver are easily done in Python, but are more difficult to achieve with R. Nevertheless, Python also comes with a number of outspoken disadvantages, which we should be clear about from the start:
matplotlib's plots to R's gorgeous default output. Nevertheless, Python has been catching up well recently, and with the recent introduction of excellent 'data viz' libraries such as seaborn or bokeh, it seems only a matter of time before Python will reach a similar status.In conclusion: it is generally important to realize that Python might not be the ultimate answer to all your problems and needs in stylometry. At the same time, this course serves to demonstrate its capabilities and suggest which potential it can bring to your projects.
In the world of scientific computing (which stylometry is increasingly part of), a highly stable "ecosystem" has been developed in recent years. In general, this ecosystem takes the form of:
numpy).pip).In general, installing Continuum's Anaconda distribution of Python will get you a long way, when you are setting up your machine for the first time, and this is also the installation procedure which we recommend for this course (see the repo's README for more details).
In this course, we will offer an introduction to a number of basic packages, which - when used in combination - offer a powerful and fully-fledged arsenal of stylometric tools. Much of the course materials here stem from my recent experiences in co-developing pystyl: a open-source Python library for stylometric analysis - which you are of course most welcome to check out or contribute to.
The main packages and the associated topics will be covered in this course include:
numpy: scientific number crunching and matrix manipulation, with an excursions to the pandas library build on top of it;matplotlib: the currently default package in Python to visualize scientific data, with excursions to seaborn and bokeh (offering more aesthetically pleasing extensions etc.);sklearn: a standard library for Machine Learning and predictive data analytics in Python. We will also focus on this library in the context of its excellent support for text 'vectorization' (i.e. turning texts into numbers). gensim: an awesome package for semantic analytics in Python, inclusing word2vec-style embeddings and topic modeling.