Corpus Description

I am using transcripts of presidential debates from http://www.debates.org/index.php?page=debate-transcripts. These are transcripts of presidential debates from 1960 until the most recent debates in 2012.

I created a python script, shared on github, to help download the debates to text files.

The text is nested in html markup which separates statements somewhat. There are also names inserted to indicate who is speaking. The files my script generates weigh in at 4.3 megabytes.

Corpus Size

With html:

String length: 4,560,289

Length split by spaces: 660,104

After using nltk.clean_html:

String length: 3,154,990

Length split by spaces: 574,570


In [15]:
from os import path

import nltk
from nltk.corpus.reader import PlaintextCorpusReader

In [16]:
corpus_path = path.join(path.curdir, "pres_debates")
corpus = PlaintextCorpusReader(corpus_path, ".*.txt")

In [17]:
rawcorp = corpus.raw()

In [18]:
len(rawcorp)


Out[18]:
4560289

In [19]:
len(rawcorp.split(" "))


Out[19]:
660104

In [20]:
cleaned = nltk.clean_html(rawcorp)

In [22]:
len(cleaned)


Out[22]:
3154990

In [23]:
len(cleaned.split(" "))


Out[23]:
574570