I am using transcripts of presidential debates from http://www.debates.org/index.php?page=debate-transcripts. These are transcripts of presidential debates from 1960 until the most recent debates in 2012.
I created a python script, shared on github, to help download the debates to text files.
The text is nested in html markup which separates statements somewhat. There are also names inserted to indicate who is speaking. The files my script generates weigh in at 4.3 megabytes.
String length: 4,560,289
Length split by spaces: 660,104
String length: 3,154,990
Length split by spaces: 574,570
In [15]:
from os import path
import nltk
from nltk.corpus.reader import PlaintextCorpusReader
In [16]:
corpus_path = path.join(path.curdir, "pres_debates")
corpus = PlaintextCorpusReader(corpus_path, ".*.txt")
In [17]:
rawcorp = corpus.raw()
In [18]:
len(rawcorp)
Out[18]:
In [19]:
len(rawcorp.split(" "))
Out[19]:
In [20]:
cleaned = nltk.clean_html(rawcorp)
In [22]:
len(cleaned)
Out[22]:
In [23]:
len(cleaned.split(" "))
Out[23]: