For my creative project I will write some code that will analyze Their Eyes Were Watching God sentence by sentence. The very first thing I will do is analyze the sentiment of every single sentence in the book. The sentiment of a sentence is a number between $0$ and $1$ which describes whether or not a sentence is positive or negative. A sentiment values of $0,0.5$ and $1$ simply map to sentiments negative,neutral and positive. Using this we can visualize the projection of sentiment throughout the book and map events to these sentiments.
The first thing we do is load up the text file containing the book and split it up by sentences.
In [15]:
import nltk.data
with open('D:\\Temp\\Their-Eyes-Were-Watching-God-rmrju9.txt', 'r') as content_file:
content = content_file.read()
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
tokens = sent_detector.tokenize(content)
The next part is a little tricky to explain. Essentially what we do is create a function that given a sentence can give us a sentiment value.
In [32]:
import numpy as np
from scipy.optimize import minimize
import sklearn.linear_model as ln
import nltk.classify.util
from nltk.classify import *
from nltk.corpus import movie_reviews
import nltk.corpus as corpus
import nltk.tokenize as tokenize
import nltk.data as data
from sets import Set
class GeneralTokenizerTools():
def __init__(self):
self.stopwords = Set(corpus.stopwords.words('english'))
self.sent_tokenizer = data.load('tokenizers/punkt/english.pickle')
def tokenize_remove_stopwords_sentence(self,sentence):
return [x for x in tokenize.WhitespaceTokenizer().tokenize(sentence) if x not in self.stopwords]
def tokenize_remove_stopwords_sentences(self,sentences):
return [self.tokenize_remove_stopwords_sentence(x) for x in sentences]
def tokenize_by_sentence(self,text):
return self.tokenize_remove_stopwords_sentences(self.sent_tokenizer.tokenize(text))
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 0) for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 1) for f in posids]
cut = .99
negcutoff = int(len(negfeats) * cut)
poscutoff = int(len(posfeats) * cut)
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]
print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))
classifier = SklearnClassifier(ln.LogisticRegression(),sparse=False).train(trainfeats)
print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)
tool = GeneralTokenizerTools()
def predict(x):
return classifier.prob_classify(dict([(y,True) for y in tool.tokenize_remove_stopwords_sentence(x)]))
Some more math magic that will make the presentation a bit nicer to look at.
In [76]:
import matplotlib.pylab as plt
%pylab inline
def exponential_smoothing(x):
def y(alpha, x):
y = np.empty(len(x), float)
y[0] = x[0]
for i in xrange(1, len(x)):
y[i] = x[i - 1] * alpha + y[i - 1] * (1 - alpha)
return y
def mape(alpha, x):
diff = y(alpha, x) - x
return np.mean(diff / x) ** 2
guess = .5
result = minimize(mape, guess, (x,), bounds=[(0,1)],method='L-BFGS-B')
print result
return y(result.x,x)
def moving_average(x,y,step_size=.1,bin_size=1):
x = x or np.arange(0,len(y),1)
y = np.array(y)
bin_centers = np.arange(np.min(x),np.max(x) - 0.5 * step_size,step_size) + 0.5 * step_size
bin_avg = np.zeros(len(bin_centers))
for index in range(0,len(bin_centers)):
bin_center = bin_centers[index]
items_in_bin = y[(x > (bin_center - bin_size * 0.5)) & (x < (bin_center + bin_size * 0.5))]
bin_avg[index] = np.average(items_in_bin)
return bin_centers,bin_avg
We classify the sentiment of each sentence and we smooth everything out. So for example if we have 5 sentiments of sentences near eachother, and 4 of them are positive we will just say that 5th one is positive too (even if it is negative). This allows the presentation to be a bit nicer.
In [67]:
page_count = 218
a_pos,b_pos = moving_average(None,[predict(x).prob(1) for x in tokens],bin_size=100)
The final stage is to visualize it.
In [73]:
plt.title("Sentiment of each Sentence")
plt.xlabel("Page")
plt.ylabel("Probability of Sentence Being Positive")
plt.plot(np.arange(0,page_count,float(page_count)/len(b_pos)), (b_pos - min(b_pos))/(max(b_pos)-min(b_pos)))
plt.show()
It is interesting to note that their are about three chunks (not including up to page 30 since that was the foreward) which correlate with the men she has met throughout the book. It seems counterintuitive that the most positive but sporadic chunk of the book occured when she was with logan while the most consistent was with Tea-Cake.
The next thing we will do is follow the trend of independence and love and how this correlates with the theme of a free women throughout the book. To do this we will analyze every individual sentence in the book to see if they contain some key words.
In [165]:
from scipy import stats
words_of_interest = ["independence","independent","free","love","like","happy","woman"]
def exists(x):
for i in words_of_interest:
if i in x:
return 1
return 0
a = (np.array([x for x in range(len(tokens)) if exists(tokens[x]) == 1 ]))
kde = stats.gaussian_kde(a)
plt.xlabel("Sentence")
plt.ylabel("Average Occurence")
plt.plot(range(5000),[-np.log(kde(x) * 1000) for x in range(5000)])
plt.show()
In this part of my project I analyze the words that correlate with the major themes of the book. It is interesting to note that the low point of the protaganist is in the middle of the book while she reaches her peak was in the end of the book with tea cake. It is intereting to note that her level of independence and love drop after the death of Tea Cake.
In [ ]: