For this, we'll use the TextBlob library (http://textblob.readthedocs.org/en/dev/) and pandas (http://pandas.pydata.org/)
In [ ]:
from textblob import TextBlob
import pandas as pd
import pylab as plt
import collections
import re
%matplotlib inline
I've already pulled down The Sunned House from Project Gutenberg (https://www.gutenberg.org/wiki/Main_Page) and saved it as a text file called 'lovecraft.txt'. Here we'll load it then define the encoding as utf-8. Lastly, we'll instantiate a TextBlob object:
In [ ]:
with open (r'lovecraft.txt', 'r') as myfile:
shunned = myfile.read()
ushunned = unicode(shunned, 'utf-8')
tb = TextBlob(ushunned)
Now we'll go through every sentence in the story and get the 'sentiment' of each one. Sentiment analysis in TextBlob returns a polarity and a subjectivity number. Here we'll just extract the polarity:
In [ ]:
paragraph = tb.sentences
i = -1
for sentence in paragraph:
i += 1
pol = sentence.sentiment.polarity
if i == 0:
write_type = 'w'
with open('shunned.csv', write_type) as text_file:
header = 'number,polarity\n'
text_file.write(str(header))
write_type = 'a'
with open('shunned.csv', write_type) as text_file:
newline = str(i) + ',' + str(pol) + '\n'
text_file.write(str(newline))
Now we instantiate a dataframe by pulling in that csv:
In [ ]:
df = pd.DataFrame.from_csv('shunned.csv')
Let's plot our data! First let's just look at how the sentiment polarity changes from sentence to sentence:
In [ ]:
df.polarity.plot(figsize=(12,5), color='b', title='Sentiment Polarity for HP Lovecraft\'s The Shunned House')
plt.xlabel('Sentence number')
plt.ylabel('Sentiment polarity')
Very up and down from sentence to sentence! Some dark sentences (the ones below 0.0 polarity), some positive sentences (greater than 0.0 polarity) but overall kind of hovers around 0.0 polarity.
One thing that may be interesting to look at is how the senitment changes over the course of the book. To examine that further, I'm going to create a new column in the dataframe which is the cumulative summation of the polarity rating, using the cumsum() pandas method:
In [ ]:
df['cum_sum'] = df.polarity.cumsum()
So, now let's plot the results-- How does the sentiment of Lovecraft's story change over the course of the book?
In [ ]:
df.cum_sum.plot(figsize=(12,5), color='r',
title='Sentiment Polarity cumulative summation for HP Lovecraft\'s The Shunned House')
plt.xlabel('Sentence number')
plt.ylabel('Cumulative sum of sentiment polarity')
The climax of Lovecraft's story appears to be around sentence 255 or so. Things really drop off at that point and get dark, according to the TextBlob sentiment analysis.
What's the dataframe look like?
In [ ]:
df.head()
Let's get some basic statistical information about sentence seniments:
In [ ]:
df.describe()
For fun, let's just see what TextBlob thinks are the most negatively polar sentences in the short story:
In [ ]:
for i in df[df.polarity < -0.5].index:
print i, tb.sentences[i]
In [ ]:
words = re.findall(r'\w+', open('lovecraft.txt').read().lower())
collections.Counter(words).most_common(10)
Let's take a quick peak at word frequencies by using the re and collections library. Here we'll use the Counter() and most_common() methods to return a list of tuples of the most common words in the story:
In [ ]:
words = re.findall(r'\w+', ushunned.lower())
common = collections.Counter(words).most_common()
In [ ]:
df_freq = pd.DataFrame(common, columns=['word', 'freq'])
df_freq.set_index('word').head()
In [ ]: