We aren't all happy at the same time
A list based sentiment analysis by Scott Golder and Michael Macy
Winning makes us happy.
by Sean J. Taylor (@seanjtaylor)
In [9]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
Open a note book and copy this text over.
Then press shift-enter
or select Cell-Run
from the pull down menu to run
it.
What's in there?
In a new cell, type tweet
and then run the cell
In [10]:
tweet
Out[10]:
We can also print
out the contents.
In [11]:
print tweet
Python doesn't care if you use '
, "
, or even '''
for your strings.
In [12]:
tweet = "We have some delightful new food in the cafeteria. Awesome!!!"
tweet
Out[12]:
Will this work?
tweet = Does anyone call New Haven "NeHa"?
Guess? Try it!
In [13]:
tweet = Does anyone call New Haven "NeHa"?
In [14]:
tweet = '''Does anyone call New Haven "NeHa"?'''
print tweet
In [15]:
tweet = 'Does anyone call New Haven "NeHa"?'
print tweet
In [16]:
['everything','in','brackets','separated','by','commas.']
Out[16]:
Think of these like variables.
In [17]:
positive_words = ['awesome', 'good', 'nice', 'super', 'fun']
In [18]:
print positive_words
We can add things to the list with append
.
In [19]:
positive_words.append('delightful')
In [20]:
print positive_words
Note that we didn't write postive_words = positive_words.append('delightful')
.
.append()
modifies the content of the list.
In [21]:
positive_words.append(like)
In [22]:
new_word_to_add = 'like'
positive_words.append(new_word_to_add)
In [23]:
print positive_words
Your turn.
Make a list callled negative_words
that includes awful, lame, horrible and bad. print
out the contents.
In [ ]:
In [24]:
negative_words = ['awful','lame','horrible','bad']
print negative_words
Combining lists
In [25]:
emotional_words = negative_words + positive_words
print emotional_words
Strings can be split to create lists. I do this a lot.
In [26]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
words = tweet.split()
print words
In [27]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
print tweet.split('.')
Unlike .append()
, .split()
doesn't alter the string strings. Strings are immutable.
In [28]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
print tweet.split()
print tweet
So when you modify a string, make sure you store the results somewhere.
In [29]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
words = tweet.split()
print words
In [30]:
print words
print len(words)
How long is tweet?
In [91]:
In [92]:
print tweet
len(tweet)
Out[92]:
With functions like len()
, Python counts the number of items in list and the number of characters in a string.
There's a couple of more data types that you might need:
In [88]:
#tuple
row = (1,3,'fish')
print row
In [89]:
#sets
set([3,4,5,5])
Out[89]:
In [90]:
#Dictionary
article_1 = {'title': 'Cat in the Hat', 'author': 'Dr. Seuss', 'Year': 1957}
In [91]:
#And a list of dictionaries is awfully close to a JSON.
article_2 = {'title': 'Go Do Go!', 'author': 'PD Eastman', 'Year': 1961}
articles = [article_1, article_2]
articles
Out[91]:
In [32]:
for word in words:
print word
In [33]:
for word in words:
print word
Note the colon at the end of the first line. Python will expect the next line to be indented.
We can also add conditionals, like if
and else
or elif
.
In [34]:
for word in words:
if word in positive_words:
print word
Your turn. Take the following tweet and print out a plus sign for each positive word:
tweet_2 = "Food is lame today. I don't like it at all."
Don't peak.
In [34]:
In [35]:
tweet_2 = "Food is lame today. I don't like it at all."
words_2 = tweet_2.split()
for word in words_2:
if word in positive_words:
print '+'
In [36]:
tweet_2 = "Food is lame today. I don't like it at all."
words_2 = tweet_2.split()
for word in words_2:
if word in positive_words:
print '+'
elif word in negative_words:
print '-'
Like lists, we can combine strings with a +
.
In [37]:
for word in words:
if word in positive_words:
print word + ' is a positive word.'
Why doesn't this work?
In [38]:
print 3 + ' is a number.'
In [39]:
print ['puppies','dogs'] + 'are pets.'
In [40]:
print '3' + ' is a number.'
print str(3) + ' is a number.'
print '%s is a number.' % 3
In [41]:
for some_number in [1,2,4,9]:
sentence = str(some_number) + ' is a number.'
print sentence
In [42]:
print tweet.lower()
But we can't do it with a list of things.
In [43]:
print words.lower()
So you'll either need clean the whole sentence, each word, or both.
In [44]:
for word in words:
word_lower = word.lower()
if word_lower in positive_words:
print word_lower + ' is a positive word.'
Updating our loop, we still don’t find awesome!!!
yet.
Why?
In [45]:
print 'awesome!!!'.strip('!')
Getting rid of '!"#$%&\'()*+,-./:;<=>?@[\\]^_
{|}~'`
In [46]:
from string import punctuation
In [47]:
punctuation
Out[47]:
In [48]:
'awesome!!!!'.strip(punctuation)
Out[48]:
In [49]:
'awesome?!?!'.strip(punctuation)
Out[49]:
In [50]:
'awesome!!!! party'.strip(punctuation)
Out[50]:
In [51]:
print 'awesome!!! party'
print 'awesome!!! party'.replace('!','')
In [52]:
word = 'awesome!!!'
word_processed = word.strip('!')
word_processed = word_processed.lower()
print word_processed
In [53]:
word = 'awesome!!!'
word_processed = word.strip('!').lower()
print word_processed
In [54]:
for word in words:
word_processed = word.lower()
word_processed = word_processed.strip(punctuation)
if word_processed in positive_words:
print word + ' is a positive word'
It worked!!!
But what we really care about is the count of words.
In [55]:
postive_counter = 0
for word in words:
word_processed = word.lower()
word_processed = word_processed.strip(punctuation)
if word_processed in positive_words:
postive_counter = postive_counter + 1
print postive_counter
Let's do this for real.
LIWC is what all the cool kids use, but there list is copyrighted. So we'll use lists of positive and negative words from:
Theresa Wilson, Janyce Wiebe and Paul Hoffmann (2005). "Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis." Proceedings of HLT/EMNLP 2005, Vancouver, Canada.
In [95]:
negative_file = open('negative.txt', 'r').read()
The basic way to open and read a text file.
r
tells the operating system you want permission to read it.
.read()
tells Python to import all the text
Let's take a look at negative_file
by slicing it up.
In [57]:
negative_file[:50]
Out[57]:
\n
is an End of Line character
[:50]
took the first 50 characters
In [58]:
negative_list = negative_file.splitlines()
In [59]:
negative_list[:5]
Out[59]:
In [60]:
print negative_list[0]
In [61]:
print negative_list[0:1]
In [62]:
print negative_list[-5:]
Let's do it again for the postive words
In [63]:
postive_file = open('positive.txt', 'r').read()
postive_list = postive_file.splitlines()
Quiz: How many words are in the two lists combined?
In [63]:
In [64]:
print len(postive_list) + len(negative_list)
A while back, I use the Twitter API to get some Tweets, or status updates, that mentioned President Obama.
In [65]:
obama_tweets = open('obama_tweets.txt', 'r').read()
obama_tweets = obama_tweets.splitlines()
Avoid copy and paste text in file! Make a function!
In [66]:
def open_list(filename):
list_file = open(filename, 'r').read()
list_file = list_file.splitlines()
return list_file
obama_tweets = open_list('obama_tweets.txt')
More slicing
In [67]:
obama_tweets[:5]
Out[67]:
Some from the middle?
In [68]:
print obama_tweets[52:55]
That's ugly.
In [69]:
for tweet in obama_tweets[52:55]:
print tweet
Let's get going!!!
In [70]:
#loop, but don't go through everything yet
for tweet in obama_tweets[:5]:
print tweet
positive_counter=0
#Lower case everything
tweet_processed=tweet.lower()
#split by ' ' into a list of words
words = tweet_processed.split()
#Loop through each word in the tweet
for word in words:
clean_word = word.strip(punctuation)
if clean_word in postive_list:
print clean_word
positive_counter=positive_counter+1
print positive_counter,len(words)
Your turn. Add a negative_counter!
In [70]:
In [71]:
#loop, but don't go through everything yet
for tweet in obama_tweets[:5]:
positive_counter = 0
negative_counter = 0
#Lower case everything
tweet_processed = tweet.lower()
#split by ' ' into a list of words
words = tweet_processed.split()
#Loop through each word in the tweet
for word in words:
clean_word = word.strip(punctuation)
if clean_word in postive_list:
positive_counter = positive_counter + 1
elif clean_word in negative_list:
negative_counter = negative_counter + 1
print positive_counter, negative_counter, len(words)
Now let's get this out of Python
In [72]:
import csv
In [73]:
csv_file = open('tweet_sentiment.csv','w')
csv_writer = csv.writer( csv_file )
In [74]:
#loop, but don't go through everything yet
for tweet in obama_tweets[:5]:
positive_counter = 0
negative_counter = 0
#Lower case everything
tweet_processed = tweet.lower()
#split by ' ' into a list of words
words = tweet_processed.split()
#Loop through each word in the tweet
for word in words:
clean_word = word.strip(punctuation)
if clean_word in postive_list:
positive_counter = positive_counter + 1
elif clean_word in negative_list:
negative_counter = negative_counter + 1
csv_writer.writerow( [positive_counter, negative_counter, len(words)] )
csv_file.close()
Let's look at the results. If !cat
doesn't work, try !type
which is the Windows equivalent.
In [75]:
!cat tweet_sentiment.csv
Why only 5 rows?
In [76]:
csv_file = open('tweet_sentiment.csv','w')
csv_writer = csv.writer( csv_file )
#loop
for tweet in obama_tweets[:]:
positive_counter = 0
negative_counter = 0
#Lower case everything
tweet_processed = tweet.lower()
#split by ' ' into a list of words
words = tweet_processed.split()
#Loop through each word in the tweet
for word in words:
clean_word = word.strip(punctuation)
if clean_word in postive_list:
positive_counter = positive_counter + 1
if clean_word in negative_list:
negative_counter = negative_counter + 1
csv_writer.writerow( [ positive_counter, negative_counter, len(words)] )
csv_file.close()
The csv
file can be read in other programs.
That's not Python
If you wanted to add a header, you could by putting
csv_writer.writerow( [ 'postive', 'negative', 'length'] )
before you start writing the values.
In case your wondering, I would probably make my code a little different
In [94]:
def clean_split(tweet):
''' Take sentence and return cleaned list of words '''
return [word.strip(punctuation) for word in tweet.lower().split()]
#turn the lists into sets so we can do intersections
postive_set = set(postive_list)
negative_set = set(negative_list)
sentiment = []
for tweet in obama_tweets:
words = clean_split(tweet)
postive_counter = len( postive_set.intersection(words) )
negative_counter = len( negative_set.intersection(words) )
sentiment.append ( [ postive_counter , negative_counter, len(words)] )
with open('tweet_sentiment_2.csv','w') as csv_file:
csv_writer = csv.writer( csv_file )
csv_writer.writerows(sentiment)