We aren't all happy at the same time

A list based sentiment analysis by Scott Golder and Michael Macy

Winning makes us happy.

by Sean J. Taylor (@seanjtaylor)

Data types

Strings


In [9]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'

Open a note book and copy this text over.

Then press shift-enter or select Cell-Run from the pull down menu to run it.

What's in there?

In a new cell, type tweet and then run the cell


In [10]:
tweet


Out[10]:
'We have some delightful new food in the cafeteria. Awesome!!!'

We can also print out the contents.


In [11]:
print tweet


We have some delightful new food in the cafeteria. Awesome!!!

Python doesn't care if you use ', ", or even ''' for your strings.


In [12]:
tweet = "We have some delightful new food in the cafeteria. Awesome!!!"
tweet


Out[12]:
'We have some delightful new food in the cafeteria. Awesome!!!'

Will this work?

tweet = Does anyone call New Haven "NeHa"?

Guess? Try it!


In [13]:
tweet = Does anyone call New Haven "NeHa"?


  File "<ipython-input-13-053a43dc44ba>", line 1
    tweet = Does anyone call New Haven "NeHa"?
                      ^
SyntaxError: invalid syntax

In [14]:
tweet = '''Does anyone call New Haven "NeHa"?'''

print tweet


Does anyone call New Haven "NeHa"?

In [15]:
tweet = 'Does anyone call New Haven "NeHa"?'

print tweet


Does anyone call New Haven "NeHa"?

Lists - another way to store data


In [16]:
['everything','in','brackets','separated','by','commas.']


Out[16]:
['everything', 'in', 'brackets', 'separated', 'by', 'commas.']

Think of these like variables.


In [17]:
positive_words = ['awesome', 'good', 'nice', 'super', 'fun']

In [18]:
print positive_words


['awesome', 'good', 'nice', 'super', 'fun']

We can add things to the list with append.


In [19]:
positive_words.append('delightful')

In [20]:
print positive_words


['awesome', 'good', 'nice', 'super', 'fun', 'delightful']

Note that we didn't write postive_words = positive_words.append('delightful').

.append() modifies the content of the list.


In [21]:
positive_words.append(like)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-d07de71c07d9> in <module>()
----> 1 positive_words.append(like)

NameError: name 'like' is not defined

In [22]:
new_word_to_add = 'like'
positive_words.append(new_word_to_add)

In [23]:
print positive_words


['awesome', 'good', 'nice', 'super', 'fun', 'delightful', 'like']

Your turn.

Make a list callled negative_words that includes awful, lame, horrible and bad. print out the contents.


In [ ]:


In [24]:
negative_words = ['awful','lame','horrible','bad']
print negative_words


['awful', 'lame', 'horrible', 'bad']

Combining lists


In [25]:
emotional_words = negative_words + positive_words
print emotional_words


['awful', 'lame', 'horrible', 'bad', 'awesome', 'good', 'nice', 'super', 'fun', 'delightful', 'like']

Strings can be split to create lists. I do this a lot.


In [26]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'

words = tweet.split()
print words


['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']

In [27]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
print tweet.split('.')


['We have some delightful new food in the cafeteria', ' Awesome!!!']

Unlike .append(), .split() doesn't alter the string strings. Strings are immutable.


In [28]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
print tweet.split()
print tweet


['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']
We have some delightful new food in the cafeteria. Awesome!!!

So when you modify a string, make sure you store the results somewhere.


In [29]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'
words = tweet.split()
print words


['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']

Most of the fun math is in numpy but we can count the length of objects.


In [30]:
print words
print len(words)


['We', 'have', 'some', 'delightful', 'new', 'food', 'in', 'the', 'cafeteria.', 'Awesome!!!']
10

How long is tweet?


In [91]:


In [92]:
print tweet 
len(tweet)


RT @SoCalConservtiv: His failed budget proposal that was voted down 414-0 in Congress... Oh wait that's Professor Obama #WhatsRomneyHiding #tcot
Out[92]:
144

With functions like len(), Python counts the number of items in list and the number of characters in a string.

There's a couple of more data types that you might need:


In [88]:
#tuple
row = (1,3,'fish')

print row


(1, 3, 'fish')

In [89]:
#sets
set([3,4,5,5])


Out[89]:
{3, 4, 5}

In [90]:
#Dictionary

article_1 = {'title': 'Cat in the Hat', 'author': 'Dr. Seuss', 'Year': 1957}

In [91]:
#And a list of dictionaries is awfully close to a JSON.
article_2 = {'title': 'Go Do Go!', 'author': 'PD Eastman', 'Year': 1961}

articles = [article_1, article_2]

articles


Out[91]:
[{'Year': 1957, 'author': 'Dr. Seuss', 'title': 'Cat in the Hat'},
 {'Year': 1961, 'author': 'PD Eastman', 'title': 'Go Do Go!'}]

Loops

Was any of our sentence words in the postive word list?


In [32]:
for word in words:
print word


  File "<ipython-input-32-8d4b96107be0>", line 2
    print word
        ^
IndentationError: expected an indented block

In [33]:
for word in words:
    print word


We
have
some
delightful
new
food
in
the
cafeteria.
Awesome!!!

Note the colon at the end of the first line. Python will expect the next line to be indented.

We can also add conditionals, like if and else or elif.


In [34]:
for word in words:
    if word in positive_words:
        print word


delightful

Your turn. Take the following tweet and print out a plus sign for each positive word:

tweet_2 = "Food is lame today. I don't like it at all."

Don't peak.


In [34]:


In [35]:
tweet_2 = "Food is lame today. I don't like it at all."
words_2 = tweet_2.split()

for word in words_2:
    if word in positive_words:
        print '+'


+

In [36]:
tweet_2 = "Food is lame today. I don't like it at all."
words_2 = tweet_2.split()

for word in words_2:
    if word in positive_words:
        print '+'
    elif word in negative_words:
        print '-'


-
+

Like lists, we can combine strings with a +.


In [37]:
for word in words:
    if word in positive_words:
        print word + ' is a positive word.'


delightful is a positive word.

Why doesn't this work?


In [38]:
print 3 + ' is a number.'


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-38-43490b2e84f2> in <module>()
----> 1 print 3 + ' is a number.'

TypeError: unsupported operand type(s) for +: 'int' and 'str'

In [39]:
print ['puppies','dogs'] + 'are pets.'


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-4c744f2d275f> in <module>()
----> 1 print ['puppies','dogs'] + 'are pets.'

TypeError: can only concatenate list (not "str") to list

In [40]:
print '3' + ' is a number.'

print str(3) + ' is a number.'

print '%s is a number.' % 3


3 is a number.
3 is a number.
3 is a number.

In [41]:
for some_number in [1,2,4,9]:
    sentence = str(some_number) + ' is a number.'
    print sentence


1 is a number.
2 is a number.
4 is a number.
9 is a number.

Text cleaning

Or why Awesome!!! wasn't a positive word


In [42]:
print tweet.lower()


we have some delightful new food in the cafeteria. awesome!!!

But we can't do it with a list of things.


In [43]:
print words.lower()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-43-14fb7edbf297> in <module>()
----> 1 print words.lower()

AttributeError: 'list' object has no attribute 'lower'

So you'll either need clean the whole sentence, each word, or both.


In [44]:
for word in words:
    word_lower = word.lower()
    if word_lower in positive_words:
        print word_lower + ' is a positive word.'


delightful is a positive word.

Updating our loop, we still don’t find awesome!!! yet.

Why?


In [45]:
print 'awesome!!!'.strip('!')


awesome

Getting rid of '!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~'`


In [46]:
from string import punctuation

In [47]:
punctuation


Out[47]:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [48]:
'awesome!!!!'.strip(punctuation)


Out[48]:
'awesome'

In [49]:
'awesome?!?!'.strip(punctuation)


Out[49]:
'awesome'

In [50]:
'awesome!!!! party'.strip(punctuation)


Out[50]:
'awesome!!!! party'

In [51]:
print 'awesome!!! party'
print 'awesome!!! party'.replace('!','')


awesome!!! party
awesome party

In [52]:
word = 'awesome!!!'

word_processed = word.strip('!')
word_processed = word_processed.lower()

print word_processed


awesome

In [53]:
word = 'awesome!!!'
word_processed = word.strip('!').lower()
print word_processed


awesome

In [54]:
for word in words:
    word_processed = word.lower()
    word_processed = word_processed.strip(punctuation)
    if word_processed in positive_words:
        print word + ' is a positive word'


delightful is a positive word
Awesome!!! is a positive word

It worked!!!

But what we really care about is the count of words.


In [55]:
postive_counter = 0

for word in words:
    word_processed = word.lower()
    word_processed = word_processed.strip(punctuation)
    if word_processed in positive_words:
        postive_counter = postive_counter + 1
print postive_counter


2

Let's do this for real.

  • Get a real list of affect words.
  • Get a real list of tweets.
  • Output the results to a csv file for additional analysis.

LIWC is what all the cool kids use, but there list is copyrighted. So we'll use lists of positive and negative words from:

Theresa Wilson, Janyce Wiebe and Paul Hoffmann (2005). "Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis." Proceedings of HLT/EMNLP 2005, Vancouver, Canada.


In [95]:
negative_file = open('negative.txt', 'r').read()

The basic way to open and read a text file.

r tells the operating system you want permission to read it.
.read() tells Python to import all the text

Let's take a look at negative_file by slicing it up.


In [57]:
negative_file[:50]


Out[57]:
'abandoned\nabandonment\naberration\naberration\nabhorr'

\n is an End of Line character

[:50] took the first 50 characters


In [58]:
negative_list = negative_file.splitlines()

In [59]:
negative_list[:5]


Out[59]:
['abandoned', 'abandonment', 'aberration', 'aberration', 'abhorred']

Python starts at 0


In [60]:
print negative_list[0]


abandoned

In [61]:
print negative_list[0:1]


['abandoned']

In [62]:
print negative_list[-5:]


['wrought', 'yawn', 'zealot', 'zealous', 'zealously']

Let's do it again for the postive words


In [63]:
postive_file = open('positive.txt', 'r').read()
postive_list = postive_file.splitlines()

Quiz: How many words are in the two lists combined?


In [63]:


In [64]:
print len(postive_list) + len(negative_list)


6135

A while back, I use the Twitter API to get some Tweets, or status updates, that mentioned President Obama.


In [65]:
obama_tweets = open('obama_tweets.txt', 'r').read()
obama_tweets = obama_tweets.splitlines()

Avoid copy and paste text in file! Make a function!


In [66]:
def open_list(filename):
    list_file = open(filename, 'r').read()
    list_file = list_file.splitlines()
    return list_file

obama_tweets = open_list('obama_tweets.txt')

More slicing


In [67]:
obama_tweets[:5]


Out[67]:
['Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.',
 'In his teen years, Obama has been known to use marijuana and cocaine.',
 'IPA Congratulates President Barack Obama for Leadership Regarding JOBS Act: WASHINGTON, Apr 05, 2012 (BUSINESS W... http://t.co/8le3DC8E',
 'RT @Professor_Why: #WhatsRomneyHiding - his connection to supporters of Critical Race Theory.... Oh wait, that was Obama, not Romney...',
 'RT @wardollarshome: Obama has approved more targeted assassinations than any modern US prez; READ & RT: http://t.co/bfC4gbBW']

Some from the middle?


In [68]:
print obama_tweets[52:55]


["Barack Obama President Ronald Reagan's Initial Actions Project: President Ronald Reagan was also facing an econo... http://t.co/8Go8oCpf", 'RT @TXGaryM: Yes #WHFail RT @jltho: This #WhatsRomneyHiding hashtag is entertaining. Is this another social media backfire from the Obama administration?', 'Barack Obama LONGBOARD Package CORE 7" TRUCKS 76mm BIGFOOT WHEELS: The newest addition to the Bigfoot Collection... http://t.co/cnHRuUBZ']

That's ugly.


In [69]:
for tweet in obama_tweets[52:55]:
    print tweet


Barack Obama President Ronald Reagan's Initial Actions Project: President Ronald Reagan was also facing an econo... http://t.co/8Go8oCpf
RT @TXGaryM: Yes #WHFail RT @jltho: This #WhatsRomneyHiding hashtag is entertaining. Is this another social media backfire from the Obama administration?
Barack Obama LONGBOARD Package CORE 7" TRUCKS 76mm BIGFOOT WHEELS: The newest addition to the Bigfoot Collection... http://t.co/cnHRuUBZ

Let's get going!!!


In [70]:
#loop, but don't go through everything yet
for tweet in obama_tweets[:5]:
    print tweet
    positive_counter=0
    #Lower case everything
    tweet_processed=tweet.lower()
    
    #split by ' ' into a list of words
    words = tweet_processed.split()
    
    #Loop through each word in the tweet
    for word in words:
        clean_word = word.strip(punctuation)
        
        if clean_word in postive_list:
            print clean_word
            positive_counter=positive_counter+1
    
    print positive_counter,len(words)


Obama has called the GOP budget social Darwinism. Nice try, but they believe in social creationism.
nice
1 16
In his teen years, Obama has been known to use marijuana and cocaine.
0 13
IPA Congratulates President Barack Obama for Leadership Regarding JOBS Act: WASHINGTON, Apr 05, 2012 (BUSINESS W... http://t.co/8le3DC8E
0 17
RT @Professor_Why: #WhatsRomneyHiding - his connection to supporters of Critical Race Theory.... Oh wait, that was Obama, not Romney...
0 19
RT @wardollarshome: Obama has approved more targeted assassinations than any modern US prez; READ & RT: http://t.co/bfC4gbBW
modern
1 17

Your turn. Add a negative_counter!


In [70]:


In [71]:
#loop, but don't go through everything yet
for tweet in obama_tweets[:5]:

    positive_counter = 0
    negative_counter = 0
    #Lower case everything
    tweet_processed = tweet.lower()
    
    #split by ' ' into a list of words
    words = tweet_processed.split()
    
    #Loop through each word in the tweet
    for word in words:
        clean_word = word.strip(punctuation)
        
        if clean_word in postive_list:
            positive_counter = positive_counter + 1
        elif clean_word in negative_list:
            negative_counter = negative_counter + 1
    
    print positive_counter, negative_counter, len(words)


1 0 16
0 0 13
0 0 17
0 0 19
1 0 17

Now let's get this out of Python


In [72]:
import csv

In [73]:
csv_file = open('tweet_sentiment.csv','w')
csv_writer = csv.writer( csv_file )

In [74]:
#loop, but don't go through everything yet
for tweet in obama_tweets[:5]:

    positive_counter = 0
    negative_counter = 0
    #Lower case everything
    tweet_processed = tweet.lower()
    
    #split by ' ' into a list of words
    words = tweet_processed.split()
    
    #Loop through each word in the tweet
    for word in words:
        clean_word = word.strip(punctuation)
        
        if clean_word in postive_list:
            positive_counter = positive_counter + 1
        elif clean_word in negative_list:
            negative_counter = negative_counter + 1
    
    csv_writer.writerow( [positive_counter, negative_counter, len(words)] )

csv_file.close()

Let's look at the results. If !cat doesn't work, try !type which is the Windows equivalent.


In [75]:
!cat tweet_sentiment.csv






Why only 5 rows?


In [76]:
csv_file = open('tweet_sentiment.csv','w')
csv_writer = csv.writer( csv_file )

#loop
for tweet in obama_tweets[:]:
    positive_counter = 0
    negative_counter = 0
    #Lower case everything
    tweet_processed = tweet.lower()
    
    #split by ' ' into a list of words
    words = tweet_processed.split()

    #Loop through each word in the tweet
    for word in words:
        clean_word = word.strip(punctuation)
        
        if clean_word in postive_list:
            positive_counter = positive_counter + 1
            
        if clean_word in negative_list:
            negative_counter = negative_counter + 1

    csv_writer.writerow( [ positive_counter, negative_counter, len(words)] )

csv_file.close()

The csv file can be read in other programs.

That's not Python

If you wanted to add a header, you could by putting

csv_writer.writerow( [ 'postive', 'negative', 'length'] )

before you start writing the values.

In case your wondering, I would probably make my code a little different


In [94]:
def clean_split(tweet):
    ''' Take sentence and return cleaned list of words '''
    return [word.strip(punctuation) for word in tweet.lower().split()]

#turn the lists into sets so we can do intersections
postive_set  = set(postive_list)
negative_set = set(negative_list)


sentiment = []
for tweet in obama_tweets:
    words = clean_split(tweet)
    postive_counter =  len( postive_set.intersection(words) )
    negative_counter = len( negative_set.intersection(words) )
    sentiment.append ( [ postive_counter , negative_counter, len(words)] )

with  open('tweet_sentiment_2.csv','w') as csv_file:
    csv_writer = csv.writer( csv_file )
    csv_writer.writerows(sentiment)

Some takeaways

  1. That's most of what you need to know in Python.
    1. List comprehension
    2. Functions/Classes
  2. Start simple.
    1. Develop a solution for one case
    2. Scale up (so make sure your 2A is generalizable.)