Homework #5 - Using Bag of Words on the 2016 Presidential Race

This homework assignment expands upon our in-class work analyzing twitter data, focusing on the 2016 presidential elections. Please make sure to get started early, and come by the instructors' office hours if you have any questions. Office hours and locations can be found in the course syllabus. IMPORTANT: While it's fine if you talk to other people in class about this homework - and in fact we encourage it! - you are responsible for creating the solutions for this homework on your own, and each student must submit their own homework assignment.

Name

// Put your name here

Off to the races

Let's see how positive the 2016 Presidential candidate race is so far. I have downloaded Twitter feed data using the tweepy module. You can download the Tweepy.ipynb example from the class D2L website to see how this was done, but it's not necessary to do so if you don't want to. We are going to make a bar graph of each of the candidate's tweets and see how positive their campaign is so far.

The first thing to do is review how we downloaded the Twitter files using the code from the in-class assignment.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
from string import punctuation
import urllib.request

In [ ]:
files=['negative.txt','positive.txt']
path='http://www.unc.edu/~ncaren/haphazard/'
for file_name in files:
    urllib.request.urlretrieve(path+file_name,file_name)

In [ ]:
pos_sent = open("positive.txt").read()
positive_words=pos_sent.split('\n')
neg_sent = open("negative.txt").read()
negative_words=neg_sent.split('\n')

For this homework we are going to use new data downloaded in the last week. The file names are:

<<twitter_names>>_tweets.txt

Where Twitter_names is the Twitter name of the individual polititions. The following Twitter names are available for download:


In [ ]:
twitter_names = [ 'BarackObama', 'realDonaldTrump','HillaryClinton', 'BernieSanders', 'tedcruz']

So, for example, the file name for Barack Obama is BarackObama_tweets.txt. The first step is to rewrite the above loop to download all five files and save them to your local directory.

NOTE: these files are no longer on Dr. Neal Caren's Website. We have posted them locally at MSU, at http://www.msu.edu/~colbrydi. As a result, the full URL and file name for Barack Obama is http://www.msu.edu/~colbrydi/BarackObama_tweets.txt

Step 1: Download Data Download the tweet data


In [ ]:
# Write your code to download the tweet files from each of the politicians.

Step 2: Create a tweetcount function

Let's see if we can make a function (let's call it tweetcount). The function should take a tweet string, positive_words, and negative_words as inputs and return a "positiveness" ratio - i.e., count all the positive words and subtract all of the negative words and divide by the total words in the tweet (HINT: See the pre-class and in-class assignments!). This value should be a number that ranges from -1.0 to 1.0.

I put in a "stub" function to help get you started. Just fill in your own code.

A code "stub" is a common way to outline a program without getting into all of the details. You can define what your functions need as inputs and ouputs without needing to write the function code. It is common practice to output an "invalid" value (0.5 in this case). You can then use your "stub" to test other code that will call your function.


In [ ]:
def tweetcount(tweet, positive_words, negative_words):
    #Put your code here and modify the return statement
    return 0.5

Let's test this on a simple tweet to make sure everything is working. (Note: this is sometimes called "unit testing", and is considered to be very good programming practice.) What answer do you expect?

Since there are two positive words (delightful and Awesome) and 10 words in the tweet, we expect to get a value of 0.2. If you do not get 0.2 then something is wrong in your code. Go back and fix it until you do get 0.2.


In [ ]:
tweetcount('We have some delightful new food in the cafeteria. Awesome!!!', positive_words, negative_words)

Step 3: Bug Fixing

During the class assignment someone noticed an interesting problem when working with real tweet data. Consider the following test. Can you identify the simple difference between this test and the previous test?


In [ ]:
tweetcount('We have some delightful new  food in the cafeteria. Awesome!!!', positive_words, negative_words)

I assume you got a different answer (maybe 0.18181818181812)? Take a moment and see if you can figure out why.

Hopefully you noticed there is an extra space between the words "new" and "food." Why would this produce a different answer?

Let's break down the problem some more. Using code from the pre-class assignment, let's remove the punctuation and split the words for both tweets:


In [ ]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'; 
tweet_processed=tweet.lower()
for p in punctuation:
    tweet_processed=tweet_processed.replace(p,'')
words=tweet_processed.split(' ')
print(len(words))

tweet = 'We have some delightful new  food in the cafeteria. Awesome!!!'; 
tweet_processed=tweet.lower()
for p in punctuation:
    tweet_processed=tweet_processed.replace(p,'')
words=tweet_processed.split(' ')
print(len(words))

That is interesting. The original tweet had 10 words (what we would expect) and the tweet with the doublespace has one extra word? Where did that come from? Well, let's look by printing the entire list instead of just the length (We can do this because the list is short).


In [ ]:
tweet = 'We have some delightful new food in the cafeteria. Awesome!!!'; 
tweet_processed=tweet.lower()
for p in punctuation:
    tweet_processed=tweet_processed.replace(p,'')
words=tweet_processed.split(' ')
print(words)

tweet = 'We have some delightful new  food in the cafeteria. Awesome!!!'; 
tweet_processed=tweet.lower()
for p in punctuation:
    tweet_processed=tweet_processed.replace(p,'')
words=tweet_processed.split(' ')
print(words)

See the empty string represented by the two quotations? It seems that the split() function is adding a "word" which is completely empty between the double spaces. If you think about it, it makes sense, but this can cause trouble. The same problem is occuring in both the positive_words and negative_words lists. You can check this with the following command.


In [ ]:
if '' in positive_words:
    print('emtpy string in positive words')
if '' in negative_words:
    print('emtpy string in negative words')

This means that anytime there is a doublespace in a tweet that will be counted as both a positive and negative word. This may not be a problem for the "total" emotion (since the net is zero) - however, the extra words impact the total words in the tweet. The best way to avoid this problem is to just remove the doublespaces. For example:


In [ ]:
words.remove('')
print(words)

Fix your above function to take into account the empty strings. Make sure you consider all cases: for example, what happens if there is no doublespace? What happens if there is more than one?


In [ ]:
#Put your new function code here.
def tweetcount(tweet, positive_words, negative_words):
    #Put your code here
    return 0.5

Test your code here and make sure it returns 0.2 for both tests:


In [ ]:
tweetcount('We have some delightful new food in the cafeteria. Awesome!!!', positive_words, negative_words)

In [ ]:
tweetcount('We   have  some    delightful   new  food in the cafeteria.  Awesome!!!', positive_words, negative_words)

Step 4: Total average Tweet

Now let's make a second function (let's call it average_tweet). The function should take tweets_list, positive_words, and negative_words as inputs, loop over all of the tweets in the tweets_list, calulate the tweetcount for each tweet, and store the value. The last step in this function will be to average all of the tweetcounts and return the average.


In [ ]:
def average_tweet(tweets_list, positive_words, negative_words):
    total = 0
    ## put your code here. Note that this is not a trick question. 
    ## You may be able to do this in just a couple lines of code.
    return total / len(tweets_list)

Assuming we wrote this function correctly we can use it to test the tweets in an entire file. For example, the following should work:


In [ ]:
tweets = open('BarackObama_tweets.txt').read()
tweets = tweets.split('\n')
average_tweet(tweets, positive_words, negative_words)

Step 5: loop over twitter_names

Now create a loop. For each politician, load their tweets file, calculate the average_tweet and appends the average_tweet to a list (one average for each politician).


In [ ]:
average_tweets = []
# Put your loop here. Again, if we call the functions, this code should be fairly short.

Step 6: Generate a bar graph of candidates Finally, use the plt.bar function to generate a bargraph of candidates. See if you can label the x axis, the y axis, and each individal bar with the candidate's name.


In [ ]:
plt.bar(range(len(average_tweets)), average_tweets);

Step 7: Make your graph readable

Okay, now let's practice making the graph easier to read by using Google to find examples and adjust your code. For example, please search for the Python code to make the following adjustments to a figure:

  1. Add labels for the x and y axes (i.e. "Candidates" and "Tweet Emotion Measure" would work)
  2. Set x and y axis font size to 24 point
  3. Rename each bar to the candidate's Twitter name and set the font size to 20
  4. Set the yaxis font size to 20
  5. Make the figure approximatly the same hieght but span more of the notebook from left to right
  6. Center the labels on each of the bars.
  7. Set the x and y axis ticks to a font size of 20
  8. Set the Republican politicians (Trump and Cruz) to red and the Democratic politicians to blue (Obama, Clinton, and Sanders).

Question 1: Type in some example search keywords that you successfully used to help find the solutions. I did the first two to help you get started.

Modify the following list

  1. How do I add x and y labels to a matplotlib figure
  2. How do I change the xlabel font size in matplotlib

Question 2: Using what you learned using google try and generate a new (more readable) bar plot following the guildlines outlined above. Your figure should look something like this (note the values on this bar chart are were picked for illustration, not correctness):


In [ ]:
#Put your code here
fig,ax = plt.subplots(1, 1, figsize=(20, 10)) 
#Center Lables
barlist = plt.bar(range(len(average_tweets)), average_tweets, align='center');

plt.xticks(range(len(average_tweets)), twitter_names, fontsize=20);
plt.yticks(fontsize=20)
plt.xlabel('Candidate', fontsize=24)
plt.ylabel('Emotion', fontsize=24)
barlist[1].set_color('r')
barlist[4].set_color('r')

Question 3: In your own words, describe what these results say about the politicians.

// Write your answer here

Question 4: Only one snapshot of data was provided for this homework. How does this snapshot limit the questions you can ask of the data? What type of data would you like to gather to make stronger claims?

// Write your answer here

Question 5: What other scientific questions can you ask using this type of data and model? Write new code to generate a different plot that answers one of the scientific questions that you have devised. Feel free to be expressive when you design your own metrics (for example, you could use the only_positive, only_negative, both and neither counts from the in-class assignment, or you can go completely beyond this). The goal here is to come up with something interesting that can be expressed with this data. More creative questions and answers will be rewarded appropriately!


In [ ]:
# Put your code here.  Add additional cells as necessary!

Question 6: What does your new graph say about the data?

// Put your answer here


Feedback (required!)

How long did you spend on the homework?

// Write your answer here


Congratulations, you're done!

How to submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "Homework assignments" folder. There will be a dropbox labeled "Homework 5". Upload this notebook there.