Day 16 In-class assignment: Data analysis and Modeling in Social Sciences

Part 3

The first part of this notebook is a copy of a blog post tutorial written by Dr. Neal Caren (University of North Carolina, Chapel Hill). The format was modified to fit into a Jupyter Notebook, ported from python2 to python3, and adjusted to meet the goals of this class. Here is a link to the original tutorial:

http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-3/

Student Names

//Put the names of everybody in your group here!

Learning Goals

Natural Language Processing can be tricky to model comparied to known physical processes with mathematical rules. A large part of modeling is trying to understand a model's limiatarions and determining what can be learned from a model dispite its limitations:

  • Apply what we have learned from the Pre-class notebooks to build a Twitter "bag of words" model on real Twitter data.
  • Introduce you to a method for downloading data from the Internet.
  • Gain practice doing string manipulation.
  • Learn how to make Pie Charts in Python.

This assignment explains how to expand the code written in your pre-class assignemnt so that you can use it to explore the positive and negative sentiment of any set of texts. Specifically, we’ll look at looping over more than one tweet, incorporating a more complete dictionary, and exporting the results.

Earlier, we used a small list of words to measure positive sentiment. While the study in Science used the commercial LIWC dictionary, an alternate sentiment dictionary is produced by Theresa Wilson, Janyce Wiebe, and Paul Hoffmann at the University of Pittsburgh and is freely available. In both cases, the sentiment dictionaries are used in a fairly straightforward way: the more positive words in the text, the higher the text scores on the positive sentiment scale. While this has some drawbacks, the method is quite popular: the LIWC database has over 1,000 citations in Google Scholar, and the Wilson et al. database has more than 600.

Do the following individually

First, load some libraries we will be using in this notebook.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
from string import punctuation

Downloading

Since the Wilson et al. list combines negative and positive polarity words in one list, and includes both words and word stems, Dr. Caren cleaned it up a bit for us. You can download the positive list and the negative list using your browser, but you don’t have to. Python can do that.

First, you need to import one of the modules that Python uses to communicate with the Internet:


In [ ]:
import urllib.request

Like many commands, Python won’t return anything unless something went wrong. In this case, the In [*] should change to a number like In [2]. Next, store the web address that you want to access in a string. You don’t have to do this, but it’s the type of thing that makes your code easier to read and allows you to scale up quickly when you want to download thousands of urls.


In [ ]:
url='http://www.unc.edu/~ncaren/haphazard/negative.txt'

You can also create a string with the name you want the file to have on you hard drive:


In [ ]:
file_name='negative.txt'

To download and save the file:


In [ ]:
urllib.request.urlretrieve(url, file_name)

This will download the file into your current directory. If you want it to go somewhere else, you can put the full path in the file_name string. You didn’t have to enter the url and the file name in the prior lines. Something like the following would have worked exactly the same:


In [ ]:
urllib.request.urlretrieve('http://www.unc.edu/~ncaren/haphazard/negative.txt','negative.txt')

Note that the location and filename are both surrounded by quotation marks because you want Python to use this information literally; they aren’t referring to a string object, like in our previous code. This line of code is actually quite readable, and in most circumstances this would be the most efficient thing to do. But there are actually three files that we want to get: the negative list, the positive list, and the list of tweets. And we can download the three using a pretty simple loop:


In [ ]:
files=['negative.txt','positive.txt','obama_tweets.txt']
path='http://www.unc.edu/~ncaren/haphazard/'
for file_name in files:
    urllib.request.urlretrieve(path+file_name,file_name)

The first line creates a new list with three items, the names of the three files to be downloaded. The second line creates a string object that stores the url path that they all share. The third line starts a loop over each of the items in the files list using file_name to reference each item in turn. The fourth line is indented, because it happens once for each item in the list as a result of the loop, and downloads the file. This is the same as the original download line, except the URL is now the combination of two strings, path and file_name. As noted previously, Python can combine strings with a plus sign, so the result from the first pass through the loop will be http://www.unc.edu/~ncaren/haphazard/negative.txt, which is where the file can be found. Note that this takes advantage of the fact that we don’t mind reusing the original file name. If we wanted to change it, or if there were different paths to each of the files, things would get slightly trickier.

More fun with lists

Let’s take a look at the list of Tweets that we just downloaded. First, open the file:


In [ ]:
tweets = open("obama_tweets.txt").read()

As you might have guessed, this line is actually doing double duty. It opens the file and reads it into memory before it is stored in tweets. Since the file has one tweet on each line, we can turn it into a list of tweets by splitting it at the end of line character. The file was originally created on a Mac, so the end of line character is an \n (think \n for new line). On a Windows computer, the end of line character is an \r\n (think \r for return and \n for new line). So if the file was created on a Windows computer, you might need to strip out the extra character with something like windows_file=windows_file.replace('\r','') before you split the lines, but you don’t need to worry about that here, no matter what operating system you are using. The end of line character comes from the computer that made the file, not the computer you are currently using. To split the tweets into a list:


In [ ]:
tweets_list = tweets.split('\n')

As always, you can check how many items are in the list:


In [ ]:
len(tweets_list)

You can print the entire list by typing print(tweets_list), but it will be very long. A more useful way to look at it is to print just some of the items. Since it’s a list, we can loop through the first few item so they each print on the same line.


In [ ]:
for tweet in tweets_list[0:5]:
    print(tweet)

Note the new [0:5] after the tweets_list but before the : that begins the loop. The first number tells Python where to make the first cut in the list. The potentially counterintuitive part is that this number doesn’t reference an actual item in the list, but rather a position between each item in the list–think about where the comma goes when lists are created or printed. Adding to the confusion, the position at the start of the list is 0. So, in this case, we are telling Python we want to slice our list starting at the beginning and continuing until the fifth comma, which is after the fifth item in the list.

So, if you wanted to just print the second item in the list, you could type:


In [ ]:
print(tweets_list[1:2])

OR


In [ ]:
print(tweets_list[1])

This slices the list from the first comma to the second comma, so the result is the second item in the list. Unless you have a computer science background, this may be confusing as it’s not the common way to think of items in lists.

As a shorthand, you can leave out the first number in the pair if you want to start at the very beginning or leave out the last number if you want to go until the end. So, if you want to print out the first five tweets, you could just type print(tweet_list[:5]). There are several other shortcuts along these lines that are available. We will cover some of them in other tutorials.

Now that we have our tweet list expanded, let’s load up the positive sentiment list and print out the first few entries:


In [ ]:
pos_sent = open("positive.txt").read()
positive_words=pos_sent.split('\n')
print(positive_words[:10])

Like the tweet list, this file contained each entry on its own line, so it loads exactly the same way. If you typed len(positive_words) you would find out that this list has 2,230 entries.

Preprocessing

In the pre-class assignment, we explored how to preprocess the tweets: remove the punctuation, convert to lower case, and examine whether or not each word was in the positive sentiment list. We can use this exact same code here with our long list. The one alteration is that instead of having just one tweet, we now have a list of 1,365 tweets, so we have to loop over that list.


In [ ]:
for tweet in tweets_list:
    positive_counter=0
    tweet_processed=tweet.lower()
    for p in punctuation:
        tweet_processed=tweet_processed.replace(p,'')
    words=tweet_processed.split(' ')
    for word in words:
        if word in positive_words:
            positive_counter=positive_counter+1
    print(positive_counter/len(words))

Do the next part with your partner

If you saw a string of numbers roll past you, it worked! To review, we start by looping over each item of the list. We set up a counter to hold the running total of the number of positive words found in the tweet. Then we make everything lower case and store it in tweet_processed. To strip out the punctuation, we loop over every item of punctuation, swapping out the punctuation mark with nothing.

The cleaned tweet is then converted to a list of words, split at the white spaces. Finally, we loop through each word in the tweet, and if the word is in our new and expanded list of positive words, we increase the counter by one. After cycling through each of the tweet words, the proportion of positive words is computed and printed.

The major problem with this script is that it is currently useless. It prints the positive sentiment results, but then doesn’t do anything with it. A more practical solution would be to store the results somehow. In a standard statistical package, we would generate a new variable that held our results. We can do something similar here by storing the results in a new list. Before we start the tweet loop, we add the line:


In [ ]:
positive_counts=[]

Then, instead of printing the proportion, we can append it to the list using the following command:

positive_counts.append(positive_counter/word_count)

Step 1: make a list of counts. Copy and paste the above and rewrite it using the above append command.


In [ ]:
#Put your code here

for tweet in tweets_list:
    positive_counter=0
    tweet_processed=tweet.lower()
    for p in punctuation:
        tweet_processed=tweet_processed.replace(p,'')
        words=tweet_processed.split(' ')
        word_count = len(words)
        for word in words:
            if word in positive_words:
                positive_counter=positive_counter+1
        positive_counts.append(positive_counter/word_count)

The next time we run through the loop, it shouldn't produce any output, but it will create a list of the proportions. Lets do a quick check to see how many positive words there are in the entire set of tweets:


In [ ]:
len(positive_counts)

The next step is to plot a histogram of the data to see the distribution of positive texts:

Step 2: make a histogram of the positive counts.


In [ ]:
#Put your code here
plt.hist(positive_counts, 100, facecolor='green');

Step 3: Subtract negative values. Now redo the caluclation in Step 1 but also subtract negative words (i.e. your measurement can now have a positive or negative value):


In [ ]:
#Put your code here
neg_sent = open("negative.txt").read()
negative_words=neg_sent.split('\n')
for tweet in tweets_list:
    positive_counter=0
    tweet_processed=tweet.lower()
    for p in punctuation:
        tweet_processed=tweet_processed.replace(p,'')
        words=tweet_processed.split(' ')
        word_count = len(words)
        for word in words:
            if word in positive_words:
                positive_counter=positive_counter+1
            if word in negative_words:
                positive_counter=positive_counter-1
        positive_counts.append(positive_counter/word_count)

Step 4: Generate positive/negative histogram. Generate a second histogram using range -5 to 5 and 20 bins.


In [ ]:
#Put your code here
plt.hist(positive_counts, 20, facecolor='green', range=[-5, 5]);

Another way to model the "bag of words" is to evaluate if the tweet has only positive words, only negative words, both positive and negative words or neither positive nor negative words. Rewrite your code to keep track of all four totals.

Step 5: Count "types" of tweets. Rewrite the code from steps 1 & 3 and determin if each tweet has only positive works, only negative words, both positive and negative words or neither positive nor negative words. Keep total counts the number of each kind of tweet.


In [ ]:
only_positive=0;
only_negative=0;
both_pos_and_neg=0;
neither_pos_nor_neg=0;
#Put your code here.

for tweet in tweets_list:
    positive_counter=0
    negative_counter=0
    tweet_processed=tweet.lower()
    for p in punctuation:
        tweet_processed=tweet_processed.replace(p,'')
        words=tweet_processed.split(' ')
        word_count = len(words)
        for word in words:
            if word in positive_words:
                positive_counter=positive_counter+1
            if word in negative_words:
                negative_counter=negative_counter+1
    if(positive_counter > 0):
        if(negative_counter > 0):
            both_pos_and_neg=both_pos_and_neg+1
        else:
            only_positive=only_positive+1;
    else:
        if(negative_counter > 0):
            only_negative=only_negative+1;
        else:
            neither_pos_nor_neg=neither_pos_nor_neg+1;

Step 6: Check your answer. If everything went as planned, you should be able to add all four totals and it will be equal to the total number of tweets!


In [ ]:
#Run this code. It should output True. 
print(only_positive)
print(only_negative)
print(both_pos_and_neg)
print(neither_pos_nor_neg)
only_positive + only_negative + both_pos_and_neg + neither_pos_nor_neg == len(tweets_list)

Step 7: Make a Pie Graph of your results. Now we are just going to plot the results using matplotlib pie function. If you used the variables above this should just work.


In [ ]:
# The slices will be ordered and plotted counter-clockwise.
labels = 'positive', 'both', 'negative', 'neither'
sizes = [only_positive, both_pos_and_neg, only_negative, neither_pos_nor_neg]
colors = ['yellowgreen', 'yellow','red',  'lightcyan']
explode = (0.1, 0, 0.1, 0)  

plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90);

Feedback on this assignment

Question 1: What questions do you (or does your group) have after this assignment?

Put your answer here.

Question 2: What does this data tell you about the tweets?

Put your answer here.

Submit this assignment

Log into the course Desire2Learn website (d2l.msu.edu) and go to the "In-class assignments" and the "Day 16" Dropbox folder. Upload this notebook there. You only have to upload one notebook per pair or group - just make sure that everybody's name is at the top of the notebook!