This notebook is a copy of a multiple blog post tutorial written by Dr. Neal Caren (University of North Carolina, Chapel Hill). The format was modified to fit into a Jupyter Notebook, ported from python2 to python3, and adjusted to meet the goals of this class.
Please read though the notebook, execute the code cells (in order) and add answers to the questions.
This first section is from Part 1 of Dr. Neal Caren's Blog post. Here is a link to the original post:
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/
Motivation
In September of 2011, Science magazine printed an article by Cornell sociologists Scott Golder and Michael Macy that examined how trends in positive and negative attitudes varied over the day and the week. To do this, they collected 500 million Tweets produced by more than two million people. They found fascinating daily and weekly trends in attitudes. It’s a great example of the sort of interesting things social scientists can do with online social network data. More generally, the growth of what computer scientists call “big data” presents social scientists with unique opportunities for researching old questions, along with empowering us to ask new questions. While some of this big data is only numbers, much of it also consists of text. Sociologists have long had tools to assist us in coding and analyzing dozens or even hundreds of text documents, but many of these tools are less useful when the number of documents is in the tens of thousands or millions. Every sociology professor, graduate student and undergraduate in the United States working together couldn’t code even the 1% daily sample of Tweets that Twitter provides free to researchers. Luckily, computer scientists have been working for quite a while on exactly this data problem - how do we collect, categorize and understand massive text databases?
It turns out that while the volume of data in a study such as Golder and Macy’s is intimidating, doing a project of this sort isn’t that complicated for the typical social scientist. The major challenges are (1) collecting and managing the data, (2) turning the text into numbers of some sort, and (3) analyzing the numbers. The third step involves techniques familiar to many quantitative researchers. Based on their supplementary file, it appears Golder and Macy used Stata to analyze the data.
Getting the Twitter data isn’t that difficult, although it does involve dealing with the Twitter Application Programming Interface, or API, a task most social scientist have not been trained to do. If you’re wondering, Facebook also has an API and you can use what are called “web scraping” techniques to gather data from blogs and other websites too. I’ll discuss these topics in other tutorials. (Note: the "I" referred to is Dr. Caren - consult his blog if you're interested!)
In this post and the next, I’ll walk through the basics of a popular way to convert text into meaningful numbers using the same analytic strategy that Golder and Macy used. While you can do this sort of analysis using one of several different programs or languages, one commonly used for this sort of quantitative text analysis is Python. It is free, used by millions (so there are lots of resources available), and relatively straightforward to learn. If you have a Mac, it’s already on your computer! There are no pull down menus in Python, though, so learning by fumbling around isn’t the best option. That’s what led me to write this series of tutorials.
This initial tutorial is aimed at social scientists who may be familiar with some statistical package like SPSS, Stata or SAS, but haven’t used Python. It walks through the basics of one type of text analysis using some sample text data, but swapping in your own data once you’ve got this up and running isn’t much harder.
Strings
Pretend that the first tweet you wanted to analyze was, “We have some delightful new food in the cafeteria. Awesome!!!” You might have a million more of these that you want to analyze, and we’ll get there eventually, but a good coding style is to start simple and then slowly add complexity. In this case, a simple way to start is with one tweet. To tell python about your tweet, type:
In [ ]:
tweet='We have some delightful new food in the cafeteria. Awesome!!!'
There are three things to note. The text is surrounded by a single quote (i.e. ') on each side. You can also use double quotes (i.e. ") or even triple single quotes (i.e. '''), but single quotes are the default Python style for entering a string.
To make sure that the tweet was typed correctly, you can just type the variable name:
In [ ]:
tweet
You can get almost the same response using a print
statement:
In [ ]:
print(tweet)
The only difference is that the first response was wrapped in single quotes and the second wasn’t. As a side note, the single quotes weren’t because you put them there. If you used double quotes, you would get the same thing:
In [ ]:
tweet="We have some delightful new food in the cafeteria. Awesome!!!"
tweet
Lists
Now, following Golder and Macy, we need to decide if this is a positive or negative opinion. If we had a large sample of the Tweets already coded by sentiment, we could try and figure out which words appeared more often in Tweets we considered positive, and which words appeared more often in Tweets we considered negative. In sociology, we want to predict whether the sentence is positive, negative, or neither, and we could use the presence or absence of words as predictors.
One straightforward way to approach the problem is to count the proportion of words that usually have a positive connotation and the proportion of words that have a negative connotation. This is a common analytic strategy in many fields, especially psychology. Golder and Macy’s Twitter study used the lists of positive and negative words that are part of the Linguistic Inquiry and Word Count (LIWC) project. This data is only available commercially, so I won’t include it this tutorial. There’s a similar dictionary that’s freely available, but we won’t use that just yet.
For now, you can just make your own list of positive words. We’ll swap in the official list before we are done. Off the top of my head, the words “good”, “nice”, “super”, and “fun” are words that I use when I’m trying to be positive. To put this list into Python:
In [ ]:
positive_words=['awesome','good','nice','super','fun']
“positive_words” is now the name of our list. There are only a few restrictions on what you can name your list (e.g., it can’t start with a number, or have spaces). To tell Python that we are creating a list, you put everything in brackets. Since the items in the list are strings, each goes in single quotes. The list form in Python is roughly analogous to a variable in statistical programs.
If you wanted to add an item to your list, you append
the list:
In [ ]:
positive_words.append('delightful')
positive_words
In this case, you start with the list name, followed by .append
, and then in parenthesis write the item that you want to add to your list. If you are adding a string, you put it in quotes. Otherwise, Python will think you are referencing something. Consider the following example. what error do you see? Fix it to include the word like in your list?
Question 1: Fix the following code so there is no error
In [ ]:
#FIX THIS CODE
positive_words.append(like)
positive_words
Now create a list of negative words and include; awful, lame, horrible, bad
Question 2: Create a list of negative words
In [ ]:
# create a list called negative_words and put your code here
negative_words=[]
negative_words
If you wanted to measure whether or not any emotion was expressed, you might create one list that combines all the positive and negative words. Rather than retyping them, you can just combine the lists with a plus sign:
In [ ]:
emotional_words=negative_words+positive_words
print(emotional_words)
From Strings to Lists
Later on, we’ll create a better list of positive and negative words, but for now let’s return to the original Tweet. The default strategy for this sort of analysis is to examine each word in the sentence on its own, regardless of word ordering. This is called a “bag of words” model. It has some obvious drawbacks (e.g. “This was not fun.” will show up as positive because of the presence of the word “fun” unless you somehow model it’s negation.), but, with a few tweaks, these models can be about as good at classification as an undergraduate RA.
Since our unit of analysis is the word and not the sentence, we want to split our sentence into words. We can do that by using the split
option:
In [ ]:
words= tweet.split(' ')
print(words)
Here, we’ve split our string “tweet“, making a cut every time there was a space. This new object is stored as “words“. As you can see from the results of the print command, the new object is displayed in brackets, so Python has created “words” as a list. In order to see how long many words are in the sentence, you use the len
(length) command, which will return the number of objects in a list.
In [ ]:
len(words)
This only works because we’ve split the sentence into a list of words. If we ask for the length of the original tweet, we get something less useful:
In [ ]:
len(tweet)
Python doesn’t know that you only care about words, so it defaulted to counting the number of characters.
Loops
Our first goal is to go through our list of words and see if any of them show up in our list of positive words. For starters, we can loop over each of the words in our sentence with a for
loop:
In [ ]:
for word in words:
print(word)
The for
tells Python that we are going to cycle through each element of a list. “word” is the name that I just made up that will hold each of the words. “in words” tells Python which list we want to iterate through, and the colon ends a line that declares a loop. Note that the second line is indented with a tab; others put four spaces. If you don’t indent, Python will report an error.
Question 3: Fix this loop
In [ ]:
#Fix this code too
for word in words:
print words
Conditionals
While this loop prints out each word (when the second line is appropriately tabbed), what we actually want to do is see if that word is in the list of positive (or negative) words.
In [ ]:
for word in words:
if word in positive_words:
print(word)
Here, we include a conditional: we only move to the print(word)
state if the value of “word” is in our list of positive words. So, the first time the loop cycles through and sees the value of “word” is “The”, so the loop skips the print(word)
line. Here, the if
line ends in a colon and the lines that should only occur if the conditions are met are doubled indented–once as a result of the for
and once because of the if
.
Not bad work. Why do you think the word “awesome” didn’t appear in the list?
Question 4: Why do you think the word “awesome” didn’t appear in the list?
//Put your answer here
We could make the print line a little more informative by adding some text that explains why it is randomly printing out the word “delightful”.
In [ ]:
for word in words:
if word in positive_words:
print(word+' is a positive word')
When Python sees a “+”, it attempts to combine the two items. In this case, since both “word” and “is a positive word” are strings, the result is a longer string. This is the same logic that we used above to combine the two lists of words to create a longer list. This also works for combining two or more numbers, but, you can’t use this strategy to combine a string and a number:
In [ ]:
2+'delightful'
Nicely, Python tells you the line number where there was a problem and a semi-informative error message: you were trying to combine a integer and string, which Python can’t do because it wouldn’t know what data type you would want to store it in. You could make this work by converting the number to a string :
In [ ]:
str(2)+'delightful'
You aren’t limited to combining just two items, any number of like objects can be put together with the +.
This section of the notebook is a copy of the second blog post tutorial written by Dr. Neal Caren (University of North Carolina, Chapel Hill). Here is a link to the original tutorial:
http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-2/
In this second part we will continue the example from Part 1 and compute the proportion of positive words vs. negative words in a tweet, after cleaning up the data a bit.
Preprocessing
You might have noticed in the previous example that while our loop matched “delightful”, it didn’t find “awesome”. Looking back at the list of words that printed when we printed every word in our tweet might provide some clues as to why this occurred. While we have “awesome” in our positive words list, we don’t have “Awesome!!!” and Python is looking for an exact match. In order to get the two versions to match, we would need to make the “A” lower case and remove the exclamation marks. This is called pre-processing, or "cleaning", the data. Shifting everything to lower case and stripping punctuation are the most common pre-processing tasks in natural language processing. Other common things to do are stemming words, which attempts to find the root of the word (e.g. “running” and “runs” both get reduced to “run”) and removing little words like “the”, “and”, or “if”, which are known as stop words.
Since removing capitalization and punctuation involves throwing away potentially meaningful variation, you should proceed with caution. For example, you might think that the “Awesome!!!” is different from “awesome”, that “WOW” is different from “wow”, or that “Cool!” is different from “Cool?”. In machine learning (a technique I will discuss in more detail at a later point), this is part of the art of “feature selection”. Social scientists have independent or explanatory variables that they use to explain their models, while computer scientists try to find the “features” with the most predictive power. In natural language processing, features can be more than the absence or presence of specific words. Word count, presence of parts of words, sentence complexity, use of the passive voice, presence of emoticons, or any other text attribute that can be expressed as a number can be included as a feature. I’m a fan of starting with just the words to get a baseline model, and then seeing if you can improve on it. And in this case, we don’t have punctuation or non-lower cases words coded in our list of emotional words, so the decision is made for us.
However, making strings lower case in Python is simple:
In [ ]:
tweet.lower()
So we either have to make it lower case when it is a full sentence, or we can do it to each individual word:
In [ ]:
for word in words:
print(word.lower())
Updating our loop, we still don’t find awesome yet:
In [ ]:
for word in words:
if word.lower() in positive_words:
print(word.lower()+' is a positive word')
This is because we have not removed the exclamation marks. If that was the only punctuation we wanted to remove, we could replace it with nothing:
In [ ]:
print(tweet.replace('!',''))
Python will let you use this technique to create a new string:
In [ ]:
tweet_noex=tweet.replace('!','')
print(tweet_noex)
Replace takes two options. The first is what you are looking for–in this case, the exclamation mark. The second is what you want to replace it with–in this case, nothing. As always, strings should be in quotation marks.
The new string could even have the same name as your old string:
In [ ]:
tweet=tweet.replace('!','')
print(tweet)
We’ve lost the original message, so this isn’t always the best policy. You might want to store your original string away some place for safe keeping, or create a new string name, such as “tweet_processed” that you update with each of your different preprocessing steps.
More than one string operation can be included in the same statement, so we could remove all the punctuation from the Tweet with something like:
In [ ]:
tweet='We have some delightful new food in the cafeteria. Awesome!!!'
tweet_processed=tweet.replace('!','').replace('.','')
print(tweet_processed)
You could even append the “.lower()” operation to this and do all the cleaning in one line, but you might have trouble figuring out what you did a month later when you come back to your code if you combine different types of operations. But, if you wanted to, you could put it together like this:
In [ ]:
tweet_processed=tweet.replace('!','').replace('.','')
tweet_processed=tweet_processed.lower()
words=tweet_processed.split(' ')
print(words)
The first line creates a new string tweet_processed that holds our original tweet minus the punctuation. Note that the second line has “tweet_processed” on both sides of the equal sign. If you kept “tweet.lower()” on the right hand side you would just be throwing away the punctuation stripping that you did in the first line.
While removing the period and exclamation mark work for this tweet, it isn’t a very good general solution, because it ignores the 30(!) other punctuation marks that could be used in a sentence. Since we want to develop a script that works more generally, we want to use a technique that can be flexible enough to handle more than periods and exclamation marks.
Importing Modules
Python has built-in all the punctuation you need to account for in all cases. You can access them by typing:
In [ ]:
from string import punctuation
print(punctuation)
Most of Python’s usefulness isn’t available to you when you start up the program. You need to selectively bring modules into memory. In this case, we are accessing the “string” module, which comes with your Python. Other modules are available from the web, and to do anything interesting with natural language processing, you’ll have to download and set some of them up.
Most of Python’s usefulness isn’t available to you when you start up the program. You need to selectively bring modules into memory. In this case, we are accessing the “string” module, which comes with your Python. Other modules are available from the web, and to do anything interesting with natural language processing, you’ll have to download and set some of them up.
There are faster and more elegant solutions, but a straightforward way to remove the punctuations is to loop through our new punctuation string and replace each instance in our sentence with nothing.
Just like a list, a string is an "iterable" object. This means that you can loop over it's parts using the for
command. Since a string is just a collection of chariceters (i.e. letters, numbers, spaces, puctuation, etc.), the for
command will pull out a single charicter for each iternation of the loop:
Like this:
In [ ]:
tweet_processed=tweet.lower()
for p in punctuation:
tweet_processed=tweet_processed.replace(p,'')
print(tweet_processed)
Now go back and look for our positive words
In [ ]:
for word in words:
if word in positive_words:
print(word+ ' is a positive word')
It worked! The first line created a new string that contained a lower-case version of the original Tweet. The second line began a loop over all the punctuation marks that could potentially be in the sentence. Since the punctuation item that we imported was originally stored as a string, we have to convert it to a list, which can happen on this same line. Python’s default splitting is between individual characters, which works perfectly here. The remainder of the script is the same as used above.
Putting it all together
The original quantity of interest was the fraction of positive words in the sentence. We already computed the denominator of the fraction when we computed the length of the string words using the len
command. One straightforward way to compute the numerator is with a counter that starts at zero and increases by one each time the loop finds a positive word.
In [ ]:
positive_counter = 0
for word in words:
if word in positive_words:
print(word+ ' is a positive word')
positive_counter=positive_counter+1
print(positive_counter)
positive_counter/len(words)
Question 1: So far our model only counts positive words. How would you include negative words in this model (more than one answer is fine)?
//Put your answer here
Question 2: What kinds of scientific questions can you ask using a "bag of words" measurements?
//Put your answer here
Question 3: List three or more limitations of the “bag of words” model. make sure you include an example sentense that would break the model. (you can start by repeating the “This was not fun” example from Part 1).
//Put your answer here
Question 4: Can you think of at least two ways to improve the "bag of words" model (for example by addressing the limitations in Question 2)? What are the pros and cons of your improvements?
//Put your answers here
Question 5: How is the "bag of words" model different from other scientific models we have discussed so far in class?
//Put your answer here
Question 6: What questions do you have, if any, about any of the topics discussed in this assignment after watching the video and reading the links?
// Put your answer here
Question 7: Do you have any further questions or comments about this material, or anything else that's going on in class?
// Put your answer here
Now, you just need to submit this assignment by uploading both notebooks to the course Desire2Learn web page. Go to the "Pre-class assignments" folder, find the dropbox link for Day 16, and upload it there.
In class we are going to have you use what you have learned to finish the "Bag of Words" model using real Tweet Data.
See you in class!
In [ ]: