The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model is commonly used in methods of document classification, where the (frequency of) occurrence of each word is used as a feature for training a classifier. [https://en.wikipedia.org/wiki/Bag-of-words_model]
In this tutorial [adapted from https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words] we'll create a bag-of-words of the dataset, and use the model in a machine learning algorithm
In [ ]:
%cd C:/temp/
In [ ]:
import pandas as pd
train = pd.read_csv("labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
"header=0" indicates that the first line of the file contains column names, "delimiter=\t" indicates that the fields are separated by tabs, and quoting=3 ignore doubled quotes
In [ ]:
print(train.columns.values)
print(train.shape)
We have 25000 rows, lets check the first one
In [ ]:
print train["review"][0]
First, we need to clean the text, removing the html markup. For this purpose, we'll use the Beautiful Soup library.
In [ ]:
from bs4 import BeautifulSoup
In [ ]:
example1 = BeautifulSoup(train["review"][0])
print(example1.get_text())
When considering how to clean the text, we should think about the data problem we are trying to solve. For many problems, it makes sense to remove punctuation. On the other hand, in this case, we are tackling a sentiment analysis problem, and it is possible that "!!!" or ":-(" could carry sentiment, and should be treated as words. In this tutorial, for simplicity, we remove the punctuation altogether.
To remove punctuation and numbers, we will use a package for dealing with regular expressions, called re, that comes built-in with Python
In [ ]:
import re
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]", # The pattern to search for (everything except letters)
" ", # The pattern to replace it with
example1.get_text() ) # The text to search
print letters_only
We'll also convert our reviews to lower case and split them into individual words (a process called "tokenization")
In [ ]:
lower_case = letters_only.lower() # Convert to lower case
words = lower_case.split() # Split into words
print(words)
Finally, we need to decide how to deal with frequently occurring words that don't carry much meaning. Such words are called "stop words"; in English they include words such as "a", "and", "is", and "the". Conveniently, there are Python packages that come with stop word lists built in. Let's import a stop word list from the Python Natural Language Toolkit (NLTK).
In [ ]:
import nltk
#nltk.download() # Download text data sets, including stop words (A new window should open)
Now we can use nltk to get a list of stop words
In [ ]:
from nltk.corpus import stopwords # Import the stop word list
print stopwords.words("english")
To remove stop words from our movie review
In [ ]:
words = [w for w in words if not w in stopwords.words("english")]
print(words)
Now we have code to clean one review - but we need to clean the rest 25,000! To make our code reusable, let's create a function
In [ ]:
def review_to_words( raw_review ):
# Function to convert a raw review to a string of words
# The input is a single string (a raw movie review), and
# the output is a single string (a preprocessed movie review)
#
# 1. Remove HTML
review_text = BeautifulSoup(raw_review).get_text()
#
# 2. Remove non-letters
letters_only = re.sub("[^a-zA-Z]", " ", review_text)
#
# 3. Convert to lower case, split into individual words
words = letters_only.lower().split()
#
# 4. In Python, searching a set is much faster than searching
# a list, so convert the stop words to a set
stops = set(stopwords.words("english"))
#
# 5. Remove stop words
meaningful_words = [w for w in words if not w in stops]
#
# 6. Join the words back into one string separated by space,
# and return the result.
return( " ".join( meaningful_words ))
At the end of the fuction we joined the words back into one paragraph. This is to make the output easier to use in our Bag of Words
Now let's loop through and clean all of the training set at once (this might take a few minutes)
In [ ]:
num_reviews = train["review"].size
clean_train_reviews = []
for i in xrange( 0, num_reviews ):
if( (i+1)%2500 == 0 ):
print "Review %d of %d\n" % ( i+1, num_reviews )
# Call our function for each one, and add the result to the list
clean_train_reviews.append( review_to_words( train["review"][i] ) )
Now that we have our training reviews tidied up, how do we convert them to some kind of numeric representation for machine learning? One common approach is called a Bag of Words. The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears. For example, consider the following two sentences:
Sentence 1: "The cat sat on the hat"
Sentence 2: "The dog ate the cat and the hat"
From these two sentences, our vocabulary is as follows:
{ the, cat, sat, on, hat, dog, ate, and }
To get our bags of words, we count the number of times each word occurs in each sentence. In Sentence 1, "the" appears twice, and "cat", "sat", "on", and "hat" each appear once, so the feature vector for Sentence 1 is:
{ the, cat, sat, on, hat, dog, ate, and }
Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 }
Similarly, the features for Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}
In the IMDB data, we have a very large number of reviews, which will give us a large vocabulary. To limit the size of the feature vectors, we should choose some maximum vocabulary size. Below, we use the 5000 most frequent words (remembering that stop words have already been removed).
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 5000)
# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
print(train_data_features.shape)
Now that the Bag of Words model is trained, let's look at the vocabulary
In [ ]:
# Take a look at the words in the vocabulary
vocab = vectorizer.get_feature_names()
print(vocab)
In [ ]:
import numpy as np
# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the training set
for tag, count in zip(vocab, dist):
print count, tag
At this point, we have numeric training features from the Bag of Words and the original sentiment labels for each feature vector, so let's do some supervised learning! Here, we'll use the Random Forest classifier. The Random Forest algorithm is included in scikit-learn (Random Forest uses many tree-based classifiers to make predictions, hence the "forest"). Below, we set the number of trees to 100 as a reasonable default value. More trees may or may not perform better (why?), but will certainly take longer to run. Likewise, the more features you include for each review, the longer this will take.
First, we'll separate the dataset into a training and testing set for model evaluation.
In [ ]:
from sklearn.cross_validation import train_test_split
random_state = np.random.RandomState(0)
X, y = train_data_features, train["sentiment"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=random_state)
In [ ]:
print("Training the random forest...")
from sklearn.ensemble import RandomForestClassifier
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(n_estimators = 100, n_jobs=2)
from time import time
t0 = time()
# Fit the forest to the training set, using the bag of words as
# features and the sentiment labels as the response variable
forest = forest.fit(X_train, y_train)
print("... took %0.3fs" % (time() - t0))
Now we'll evaluate the performance of our classifier
In [ ]:
y_pred = forest.predict(X_test)
In [ ]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=['negative review', 'positive review']))
Now we'll do sentiment analysis of the reviews using a famous list of words rated for sentiments called AFINN-111. Check the description in http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010
First, we'll load the data file and create a sentiment dictionary
In [ ]:
import pandas as pd
afinn = pd.read_csv("AFINN-111.txt", header=None, delimiter="\t")
sent_dict = dict(zip(afinn[0], afinn[1]))
print(sent_dict)
Next, we create a function to sum the sentiment associated with each word in a paragraph
In [ ]:
#Calculate the sentiment in the provided text
def sentiment_in_text(text, sent_dict):
sentiment = 0.0
words = text.split()
for w in words:
if not w in sent_dict: continue
sentiment += float(sent_dict[w])
return sentiment
We'll use our cleaned review dataset in clean_train_reviews. Lets check the results on the first two.
Remember that negative values represent negative sentiments
In [ ]:
print(clean_train_reviews[0])
print(sentiment_in_text(clean_train_reviews[0], sent_dict))
print(clean_train_reviews[1])
print(sentiment_in_text(clean_train_reviews[1], sent_dict))
Why this approach to sentiment analysis in movie reviews can be problematic?
Remember that we always need to think about the context when doing data analysis
Now we'll apply the function to the whole clean dataset
In [ ]:
sentiment_values = [sentiment_in_text(x, sent_dict) for x in clean_train_reviews] #This is a list comprehension expression
sentiment_values = np.array(sentiment_values) #We convert the list to a numpy array for easier manipulation
print(sentiment_values)
Then we'll convert this sentiment values to positive (1) and negative (0) reviews as we have in our dataset
In [ ]:
y_pred_sent = [1 if x>0 else 0 for x in sentiment_values]
And we'll compare our results with the entire target vector (because we are not doing training at this point)
In [ ]:
print(metrics.classification_report(y, y_pred_sent, target_names=['negative review', 'positive review']))
Not bad for such a simple method. What can we say about the performance of our method? How can we improve the precision or recall?
In [ ]:
The sentiment_values that we just created could be used as an additional feature of the dataset for our classification task. Lets combine the bag of words with the sentiment values in an extended feature set. This could improve the classification performance, but could also be detrimental (why?), so lets check.
In [ ]:
#The bag-of-words is in the variable train_data_features
print(train_data_features.shape)
sentiment_values_matrix = np.matrix(sentiment_values).T
print(sentiment_values_matrix.shape)
#numpy.hstack() Stack arrays in sequence horizontally (column wise). The number of rows must match
X2 = np.hstack((sentiment_values_matrix, train_data_features))
print(X2.shape)
Now we can do classification again with our new feature set
In [ ]:
random_state = np.random.RandomState(0)
y = train["sentiment"]
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=.25, random_state=random_state)
In [ ]:
print("Training again the random forest...")
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100, n_jobs=2)
from time import time
t0 = time()
forest = forest.fit(X_train2, y_train2)
print("... took %0.3fs" % (time() - t0))
In [ ]:
y_pred2 = forest.predict(X_test2)
from sklearn import metrics
print(metrics.classification_report(y_test2, y_pred2, target_names=['negative review', 'positive review']))
Was the new feature set useful or not?
An important note about Random Forests: Every time that you train a Random Forest you will obtain a somewhat different Random Forest (because it's random), so performance between forest can be different just because of the method, although the difference shouldn't be too big.
Note that when we use the Bag of Words for the test set, we only call "transform", not "fit_transform" as we did for the training set. In machine learning, you shouldn't use the test set to fit your model, otherwise you run the risk of overfitting. For this reason, we keep the test set off-limits until we are ready to make predictions.
In [ ]:
# Read the test data
test = pd.read_csv("testData.tsv", header=0, delimiter="\t", \
quoting=3 )
# Verify that there are 25,000 rows and 2 columns
print test.shape
# Create an empty list and append the clean reviews one by one
num_reviews = len(test["review"])
clean_test_reviews = []
print "Cleaning and parsing the test set movie reviews...\n"
for i in xrange(0,num_reviews):
if( (i+1) % 2500 == 0 ):
print "Review %d of %d\n" % (i+1, num_reviews)
clean_review = review_to_words( test["review"][i] )
clean_test_reviews.append( clean_review )
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)
# Copy the results to a pandas dataframe with an "id" column and
# a "sentiment" column
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
# Use pandas to write the comma-separated output file
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )