It also includes many easy-to-use datasets in the nltk.corpus package, we can download for example the movie_reviews package using the nltk.download function:
In [1]:
import nltk
In [2]:
nltk.download("movie_reviews")
Out[2]:
You can also list and download other datasets interactively just typing:
nltk.download()
in the Jupyter Notebook.
Once the data have been downloaded, we can import them from nltk.corpus
In [ ]:
from nltk.corpus import movie_reviews
The fileids method provided by all the datasets in nltk.corpus gives access to a list of all the files available.
In particular in the movie_reviews dataset we have 2000 text files, each of them is a review of a movie, and they are already split in a neg folder for the negative reviews and a pos folder for the positive reviews:
In [ ]:
len(movie_reviews.fileids())
In [ ]:
movie_reviews.fileids()[:5]
In [ ]:
movie_reviews.fileids()[-5:]
fileids can also filter the available files based on their category, which is the name of the subfolders they are located in. Therefore we can have lists of positive and negative reviews separately.
In [ ]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')
In [ ]:
len(negative_fileids), len(positive_fileids)
We can inspect one of the reviews using the raw method of movie_reviews, each file is split into sentences, the curators of this dataset also removed from each review from any direct mention of the rating of the movie.
In [ ]:
print(movie_reviews.raw(fileids=positive_fileids[0]))
In [ ]:
romeo_text = """Why then, O brawling love! O loving hate!
O any thing, of nothing first create!
O heavy lightness, serious vanity,
Misshapen chaos of well-seeming forms,
Feather of lead, bright smoke, cold fire, sick health,
Still-waking sleep, that is not what it is!
This love feel I, that feel no love in this."""
The first step in Natural Language processing is generally to split the text into words, this process might appear simple but it is very tedious to handle all corner cases, see for example all the issues with punctuation we have to solve if we just start with a split on whitespace:
In [ ]:
romeo_text.split()
nltk has a sophisticated word tokenizer trained on English named punkt, we first have to download its parameters:
In [ ]:
nltk.download("punkt")
Then we can use the word_tokenize function to properly tokenize this text, compare to the whitespace splitting we used above:
In [ ]:
romeo_words = nltk.word_tokenize(romeo_text)
In [ ]:
romeo_words
Good news is that the movie_reviews corpus already has direct access to tokenized text with the words method:
In [ ]:
movie_reviews.words(fileids=positive_fileids[0])
The simplest model for analyzing text is just to think about text as an unordered collection of words (bag-of-words). This can generally allow to infer from the text the category, the topic or the sentiment.
From the bag-of-words model we can build features to be used by a classifier, here we assume that each word is a feature that can either be True or False.
We implement this in Python as a dictionary where for each word in a sentence we associate True, if a word is missing, that would be the same as assigning False.
In [ ]:
{word:True for word in romeo_words}
In [ ]:
type(_)
In [ ]:
def build_bag_of_words_features(words):
return {word:True for word in words}
In [ ]:
build_bag_of_words_features(romeo_words)
This is what we wanted, but we notice that also punctuation like "!" and words useless for classification purposes like "of" or "that" are also included.
Those words are named "stopwords" and nltk has a convenient corpus we can download:
In [ ]:
nltk.download("stopwords")
In [ ]:
import string
In [ ]:
string.punctuation
Using the Python string.punctuation list and the English stopwords we can build better features by filtering out those words that would not help in the classification:
In [ ]:
useless_words = nltk.corpus.stopwords.words("english") + list(string.punctuation)
#useless_words
#type(useless_words)
In [ ]:
def build_bag_of_words_features_filtered(words):
return {
word:1 for word in words \
if not word in useless_words}
In [ ]:
build_bag_of_words_features_filtered(romeo_words)
It is common to explore a dataset before starting the analysis, in this section we will find the most common words and plot their frequency.
Using the .words() function with no argument we can extract the words from the entire dataset and check that it is about 1.6 millions.
In [ ]:
all_words = movie_reviews.words()
len(all_words)/1e6
First we want to filter out useless_words as defined in the previous section, this will reduce the length of the dataset by more than a factor of 2:
In [ ]:
filtered_words = [word for word in movie_reviews.words() if not word in useless_words]
type(filtered_words)
In [ ]:
len(filtered_words)/1e6
The collection package of the standard library contains a Counter class that is handy for counting frequencies of words in our list:
In [ ]:
from collections import Counter
word_counter = Counter(filtered_words)
It also has a most_common() method to access the words with the higher count:
In [ ]:
most_common_words = word_counter.most_common()[:10]
In [ ]:
most_common_words
Then we would like to have a visualization of this using matplotlib.
First we want to use the Jupyter magic function
%matplotlib inline
to setup the Notebook to show the plot embedded into the Jupyter Notebook page, you can also test:
%matplotlib notebook
for a more interactive plotting interface which however is not as well supported on all platforms and browsers.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
We can sort the word counts and plot their values on Logarithmic axes to check the shape of the distribution. This visualization is particularly useful if comparing 2 or more datasets, a flatter distribution indicates a large vocabulary while a peaked distribution a restricted vocabulary often due to a focused topic or specialized language.
In [ ]:
sorted_word_counts = sorted(list(word_counter.values()), reverse=True)
plt.loglog(sorted_word_counts)
plt.ylabel("Freq")
plt.xlabel("Word Rank");
Another related plot is the histogram of sorted_word_counts, which displays how many words have a count in a specific range.
Of course the distribution is highly peaked at low counts, i.e. most of the words appear which a low count, so we better display it on semilogarithmic axes to inspect the tail of the distribution.
In [ ]:
plt.hist(sorted_word_counts, bins=50);
In [ ]:
plt.hist(sorted_word_counts, bins=50, log=True);
Using our build_bag_of_words_features function we can build separately the negative and positive features.
Basically for each of the 1000 negative and for the 1000 positive review, we create one dictionary of the words and we associate the label "neg" and "pos" to it.
In [ ]:
negative_features = [
(build_bag_of_words_features_filtered(movie_reviews.words(fileids=[f])), 'neg') \
for f in negative_fileids
]
In [ ]:
print(negative_features[3])
In [ ]:
positive_features = [
(build_bag_of_words_features_filtered(movie_reviews.words(fileids=[f])), 'pos') \
for f in positive_fileids
]
In [ ]:
print(positive_features[6])
In [ ]:
from nltk.classify import NaiveBayesClassifier
One of the simplest supervised machine learning classifiers is the Naive Bayes Classifier, it can be trained on 80% of the data to learn what words are generally associated with positive or with negative reviews.
In [ ]:
split = 800
In [ ]:
sentiment_classifier = NaiveBayesClassifier.train(positive_features[:split]+negative_features[:split])
We can check after training what is the accuracy on the training set, i.e. the same data used for training, we expect this to be a very high number because the algorithm already "saw" those data. Accuracy is the fraction of the data that is classified correctly, we can turn it into percent:
In [ ]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[:split]+negative_features[:split])*100
The accuracy above is mostly a check that nothing went very wrong in the training, the real measure of accuracy is on the remaining 20% of the data that wasn't used in training, the test data:
In [ ]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[split:]+negative_features[split:])*100
Accuracy here is around 70% which is pretty good for such a simple model if we consider that the estimated accuracy for a person is about 80%. We can finally print the most informative features, i.e. the words that mostly identify a positive or a negative review:
In [ ]:
sentiment_classifier.show_most_informative_features()