Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/ It is based on a tutorial of Nils Witt (https://github.com/n-witt/MachineLearningWithText_SS2017)
This is a tutorial for learning and evaluating a simple naive bayes classifier on for a simple text classification problem. In this tutorial you will:
It is assumed that you have some general knowledge on
We wil start with a small example of 3 SMS'. The texts in the SMS are the following "call me tonight", "Call me a cab", "please call me... PLEASE!" In order to do text classification we need to convert the text into a feature vector. We will follow a very simple approach here:
All those things can easily be done with the CountVectorizer from the sklearn library.
In [1]:
simple_train = ['call you tonight', 'Call me a cab', 'please call me... PLEASE!']
In [2]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
In [3]:
# learn the 'vocabulary' of the training data
vect.fit(simple_train)
Out[3]:
In [4]:
# examine the fitted vocabulary
vect.get_feature_names()
Out[4]:
Have you noticed that all words are lower case now? And that we ignored punctuation? Whether this is a good idea, depends on the application. E.g. for detecting emotions in texts, smilies (punctutation) might be a helpful feature. But for now, let's keep it simple.
Now we generate a document-term matrix. In this matrix each row corresponds to one document, each column to one feature. Entry (i,j)
tells us how often word j
occurs in document i
.
Note: The "how often" is only true if we use the count vectorizer. Instead of word count there are many other possible features.
From the scikit-learn documentation:
In this scheme, features and samples are defined as follows:
- Each individual token occurrence frequency (normalized or not) is treated as a feature.
- The vector of all the token frequencies for a given document is considered a sample.
A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.
We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
In [5]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm
Out[5]:
In [6]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()
Out[6]:
We can use a pandas data frame to store the vector and the feature names together.
In [7]:
# examine the vocabulary and document-term matrix together
import pandas as pd
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())
Out[7]:
Since in general this is an aweful lot of zeros (think of how many of all English words are present in a SMS), the more efficient way to store the information is as a sparse matrix. For humans this is a bit harder to read.
From the scikit-learn documentation:
As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have many feature values that are zeros (typically more than 99% of them).
For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.
In order to be able to store such a matrix in memory but also to speed up operations, implementations will typically use a sparse representation such as the implementations available in the
scipy.sparse
package.
In [8]:
# check the type of the document-term matrix
type(simple_train_dtm)
Out[8]:
In [9]:
# examine the sparse matrix contents
print(simple_train_dtm)
In [10]:
# example text for model testing
simple_test = ["please don't call me, I don't like you"]
In [11]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()
Out[11]:
In [12]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())
Out[12]:
In [13]:
path = 'material/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])
In [14]:
sms.shape
Out[14]:
In [15]:
# examine the first 10 rows
sms.head(10)
Out[15]:
We convert the label to a numerical value.
In [16]:
# examine the class distribution
sms.label.value_counts()
Out[16]:
In [17]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})
In [18]:
# check that the conversion worked
sms.head(10)
Out[18]:
Now we have our text in the column message
and our label in the column label_num
. Let's have a look at the sizes.
In [19]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)
And at the text of the first 5 messages.
In [20]:
sms.message.head()
Out[20]:
We now prepare the data for the classifier. First split it into a training and a test set. There is a convenient method train_test_split
available that helps us with that. We use a fixed random state random_state=42
to split randomly, but at the same time get the same results each time we run the code.
In [21]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
Now we use the data preprocessing knowledge from above and generate the vocabulary. We will do this ONLY on the training data set, because we presume to have no knowledge whatsoever about the test data set. So we don't know the test data's vocabulary.
In [22]:
# learn training data vocabulary, then use it to create a document-term matrix
vect = CountVectorizer()
vect.fit(X_train)
X_train_dtm = vect.transform(X_train)
In [23]:
# examine the document-term matrix
X_train_dtm
Out[23]:
Next we transform the test data set using the same vocabulary (that is using the same vect
object that internally knows the vocabulary).
In [24]:
# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm
Out[24]:
Now we are at the stage where we have a matrix of features and the corresponding labels. We can now train a classifier for spam detection on sms. We will use multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
In [25]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
In [26]:
nb.fit(X_train_dtm, y_train)
y_test_pred = nb.predict(X_test_dtm)
In [27]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_test_pred)
Out[27]:
In [29]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_test_pred)
Out[29]:
Before we start: the estimator has several fields that allow us to examine its internal state:
In [30]:
vect.vocabulary_
Out[30]:
In [31]:
X_train_tokens = vect.get_feature_names()
print(X_train_tokens[:50])
In [32]:
print(X_train_tokens[-50:])
In [33]:
# feature count per class
nb.feature_count_
Out[33]:
In [34]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
In [35]:
# create a table of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()
Out[35]:
In [36]:
tokens.sample(5, random_state=6)
Out[36]:
Naive Bayes counts the number of observations in each class
In [37]:
nb.class_count_
Out[37]:
Add 1 to ham and spam counts to avoid dividing by 0
In [38]:
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)
Out[38]:
In [39]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5, random_state=6)
Out[39]:
Calculate the ratio of spam-to-ham for each token
In [40]:
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)
Out[40]:
Examine the DataFrame sorted by spam_ratio
In [41]:
tokens.sort_values('spam_ratio', ascending=False)
Out[41]:
In [42]:
tokens.loc['00', 'spam_ratio']
Out[42]:
Stopwords are the most common words in a language. Examples are 'is', 'which' and 'the'. Usually is beneficial to exclude these words in text processing tasks.
The CountVectorizer
has a stop_words
parameter:
In [43]:
vect = CountVectorizer(stop_words='english')
n-grams concatenate n words to form a token. The following accounts for 1- and 2-grams
In [44]:
vect = CountVectorizer(ngram_range=(1, 2))
Often it's beneficial to exclude words that appear in the majority or just a couple of documents. This is, very frequent or infrequent words. This can be achieved by using the max_df
and min_df
parameters of the vectorizer.
In [45]:
# ignore terms that appear in more than 50% of the documents
vect = CountVectorizer(max_df=0.5)
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(min_df=2)
The process of reducing a word to it's word stem, base or root form is called stemming. Scikit-Learn has no powerfull stemmer, but other libraries like the NLTK have.
In [46]:
import numpy as np
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'])
First, we will compute the term frequency (alternatively: Bag-of-Words) $tf(t, d)$. $t$ is the number of times a term occures in a document $d$. Using Scikit-Learn we can quickly get those numbers:
In [47]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf
Out[47]:
In [48]:
cv.vocabulary_
Out[48]:
Secondly, we introduce inverse document frequency ($idf$) by defining the term document frequency $\text{df}(d,t)$, which is simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:
where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.
Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.
Now, Let us calculate the idfs of the words "and", "is," and "shining:"
In [49]:
n_docs = len(docs)
df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)
df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)
df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)
Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:
In [50]:
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))
In [51]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]
Out[51]:
Wait! Those numbers aren't the same!
Tf-idf in Scikit-Learn is calculated a little bit differently. Here, the +1
count is added to the idf, whereas instead of the denominator if the df:
In [52]:
tf_and = 1
df_and = 1
tf_and * (np.log(n_docs / df_and) + 1)
Out[52]:
In [53]:
tf_is = 2
df_is = 3
tf_is * (np.log(n_docs / df_is) + 1)
Out[53]:
In [54]:
tf_shining = 1
df_shining = 2
tf_shining * (np.log(n_docs / df_shining) + 1)
Out[54]:
By default, Scikit-Learn performs a normalization. The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).
$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}$$Why is that useful?
For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet'
as follows:
In [55]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]
Out[55]:
We are not quite there. Sckit-Learn also applies smoothing, which changes the original formula as follows:
In [56]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm='l2')
tfidf.fit_transform(tf).toarray()[-1][:3]
Out[56]: