In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -p scikit-learn,numpy,scipy -v -d
When I was using scikit-learn extremely handy TfidfVectorizer
I had some trouble interpreting the results since they seem to be a little bit different from the standard convention of how the term frequency-inverse document frequency (tf-idf) is calculated. Here, I just put together a brief overview of how the TfidfVectorizer
works -- mainly as personal reference sheet, but maybe it is useful to one or the other.
Tf-idfs are a way to represent documents as feature vectors. Tf-idfs can be understood as a modification of the raw term frequencies (tf); the tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.
For the following sections, let us consider the following dataset that consists of 3 documents:
In [2]:
import numpy as np
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'])
First, we will start with the raw term frequency tf(t, d), which is defined by the number of times a term t occurs in a document t
Using the CountVectorizer
from scikit-learn, we can construct a bag-of-words model with the term frequencies as follows:
In [3]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf
Out[3]:
In [4]:
cv.vocabulary_
Out[4]:
Based on the vocabulary, the word "and" would be the 1st feature in each document vector in tf
, the word "is" the 2nd, the word "shining" the 3rd, etc.
In this section, we will take a brief look at how the tf-feature vector can be normalized, which will be useful later.
The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).
$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}} = \frac{v}{\big(\sum_{i=1}^n v_i \big)^{\frac{1}{2}}}$$For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet'
as follows:
$\frac{[1, 2, 1, 1, 1, 2, 1]}{2} = [ 0.2773501, 0.5547002, 0.2773501, 0.2773501, 0.2773501, 0.5547002, 0.2773501]$
In [5]:
tf_norm = tf[2] / np.sqrt(np.sum(tf[2]**2))
tf_norm
Out[5]:
In scikit-learn, we can use the TfidfTransformer
to normalize the term frequencies if we disable the inverse document frequency calculation (use_idf: False
and smooth_idf=False
):
In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=False, norm='l2', smooth_idf=False)
tf_norm = tfidf.fit_transform(tf).toarray()
print('Normalized term frequencies of document 3:\n %s' % tf_norm[-1])
Most commonly, the term frequency-inverse document frequency (tf-idf) is calculated as follows [1]:
where idf is the inverse document frequency, which we will introduce in the next section.
In order to understand the inverse document frequency idf, let us first introduce the term document frequency $\text{df}(d,t)$, which simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:
where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.
Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.
Now, Let us calculate the idfs of the words "and", "is," and "shining:"
In [7]:
n_docs = len(docs)
df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)
df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)
df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)
Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:
In [8]:
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))
The tf-idfs in scikit-learn are calculated a little bit differently. Here, the +1
count is added to the idf, whereas instead of the denominator if the df:
We can demonstrate this by calculating the idfs manually using the equation above and comparing the results to the TfidfTransformer output using the settings use_idf=True, smooth_idf=False, norm=None
.
In the following examples, we will be again using the words "and," "is," and "shining:" from document 3.
In [9]:
tf_and = 1
df_and = 1
tf_and * (np.log(n_docs / df_and) + 1)
Out[9]:
In [10]:
tf_shining = 2
df_shining = 3
tf_shining * (np.log(n_docs / df_shining) + 1)
Out[10]:
In [11]:
tf_is = 1
df_is = 2
tf_is * (np.log(n_docs / df_is) + 1)
Out[11]:
In [12]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][0:3]
Out[12]:
Now, let us calculate the normalized tf-idfs. Our feature vector of un-normalized tf-idfs for document 3 would look as follows if we'd applied the equation from the previous section to all words in the document:
In [13]:
tf_idfs_d3 = np.array([[ 2.09861229, 2.0, 1.40546511, 1.40546511, 1.40546511, 2.0, 1.40546511]])
Using the l2-norm, we then normalize the tf-idfs as follows:
In [14]:
tf_idfs_d3_norm = tf_idfs_d3[-1] / np.sqrt(np.sum(tf_idfs_d3[-1]**2))
tf_idfs_d3_norm
Out[14]:
And finally, we compare the results to the results that the TfidfTransformer
returns.
In [15]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1]
Out[15]:
Another parameter in the TfidfTransformer
is the smooth_idf
, which is described as
smooth_idf : boolean, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.
So, our idf would then be defined as follows:
To confirm that we understand the smooth_idf
parameter correctly, let us walk through the 3-word example again:
In [16]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]
Out[16]:
In [17]:
tf_and = 1
df_and = 1
tf_and * (np.log((n_docs+1) / (df_and+1)) + 1)
Out[17]:
In [18]:
tf_is = 2
df_is = 3
tf_is * (np.log((n_docs+1) / (df_is+1)) + 1)
Out[18]:
In [19]:
tf_shining = 1
df_shining = 2
tf_shining * (np.log((n_docs+1) / (df_shining+1)) + 1)
Out[19]:
Finally, we now understand the default settings in the TfidfTransformer
, which are:
use_idf=True
smooth_idf=True
norm='l2'
And the equation can be summarized as
$\text{tf-idf} = \text{tf}(t) \times (\text{idf}(t, d) + 1),$
where
$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}}.$
In [20]:
tfidf = TfidfTransformer()
tfidf.fit_transform(tf).toarray()[-1]
Out[20]:
In [21]:
smooth_tfidfs_d3 = np.array([[ 1.69314718, 2.0, 1.28768207, 1.28768207, 1.28768207, 2.0, 1.28768207]])
smooth_tfidfs_d3_norm = smooth_tfidfs_d3[-1] / np.sqrt(np.sum(smooth_tfidfs_d3[-1]**2))
smooth_tfidfs_d3_norm
Out[21]:
[1] G. Salton and M. J. McGill. Introduction to modern information retrieval. 1983.