In [1]:
%load_ext watermark
%watermark -a 'Sebastian Raschka' -p scikit-learn,numpy,scipy -v -d


Sebastian Raschka 23/05/2015 

CPython 3.4.3
IPython 3.1.0

scikit-learn 0.16.1
numpy 1.9.2
scipy 0.15.1



Tf-idf Walkthrough for scikit-learn

When I was using scikit-learn extremely handy TfidfVectorizer I had some trouble interpreting the results since they seem to be a little bit different from the standard convention of how the term frequency-inverse document frequency (tf-idf) is calculated. Here, I just put together a brief overview of how the TfidfVectorizer works -- mainly as personal reference sheet, but maybe it is useful to one or the other.



What are Tf-idfs all about?

Tf-idfs are a way to represent documents as feature vectors. Tf-idfs can be understood as a modification of the raw term frequencies (tf); the tf is the count of how often a particular word occurs in a given document. The concept behind the tf-idf is to downweight terms proportionally to the number of documents in which they occur. Here, the idea is that terms that occur in many different documents are likely unimportant or don't contain any useful information for Natural Language Processing tasks such as document classification.



Example data

For the following sections, let us consider the following dataset that consists of 3 documents:


In [2]:
import numpy as np
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining and the weather is sweet'])



Raw term frequency

First, we will start with the raw term frequency tf(t, d), which is defined by the number of times a term t occurs in a document t


Alternative term frequency definitions include the binary term frequency, log-scaled term frequency, and augmented term frequency [1].


Using the CountVectorizer from scikit-learn, we can construct a bag-of-words model with the term frequencies as follows:


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf


Out[3]:
array([[0, 1, 1, 1, 0, 1, 0],
       [0, 1, 0, 0, 1, 1, 1],
       [1, 2, 1, 1, 1, 2, 1]], dtype=int64)

In [4]:
cv.vocabulary_


Out[4]:
{'and': 0, 'is': 1, 'shining': 2, 'sun': 3, 'sweet': 4, 'the': 5, 'weather': 6}

Based on the vocabulary, the word "and" would be the 1st feature in each document vector in tf, the word "is" the 2nd, the word "shining" the 3rd, etc.



Normalized term frequency

In this section, we will take a brief look at how the tf-feature vector can be normalized, which will be useful later.

The most common way to normalize the raw term frequency is l2-normalization, i.e., dividing the raw term frequency vector $v$ by its length $||v||_2$ (L2- or Euclidean norm).

$$v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}} = \frac{v}{\big(\sum_{i=1}^n v_i \big)^{\frac{1}{2}}}$$

For example, we would normalize our 3rd document 'The sun is shining and the weather is sweet' as follows:

$\frac{[1, 2, 1, 1, 1, 2, 1]}{2} = [ 0.2773501, 0.5547002, 0.2773501, 0.2773501, 0.2773501, 0.5547002, 0.2773501]$


In [5]:
tf_norm = tf[2] / np.sqrt(np.sum(tf[2]**2))
tf_norm


Out[5]:
array([ 0.2773501,  0.5547002,  0.2773501,  0.2773501,  0.2773501,
        0.5547002,  0.2773501])

In scikit-learn, we can use the TfidfTransformer to normalize the term frequencies if we disable the inverse document frequency calculation (use_idf: False and smooth_idf=False):


In [6]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=False, norm='l2', smooth_idf=False)
tf_norm = tfidf.fit_transform(tf).toarray()
print('Normalized term frequencies of document 3:\n %s' % tf_norm[-1])


Normalized term frequencies of document 3:
 [ 0.2773501  0.5547002  0.2773501  0.2773501  0.2773501  0.5547002
  0.2773501]



Term frequency-inverse document frequency -- tf-idf

Most commonly, the term frequency-inverse document frequency (tf-idf) is calculated as follows [1]:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t),$$

where idf is the inverse document frequency, which we will introduce in the next section.



Inverse document frequency

In order to understand the inverse document frequency idf, let us first introduce the term document frequency $\text{df}(d,t)$, which simply the number of documents $d$ that contain the term $t$. We can then define the idf as follows:

$$\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}},$$

where
$n_d$: The total number of documents
$\text{df}(d,t)$: The number of documents that contain term $t$.

Note that the constant 1 is added to the denominator to avoid a zero-division error if a term is not contained in any document in the test dataset.

Now, Let us calculate the idfs of the words "and", "is," and "shining:"


In [7]:
n_docs = len(docs)

df_and = 1
idf_and = np.log(n_docs / (1 + df_and))
print('idf "and": %s' % idf_and)

df_is = 3
idf_is = np.log(n_docs / (1 + df_is))
print('idf "is": %s' % idf_is)

df_shining = 2
idf_shining = np.log(n_docs / (1 + df_shining))
print('idf "shining": %s' % idf_shining)


idf "and": 0.405465108108
idf "is": -0.287682072452
idf "shining": 0.0

Using those idfs, we can eventually calculate the tf-idfs for the 3rd document:


In [8]:
print('Tf-idfs in document 3:\n')
print('tf-idf "and": %s' % (1 * idf_and))
print('tf-idf "is": %s' % (2 * idf_is))
print('tf-idf "shining": %s' % (1 * idf_shining))


Tf-idfs in document 3:

tf-idf "and": 0.405465108108
tf-idf "is": -0.575364144904
tf-idf "shining": 0.0



Tf-idf in scikit-learn

The tf-idfs in scikit-learn are calculated a little bit differently. Here, the +1 count is added to the idf, whereas instead of the denominator if the df:

$$\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1$$

We can demonstrate this by calculating the idfs manually using the equation above and comparing the results to the TfidfTransformer output using the settings use_idf=True, smooth_idf=False, norm=None. In the following examples, we will be again using the words "and," "is," and "shining:" from document 3.


In [9]:
tf_and = 1 
df_and = 1 
tf_and * (np.log(n_docs / df_and) + 1)


Out[9]:
2.09861228866811

In [10]:
tf_shining = 2 
df_shining = 3 
tf_shining * (np.log(n_docs / df_shining) + 1)


Out[10]:
2.0

In [11]:
tf_is = 1 
df_is = 2 
tf_is * (np.log(n_docs / df_is) + 1)


Out[11]:
1.4054651081081644

In [12]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()[-1][0:3]


Out[12]:
array([ 2.09861229,  2.        ,  1.40546511])



Normalized tf-idf

Now, let us calculate the normalized tf-idfs. Our feature vector of un-normalized tf-idfs for document 3 would look as follows if we'd applied the equation from the previous section to all words in the document:


In [13]:
tf_idfs_d3 = np.array([[ 2.09861229, 2.0, 1.40546511, 1.40546511, 1.40546511, 2.0, 1.40546511]])

Using the l2-norm, we then normalize the tf-idfs as follows:


In [14]:
tf_idfs_d3_norm = tf_idfs_d3[-1] / np.sqrt(np.sum(tf_idfs_d3[-1]**2))
tf_idfs_d3_norm


Out[14]:
array([ 0.46572049,  0.44383662,  0.31189844,  0.31189844,  0.31189844,
        0.44383662,  0.31189844])

And finally, we compare the results to the results that the TfidfTransformer returns.


In [15]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm='l2')
tfidf.fit_transform(tf).toarray()[-1]


Out[15]:
array([ 0.46572049,  0.44383662,  0.31189844,  0.31189844,  0.31189844,
        0.44383662,  0.31189844])



Smooth idf

Another parameter in the TfidfTransformer is the smooth_idf, which is described as

smooth_idf : boolean, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

So, our idf would then be defined as follows:

$$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1$$

To confirm that we understand the smooth_idf parameter correctly, let us walk through the 3-word example again:


In [16]:
tfidf = TfidfTransformer(use_idf=True, smooth_idf=True, norm=None)
tfidf.fit_transform(tf).toarray()[-1][:3]


Out[16]:
array([ 1.69314718,  2.        ,  1.28768207])

In [17]:
tf_and = 1 
df_and = 1 
tf_and * (np.log((n_docs+1) / (df_and+1)) + 1)


Out[17]:
1.6931471805599454

In [18]:
tf_is = 2
df_is = 3 
tf_is * (np.log((n_docs+1) / (df_is+1)) + 1)


Out[18]:
2.0

In [19]:
tf_shining = 1 
df_shining = 2
tf_shining * (np.log((n_docs+1) / (df_shining+1)) + 1)


Out[19]:
1.2876820724517808



TfidfVectorizer defaults

Finally, we now understand the default settings in the TfidfTransformer, which are:

  • use_idf=True
  • smooth_idf=True
  • norm='l2'

And the equation can be summarized as
$\text{tf-idf} = \text{tf}(t) \times (\text{idf}(t, d) + 1),$
where
$\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}}.$


In [20]:
tfidf = TfidfTransformer()
tfidf.fit_transform(tf).toarray()[-1]


Out[20]:
array([ 0.40474829,  0.47810172,  0.30782151,  0.30782151,  0.30782151,
        0.47810172,  0.30782151])

In [21]:
smooth_tfidfs_d3 = np.array([[ 1.69314718, 2.0, 1.28768207, 1.28768207, 1.28768207, 2.0, 1.28768207]])
smooth_tfidfs_d3_norm = smooth_tfidfs_d3[-1] / np.sqrt(np.sum(smooth_tfidfs_d3[-1]**2))
smooth_tfidfs_d3_norm


Out[21]:
array([ 0.40474829,  0.47810172,  0.30782151,  0.30782151,  0.30782151,
        0.47810172,  0.30782151])



References

[1] G. Salton and M. J. McGill. Introduction to modern information retrieval. 1983.