Title: Term Frequency Inverse Document Frequency
Slug: tf-idf
Summary: How to weight word importance in unstructured text data as bags of words for machine learning in Python. Date: 2017-09-09 12:00
Category: Machine Learning
Tags: Preprocessing Text Authors: Chris Albon

Preliminaries


In [14]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

Create Text Data


In [15]:
# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])

Create Feature Matrix


In [16]:
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

# Show tf-idf feature matrix
feature_matrix.toarray()


Out[16]:
array([[ 0.        ,  0.        ,  0.        ,  0.89442719,  0.        ,
         0.        ,  0.4472136 ,  0.        ],
       [ 0.        ,  0.57735027,  0.        ,  0.        ,  0.        ,
         0.57735027,  0.        ,  0.57735027],
       [ 0.57735027,  0.        ,  0.57735027,  0.        ,  0.57735027,
         0.        ,  0.        ,  0.        ]])

In [17]:
# Show tf-idf feature matrix
tfidf.get_feature_names()


Out[17]:
['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

View Feature Matrix As Data Frame


In [18]:
# Create data frame
pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())


Out[18]:
beats best both brazil germany is love sweden
0 0.00000 0.00000 0.00000 0.894427 0.00000 0.00000 0.447214 0.00000
1 0.00000 0.57735 0.00000 0.000000 0.00000 0.57735 0.000000 0.57735
2 0.57735 0.00000 0.57735 0.000000 0.57735 0.00000 0.000000 0.00000