Title: Term Frequency Inverse Document Frequency
Slug: tf-idf
Summary: How to weight word importance in unstructured text data as bags of words for machine learning in Python. Date: 2017-09-09 12:00
Category: Machine Learning
Tags: Preprocessing Text Authors: Chris Albon

Preliminaries



In [14]:

    
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

Create Text Data



In [15]:

    
# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])

Create Feature Matrix



In [16]:

    
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

# Show tf-idf feature matrix
feature_matrix.toarray()









    Out[16]:





array([[ 0.        ,  0.        ,  0.        ,  0.89442719,  0.        ,
         0.        ,  0.4472136 ,  0.        ],
       [ 0.        ,  0.57735027,  0.        ,  0.        ,  0.        ,
         0.57735027,  0.        ,  0.57735027],
       [ 0.57735027,  0.        ,  0.57735027,  0.        ,  0.57735027,
         0.        ,  0.        ,  0.        ]])



In [17]:

    
# Show tf-idf feature matrix
tfidf.get_feature_names()









    Out[17]:





['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

View Feature Matrix As Data Frame



In [18]:

    
# Create data frame
pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())

	beats	best	both	brazil	germany	is	love	sweden
0	0.00000	0.00000	0.00000	0.894427	0.00000	0.00000	0.447214	0.00000
1	0.00000	0.57735	0.00000	0.000000	0.00000	0.57735	0.000000	0.57735
2	0.57735	0.00000	0.57735	0.000000	0.57735	0.00000	0.000000	0.00000