Title: Term Frequency Inverse Document Frequency
Slug: tf-idf
Summary: How to weight word importance in unstructured text data as bags of words for machine learning in Python.
Date: 2017-09-09 12:00
Category: Machine Learning
Tags: Preprocessing Text
Authors: Chris Albon
In [14]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
In [15]:
# Create text
text_data = np.array(['I love Brazil. Brazil!',
'Sweden is best',
'Germany beats both'])
In [16]:
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
# Show tf-idf feature matrix
feature_matrix.toarray()
Out[16]:
In [17]:
# Show tf-idf feature matrix
tfidf.get_feature_names()
Out[17]:
In [18]:
# Create data frame
pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())
Out[18]: