A Data Model for Text

Starting with importing the tools and the data


In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
plt.style.use('ggplot')
%matplotlib inline
df = pd.read_table('data/preprocessed.tsv')

Quantifying text: The Term-Document Matrix

Three types of metrics:

  • ### Presence or absense of a word
  • ### The number of times a word appears
  • ### The probability of a word occuring

In [ ]:
corpus = [
    "it was the best of times",
    "it was the worst of times",
    "it was the age of wisdom",
    "it was the age of foolishness"
]

In [ ]:
vect = CountVectorizer(binary=True)
X = vect.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=vect.get_feature_names())

In [ ]:
vect = CountVectorizer()
X = vect.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=vect.get_feature_names())

In [ ]:
vect = TfidfVectorizer()
X = vect.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=vect.get_feature_names())

Back to the PyCon proposals dataset


In [ ]:
df.head()

Pick a corpus consisting only of the title of the talk


In [ ]:
corpus = df['title']
vect = CountVectorizer(stop_words='english')
X = vect.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=vect.get_feature_names())

What just happened?


In [ ]:
corpus = df['title']
vect = CountVectorizer(stop_words='english', max_features=10)
X = vect.fit_transform(corpus).toarray()
pd.DataFrame(X, columns=vect.get_feature_names())

Exercise: Find the top 10 most frequently occuring words in the "description" column of the dataset.


In [ ]:
# ENTER CODE HERE