Machine Learning with Text

Problem: Use the title and description of a talk to predict whether it might be selected.

Starting with the imports...



In [ ]:

    
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, recall_score



In [ ]:

    
df = pd.read_table('data/preprocessed.tsv', usecols=['title', 'description', 'selected'])
df.fillna(value="", inplace=True)



In [ ]:

    
y = df['selected'].astype(int).values

The Training & Prediction pipeline

Let's use the 'title' column as the corpus



In [ ]:

    
corpus = df['title']

Text Vectorization & The TD Matrix



In [ ]:

    
vect = TfidfVectorizer(sublinear_tf=True, stop_words='english')
X = vect.fit_transform(corpus)
pd.DataFrame(X.toarray(), columns=vect.get_feature_names()).head()

Dimensionality Reduction



In [ ]:

    
svd = TruncatedSVD(n_components=250)
X = svd.fit_transform(X)
pd.DataFrame(X).head()

Training the Classifier



In [ ]:

    
gnb = GaussianNB()
gnb.fit(X, y)

Testing the classifier



In [ ]:

    
predictions = gnb.predict(X)
print((predictions == y).sum() / 290)

Exercise 1: Use the 'description' column of the dataset as a corpus for the predictions



In [ ]:

    
# Retrieve the corpus from the dataset



In [ ]:

    
# Obtain the TD Matrix



In [ ]:

    
# Reduce the dimensionality of the TD matrix to 250



In [ ]:

    
# Train the classifier



In [ ]:

    
# Test the classifier

Exercise 2: Use a combination of 'title' and 'description' corpora for the training & predictions



In [ ]: