Best Practices for Preprocessing Natural Language Data

In this notebook, we improve the quality of our Project Gutenberg word vectors by adopting best-practices for preprocessing natural language data.

N.B.: Some, all or none of these preprocessing steps may be helpful to a given downstream application.

Load dependencies


In [ ]:
# the initial block is copied from creating_word_vectors_with_word2vec.ipynb
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import gensim
from gensim.models.word2vec import Word2Vec
from gensim.models.phrases import Phraser, Phrases
from sklearn.manifold import TSNE
import pandas as pd
from bokeh.io import output_notebook, output_file
from bokeh.plotting import show, figure
import string
%matplotlib inline

In [ ]:
nltk.download('punkt')

In [ ]:
nltk.download('stopwords')

Load data


In [ ]:
nltk.download('gutenberg')

In [ ]:
from nltk.corpus import gutenberg

In [ ]:
len(gutenberg.fileids())

In [ ]:
gutenberg.fileids()

In [ ]:
gberg_sent_tokens = sent_tokenize(gutenberg.raw())

In [ ]:
gberg_sent_tokens[0:5]

In [ ]:
gberg_sent_tokens[1]

In [ ]:
word_tokenize(gberg_sent_tokens[1])

In [ ]:
word_tokenize(gberg_sent_tokens[1])[14]

In [ ]:
gberg_sents = gutenberg.sents()

In [ ]:
gberg_sents[0:5]

In [ ]:
len(gutenberg.words())

Iteratively preprocess a sentence

a tokenized sentence:

In [ ]:
gberg_sents[4]
to lowercase:

In [ ]:
# CODE HERE
remove stopwords and punctuation:

In [ ]:
stpwrds = stopwords.words('english') + list(string.punctuation)

In [ ]:
stpwrds

In [ ]:
# CODE HERE
stem words:

In [ ]:
stemmer = PorterStemmer()

In [ ]:
# CODE HERE
handle bigram collocations:

In [ ]:
phrases = Phrases(gberg_sents) # train detector

In [ ]:
bigram = Phraser(phrases) # create a more efficient Phraser object for transforming sentences

In [ ]:
bigram.phrasegrams # output count and score of each bigram

In [ ]:
"Jon lives in New York City".split()

In [ ]:
# CODE HERE

Preprocess the corpus


In [ ]:
lower_sents = []
for s in gberg_sents:
    lower_sents.append( # CODE HERE  )

In [ ]:
lower_sents[0:5]

In [ ]:
lower_bigram = Phraser(Phrases(lower_sents))

In [ ]:
lower_bigram.phrasegrams # miss taylor, mr woodhouse, mr weston

In [ ]:
lower_bigram["jon lives in new york city".split()]

In [ ]:
lower_bigram = Phraser(Phrases(lower_sents, min_count=32, threshold=64))
lower_bigram.phrasegrams

In [ ]:
# as in Maas et al. (2001):
# - leave in stop words ("indicative of sentiment")
# - no stemming ("model learns similar representations of words of the same stem when data suggests it")
clean_sents = []
for s in lower_sents:
    clean_sents.append( # CODE HERE )

In [ ]:
clean_sents[0:9]

In [ ]:
clean_sents[6] # could consider removing stop words or common words

Run word2vec


In [ ]:
# max_vocab_size can be used instead of min_count (which has increased here)
# model = Word2Vec(sentences=clean_sents, size=64, sg=1, window=10, min_count=10, seed=42, workers=8)
# model.save('../clean_gutenberg_model.w2v')

Explore model


In [ ]:
# skip re-training the model with the next line:  
model = gensim.models.Word2Vec.load('../clean_gutenberg_model.w2v')

In [ ]:
len(model.wv.vocab) # 17k with raw data

In [ ]:
len(model['dog'])

In [ ]:
model['dog']

In [ ]:
model.most_similar('dog')

In [ ]:
model.most_similar('think')

In [ ]:
model.most_similar('day')

In [ ]:
model.doesnt_match("morning afternoon evening dog".split())

In [ ]:
model.similarity('morning', 'dog')

In [ ]:
model.most_similar('ma_am')

In [ ]:
model.most_similar(positive=['father', 'woman'], negative=['man'])

Reduce word vector dimensionality with t-SNE


In [ ]:
# tsne = TSNE(n_components=2, n_iter=1000)

In [ ]:
# X_2d = tsne.fit_transform(model[model.wv.vocab])

In [ ]:
# coords_df = pd.DataFrame(X_2d, columns=['x','y'])
# coords_df['token'] = model.wv.vocab.keys()

In [ ]:
# coords_df.to_csv('../clean_gutenberg_tsne.csv', index=False)

Visualise


In [ ]:
coords_df = pd.read_csv('../clean_gutenberg_tsne.csv')

In [ ]:
coords_df.head()

In [ ]:
_ = coords_df.plot.scatter('x', 'y', figsize=(12,12), marker='.', s=10, alpha=0.2)

In [ ]:
output_notebook()

In [ ]:
subset_df = coords_df.sample(n=5000)

In [ ]:
p = figure(plot_width=800, plot_height=800)
_ = p.text(x=subset_df.x, y=subset_df.y, text=subset_df.token)

In [ ]:
show(p)

In [ ]:
# output_file() here