Author: Matt Fortier matt.fortier@openmailbox.org
Creation Date: 2017-02-21 | Last Updated: 2107-08-01
Python 3.6
The goal here is to get a good feel for what kind of data is available to us, and in what general form is the text at our disposition. We should experiment with different preprocessing steps and find one that gives you satisfactory results. Objective is always to extract as many of the useful/valuable information, while ignoring noise.
In [2]:
import re
import bs4
import numpy as np
import pandas as pd
In [3]:
fname_posts = '../data/processed/posts_processed.json'
In [4]:
df_posts = pd.read_json(fname_posts, lines=True)
In [5]:
df_posts.index = df_posts.id
df_posts.head(3)
Out[5]:
In [6]:
df_body = df_posts.body
df_body.iloc[0]
Out[6]:
Okay, so at a minimum we need to filter out the HTML tags and remove the backslashes. Let's prepare to do that by writing a simple function that levegerages the excellent BeutifulSoup4 library.
In [7]:
def cleanup_text(text):
soup = bs4.BeautifulSoup(text, "html5lib")
return [p.get_text().translate({ord('\\'): None}) for p in soup.find_all('p')]
In [8]:
cleanup_text(df_body.iloc[0])
Out[8]:
We got workable text here - Let's fire up Spacy and shoot this across an entire NLP .
Spacy is a NLP library that regroups a whole set of models and algorithms to perform the most popular NLP preprocessing tasks. Its very robust, and most of the results are very close to current state of the art.
In [13]:
import spacy
nlp = spacy.load('en_core_web_md')
In [36]:
test_output = nlp(' '.join(cleanup_text(df_body.iloc[0])))
test_output
Out[36]:
Let's print out the result of sentance segmentation...
In [37]:
for num, sent in enumerate(test_output.sents):
print('{}> {}\n'.format(num, sent))
What about Named Entities?
In [39]:
for num, ent in enumerate(test_output.ents):
print('{}) {} - {}\n'.format(num+1, ent, ent.label_))
Part-of-Speech tagging identifies the syntactic function of each token/word.
Stemming and lemmatization are "normalizing" functions: They try to identify a common root for words of a same family.
In [42]:
for token, lemma, pos in ((t.orth_, t.lemma_, t.pos_) for t in test_output):
print('{}|{} - {}\n'.format(token, lemma, pos))
In [45]:
from gensim.models import Phrases
from gensim.models.word2vec import LineSentence
In [142]:
def is_invalid_token(token):
return token.is_punct or token.is_space or token.lemma_ == '-PRON-'
def cleanup_html(corpus):
for doc in corpus:
soup = bs4.BeautifulSoup(doc, "html5lib")
yield ' '.join([p.get_text().translate({ord('\\'): None}) for p in soup.find_all('p')])
In [143]:
unigram_fname = '../data/processed/unigram_processed_posts.txt'
In [144]:
%%time
with open(unigram_fname, 'w') as fout:
for parsed_corpus in nlp.pipe(cleanup_html((i[1] for i in df_body.iteritems())), n_threads=8):
for sent in parsed_corpus.sents:
fout.write('{}\n'.format(' '.join([t.lemma_ for t in sent if not is_invalid_token(t)])))
In [145]:
unigram_sentences = LineSentence(unigram_fname)
In [151]:
bigram_model_fname = '../data/processed/bigram.model'
In [152]:
%%time
bigram_model = Phrases(unigram_sentences)
bigram_model.save(bigram_model_fname)
In [153]:
bigram_fname = '../data/processed/bigram_processed_posts.txt'
In [154]:
with open(bigram_fname, 'w') as fout:
for sent in unigram_sentences:
bigram_sent = ' '.join(bigram_model[sent])
fout.write(bigram_sent + '\n')
In [155]:
bigram_sentences = LineSentence(bigram_fname)
In [156]:
trigram_model_fname = '../data/processed/trigram.model'
In [157]:
%%time
trigram_model = Phrases(bigram_sentences)
trigram_model.save(trigram_model_fname)
In [158]:
trigram_fname = '../data/processed/trigram_processed_posts.txt'
In [159]:
with open(trigram_fname, 'w') as fout:
for sent in bigram_sentences:
trigram_sent = ' '.join(trigram_model[sent])
fout.write(trigram_sent + '\n')
In [160]:
final_fname = '../data/processed/final_processed_trigram_posts.txt'
In [162]:
with open(final_fname, 'w') as fout:
for parsed_post in nlp.pipe(cleanup_html((i[1] for i in df_body.iteritems())), n_threads=10):
unigram_post = [t.lemma_ for t in parsed_post if not is_invalid_token(t)]
bigram_post = bigram_model[unigram_post]
trigram_post = trigram_model[bigram_post]
trigram_post = [term for term in trigram_post if term not in spacy.en.STOP_WORDS]
fout.write('{}\n'.format(' '.join(trigram_post)))
In [ ]: