Natural Language Preprocessing


Gregory Antell & Emily Halket
December, 2016

This notebook provides a brief overview of common steps taken during natural language preprocessing (NLP). The goal is to get you started thinking about how to process your data, not to provide a formal pipeline. Unstructured text analysis is often restricted by computational techniques rather than data collection.

Preprocessing follows a general series of steps, each requiring decisions that can substantially impact the final output if not considered carefully. For this tutorial, we will be emphasizing how different sources of text require different approaches for preprocessing and modeling. As you approach your own data, think about the implications of each decision on the outcome of your analysis.

Requirements

This tutorial requires several commonly used Python packages for data analysis and Natural Language Processing (NLP):

  • Pandas: for data structures and analysis in Python
  • NLTK: Natural Language Toolkit
  • gensim: for topic modelling

In [3]:
# import requirements
import pandas as pd
import nltk
#import gensim
import spacy

Data

Here we will be exploring two different data sets:

  1. New York Times op-eds
  2. Stack Overflow questions and comments

While the New York Times data set consists of traditional English prose and substantially longer articles, the Stack Overflow data set is vastly different. It contains Finish statement later? Also, this part may want to be moved to a second section where we actually do the comparison

In this repository, there is a subset of 100 op-ed articles from the New York Times. We will read these articles into a data frame. We will start off by looking at one article to illustrate the steps of preprocessing, and then we will compare both data sets to illustrate how the process is informed by the nature of the data.


In [17]:
# New York Times data
## read subset of data from csv file into panadas dataframe
df = pd.read_csv('1_100.csv')
## for now, chosing one article to illustrate preprocessing
article = df['full_text'][939]

# Stack Overflow data
## ## read subset of data from csv file into panadas dataframe
df2 = pd.read_csv('doc_200.csv')
## for now, chosing one article to illustrate preprocessing
posting = df2['Document'][1]

Let's take a peek at the raw text of this article to see what we are dealing with!

Right off the bat you can see that we have a mixture of uppercase and lowercase words, punctuation, and some character encoding. The Stack Overflow dataset also contains many html tags. These need to be addressed.


In [67]:
# NY Times
article[:500]


Out[67]:
'AMERICANS work some of the longest hours in the Western world, and many struggle to achieve a healthy balance between work and life. As a result, there is an understandable tendency to assume that the problem we face is one of quantity: We simply do not have enough free time. \xe2\x80\x9cIf I could just get a few more hours off work each week,\xe2\x80\x9d you might think, \xe2\x80\x9cI would be happier.\xe2\x80\x9d This may be true. But the situation, I believe, is more complicated than that. As I discovered in a study that I publ'

In [18]:
# Stack Overflow
posting[:500]


Out[18]:
'Adding scripting functionality to .NET applications <p>I have a little game written in C#. It uses a database as back-end. It\'s \na <a href="http://en.wikipedia.org/wiki/Collectible_card_game">trading card game</a>, and I wanted to implement the function of the cards as a script.</p>\n\n<p>What I mean is that I essentially have an interface, <code>ICard</code>, which a card class implements (<code>public class Card056 : ICard</code>) and which contains function that are called by the game.</p>\n\n<p>'

Preprocessing Text

After looking at our raw text, we know that there are a number of textual attributes that we will need to address before we can ultimately represent our text as quantified features. Using some built in string functions, we can address the character encoding and mixed capitalization.


In [20]:
print(article[:500].decode('utf-8').lower())


americans work some of the longest hours in the western world, and many struggle to achieve a healthy balance between work and life. as a result, there is an understandable tendency to assume that the problem we face is one of quantity: we simply do not have enough free time. “if i could just get a few more hours off work each week,” you might think, “i would be happier.” this may be true. but the situation, i believe, is more complicated than that. as i discovered in a study that i publ

In [22]:
print(posting[:500].decode('utf-8').lower())


adding scripting functionality to .net applications <p>i have a little game written in c#. it uses a database as back-end. it's 
a <a href="http://en.wikipedia.org/wiki/collectible_card_game">trading card game</a>, and i wanted to implement the function of the cards as a script.</p>

<p>what i mean is that i essentially have an interface, <code>icard</code>, which a card class implements (<code>public class card056 : icard</code>) and which contains function that are called by the game.</p>

<p>

1. Tokenization

In order to process text, it must be deconstructed into its constituent elements through a process termed tokenization. Often, the tokens yielded from this process are individual words in a document. Tokens represent the linguistic units of a document.

A simplistic way to tokenize text relies on white space, such as in nltk.tokenize.WhitespaceTokenizer. Relying on white space, however, does not take punctuation into account, and depending on this some tokens will include punctuation and will require further preprocessing (e.g. 'account,'). Depending on your data, the punctuation may provide meaningful information, so you will want to think about whether it should be preserved or if it can be removed. Tokenization is particularly challenging in the biomedical field, where many phrases contain substantial punctuation (parentheses, hyphens, etc.) and negation detection is critical.

NLTK contains many built-in modules for tokenization, such as nltk.tokenize.WhitespaceTokenizer and nltk.tokenize.RegexpTokenizer.

See also:
The Art of Tokenization

Negation's Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing

Example: Whitespace Tokenization

Here we apply the Whitespace Tokenizer on the sample article. Notice that we are again decoding characters (such as quotation marks) and using all lowercase characters. Because we used white space as the marker between tokens, we still have punctuation (e.g. 'life.' and '\u201cif')


In [76]:
from nltk.tokenize import WhitespaceTokenizer
ws_tokenizer = WhitespaceTokenizer()

# tokenize example document
nyt_ws_tokens = ws_tokenizer.tokenize(article.decode('utf-8').lower())

print nyt_ws_tokens[:75]


[u'americans', u'work', u'some', u'of', u'the', u'longest', u'hours', u'in', u'the', u'western', u'world,', u'and', u'many', u'struggle', u'to', u'achieve', u'a', u'healthy', u'balance', u'between', u'work', u'and', u'life.', u'as', u'a', u'result,', u'there', u'is', u'an', u'understandable', u'tendency', u'to', u'assume', u'that', u'the', u'problem', u'we', u'face', u'is', u'one', u'of', u'quantity:', u'we', u'simply', u'do', u'not', u'have', u'enough', u'free', u'time.', u'\u201cif', u'i', u'could', u'just', u'get', u'a', u'few', u'more', u'hours', u'off', u'work', u'each', u'week,\u201d', u'you', u'might', u'think,', u'\u201ci', u'would', u'be', u'happier.\u201d', u'this', u'may', u'be', u'true.', u'but']

Example: Regular Expression Tokenization

By applying the regular expression tokenizer we can return a list of word tokens without punctuation.


In [77]:
from nltk.tokenize import RegexpTokenizer
re_tokenizer = RegexpTokenizer(r'\w+')

nyt_re_tokens = re_tokenizer.tokenize(article.decode('utf-8').lower())

In [78]:
print nyt_re_tokens[:100]


[u'americans', u'work', u'some', u'of', u'the', u'longest', u'hours', u'in', u'the', u'western', u'world', u'and', u'many', u'struggle', u'to', u'achieve', u'a', u'healthy', u'balance', u'between', u'work', u'and', u'life', u'as', u'a', u'result', u'there', u'is', u'an', u'understandable', u'tendency', u'to', u'assume', u'that', u'the', u'problem', u'we', u'face', u'is', u'one', u'of', u'quantity', u'we', u'simply', u'do', u'not', u'have', u'enough', u'free', u'time', u'if', u'i', u'could', u'just', u'get', u'a', u'few', u'more', u'hours', u'off', u'work', u'each', u'week', u'you', u'might', u'think', u'i', u'would', u'be', u'happier', u'this', u'may', u'be', u'true', u'but', u'the', u'situation', u'i', u'believe', u'is', u'more', u'complicated', u'than', u'that', u'as', u'i', u'discovered', u'in', u'a', u'study', u'that', u'i', u'published', u'with', u'my', u'colleague', u'chaeyoon', u'lim', u'in', u'the']

2. Stop Words

Depending on the application, many words provide little value when building an NLP model. Accordingly, these are termed stop words. Examples of stop words include pronouns, articles, prepositions and conjunctions, but there are many other words, or non meaningful tokens, that you may wish to remove. For instance, there may be artifacts from the web scraping process that you need to remove.

Stop words can be determined and handled in many different ways, including:

  • Using a list of words determined a priori, either a standard list from the NLTK package or one modified from such a list based on domain knowledge of a particular subject

  • Sorting the terms by collection frequency (the total number of times each term appears in the document collection), and then to taking the most frequent terms as a stop list based on semantic content.

  • Using no defined stop list at all, and dealing with text data in a purely statistical manner. In general, search engines do not use stop lists.

As you work with your text, you may decide to iterate on this process. See also: Stop Words

Example: Stopword Corpus

For this example, we will use the english stopword corpus from NLTK.


In [88]:
from nltk.corpus import stopwords

# print the first 5 standard English stop words
stop_list = [w for w in stopwords.words('english')]
print stop_list[:5]

# print the type of the elements in the stop words list
print type(stop_list[0])


[u'i', u'me', u'my', u'myself', u'we']
<type 'unicode'>

Let's remove the stop words and compare to our original list of tokens from our regular expression tokenizer.


In [101]:
cleaned_tokens = []
stop_words = set(stopwords.words('english'))
for token in nyt_re_tokens:
    if token not in stop_words:
        cleaned_tokens.append(token)

In [102]:
print 'Number of tokens before removing stop words: %d' % len(nyt_re_tokens)
print 'Number of tokens after removing stop words: %d' % len(cleaned_tokens)


Number of tokens before removing stop words: 825
Number of tokens after removing stop words: 405

You can see that by removing stop words, we now have less than half the number of tokens as our original list. Taking a peek at the cleaned tokens, we can see that a lot of the information that makes the sentence read like something a human would expect has been lost but the key nouns, verbs, adjectives, and adverbs remain.


In [105]:
print cleaned_tokens[:50]


[u'americans', u'work', u'longest', u'hours', u'western', u'world', u'many', u'struggle', u'achieve', u'healthy', u'balance', u'work', u'life', u'result', u'understandable', u'tendency', u'assume', u'problem', u'face', u'one', u'quantity', u'simply', u'enough', u'free', u'time', u'could', u'get', u'hours', u'work', u'week', u'might', u'think', u'would', u'happier', u'may', u'true', u'situation', u'believe', u'complicated', u'discovered', u'study', u'published', u'colleague', u'chaeyoon', u'lim', u'journal', u'sociological', u'science', u'shortage', u'free']

You may notice from looking at this sample, however, that a potentially meaningful word has been removed: 'not'. This stopword corpus includes the words 'no', 'nor', and 'not'and so by removing these words we have removed negation.

3. Stemming and Lemmatization

I think we might want to beef up the explanation here a little bit more. Also, do we want to go into POS tagging?

The overarching goal of stemming and lemmatization is to reduce differential forms of a word to a common base form. This step will allow you to count occurrences of words in the vectorization step. In deciding how to reduce the differential forms of words, you will want to consider how much information you will need to retain for your application. For instance, in many cases markers of tense and plurality are not informative, and so removing these markers will allow you to reduce the number of features.

Stemming is the process of representing the word as its root word while removing inflection. For example, the stem of the word 'explained' is 'explain'. By passing this word through the stemmer you would remove the tense inflection. There are multiple approaches to stemming: Porter stemming, Porter2 (snowball) stemming, and Lancaster stemming. You can read more in depth about these approaches.


In [140]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()

In [162]:
print 'Porter Stem of "explanation": %s' % porter.stem('explanation')
print 'Porter2 (Snowball) Stem of "explanation": %s' %snowball.stem('explanation')
print 'Lancaster Stem of "explanation": %s' %lancaster.stem('explanation')


Porter Stem of "explanation": explan
Porter2 (Snowball) Stem of "explanation": explan
Lancaster Stem of "explanation": expl

While stemming is a heuristic process that selectively removes the end of words, lemmatization is a more sophisticated process that takes into account variables such as part-of-speech, meaning, and context within a document or neighboring sentences.</p>


In [152]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [153]:
print lemmatizer.lemmatize('explanation')


explanation

In this example, lemmatization retains a bit more information than stemming. Within stemming, the Lancaster method is more aggressive than Porter and Snowball. Remember that this step allows us to reduce words to a common base form so that we can reduce our feature space and perform counting of occurrences. It will depend on your data and your application as to how much information you need to retain.

See also: Stemming and lemmatization


In [ ]:

Example: Stemming and Lemmatization

To illustrate the difference between stemming and lemmatization, we will apply both methods to our articles.


In [110]:
stemmed_tokens = []
lemmatized_tokens = []

for token in cleaned_tokens:
    stemmed_tokens.append(stemmer.stem(token))
    lemmatized_tokens.append(lemmatizer.lemmatize(token))

Let's take a look at a sample of our stemmed tokens


In [121]:
print stemmed_tokens[:50]


[u'american', u'work', u'longest', u'hour', u'western', u'world', u'mani', u'struggl', u'achiev', u'healthi', u'balanc', u'work', u'life', u'result', u'understand', u'tendenc', u'assum', u'problem', u'face', u'one', u'quantiti', u'simpli', u'enough', u'free', u'time', u'could', u'get', u'hour', u'work', u'week', u'might', u'think', u'would', u'happier', u'may', u'true', u'situat', u'believ', u'complic', u'discov', u'studi', u'publish', u'colleagu', u'chaeyoon', u'lim', u'journal', u'sociolog', u'scienc', u'shortag', u'free']

In contrast, here are the same tokens in their lemmatized form


In [122]:
print lemmatized_tokens[:50]


[u'american', u'work', u'longest', u'hour', u'western', u'world', u'many', u'struggle', u'achieve', u'healthy', u'balance', u'work', u'life', u'result', u'understandable', u'tendency', u'assume', u'problem', u'face', u'one', u'quantity', u'simply', u'enough', u'free', u'time', u'could', u'get', u'hour', u'work', u'week', u'might', u'think', u'would', u'happier', u'may', u'true', u'situation', u'believe', u'complicated', u'discovered', u'study', u'published', u'colleague', u'chaeyoon', u'lim', u'journal', u'sociological', u'science', u'shortage', u'free']

4. Vectorization

Often in natural language processing we want to represent our text as a quantitative set of features for subsequent analysis. One way to generate features from text is to count the occurrences words. This apporoach is often referred to as a bag of words approach.

In the example of our article, we could represent the article as a vector of counts for each token. If we did the same for all of the other articles, we would have a set of vectors with each vector representing an article. If we had only one article, then we could have split the article into sentences and then represented each sentence as a vector.

If we apply a count vectorizer to our article, we will have a vector with the length of the number of unique tokens.

Example: Count Vectorization of Article

For this example we will use the stemmed tokens from our article. We will need to join the tokens together to represent one article.

Check out the documentation for CountVectorizer in scikit-learn. You will see that there are a number of parameters that you can specify - including the maximum number of features. Depending on your data, you may choose to restrict the number of features by removing words that appear with least frequency.


In [192]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

In [180]:
stemmed_article = ' '.join(wd for wd in stemmed_tokens)

In [194]:
article_vect = vectorizer.fit_transform([stemmed_article])

In [ ]:

Unigrams v. Bigrams v. Ngrams

tf-idf


In [ ]: