This notebook provides a brief overview of common steps taken natural language preprocessing. The goal is to get you started thinking about how to process your data, not to provide a formal pipeline. (add another few background sentences here)
Preprocessing follows a general series of steps, each requiring decisions that can substantially impact the final output if not considered carefully. For this tutorial, we will be emphasizing how different sources of text require different approaches for preprocessing and modeling. As you approach your own data, think about the implications of each decision on the outcome of your analysis.
This tutorial requires several commonly used Python packages for data analysis and Natural Language Processing (NLP):
In [1]:
# import requirements
import pandas as pd
import nltk
import gensim
import spacy
Here we will be exploring two different data sets:
While the New York Times data set consists of traditional English prose and substantially longer articles, the Stack Overflow data set is vastly different. It contains Finish statement later? Also, this part may want to be moved to a second section where we actually do the comparison
In this repository, there is a subset of 100 op-ed articles from the New York Times. We will read these articles into a data frame. We will start off by looking at one article to illustrate the steps of preprocessing, and then we will compare both data sets to illustrate how the process is informed by the nature of the data.
In [58]:
# read subset of data from csv file into panadas dataframe
df = pd.read_csv('1_100.csv')
# for now, chosing one article to illustrate preprocessing
article = df['full_text'][939]
In [67]:
article[:500]
Out[67]:
After looking at our raw text, we know that there are a number of textual attributes that we will need to address before we can ultimately represent our text as quantified features. Using some built in string functions, we can address the character encoding and mixed capitalization.
In [68]:
article[:500].decode('utf-8').lower()
Out[68]:
In order to process text, it must be deconstructed into its constituent elements through a process termed tokenization. Often, the tokens yielded from this process are individual words in a document. Tokens represent the linguistic units of a document.
A simplistic way to tokenize text relies on white space, such as in nltk.tokenize.WhitespaceTokenizer
. Relying on white space, however, does not take punctuation into account, and depending on this some tokens will include punctuation and will require further preprocessing (e.g. 'account,'). Depending on your data, the punctuation may provide meaningful information, so you will want to think about whether it should be preserved or if it can be removed. Tokenization is particularly challenging in the biomedical field, where many phrases contain substantial punctuation (parentheses, hyphens, etc.) and negation detection is critical.
NLTK contains many built-in modules for tokenization, such as nltk.tokenize.WhitespaceTokenizer
and nltk.tokenize.RegexpTokenizer
.
See also:
The Art of Tokenization
Here we apply the Whitespace Tokenizer on the sample article. Notice that we are again decoding characters (such as quotation marks) and using all lowercase characters. Because we used white space as the marker between tokens, we still have punctuation (e.g. 'life.' and '\u201cif')
In [76]:
from nltk.tokenize import WhitespaceTokenizer
ws_tokenizer = WhitespaceTokenizer()
# tokenize example document
nyt_ws_tokens = ws_tokenizer.tokenize(article.decode('utf-8').lower())
print nyt_ws_tokens[:75]
In [77]:
from nltk.tokenize import RegexpTokenizer
re_tokenizer = RegexpTokenizer(r'\w+')
nyt_re_tokens = re_tokenizer.tokenize(article.decode('utf-8').lower())
In [78]:
print nyt_re_tokens[:100]
Depending on the application, many words provide little value when building an NLP model. Accordingly, these are termed stop words. Examples of stop words include pronouns, articles, prepositions and conjunctions, but there are many other words, or non meaningful tokens, that you may wish to remove. For instance, there may be artifacts from the web scraping process that you need to remove.
Stop words can be determined and handled in many different ways, including:
As you work with your text, you may decide to iterate on this process. See also: Stop Words
In [88]:
from nltk.corpus import stopwords
# print the first 5 standard English stop words
stop_list = [w for w in stopwords.words('english')]
print stop_list[:5]
# print the type of the elements in the stop words list
print type(stop_list[0])
Let's remove the stop words and compare to our original list of tokens from our regular expression tokenizer.
In [101]:
cleaned_tokens = []
stop_words = set(stopwords.words('english'))
for token in nyt_re_tokens:
if token not in stop_words:
cleaned_tokens.append(token)
In [102]:
print 'Number of tokens before removing stop words: %d' % len(nyt_re_tokens)
print 'Number of tokens after removing stop words: %d' % len(cleaned_tokens)
You can see that by removing stop words, we now have less than half the number of tokens as our original list. Taking a peek at the cleaned tokens, we can see that a lot of the information that makes the sentence read like something a human would expect has been lost but the key nouns, verbs, adjectives, and adverbs remain.
In [105]:
print cleaned_tokens[:50]
You may notice from looking at this sample, however, that a potentially meaningful word has been removed: 'not'. This stopword corpus includes the words 'no', 'nor', and 'not'and so by removing these words we have removed negation.
I think we might want to beef up the explanation here a little bit more. Also, do we want to go into POS tagging?
The overarching goal of stemming and lemmatization is to reduce differential forms of a word to a common base form. This step will allow you to count occurrences of words in the vectorization step. In deciding how to reduce the differential forms of words, you will want to consider how much information you will need to retain for your application. For instance, in many cases markers of tense and plurality are not informative, and so removing these markers will allow you to reduce the number of features.
Stemming is the process of representing the word as its root word while removing inflection. For example, the stem of the word 'explained' is 'explain'. By passing this word through the stemmer you would remove the tense inflection. There are multiple approaches to stemming: Porter stemming, Porter2 (snowball) stemming, and Lancaster stemming. You can read more in depth about these approaches.
In [140]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
In [162]:
print 'Porter Stem of "explanation": %s' % porter.stem('explanation')
print 'Porter2 (Snowball) Stem of "explanation": %s' %snowball.stem('explanation')
print 'Lancaster Stem of "explanation": %s' %lancaster.stem('explanation')
While stemming is a heuristic process that selectively removes the end of words, lemmatization is a more sophisticated process that takes into account variables such as part-of-speech, meaning, and context within a document or neighboring sentences.</p>
In [152]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
In [153]:
print lemmatizer.lemmatize('explanation')
In this example, lemmatization retains a bit more information than stemming. Within stemming, the Lancaster method is more aggressive than Porter and Snowball. Remember that this step allows us to reduce words to a common base form so that we can reduce our feature space and perform counting of occurrences. It will depend on your data and your application as to how much information you need to retain.
See also: Stemming and lemmatization
In [ ]:
In [110]:
stemmed_tokens = []
lemmatized_tokens = []
for token in cleaned_tokens:
stemmed_tokens.append(stemmer.stem(token))
lemmatized_tokens.append(lemmatizer.lemmatize(token))
Let's take a look at a sample of our stemmed tokens
In [121]:
print stemmed_tokens[:50]
In contrast, here are the same tokens in their lemmatized form
In [122]:
print lemmatized_tokens[:50]
Often in natural language processing we want to represent our text as a quantitative set of features for subsequent analysis. One way to generate features from text is to count the occurrences words. This apporoach is often referred to as a bag of words approach.
In the example of our article, we could represent the article as a vector of counts for each token. If we did the same for all of the other articles, we would have a set of vectors with each vector representing an article. If we had only one article, then we could have split the article into sentences and then represented each sentence as a vector.
If we apply a count vectorizer to our article, we will have a vector with the length of the number of unique tokens.
For this example we will use the stemmed tokens from our article. We will need to join the tokens together to represent one article.
Check out the documentation for CountVectorizer in scikit-learn. You will see that there are a number of parameters that you can specify - including the maximum number of features. Depending on your data, you may choose to restrict the number of features by removing words that appear with least frequency.
In [192]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
In [180]:
stemmed_article = ' '.join(wd for wd in stemmed_tokens)
In [194]:
article_vect = vectorizer.fit_transform([stemmed_article])
In [ ]:
In [ ]: