The goal of this notebook is to provide a brief overview of common steps taken during natural language preprocessing (NLP). When dealing with text data, the first major hurdle is figuring out how go from a collection of strings to a format that statistical and maching learning models can understand. This resource is meant get you started thinking about how to process your data, not to provide a formal pipeline.
Preprocessing follows a general series of steps, each requiring decisions that can substantially impact the final outcome of analyses if not considered carefully. Below we will be emphasizing how different sources of text require different approaches for preprocessing and modeling. As you approach your own data, think about the implications of each decision on the outcome of your analysis.
Note: Please send along any errata or comments you may have.
As a working example, we will be exploring data from New York Times op-eds articles. While a rich and appropriate data set, keep in mind that the examples below will likely be quite different than the data you are working with.
In NY Times op-ed repository, there is a subset of 947 op-ed articles. To begin we will start by looking at one article to illustrate the steps of preprocessing. Later we will suggest some potential future directions for exploring the dataset in full.
In [1]:
import pandas as pd
# read subset of data from csv file into panadas dataframe
df = pd.read_csv('data_files/1_100.csv')
# get rid of any missing text data
df = df[pd.notnull(df['full_text'])]
# for now, chosing one article to illustrate preprocessing
article = df['full_text'][939]
In [2]:
# NY Times
article[:500]
Out[2]:
When working with text data, the goal is to process (remove, filter, and combine) the text in such a way that informative text is preserve and munged into a form that models can better understand. After looking at our raw text, we know that there are a number of textual attributes that we will need to address before we can ultimately represent our text as quantified features.
A common first step is to handle string encoding and formatting issues. Often it is easy to address the character encoding and mixed capitalization using Python's built-in functions. For our NY Times example, we will convert everything to UTF-8 encoding and convert all letters to lowercase.
In [3]:
print(article[:500].decode('utf-8').lower())
In order to process text, it must be deconstructed into its constituent elements through a process termed tokenization. Often, the tokens yielded from this process are simply individual words in a document. In certain cases, it can be useful to tokenize stranger objects like emoji or parts of html (or other code).
A simplistic way to tokenize text relies on white space, such as in nltk.tokenize.WhitespaceTokenizer
. Relying on white space, however, does not take punctuation into account, and depending on this some tokens will include punctuation and will require further preprocessing (e.g. 'account,'). Depending on your data, the punctuation may provide meaningful information, so you will want to think about whether it should be preserved or if it can be removed.
Tokenization is particularly challenging in the biomedical field, where many phrases contain substantial punctuation (parentheses, hyphens, etc.) that can't necessarily be ignored. Additionally, negation detection can be critical in this context which can provide an additional preprocessing challenge.
NLTK contains many built-in modules for tokenization, such as nltk.tokenize.WhitespaceTokenizer
and nltk.tokenize.RegexpTokenizer
.
See also:
The Art of Tokenization
Negation's Not Solved: Generalizability Versus Optimizability in Clinical Natural Language Processing
Here we apply the Whitespace Tokenizer on our example article. Notice that we are again decoding characters (such as quotation marks) and using all lowercase characters. Because we used white space as the marker between tokens, we still have punctuation attached to some tokens (e.g. 'life.' and '\u201cif')
In [4]:
from nltk.tokenize import WhitespaceTokenizer
ws_tokenizer = WhitespaceTokenizer()
# tokenize example document
nyt_ws_tokens = ws_tokenizer.tokenize(article.decode('utf-8').lower())
print nyt_ws_tokens[:75]
In [5]:
from nltk.tokenize import RegexpTokenizer
re_tokenizer = RegexpTokenizer(r'\w+')
nyt_re_tokens = re_tokenizer.tokenize(article.decode('utf-8').lower())
In [6]:
print nyt_re_tokens[:100]
Critical thoughts: Decisions about tokens can be difficult. In general its best to start with common sense, intuition, and domain knowledge to start, and iterate based on overall model performance.
Depending on the application, many words provide little value when building an NLP model. Moreover, they may provide a source of "distraction" for models since model capacity is used to understand words with low information content. Accordingly, these are termed stop words. Examples of stop words include pronouns, articles, prepositions and conjunctions, but there are many other words, or non meaningful tokens, that you may wish to remove.
Stop words can be determined and handled in many different ways, including:
As you work with your text, you may decide to iterate on this process. When in doubt, it is often a fruitful strategy to try the above bullets in order. See also: Stop Words
In [7]:
from nltk.corpus import stopwords
# here you can see the words included in the stop words corpus
print stopwords.words('english')
Let's remove the stop words and compare to our original list of tokens from our regular expression tokenizer.
In [8]:
cleaned_tokens = []
stop_words = set(stopwords.words('english'))
for token in nyt_re_tokens:
if token not in stop_words:
cleaned_tokens.append(token)
In [9]:
print 'Number of tokens before removing stop words: %d' % len(nyt_re_tokens)
print 'Number of tokens after removing stop words: %d' % len(cleaned_tokens)
You can see that by removing stop words, we now have less than half the number of tokens as our original list. Taking a peek at the cleaned tokens, we can see that a lot of the information that makes sentences human-readable has been lost, but the key nouns, verbs, adjectives, and adverbs remain.
In [10]:
print cleaned_tokens[:50]
Critical thoughts: You may notice from looking at this sample, however, that a potentially meaningful word has been removed: 'not'. This stopword corpus includes the words 'no', 'nor', and 'not' and so by removing these words we have removed negation.
The overarching goal of stemming and lemmatization is to reduce differential forms of a word to a common base form. By performing stemming and lemmitzation, the count occurrences of words are can be very informative when further processing the data (such as the vectorization, see below).
In deciding how to reduce the differential forms of words, you will want to consider how much information you will need to retain for your application. For instance, in many cases markers of tense and plurality are not informative, and so removing these markers will allow you to reduce the number of features. In other cases, retaining these variations results in better understanding of the underlying content.
Stemming is the process of representing the word as its root word while removing inflection. For example, the stem of the word 'explained' is 'explain'. By passing this word through a stemming function you would remove the tense inflection. There are multiple approaches to stemming: Porter stemming, Porter2 (snowball) stemming, and Lancaster stemming. You can read more in depth about these approaches.
In [11]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lancaster = LancasterStemmer()
In [12]:
print 'Porter Stem of "explanation": %s' % porter.stem('explanation')
print 'Porter2 (Snowball) Stem of "explanation": %s' %snowball.stem('explanation')
print 'Lancaster Stem of "explanation": %s' %lancaster.stem('explanation')
While stemming is a heuristic process that selectively removes the end of words, lemmatization is a more sophisticated process that can account for variables such as part-of-speech, meaning, and context within a document or neighboring sentences.</p>
In [13]:
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
In [14]:
print lemmatizer.lemmatize('explanation')
In this example, lemmatization retains a bit more information than stemming. Within stemming, the Lancaster method is more aggressive than Porter and Snowball. Remember that this step allows us to reduce words to a common base form so that we can reduce our feature space and perform counting of occurrences. It will depend on your data and your application as to how much information you need to retain.
As a good starting point, see also: Stemming and lemmatization
In [15]:
stemmed_tokens = []
lemmatized_tokens = []
for token in cleaned_tokens:
stemmed_tokens.append(snowball.stem(token))
lemmatized_tokens.append(lemmatizer.lemmatize(token))
Let's take a look at a sample of our stemmed tokens
In [16]:
print stemmed_tokens[:50]
In contrast, here are the same tokens in their lemmatized form
In [17]:
print lemmatized_tokens[:50]
Looking at the above, it is clear different strategies for generating tokens might retain different information. Moreover, given the transformations stemming and lemmatization apply there will be a different amount of tokens retained in the overall vocabularity.
Critical thoughts: It's best to apply intuition and domain knowledge to get a feel for which strategy(ies) to begin with. In short, it's usually a good idea to optimize for smaller numbers of unique tokens and greater interpretibility as long as it doesn't disagree with common sense and (sometimes more importantly) overall performance.
Often in natural language processing we want to represent our text as a quantitative set of features for subsequent analysis. We can refer to this as vectorization. One way to generate features from text is to count the occurrences words. This apporoach is often referred to as a bag of words approach.
For the example of our article, we can represent the document as a vector of counts for each token. We can do the same for the other articles, and in the end we would have a set of vectors - with each vector representing an article. These vectors could then be used in the next phase of analysis (e.g. classification, document clustering, ...).
When we apply a count vectorizer to our corpus of articles, the output will be a matrix with the number of rows corresponding to the number of articles and the number of columns corresponding to the number of unique tokens across (across articles). You can imagine that if we have many articles in a corpus of varied content, the number of unique tokens could get quite large. Some of our preprocessing steps address this issue. In particular, the stemming/lemmatization step reduces the number of unique versions of a word that appear in the corpus. Additionally it is possible to reduce the number of features by removing words that appear least frequently, or by removing words that are common to each article and therefore may not be informative for subsequent analysis.
For this example we will use the stemmed tokens from our article. We will need to join the tokens together to represent one article.
Check out the documentation for CountVectorizer in scikit-learn. You will see that there are a number of parameters that you can specify - including the maximum number of features. Depending on your data, you may choose to restrict the number of features by removing words that appear with least frequency (and this number may be set by cross-validation).
Example:
In [18]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
# stem our example article
stemmed_article = ' '.join(wd for wd in stemmed_tokens)
# performe a count-based vectorization of the document
article_vect = vectorizer.fit_transform([stemmed_article])
As shown below, we can see that the five most frequently occuring words in this article, titled "You Don't Need More Free Time," are time, work, weekend, people, and well:
In [19]:
freqs = [(word, article_vect.getcol(idx).sum()) for word, idx in vectorizer.vocabulary_.items()]
print 'top 5 words for op-ed titled "%s"' % df['title'][939]
print sorted (freqs, key = lambda x: -x[1])[0:5]
Now you can imagine that we could apply this count vectorizer to all of our articles. We could then use the word count vectors in a number of subsequent analyses (e.g. exploring the topics appearing across the corpus).
We have mentioned that you may want to limit the number of features in your vector, and that one way to do this would be to only take the tokens that occur most frequently. Imagine again the above example of trying to differentiate between supporting and opposing documents in a political context. If the documents are all related to the same political initiative, then very likely there will be words related to the intitiative that appear in both documents and thus have high frequency counts. If we cap the number of features by frequency, these words would likely be included, but will they be the most informative when trying to differentiate documents?
For many such cases we may want to use a vectorization approach called term frequency - inverse document frequency (tf-idf). Tf-idf allows us to weight words by their importance by considering how often a word appears in a given document and throughout the corpus. That is, if a word occurs frequently in a (preprocessed) document it should be important, yet if it also occurs frequently accross many documents it is less informative and differentiating.
In our example, the name of the inititative would likely appear numerous times in each document for both opposing and supporting positions. Because the name occurs across all documents, this word would be down weighted in importance. For a more in depth read, these posts go into a bit more depth about text vectorization: tf-idf part 1 and tf-idf part 2.
Example:
To utilize tf-idf, we will add in additional articles from our dataset. We will need to preprocess the text from these articles and then we can use TfidfVectorizer on our stemmed tokens.
To perform tf-idf tranformations, we first need occurence vectors for all our articles using (like the above) count vectorizer. From here, we could use scikit-learn's TfidfTransformer to transform our matrix into a tf-idf matrix.
For a more complete example, consider a preprocessing pipeline where we first tokenize using regexp, remove standard stop words, perform stemming, and finally convert to tf-idf vectors:
In [20]:
def preprocess_article_content(text_df):
"""
Simple preprocessing pipeline which uses RegExp, sets basic token requirements, and removes stop words.
"""
print 'preprocessing article text...'
# tokenizer, stops, and stemmer
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english')) # can add more stop words to this set
stemmer = SnowballStemmer('english')
# process articles
article_list = []
for row, article in enumerate(text_df['full_text']):
cleaned_tokens = []
tokens = tokenizer.tokenize(article.decode('utf-8').lower())
for token in tokens:
if token not in stop_words:
if len(token) > 0 and len(token) < 20: # removes non words
if not token[0].isdigit() and not token[-1].isdigit(): # removes numbers
stemmed_tokens = stemmer.stem(token)
cleaned_tokens.append(stemmed_tokens)
# add process article
article_list.append(' '.join(wd for wd in cleaned_tokens))
# echo results and return
print 'preprocessed content for %d articles' % len(article_list)
return article_list
# process articles
processed_article_list = preprocess_article_content(df)
# vectorize the articles and compute count matrix
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vectorizer = TfidfVectorizer()
tfidf_article_matrix = tf_vectorizer.fit_transform(processed_article_list)
print tfidf_article_matrix.shape
You can see that after applying the tf-idf vectorizers to our sample of 947 op-ed articles, we have a sparse matrix with 947 rows (each corresponding to an article) and 19,702 columns (each corresponding to a stemmed token. Depending on our application, we may choose to restrict the number of features (corresponding to the number of columns).
When we decided to tokenize our corpus above, we decided to treat each word as a token. A collection of text represented by single words only is a unigram model of the data. This representation can often be surprisingly powerful, because the presence of single words can be hugely informative.
However, when dealing with natural language we often want to incorporate structure that is present - grammer, syntactic meaning, and tone. The downside of unigrams is that it ignores the ordering of words, as the order of the token counts is not captured. The simplest model that captures ordering and structure is one that treats neighboring word pairs as tokens, this is called a bigram model.
As an example consider a document that has the words "good", "bad", and "project" in its corpus (with relatively similar count frequencies). From unigrams alone, its not possible to tell whether the project is good or bad, because those adjectives could appear next to the subject "project" or in completely unrelated sentences. With bigrams, we might then see the token "good project" appearing frequently and we would now know significantly more what the document is about.
Choosing pairs of words (bigrams) is just the simplest choice we can make. We can generalize this allow tokens of N numbers of words, these are called Ngrams. When N=3 we refer to tokens as trigrams, but for higher values of N we do not typically assign a unique name.
Best practices: Generally speaking most NLP models want to have unigrams present. Very commonly bigrams are important and are also used to build high quality models. Typically higher order Ngrams are less common, as the number of features (and computational requirements) increase rapidly and yield diminishing returns.
Another vectorization option is to use a word embedding model to generate vector representations of words. Word embedding models create non-linear representations of words, which account for the context and neighboring language surround a word. A common model(which has many pretrained libraries) is Word2Vec.
Word embedding models have gained lots of popularity, as they are able to capture syntactic meaning quite well. However, good vector representations are only appropriate for the corpus they are trained on and often they will not generate good models for corpuses which are significantly different. For instance, a Word2Vec model trained on literature may not be appropriate for Twitter or StackOverflow text data. The alternative in these cases is to retrain the model on the correct data, but this is hard - it requires lots of data, choices, and computation to generate good representations. As a first approach, it's probably best to start with Ngrams using counts or tf-idf weightings.
When thinking about NLP applications, there are a number of approaches to take!