Introduction to Language Processing Concepts

Original tutorial by Brain Lehman, with updates by Fiona Pigott

The goal of this tutorial is to introduce a few basical vocabularies, ideas, and Python libraries for thinking about topic modeling, in order to make sure that we have a good set of vocabulary to talk more in-depth about processing languge with Python later. We'll spend some time on defining vocabulary for topic modeling and using basic topic modeling tools.

A big thank-you to the good people at the Stanford NLP group, for their informative and helpful online book: https://nlp.stanford.edu/IR-book/.

Definitions.

Document: a body of text (eg. tweet)
Tokenization: dividing a document into pieces (and maybe throwing away some characters), in English this often (but not necessarily) means words separated by spaces and puctuation.
Text corpus: the set of documents that contains the text for the analysis (eg. many tweets)
Stop words: words that occur so frequently, or have so little topical meaning, that they are excluded (e.g., "and")
Vectorize: Turn some documents into vectors
Vector corpus: the set of documents transformed such that each token is a tuple (token_id , doc_freq)



In [ ]:

    
# first, get some text:
import fileinput
try:
    import ujson as json
except ImportError:
    import json
documents = []
for line in fileinput.FileInput("example_tweets.json"):
    documents.append(json.loads(line)["text"])

1) Document

In the case of the text that we just imported, each entry in the list is a "document"--a single body of text, hopefully with some coherent meaning.



In [ ]:

    
print("One document: \"{}\"".format(documents[0]))

2) Tokenization

We split each document into smaller pieces ("tokens") in a process called tokenization. Tokens can be counted, and most importantly, compared between documents. There are potentially many different ways to tokenize text--splitting on spaces, removing punctionation, diving the document into n-character pieces--anything that gives us tokens that we can, hopefully, effectively compare across documents and derive meaning from.

Related to tokenization are processes called stemming and lemmatiztion which can help when using tokens to model topics based on the meaning of a word. In the phrases "they run" and "he runs" (space separated tokens: ["they", "run"] and ["he", "runs"]) the words "run" and "runs" mean basically the same thing, but are two different tokens. Stemming and/or lemmatization help us compare tokens with the same meaning but different spelling/suffixes.

Lemmatization:

Uses a dictionary of words and their possible morphologies to map many different forms of a base word ("lemma") to a single lemma, comparable across documents. E.g.: "run", "ran", "runs", and "running" might all map to the lemma "run"

Stemming:

Uses a set of heuristic rules to try to approximate lemmatization, without knowing the words in advance. For the English language, a simple and effective stemming algorithm might simply be to remove an "s" from the ends of words, or an "ing" from the end of words. E.g.: "run", "runs", and "running" all map to "run," but "ran" (an irregularrly conjugated verb) would not.

Stemming is particularly interesting and applicable in social data, because while some words are decidely not standard English, conventinoal rules of grammar still apply. A fan of the popular singer Justin Bieber might call herself a "belieber," while a group of fans call themselves "beliebers." You won't find "belieber" in any English lemmatization dictionary, but a good stemming algorithm will still map "belieber" and "beliebers" to the same token ("belieber", or even "belieb", if we remover the common suffix "er").



In [ ]:

    
from nltk.stem import porter
from nltk.tokenize import TweetTokenizer

# tokenize the documents
# find good information on tokenization:
# https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html
# find documentation on pre-made tokenizers and options here:
# http://www.nltk.org/api/nltk.tokenize.html
tknzr = TweetTokenizer(reduce_len = True)

# stem the documents
# find good information on stemming and lemmatization:
# https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
# find documentation on available pre-implemented stemmers here:
# http://www.nltk.org/api/nltk.stem.html
stemmer = porter.PorterStemmer()
for doc in documents[0:10]:
    tokenized = tknzr.tokenize(doc)
    stemmed = [stemmer.stem(x) for x in tokenized]
    print("Original document:\n{}\nTokenized result:\n{}\nStemmed result:\n{}\n".format(
        doc, tokenized, stemmed))

3) Text corpus

The text corpus is a collection of all of the documents (Tweets) that we're interested in modeling. Topic modeling and/or clustering on a corpus tends to work best if that corpus has some similar themes--this will mean that some tokens overlap, and we can get signal out of when documents share (or do not share) tokens.

Modeling text tends to get much harder the more different, uncommon and unrelated tokens appear in a text, especially when we are working with social data, where tokens don't necessarily appear in a dictionary. This difficultly (of having many, many unrelated tokens as dimension in our model) is one example of the curse of dimensionality.



In [ ]:

    
# number of documents in the corpus
print("There are {} documents in the corpus.".format(len(documents)))

4) Stop words:

Stop words are simply tokens that we've chosen to remove from the corpus, for any reason. In English, removing words like "and", "the", "a", "at", and "it" are common choices for stop words. Stop words can also be edited per project requirement, in case some words are too common in a particular dataset to be meaningful (another way to do stop word removal is to simply remove any word that appears in more than some fixed percentage of documents).



In [ ]:

    
from nltk.corpus import stopwords

stopset = set(stopwords.words('english'))
print("The English stop words list provided by NLTK: ")
print(stopset)

stopset.update(["twitter"]) # add token
stopset.remove("i")   # remove token
print("\nAdd or remove stop words form the set: ")
print(stopset)

5) Vectorize:

Transform each document into a vector. There are several good choices that you can make about how to do this transformation, and I'll talk about each of them in a second.

In order to vectorize documents in a corpus (without any dimensional reduction around the vocabulary), think of each document as a row in a matrix, and each column as a word in the vocabulary of the entire corpus. In order to vectorize a corpus, we must read the entire corpus, assign one word to each column, and then turn each document into a row.

Example:
Documents: "I love cake", "I hate chocolate", "I love chocolate cake", "I love cake, but I hate chocolate cake" Stopwords: Say, because the word "but" is a conjunction, we want to make it a stop word (not include it in our document vectors) Vocabulary: "I" (column 1), "love" (column 2), "cake" (column 3), "hate" (column 4), "chocolate" (column 5) \begin{equation*} \begin{matrix} \text{"I love cake" } & =\\ \text{"I hate chocolate" } & =\\ \text{"I love chocolate cake" } & = \\ \text{"I love cake, but I hate chocolate cake"} & = \end{matrix} \qquad \begin{bmatrix} 1 & 1 & 1 & 0 & 0\\ 1 & 0 & 0 & 1 & 1\\ 1 & 1 & 1 & 0 & 1\\ 2 & 1 & 2 & 1 & 1 \end{bmatrix} \end{equation*}

Vectorization like this don't take into account word order (we call this property "bag of words"), and in the above example I am simply counting the frequency of each term in each document.



In [ ]:

    
# we're going to use the vectorizer functions that scikit learn provides

# define the tokenizer that we want to use
# must be a callable function that takes a document and returns a list of tokens
tknzr = TweetTokenizer(reduce_len = True)
stemmer = porter.PorterStemmer()
def myTokenizer(doc):
    return [stemmer.stem(x) for x in tknzr.tokenize(doc)]

# choose the stopword set that we want to use
stopset = set(stopwords.words('english'))
stopset.update(["http","https","twitter","amp"])

# vectorize
# we're using the scikit learn CountVectorizer function, which is very handy
# documentation here: 
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vectorizer = CountVectorizer(tokenizer = myTokenizer, stop_words = stopset)
vectorized_documents = vectorizer.fit_transform(documents)



In [ ]:

    
vectorized_documents



In [ ]:

    
import matplotlib.pyplot as plt
%matplotlib inline
_ = plt.hist(vectorized_documents.todense().sum(axis = 1))
_ = plt.title("Number of tokens per document")
_ = plt.xlabel("Number of tokens")
_ = plt.ylabel("Number of documents with x tokens")



In [ ]:

    
from numpy import logspace, ceil, histogram, array
# get the token frequency
token_freq = sorted(vectorized_documents.todense().astype(bool).sum(axis = 0).tolist()[0], reverse = False)
# make a histogram with log scales
bins = array([ceil(x) for x in logspace(0, 3, 5)])
widths = (bins[1:] - bins[:-1])
hist = histogram(token_freq, bins=bins)
hist_norm = hist[0]/widths
# plot (notice that most tokens only appear in one document)
plt.bar(bins[:-1], hist_norm, widths)
plt.xscale('log')
plt.yscale('log')
_ = plt.title("Number of documents in which each token appears")
_ = plt.xlabel("Number of documents")
_ = plt.ylabel("Number of tokens")

Bag of words

Taking all the words from a document, and sticking them in a bag. Order does not matter, which could cause a problem. "Alice loves cake" might have a different meaning than "Cake loves Alice."

Frequency

Counting the number of times a word appears in a document.

Tf-Idf (term frequency inverse document frequency):

A statistic that is intended to reflect how important a word is to a document in a collection or corpus. The Tf-Idf value increases proportionally to the number of times a word appears in the document and is inversely proportional to the frequency of the word in the corpus--this helps control words that are generally more common than others.

There are several different possibilities for computing the tf-idf statistic--choosing whether to normalize the vectors, choosing whether to use counts or the logarithm of counts, etc. I'm going to show how scikit-learn computed the tf-idf statistic by default, with more information available in the documentation of the sckit-learn TfidfVectorizer.

$tf(t)$ : Term Frequency, count of the number of times each term appears in the document.
$idf(d,t)$ : Inverse document frequency.
$df(d,t)$ : Document frequency, the count of the number of documents in which the term appears.

$$ tfidf(t) = tf(t) * \log\big(\frac{1 + n}{1 + df(d, t)}\big) + 1 $$

We also then take the Euclidean ($l2$) norm of each document vector, so that long documents (documents with many non-stopword tokens) have the same norm as shorter documents.



In [ ]:

    
# documentation on this sckit-learn function here:
# http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tfidf_vectorizer = TfidfVectorizer(tokenizer = myTokenizer, stop_words = stopset)
tfidf_vectorized_documents = tfidf_vectorizer.fit_transform(documents)



In [ ]:

    
tfidf_vectorized_documents



In [ ]:

    
# you can look at two vectors for the same document, from 2 different vectorizers:
tfidf_vectorized_documents[0].todense().tolist()[0]



In [ ]:

    
vectorized_documents[0].todense().tolist()[0]

That's all for now!



In [ ]: