Text Analysis


Text Analysis is used for summarizing or getting useful information out of a large amount of unstructured text stored in documents. This opens up the oppurtunity of using text data alongside more conventional data sources (e.g, surveys and administrative data). The goal of text analysis is to take a large corpus of complex and unstructured text data and extract important and meaningful messages in a comprehensible meaningful way.

Text Analysis can help with the following tasks:

  • Informationa retrieval: Help find relevant information in large databases such as a systematic literature review.

  • Clustering and text categorization: Techniques like topic modeling modeling can summarize a large corpus of text by finding the most important phrases.

  • Text Summarization: Create category-sensitive text summaries of a large corpus of text.

  • Machine Translation: Translate from one language to another.

In this tutorial we are going to analyze reddit posts from May 2015 in order to classify which subreddit a post origniated from and also do topic modelling to categorize posts into topcs made up of co-ocurring words.

Glossary of Terms

  • Tokenize: Tokenization is the process by which text is sepearated into meaningful terms or phrases. In english this is fairly triial as words as separated by whitespace.

  • Stemming: Stemming is a type of text normalization where words that have different forms but their essential meaning are normalized to the original dictionary form of a word. For example "go," "went," and "goes" all stem from the lemma "go."

  • TFIDF: TFIDF (Term frequency-inverse document frequency) is an example of feature enginnering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole.

  • Topic Modeling: Topic modeling is an unsupervised learning method where groups of co-occuring words are clustered into topics. Typically, the words in a a cluster should be related and make sense (e.g, boat, ship, captain). Individual documents will then fall into multiple topics.

  • LDA: LDA (latent Dirichlet allocation) is a type of probabilistic model commonly used for topic modelling.

  • Stop Words: Stop words are words that have little semantic meaning like prepositions, articles and common nouns. They can often be ingnored.

Table of Contents

In [ ]:
import numpy 
import matplotlib.pyplot
import nltk
import ujson
import re
import time

from __future__ import print_function
from six.moves import zip, range 

%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_auc_score
from sklearn import preprocessing
from collections import Counter, OrderedDict
from nltk.corpus import stopwords
from nltk import SnowballStemmer

%matplotlib inline
nltk.download('stopwords') #download the latest stopwords

In [ ]:
def load_reddit(fname, ls_subreddits=[], MIN_CHAR=30):
    Loads Reddit Comments from a json file based on 
    whether they are in the selected subreddits and 
    have more than the MIN_CHARACTERS
    fname: str
    ls_subreddits: ls[str]
        list of subreddits to select from 
    MIN_CHAR: int
        minimum number of characters necessary to select
        a comment
    corpus: ls[str]
        list of selected reddit comments
    subreddit_id: array[int]
        np.array of indices that match with the ls_subreddit
    corpus = []
    subreddit_id = []
    with open(fname, 'r') as infile:
        for line in infile:
            dict_reddit_post =  ujson.loads(line)
            subreddit = dict_reddit_post['subreddit']
            n_characters = len( dict_reddit_post['body'] )
            if ls_subreddits: #check that the list is not empty
                in_ls_subreddits = subreddit in ls_subreddits
                in_ls_subreddits = True
            grter_than_min = n_characters > MIN_CHAR
            if ( grter_than_min and in_ls_subreddits ):
    return np.array(corpus), np.array(subreddit_id)

In [ ]:
def plot_precision_recall():

In [ ]:
def plot_precision_recall_n(y_true, y_prob, model_name, fname=None):
    create a precision recall curve
    y_true: ls
        ls of ground truth labels 
    y_prob: ls
        ls of predict probas from model
    model_name: str
        str of model used
    fname: str
        filename to save figure
    Plot of precision recall. 
    import matplotlib.pyplot as plt
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1] #take every value up to the last one 
    recall_curve = recall_curve[:-1]# take every value up to the last one
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    name = model_name

Load Data

Data Source: Reddit Comments from May 2015 in JSON format

For the superivised learning portion of the tutorial we will being attempting to classify whether reddit threads have come from the SucideWatch or depression. These two threads should be somewhat similiar so it poses a non-trivial challenge for a classifier.

In [ ]:
%%bash #magic function to run a bash command inside of a Jupyter notebook. 
#unizip the data
gunzip ./data/RC_2015-05.json.gz

In [ ]:
#grab data from the following subreddits
ls_subreddits = ['SuicideWatch', 'depression']
[corpus, subreddit_id] = load_reddit('./data/RC_2015-05.json', ls_subreddits, MIN_CHAR=30)

Preprocess the data

In [ ]:
#matches are non-word chracters and digits to be replaced with spaces.
RE_PREPROCESS = r'\W+|\d+'  
#get rid of punctuation and make everything lowercase
processed_corpus = np.array( [ re.sub(RE_PREPROCESS, ' ', comment).lower() for comment in corpus] )

Supervised Learning: Identify the Subreddit Section

In this section we are going to train a classifier to properly tag the original subreddit the comment appeared. First we split our data into a testing and training set using the first 80% of the data as the training set and the remaining 20% as the testing set.

In [ ]:
#split the data into training and testing sets. 
#refactor this in the test train-split
train_set_size = int(0.8*len(subreddit_id))
train_idx = np.arange(0,train_set_size)
test_idx = np.arange(train_set_size, len(subreddit_id))

train_subreddit_id = subreddit_id[train_idx]
train_corpus = processed_corpus[train_idx]

test_subreddit_id = subreddit_id[test_idx]
test_corpus = processed_corpus[test_idx]

print('Training Labels', Counter(subreddit_id[train_idx]))
print('Testing Labels', Counter((subreddit_id[test_idx])))

Tokenize and stem to create features

Now that we have the data and we have done a bit of preprocessing. We want to create features. Now we create a vectorizer object that finds the frequency of words in each of the documents while weighing the importance of each word. For example, the words the or for may appear often in a document but may have very little semantic value. Conversely a document may have specialized, obscure words that do not occur anywhere in else in the corpus. These cases are managed by setting a threshold for the Min and Max Document Frequeny(DF).

In [ ]:
#parameters for vectorizer 
ANALYZER = "word" #unit of features are single words rather then phrases of words 
STRIP_ACCENTS = 'unicode'
NGRAM_RANGE = (0,2) #Range for pharases of words
MIN_DF = 0.01 # Exclude words that have a frequency less than the threshold
MAX_DF = 0.8  # Exclude words that have a frequency greater then the threshold 

vectorizer = CountVectorizer(analyzer=ANALYZER,
                            tokenizer=None, # alternatively tokenize_and_stem but it will be slower 
                            stop_words = stopwords.words('english'),
                            min_df = MIN_DF,
                            max_df = MAX_DF)

TFIDF (Term Frequency Inverse Document Frequency) transforms a count matrix--what we created above--into a TFIDF represenation. This is done by reweigthing words that occur throughout the entire corpus to a lower weight due to empirically being found to be less informative.

In [ ]:
NORM = None #turn on normalization flag
SMOOTH_IDF = True #prvents division by zero errors
SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
USE_IDF = True #flag to control whether to use TFIDF

transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)

In [ ]:
#get the bag-of-words from the vectorizer and
#then use TFIDF to limit the tokens found throughout the text 
start_time = time.time()
train_bag_of_words = vectorizer.fit_transform( train_corpus ) #using all the data on for generating features!! Bad!
test_bag_of_words = vectorizer.transform( test_corpus )
    train_tfidf = transformer.fit_transform(train_bag_of_words)
    test_tfidf = transformer.transform(test_bag_of_words)
features = vectorizer.get_feature_names()
print('Time Elapsed: {0:.2f}s'.format(

In [ ]:
#relabel our labels as a 0 or 1
le = preprocessing.LabelEncoder() 
subreddit_id_binary = le.transform(subreddit_id)

In [ ]:
#make clear what are the features and what are the labels 
clf = LogisticRegression(penalty='l1')
mdl = clf.fit(train_tfidf, 
y_score = mdl.predict_proba( test_tfidf )

Evalution of the Supervised Model

To evalute how are classifer had done we find the AUC (Area Under Curve) of a ROC Curve and plot a precision recall curve.

In [ ]:
auc = roc_auc_score( subreddit_id_binary[test_idx], y_score[:,1])

In [ ]:
plot_precision_recall_n(subreddit_id_binary[test_idx], y_score[:,1], 'LR')

Feature Importances

Find the five words that are most predictive of each subreddit

In [ ]:
coef = mdl.coef_.ravel()

dict_feature_importances = dict( zip(features, coef) )
orddict_feature_importances = OrderedDict( 
                                sorted(dict_feature_importances.items(), key=lambda x: x[1]) )

ls_sorted_features  = list(orddict_feature_importances.keys())

num_features = 5
subreddit0_features = ls_sorted_features[:5] #SuicideWatch
subreddit1_features = ls_sorted_features[-5:] #depression
print('SuicideWatch: ',subreddit0_features)
print('depression: ', subreddit1_features)

See the predictions and how well they match up

In [ ]:
#maybe do something with this crazy indexing: this is python not C!
num_comments = 5
subreddit0_comment_idx = y_score[:,1].argsort()[:num_comments] #SuicideWatch
subreddit1_comment_idx = y_score[:,1].argsort()[-num_comments:] #depression

#convert back to the indices of the original dataset
top_comments_testing_set_idx = np.concatenate([subreddit0_comment_idx, 

In [ ]:
#these are the 5 comments the model is most sure of 
for i in top_comments_testing_set_idx:

The predict probability refer to the probablity of a comment belonging to the depression subreddit. Therefore, comments belonging to the SucideWatch subreddit will have a low probablity. As can be seen from the comments for the three highest probablities for SucideWatch and depression, the classifier does a good job. Note the last entry in the depression comment is miscategorized. This is most likely due to the comment referring to depression and treatment.

Topic Modeling: Unsupervised Learning

In this portion of the tutorial we will be extracting topics in the form of commonly co-occuring words from the corpus of data.

In [ ]:

Load All the reddit Data

In [ ]:
start = time.time()
corpus_all, subreddit_id_all = load_reddit('./data/RC_2015-05.json',MIN_CHAR=250)
end = time.time()
print('Loading takes {0:2f}s'.format(end-start))

Preprocess the data

In [ ]:
# Get rid of punctuation and set to lowercase  
start = time.time()
processed_corpus_all = [ re.sub( RE_PREPROCESS, ' ', comment).lower() for comment in corpus_all]

#tokenzie the words
bag_of_words_all = vectorizer.fit_transform( processed_corpus_all ) 
end = time.time() 
#grab the features/vocabulary
features_all = vectorizer.get_feature_names()
print("Processing took {}s".format(end - start))

In [ ]:
print(Counter(subreddit_id_all), len(subreddit_id_all))

Create topics

To create our topics we will use the LatentDirichletAllocation algorithm

In [ ]:
start = time.time()
lda = LatentDirichletAllocation( n_topics = N_TOPICS )
doctopic = lda.fit_transform( bag_of_words_all )
end = time.time() 
print("Processing took {}s".format(end- start)) # takes ~72s for 1 file, ~445s for 5 files

Display the top ten words for each topics

In [ ]:
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:N_TOP_WORDS]
    keywords = ', '.join( features_all[i] for i in word_idx)
    print(i, keywords)

first 25 comments with the majority topic

In [ ]:
num_comments = 50
for comment_id in range(num_comments):
    topic_id = np.argsort(doctopic[comment_id])[::-1][0]

Further Resources

Return to TOC