Text Analysis


Introduction

Text Analysis is used for summarizing or getting useful information out of a large amount of unstructured text stored in documents. This opens up the opportunity of using text data alongside more conventional data sources (e.g, surveys and administrative data). The goal of text analysis is to take a large corpus of complex and unstructured text data and extract important and meaningful messages in a comprehensible meaningful way.

Text Analysis can help with the following tasks:

  • Informationan retrieval: Help find relevant information in large databases such as a systematic literature review.

  • Clustering and text categorization: Techniques like topic modeling can summarize a large corpus of text by finding the most important phrases.

  • Text Summarization: Create category-sensitive text summaries of a large corpus of text.

  • Machine Translation: Translate from one language to another.

In this tutorial, we are going to analyze job advertisements from 2010-2015 using topic modeling to examine the content of our data and document classification to tag the type of job in the advertisement. First we will go over how to transform our data into a matrix that can be read in by an algorithm.

Glossary of Terms

  • Corpus: A corpus of documents is the set of all documents in the dataset.

  • Tokenize: Tokenization is the process by which text is sepearated into meaningful terms or phrases. In english this is fairly triial as words as separated by whitespace.

  • Stemming: Stemming is a type of text normalization where words that have different forms but their essential meaning at the same are normalized to the original dictionary form of a word. For example "go," "went," and "goes" all stem from the lemma "go."

  • TFIDF: TFIDF (Term frequency-inverse document frequency) is an example of feature enginnering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole.

  • Topic Modeling: Topic modeling is an unsupervised learning method where groups of co-occuring words are clustered into topics. Typically, the words in a cluster should be related and make sense (e.g, boat, ship, captain). Individual documents will then fall into multiple topics.

  • LDA: LDA (latent Dirichlet allocation) is a type of probabilistic model commonly used for topic modelling.

  • Stop Words: Stop words are words that have little semantic meaning like prepositions, articles and common nouns. They can often be ignored.


In [ ]:
%pylab inline 
import nltk
import ujson
import re
import time
import progressbar

import pandas as pd
from __future__ import print_function
from six.moves import zip, range 

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_auc_score, auc
from sklearn import preprocessing
from collections import Counter, OrderedDict
from nltk.corpus import stopwords
from nltk import PorterStemmer

nltk.download('stopwords') #download the latest stopwords

Load the Data

Our Dataset for this tutorial will be a subset jobs-ad data from 2010-2015 compiled by the Commonwealth of Virginia. The full data and how this subset was created can be found in the data folder in this tutorial.


In [ ]:
df_jobs_data = pd.read_csv('./data/jobs_subset.csv')

Explore the Data


In [ ]:
df_jobs_data.head()

Our table has 4 fields. normalizedTitle_onetName, normalizedTitle_onetCode, jobDescription, title

Onet is an online database that contains hundreds of occupational definitions. https://en.wikipedia.org/wiki/Occupational_Information_Network

The normalizedTitle_onetName and the normalizedTitle_onetCode are derived from the Onet Database. We wil use the names in the document tagging portion of the tutorial. The jobDescription is the actual jobDescription and the title is derived from the jobDescription.

How many unique job titles are in this dataset?


In [ ]:
df_jobs_data.normalizedTitle_onetName.unique()

In [ ]:
df_jobs_data.title.unique()

In [ ]:
df_jobs_data.title.unique().shape

There are 5 unique categories of jobs using the ONet classification. There are too many unique job titles in the title field to display. We can see the shape of the array of unique titles is 2496 titles.

Each job description has a great deal of information contained in unstructered text. We can use text analysis to find overarching concepts that are in our corpus. This will allow us to the discover the most important words and phrases in the job descriptions and give us a big-picture of the content in our collection.

Topic Modeling

We are going to apply topic modeling, an unsuperivised learning method, to our corpus to find the high-level topics in our corpus as a first-go for exploring our data. As we apply topic modeling we will discuss ways of cleaning and preprocessing our data to get the best results.

Topic modeling is a broad subfield of machine learning and natural language processing. We are going to focus on one approach, Latent Dirichlet allocation (LDA). LDA is a fully Bayesian extension of probabilistic latent semantic indexing, itself a probabilistic extension of latent semantic analysis.

In topic modeling we first assume the existence of topics in the corpus and that there is a small number of topics that can explain a corpus. Topics, in this case, are a ranked-list of words from our corpus, with the highest probability words at the top. A single document can be explained by multiple topics. For instance, an article on net neutrality has to do with both technology and politics. The set of topics used by a document is known as the document's allocation, hence, the name latent Dirchlet allocation, each document has an allocation of latent topics allocated by Dirchlet distribution.

Processing text data

The first important step in working with text data is cleaning and processing the data. This include but is not limited to forming a corpus of text, tokenization, removing stop-words, finding words colocated together (N-grams), and stemming and lemmatization. Each of these steps will be discussed below. The ultimate goal is to transform our text data into a form an algorithm can work with. A sequence of symbols cannot be fed directly into an algorithm. Algorithms expect numerical feature vectors with fixed size rather then a document with a variable document length. We will be transforming our text corpus into a bag of n-grams to be further analyzed. In this form our text data is represented as a matrix where each row refers to a specific job description (document) and each column is the occurence of a word (feature).

Bag of n-gram representation example

Ultimately, we want to take our collection of documents, corpus, and convert it into a matrix. Fortunately sklearn has a pre-built object, CountVectorizer, that can tokenize, eliminate stopwords, identify n-grams and stem our corpus, outputing a matrix in one step. Before we apply the vectorizer on our corpus of data we are going to apply it to a toy example so we can understand the output and how a bag of words is represented.


In [ ]:
def create_bag_of_words(corpus,
                       NGRAM_RANGE=(0,1),
                       stop_words = None,
                        stem = False,
                       MIN_DF = 0.05,
                       MAX_DF = 0.95,
                       USE_IDF=False):
    """
    Turn a corpus of text into a bag-of-words.
    
    Parameters
    -----------
    corpus: ls
        test of documents in corpus    
    NGRAM_RANGE: tupule
        range of N-gram default (0,1)
    stop_words: ls
        list of commonly occuring words that have little semantic
        value
    stem: bool
        use a stemmer to stem words
    MIN_DF: float
       exclude words that have a frequency less than the threshold
    MAX_DF: float
        exclude words that have a frequency greater than the threshold
    
    
    Returns
    -------
    bag_of_words: scipy sparse matrix
        scipy sparse matrix of text
    features:
        ls of words
    """
    #parameters for vectorizer 
    ANALYZER = "word" #unit of features are single words rather then phrases of words 
    STRIP_ACCENTS = 'unicode'
     
    if stem:
        tokenize = lambda x: [stemmer.stem(i) for i in x.split()]
    else:
        tokenize = None
    vectorizer = CountVectorizer(analyzer=ANALYZER,
                                tokenizer=tokenize, 
                                ngram_range=NGRAM_RANGE,
                                stop_words = stop_words,
                                strip_accents=STRIP_ACCENTS,
                                min_df = MIN_DF,
                                max_df = MAX_DF)
    
    bag_of_words = vectorizer.fit_transform( corpus ) #transform our corpus is a bag of words 
    features = vectorizer.get_feature_names()

    if USE_IDF:
        NORM = None #turn on normalization flag
        SMOOTH_IDF = True #prvents division by zero errors
        SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
        transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)
        #get the bag-of-words from the vectorizer and
        #then use TFIDF to limit the tokens found throughout the text 
        tfidf = transformer.fit_transform(bag_of_words)
        
        return tfidf, features
    else:
        return bag_of_words, features

In [ ]:
toy_corpus = ['this is document one', 'this is document two', 'text analysis on documents is fun']

In [ ]:
toy_bag_of_words, toy_features = create_bag_of_words(toy_corpus)

The counter_vectorizer outputs a matrix. In this case a sparse matrix, a matrix that has a lot more 0s then 1s. To save space scipy has special methods for storing sparse matrices in a space-efficient way rather than saving many many 0s.


In [ ]:
toy_corpus

In [ ]:
np_bag_of_words = toy_bag_of_words.toarray()
np_bag_of_words

In [ ]:
toy_features

Our data has been transformed into a 3x9 matrix where each row corresponds to a document and the columns correspond to the features. A 1 indicates the existence of the feature or word in the document, 0 indicates the word is not present.Our toy corpus is now ready to be analyzed. We illustrated bag of n-gram with a toy example because the matrix for a much larger corpus would be much larger and harder to interpet on our corpus of data.

word counts

As a initial look into the data we can examine what the top few words are in our corpus. We can sum the columns of the bag_of_words and then convert to a numpy array. From here we can zip the features and word_count into a dictionary and display the results.


In [ ]:
def get_word_counts(bag_of_words, feature_names):
    """
    Get the ordered word counts from a bag_of_words
    
    Parameters
    ----------
    bag_of_words: obj
        scipy sparse matrix from CounterVectorizer
    feature_names: ls
        list of words
        
    Returns
    -------
    word_counts: dict
        Dictionary of word counts
    """
    np_bag_of_words = bag_of_words.toarray()
    word_count = np.sum(np_bag_of_words,axis=0)
    np_word_count = np.asarray(word_count).ravel()
    dict_word_counts = dict( zip(feature_names, np_word_count) )
    
    orddict_word_counts = OrderedDict( 
                                    sorted(dict_word_counts.items(), key=lambda x: x[1], reverse=True), )
    
    return orddict_word_counts

In [ ]:
get_word_counts(toy_bag_of_words, toy_features)

Text Corpora

First we need to form our corpus, a set of multiple similiar documents. In our case, our corpus is the set of all job descriptions. We can pull out the job descriptions from the data frame by pulling out the underlying numpy array using the .values attribute.


In [ ]:
corpus = df_jobs_data['jobDescription'].values #pull all the jobDescriptions and put them in a numpy array 
corpus

In [ ]:
def create_topics(tfidf, features, N_TOPICS=3, N_TOP_WORDS=5,):
    """
    Given a matrix of features of text data generate topics
    
    Parameters
    -----------
    tfidf: scipy sparse matrix
        sparse matrix of text features
    N_TOPICS: int
        number of topics (default 10)
    N_TOP_WORDS: int
        number of top words to display in each topic (default 10)
        
    Returns
    -------
    ls_keywords: ls
        list of keywords for each topics
    doctopic: array
        numpy array with percentages of topic that fit each category
    N_TOPICS: int
        number of assumed topics
    N_TOP_WORDS: int
        Number of top words in a given topic. 
    """
    
    with progressbar.ProgressBar(max_value=progressbar.UnknownLength) as bar:
        i=0
        lda = LatentDirichletAllocation( n_topics= N_TOPICS,
                                       learning_method='online') #create an object that will create 5 topics
        bar.update(i)
        i+=1
        doctopic = lda.fit_transform( tfidf )
        bar.update(i)
        i+=1
        
        ls_keywords = []
        for i,topic in enumerate(lda.components_):
            word_idx = np.argsort(topic)[::-1][:N_TOP_WORDS]
            keywords = ', '.join( features[i] for i in word_idx)
            ls_keywords.append(keywords)
            print(i, keywords)
            bar.update(i)
            i+=1
            
    return ls_keywords, doctopic

In [ ]:
corpus_bag_of_words, corpus_features = create_bag_of_words(corpus)

Let's examine our features.


In [ ]:
corpus_features

The first aspect of the feature list that should stand out for us is that the first few entries are numbers that have no real semantic meaning. There are also other useless words such as prepositions and articles that also have no semantic meaning. The words ability or abilities or accuracy and accurate are also quite similiar and mean the same thing. We should try cleaning our corpus of data of these types of words as they just add noise to our analysis. Nevertheless let's try creating topics and seeing the quality of the results.


In [ ]:
get_word_counts(corpus_bag_of_words, corpus_features)

Our top words are articles, prepositions and conjunctions that do not tell us anything about our courpus. Let's march on create topics anyway.


In [ ]:
ls_corpus_keywords, corpus_doctopic = create_topics(corpus_bag_of_words, corpus_features)

Looking at these topics we have no real knowledge of what is in our corpus, with the exception that there are job ads written in Spanish. The problem is the the top words in the topics are conjunctions and prepositions that have no semantic information. We have to try and clean and process our data to get more meaningful infomation.

Text Cleaning and Normalization

To clean and normalize text we will remove all special characters, numbers, and punctuation. Then we will make all the text lowercase to normalize the text; this is so words like "the" and "The" will be counted as the same in our analysis. To remove the special characters, numbers and punctuation we will use regular expressions.

Regular Expressions

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." -- Jaime Zawinski

Regular Expressions or regexes match a certain amount text in a document based on a set of rules and syntax. The name "regular expressions" actually comes from the mathematical theory it is based on. These rules are useful for pulling out useful information in a large amount of text (e.g., email addresses, html-tags, credit card numbers). Regexes often match text much more quickly then plain text sorting and can often reduce their development time. Some regular expressions can become quite complicated and it may then become a good idea to write code using Python. Any developer should keep in mind there is a trade-off between optimization and understandibility. In Python, a general philosophy is code is meant to be as understandable by people as much as possible. You should therefore always tend toward the understandabilty side of things rather than overly optimizing your code. Your future-self, code-reviewers, people who inherit your code, and anyone else who has to make sense of your code in the future will appreciate it.

For our purposes we are going to use a regular expression to match all characters that are not letters -- punctutation, quotes, special characters and numbers --replace them with spaces and then take all the remaining characters and make them lowercase.

A full tutorial on regular expressions would be outside the scope of this tutorial. There are many good tutorials that can be found on-line. There is also a great interactive tool for developing and checking regular expressions regex101.com.

We will be using the re library in python for regular expression matching.


In [ ]:
#get rid of the punctuations and set all characters to lowercase
RE_PREPROCESS = r'\W+|\d+' #the regular expressions that matches all non-characters

#get rid of punctuation and make everything lowercase
#the code belows works by looping through the array of text
#for a given piece of text we invoke the `re.sub` command where we pass in the regular expression, a space ' ' to
#subsitute all the matching characters with
#we then invoke the `lower()` method on the output of the re.sub command
#to make all the remaining characters
#the cleaned document is then stored in a list
#once this list has been filed it is then stored in a numpy array

processed_corpus = np.array( [ re.sub(RE_PREPROCESS, ' ', comment).lower() for comment in corpus] )

first decription before cleaning


In [ ]:
corpus[0]

first description after cleaning


In [ ]:
processed_corpus[0]

All lowercase, all numbers and special chracters have been removed. Out text is now normalized.

Tokenization

Now that we have cleaned our text we can tokenize it by deciding which terms and phrases are the most meaningful. In this case we want to split our text into individual words. Our words are separted by spaces so we can use the .split() command to turn are document into a list of words using a space as the character to split on as an example. Normally the CountVectorizer handles this for us.


In [ ]:
tokens = processed_corpus[0].split()

In [ ]:
tokens

Stopwords

Stopwords are words that have very little semantic meaning and are found throughout a text. Having the word the or of will tell us nothing about our corpus, nor will they be meaningful features. Examples of stopwords are prepositions, articles and common nouns. To process the corpus even further we can eliminate these stopwords by checking if the are in a list of commonly occuring stopwords.


In [ ]:
eng_stopwords =  stopwords.words('english')

In [ ]:
#sample of stopwords
eng_stopwords[::10]

In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,stop_words=eng_stopwords)
dict_processed_word_counts = get_word_counts(processed_bag_of_words, processed_features)
dict_processed_word_counts

Much better! Now let's see how this affects the topics that are produce. Though the top 20 words are likely to be in all the job ads. Let's add them to the stopwords to remove them as well.


In [ ]:
top_20_words = list(dict_processed_word_counts.keys())[:20]
domain_specific_stopwords = eng_stopwords + top_20_words
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords)

In [ ]:
dict_processed_word_counts = get_word_counts(processed_bag_of_words, processed_features)
dict_processed_word_counts

This is a bit better. Let's see what topics we produce.


In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features)

Now we are starting to get somewhere! There are a lot of jobs that have to do with the law and engineering and medicine. We should increase the number of topics and words for each topics to see if we can understand more from our corpus.


In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                      N_TOPICS = 5,
                                                      N_TOP_WORDS= 10)

Adding more topics has revealed to larger subtopics. Let's see if adding 10 topics will tell us more information.


In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                      N_TOPICS = 10,
                                                      N_TOP_WORDS= 15)

It looks like we have a good amount of topics. Some of the top words are quite similiar such as engineering and enginner. We can reduce those words to its stem to further refine our features.

Stemming and lemmitzation

We can further process our text through stemming and lemmatization. Words can take on muliple forms with limited change to their meaning. For example "systems", "systematic" and "system" are all different words but they all have the same meeting. We can replace all these words with system witout losing any meaning. The lemma is the original dictionary form of a word (e.g. lying and lie). There are several well known stemming algorithms -- Porter, Snowball, Lancaster -- that all have strengths and weakneses. For this tutorial we are using the Porter Stemmer.


In [ ]:
stemmer = PorterStemmer()
print(stemmer.stem('lies'))
print(stemmer.stem("lying"))
print(stemmer.stem('systematic'))
print(stemmer.stem("running"))

In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords,
                                                                 stem=True)
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                      N_TOPICS = 10,
                                                      N_TOP_WORDS= 15)

Not it appears we have picked up some extra topics that describe the educational requirements for a job ad or the equal opportunity clause of a job ad.

N-grams

Individual words are not always the the correct unit of analysis. Prematurely removing stopwords can lead to losing pharases such as "kick the bucket", "commander in chief", or "sleeps with the fishes". Idenitfying these N-grams requires looking for patterns of words that often appear together in fixed patterns.

Now let's transform our corpus into a bag of n-grams which in this case is a bag of bi-grams or bag of 2-grams.


In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords,
                                                                 stem=True,
                                                                 NGRAM_RANGE=(0,2))
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                      N_TOPICS = 10,
                                                      N_TOP_WORDS= 15)

Notice one of the top words in one of the topics is "northrop grumman", a bi-gram!

TFIDF (Term Frequency Inverse Document Frequency)

A final step in cleaning and processing our text data is TFIDF. TFIDF (Term frequency-inverse document frequency) is an example of feature enginnering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole. Words that appear in all documents are deemphazized while more meaningful words are emphaized.


In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
                                                                 stop_words=domain_specific_stopwords,
                                                                 stem=True,
                                                                 NGRAM_RANGE=(0,2),
                                                                 USE_IDF = True)

In [ ]:
dict_word_counts = get_word_counts(processed_bag_of_words,
                   processed_features)

In [ ]:
dict_word_counts

The words counts have been reweighted to emphasize the more meaninful words of the corpus while deemphasizing the the are found throughout the corpus.


In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words, 
                                                       processed_features, 
                                                      N_TOPICS = 10,
                                                      N_TOP_WORDS= 15)

In [ ]:
#grab the topic_id of the majority topic for each document and store it in a list
ls_topic_id = [np.argsort(processed_doctopic[comment_id])[::-1][0] for comment_id in range(len(corpus))]
df_jobs_data['topic_id'] = ls_topic_id #add to the dataframe so we can compare with the job titles

Now that each row is tagged with a topic id let's see how well the topics exaplain the job advertisements.


In [ ]:
topic_num = 0
print(processed_keywords[topic_num])
df_jobs_data[ df_jobs_data.topic_id == topic_num ].head(10)

Supervised Learning: Document Classification.

Now we turn our attention to supervised learning. Previously, using topic modelling, we were inferring relationships between the data. In supervised learning, we produce a label, y, given some data x. In order to produce labels we need to first have examples our algorithm can learn from, a training set. Developing a training set can be very expensive, as it can require a large amount of human labor or linguistic expertise. Document classification is the case where our x are documents and our y are what the documents are (e.g, title for a job position). A common example of document classification is spam detection in emails. In sentiment analysis our x is our documents and y is the state of the author. This can range from an author being happy or unhappy with a product or the author being pollitically conservative or liberal. There is also Part-of-speech tagging were our x are individual words and y is the part-of-speech.

In this section we are going to train a classifier to classify job titles using our jobs dataset.

Load the Data


In [ ]:
df_train = pd.read_csv('./data/train_corpus_document_tagging.csv')
df_test = pd.read_csv('./data/test_corpus_document_tagging.csv')

In [ ]:
df_train.head()

In [ ]:
df_train['normalizedTitle_onetName'].unique()

In [ ]:
Counter(df_train['normalizedTitle_onetName'].values)

In [ ]:
df_test.head()

In [ ]:
df_test['normalizedTitle_onetName'].unique()

In [ ]:
Counter(df_test['normalizedTitle_onetName'].values)

Our data is job advertisements for credit analysts and financial examiners.

Process our Data

In order to feed out data into a classifier we need to pull out the labels, our y's, and a clean corpus of documents, x, for our training and testing set.


In [ ]:
train_labels = df_train.normalizedTitle_onetName.values
train_corpus = np.array( [re.sub(RE_PREPROCESS, ' ', text).lower() for text in df_train.jobDescription.values])
test_labels = df_test.normalizedTitle_onetName.values
test_corpus = np.array( [re.sub(RE_PREPROCESS, ' ', text).lower() for text in df_test.jobDescription.values])
labels = np.append(train_labels, test_labels)

Just as we had done in the unsupervised learning we have to transform our data. This time we have to transform our testing and training set into two different bag-of-words. The classifer will learn from the training set and we will evaluate the clasffier's performance on the testing set.


In [ ]:
#parameters for vectorizer 
ANALYZER = "word" #unit of features are single words rather then phrases of words 
STRIP_ACCENTS = 'unicode'
TOKENIZER = None
NGRAM_RANGE = (0,2) #Range for pharases of words
MIN_DF = 0.01 # Exclude words that have a frequency less than the threshold
MAX_DF = 0.8  # Exclude words that have a frequency greater then the threshold 

vectorizer = CountVectorizer(analyzer=ANALYZER,
                            tokenizer=None, # alternatively tokenize_and_stem but it will be slower 
                            ngram_range=NGRAM_RANGE,
                            stop_words = stopwords.words('english'),
                            strip_accents=STRIP_ACCENTS,
                            min_df = MIN_DF,
                            max_df = MAX_DF)

In [ ]:
NORM = None #turn on normalization flag
SMOOTH_IDF = True #prvents division by zero errors
SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
USE_IDF = True #flag to control whether to use TFIDF

transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)

#get the bag-of-words from the vectorizer and
#then use TFIDF to limit the tokens found throughout the text 
start_time = time.time()
train_bag_of_words = vectorizer.fit_transform( train_corpus ) #using all the data on for generating features!! Bad!
test_bag_of_words = vectorizer.transform( test_corpus )
if USE_IDF:
    train_tfidf = transformer.fit_transform(train_bag_of_words)
    test_tfidf = transformer.transform(test_bag_of_words)
features = vectorizer.get_feature_names()
print('Time Elapsed: {0:.2f}s'.format(
        time.time()-start_time))

We cannot pass the label "Credit Analyst" or "Financial Examiner" into the classifier. Instead we to encode them as 0s and 1s using the labelencoder as a part of sklearn.


In [ ]:
#relabel our labels as a 0 or 1
le = preprocessing.LabelEncoder() 
le.fit(labels)
labels_binary = le.transform(labels)

We also need to create arrays of indices so we can access the training and testing sets accoringly.


In [ ]:
train_size = df_train.shape[0]
train_set_idx = np.arange(0,train_size)
test_set_idx = np.arange(train_size, len(labels))
train_labels_binary = labels_binary[train_set_idx]
test_labels_binary = labels_binary[test_set_idx]

The classifier we are using in the example is LogisticRegression. As we saw in the Machine Learning tutorial we first decide on a classifier, then we fit the classfier to create a model. We can then test our model with our testing set by passing in the features of the testing set. The model will output the probablity of each document being classfied as a Credit Analyst or Financial Analyst.


In [ ]:
clf = LogisticRegression(penalty='l1')
mdl = clf.fit(train_tfidf, labels_binary[train_set_idx]) #train the classifer to get the model
y_score = mdl.predict_proba( test_tfidf ) #score of the document being an ad for Credit or Financial Analyst

Evaluation


In [ ]:
def plot_precision_recall_n(y_true, y_prob, model_name):
    """
    y_true: ls
        ls of ground truth labels
    y_prob: ls
        ls of predic proba from model
    model_name: str
        str of model name (e.g, LR_123)
    """
    from sklearn.metrics import precision_recall_curve
    y_score = y_prob
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
    precision_curve = precision_curve[:-1]
    recall_curve = recall_curve[:-1]
    pct_above_per_thresh = []
    number_scored = len(y_score)
    for value in pr_thresholds:
        num_above_thresh = len(y_score[y_score>=value])
        pct_above_thresh = num_above_thresh / float(number_scored)
        pct_above_per_thresh.append(pct_above_thresh)
    pct_above_per_thresh = np.array(pct_above_per_thresh)
    plt.clf()
    fig, ax1 = plt.subplots()
    ax1.plot(pct_above_per_thresh, precision_curve, 'b')
    ax1.set_xlabel('percent of population')
    ax1.set_ylabel('precision', color='b')
    ax1.set_ylim(0,1.05)
    ax2 = ax1.twinx()
    ax2.plot(pct_above_per_thresh, recall_curve, 'r')
    ax2.set_ylabel('recall', color='r')
    ax2.set_ylim(0,1.05)
    
    name = model_name
    plt.title(name)
    plt.show()

In [ ]:
plot_precision_recall_n(labels_binary[test_set_idx], y_score[:,1], 'LR')

If we examine our precision-recall curve we can see that our precision is 1 and recall is 0.8 up to 0.4 percent of the population. Unlike the previous example where we are using a precision at k curve to prioritize our resources. We can still use a precision at k curve to see what parts of the corpus can be tagged by the classifier and which should undergo a manual clerical review. Based on this we can make decisions of what documents should be manually tagged by a person during a clerical rewiew, say, the percent of the population above 40%.

Alternatively, we can try to maximize the entire precision-recall space. In this case we need a different metric.


In [ ]:
def plot_precision_recall(y_true,y_score):
    """
    Plot a precision recall curve
    
    Parameters
    ----------
    y_true: ls
        ground truth labels
    y_score: ls
        score output from model
    """
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score[:,1])
    plt.plot(recall_curve, precision_curve)
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    auc_val = auc(recall_curve,precision_curve)
    print('AUC-PR: {0:1f}'.format(auc_val))
    plt.show()
    plt.clf()

In [ ]:
plot_precision_recall(labels_binary[test_set_idx],y_score)

If we look at the area under the curve, 0.96, we see we have a very good classifier. The AUC shows how accurate our scores are under different cut-off thresholds. If you recall from the Machine Learning tutorial, the model outputs a score. We then set a cutoff to bin each score as a 0 or 1. The closer our scores are to the true values the more resilent they are to different cutoffs. For instance, if our scores were perfect our AUC would be 1.

Feature Importances


In [ ]:
def display_feature_importances(coef,features, labels, num_features=10):
    """
    output feature importances
    
    Parameters
    ----------
    coef: numpy
        feature importances
    features: ls 
        feature names
    labels: ls
        labels for the classifier
    num_features: int
        number of features to output (default 10)
    
    Example
    --------
    
    
    """
    coef = mdl.coef_.ravel()

    dict_feature_importances = dict( zip(features, coef) )
    orddict_feature_importances = OrderedDict( 
                                    sorted(dict_feature_importances.items(), key=lambda x: x[1]) )

    ls_sorted_features  = list(orddict_feature_importances.keys())

    num_features = 10
    label0_features = ls_sorted_features[:num_features] 
    label1_features = ls_sorted_features[-num_features:] 

    print(labels[0],label0_features)
    print(labels[1], label1_features)

In [ ]:
display_feature_importances(mdl.coef_.ravel(), features, ['Credit Analysts','Financial Examiner'])

The feature importances are which words are the most relevant for predicting the type of Job Ad. We would expect words like credit, customer and candidate to be found in ads for a Credit Analyst. While words like review officer, compliance would be found in ads for a Financial Examiner.

Cross-validation

Recall from the machine learning tutorial that we are seeking the find the most general pattern in the data in order to have to most general model that will be successfull at classfying new unseen data. Our previous strategy above was the Out-of-sample and holdout set. With this strategy we try to find a general pattern by randomly dividing our data into a test and training set based on some percentage split (e.g., 50-50 or 80-20). We train on the test set and evalute on the test set, where we pretend the test set is unforseen data. A significant drawback with this approach is we may be lucky or unlucky with our random split. A possible solution is to create many random splits into training and testing sets and evaluate each split to estimate the performance of a given model.

A more sophisticated holdout training and testing procedure is cross-validation. In cross-validation we split our data into k-folds or k-partions, usually 5 or 10 folds. We then iterate k times. In each iteration one of the folds is used as a testing set and the rest of the folds are combined to form the training set. We can then evaluate the performance at each iteration to estimate the performance of a given method. An advantage of using cross-validation is all examples of data are used in the training set at least once.


In [ ]:
def create_test_train_bag_of_words(train_corpus, test_corpus):
    """
    Create test and training set bag of words
    
    
    Parameters
    ----------
    train_corpus: ls
        ls of raw text for text corpus.
    test_corpus: ls
        ls of raw text for train corpus. 
        
    Returns
    -------
    (train_bag_of_words,test_bag_of_words): scipy sparse matrix
        bag-of-words representation of train and test corpus
    features: ls
        ls of words used as features. 
    """
    #parameters for vectorizer 
    ANALYZER = "word" #unit of features are single words rather then phrases of words 
    STRIP_ACCENTS = 'unicode'
    TOKENIZER = None
    NGRAM_RANGE = (0,2) #Range for pharases of words
    MIN_DF = 0.01 # Exclude words that have a frequency less than the threshold
    MAX_DF = 0.8  # Exclude words that have a frequency greater then the threshold 

    vectorizer = CountVectorizer(analyzer=ANALYZER,
                                tokenizer=None, # alternatively tokenize_and_stem but it will be slower 
                                ngram_range=NGRAM_RANGE,
                                stop_words = stopwords.words('english'),
                                strip_accents=STRIP_ACCENTS,
                                min_df = MIN_DF,
                                max_df = MAX_DF)
    
    NORM = None #turn on normalization flag
    SMOOTH_IDF = True #prvents division by zero errors
    SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
    USE_IDF = True #flag to control whether to use TFIDF

    transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)

    #get the bag-of-words from the vectorizer and
    #then use TFIDF to limit the tokens found throughout the text 
    train_bag_of_words = vectorizer.fit_transform( train_corpus ) 
    test_bag_of_words = vectorizer.transform( test_corpus )
    if USE_IDF:
        train_tfidf = transformer.fit_transform(train_bag_of_words)
        test_tfidf = transformer.transform(test_bag_of_words)
    features = vectorizer.get_feature_names()

    
    return train_tfidf, test_tfidf, features

In [ ]:
from sklearn.cross_validation import StratifiedKFold
cv = StratifiedKFold(train_labels_binary, n_folds=5)
train_labels_binary = le.transform(train_labels)
for i, (train,test) in enumerate(cv):
    cv_train = train_corpus[train]
    cv_test = train_corpus[test]
    bag_of_words_train, bag_of_words_test, feature_names = create_test_train_bag_of_words(cv_train, 
                                                                                          cv_test)
    
    probas_ = clf.fit(bag_of_words_train, 
                      train_labels_binary[train]).predict_proba(bag_of_words_test)
    cv_test_labels = train_labels_binary[test]
    
    precision_curve, recall_curve, pr_thresholds = precision_recall_curve(cv_test_labels,
                                                                          probas_[:,1])
    auc_val = auc(recall_curve,precision_curve)
    plt.plot(recall_curve, precision_curve, label='AUC-PR {0} {1:.2f}'.format(i,auc_val))
    
plt.ylim(0,1.05)    
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc="lower left", fontsize='x-small')

In this case we did 5-fold cross-validation and plotted precision-recall curves for each iteration. We can then average the AUC-PR of each iteration to estimate the performance of our method.

Examples of tagging


In [ ]:
num_comments = 2
label0_comment_idx = y_score[:,1].argsort()[:num_comments] #SuicideWatch
label1_comment_idx = y_score[:,1].argsort()[-num_comments:] #depression
test_set_labels = labels[test_set_idx]
#convert back to the indices of the original dataset
top_comments_testing_set_idx = np.concatenate([label0_comment_idx, 
                                               label1_comment_idx])


#these are the 5 comments the model is most sure of 
for i in top_comments_testing_set_idx:
    print(
        u"""{}:{}\n---\n{}\n===""".format(test_set_labels[i],
                                          y_score[i,1],
                                          test_corpus[i]))

These are the top-2 example for each label that the model is sure of. In this case we can see our important feature words in the ads and see how the model classified these advertisements.

Further Resources

A great resource for NLP in python is Natural Language Processing with Python

Exercises

Work thorugh the Reddit_TextAnalysis.ipynb notebook.