Text analysis is used to extract useful information from or summarize a large amount of unstructured text stored in documents. This opens up the opportunity of using text data alongside more conventional data sources (e.g. surveys and administrative data). The goal of text analysis is to take a large corpus of complex and unstructured text data and extract important and meaningful messages in a comprehensible way.
Text analysis can help with the following tasks:
Information Retrieval: Find relevant information in a large database, such as a systematic literature review, that would be very time-consuming for humans to do manually.
Clustering and Text Categorization: Summarize a large corpus of text by finding the most important phrases, using methods like topic modeling.
Text Summarization: Create category-sensitive text summaries of a large corpus of text.
Machine Translation: Translate documents from one language to another.
In this tutorial, we are going to analyze social services descriptions using topic modeling to examine the content of our data and document classification to tag the type of job in the advertisement.
In this tutorial, you will...
Corpus: A corpus is the set of all text documents used in your analysis; for example, your corpus of text may include hundreds of research articles.
Tokenize: Tokenization is the process by which text is separated into meaningful terms or phrases. In English this is easy to do for individual words, as they are separated by whitespace; however, it can get more complicated to automate determining which groups of words constitute meaningful phrases.
Stemming: Stemming is normalizing text by reducing all forms or conjugations of a word to the word's most basic form. In English, this can mean making a rule of removing the suffixes "ed" or "ing" from the end of all words, but it gets more complex. For example, "to go" is irregular, so you need to tell the algorithm that "went" and "goes" stem from a common lemma, and should be considered alternate forms of the word "go."
TF-IDF: TF-IDF (term frequency-inverse document frequency) is an example of feature engineering where the most important words are extracted by taking account their frequency in documents and the entire corpus of documents as a whole.
Topic Modeling: Topic modeling is an unsupervised learning method where groups of words that often appear together are clustered into topics. Typically, the words in one topic should be related and make sense (e.g. boat, ship, captain). Individual documents can fall under one topic or multiple topics.
LDA: LDA (Latent Dirichlet Allocation) is a type of probabilistic model commonly used for topic modeling.
Stop Words: Stop words are words that have little semantic meaning but occur very frequently, like prepositions, articles and common nouns. For example, every document (in English) will probably contain the words "and" and "the" many times. You will often remove them as part of preprocessing using a list of stop words.
In [ ]:
%pylab inline
import nltk
import ujson
import re
import time
import progressbar
import pandas as pd
from __future__ import print_function
from six.moves import zip, range
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve, roc_auc_score, auc
from sklearn import preprocessing
from collections import Counter, OrderedDict
from nltk.corpus import stopwords
from nltk import SnowballStemmer
In [ ]:
#nltk.download('stopwords') #download the latest stopwords
In [ ]:
df_socialservices_data = pd.read_csv('./data/socialservices.csv')
In [ ]:
df_socialservices_data.head()
Our table has 7 fields: FACID
, facname
, factype
, facurl
, facloc
, abouturl
, and textfromurl
.
In [ ]:
df_socialservices_data.factype.unique()
In [ ]:
df_socialservices_data.facname.unique()
In [ ]:
df_socialservices_data.facname.unique().shape
There are 48 facilities, categorized into 4 unique facility types: education, income, health, and safety net.
We are going to apply topic modeling, an unsupervised learning method, to our corpus to find the high-level topics in our corpus as a "first go" for exploring our data. Through this process, we'll discuss how to clean and preprocess our data to get the best results.
Topic modeling is a broad subfield of machine learning and natural language processing. We are going to focus on a common modeling approach called Latent Dirichlet Allocation (LDA).
To use topic modeling, we first have to assume that topics exist in our corpus, and that some small number of these topics can "explain" the corpus. Topics in this context refer to words from the corpus, in a list that is ranked by probability. A single document can be explained by multiple topics. For instance, an article on net neutrality would fall under the topic "technology" as well as the topic "politics." The set of topics used by a document is known as the document's allocation, hence, the name Latent Dirchlet Allocation, each document has an allocation of latent topics allocated by Dirchlet distribution.
The first important step in working with text data is cleaning and processing the data, which includes (but is not limited to) forming a corpus of text, tokenization, removing stop-words, finding words co-located together (N-grams), and stemming and lemmatization. Each of these steps will be discussed below.
The ultimate goal is to transform our text data into a form an algorithm can work with, because a document or a corpus of text cannot be fed directly into an algorithm. Algorithms expect numerical feature vectors with certain fixed sizes, and can't handle documents, which are basically sequences of symbols with variable length. We will be transforming our text corpus into a bag of n-grams to be further analyzed. In this form our text data is represented as a matrix where each row refers to a specific job description (document) and each column is the occurence of a word (feature).
Ultimately, we want to take our collection of documents, corpus, and convert it into a matrix. Fortunately, sklearn
has a pre-built object, CountVectorizer
, that can tokenize, eliminate stopwords, identify n-grams, and stem our corpus, and output a matrix in one step. Before we apply the vectorizer to our corpus of data, let's apply it to a toy example so that we see what the output looks like and how a bag of words is represented.
In [ ]:
def create_bag_of_words(corpus,
NGRAM_RANGE=(0,1),
stop_words = None,
stem = False,
MIN_DF = 0.05,
MAX_DF = 0.95,
USE_IDF=False):
"""
Turn a corpus of text into a bag-of-words.
Parameters
-----------
corpus: ls
test of documents in corpus
NGRAM_RANGE: tuple
range of N-gram. Default (0,1)
stop_words: ls
list of commonly occuring words that have little semantic
value
stem: bool
use a stemmer to stem words
MIN_DF: float
exclude words that have a frequency less than the threshold
MAX_DF: float
exclude words that have a frequency greater than the threshold
Returns
-------
bag_of_words: scipy sparse matrix
scipy sparse matrix of text
features:
ls of words
"""
#parameters for vectorizer
ANALYZER = "word" #unit of features are single words rather then phrases of words
STRIP_ACCENTS = 'unicode'
stemmer = nltk.SnowballStemmer("english")
if stem:
tokenize = lambda x: [stemmer.stem(i) for i in x.split()]
else:
tokenize = None
vectorizer = CountVectorizer(analyzer=ANALYZER,
tokenizer=tokenize,
ngram_range=NGRAM_RANGE,
stop_words = stop_words,
strip_accents=STRIP_ACCENTS,
min_df = MIN_DF,
max_df = MAX_DF)
bag_of_words = vectorizer.fit_transform( corpus ) #transform our corpus is a bag of words
features = vectorizer.get_feature_names()
if USE_IDF:
NORM = None #turn on normalization flag
SMOOTH_IDF = True #prvents division by zero errors
SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)
#get the bag-of-words from the vectorizer and
#then use TFIDF to limit the tokens found throughout the text
tfidf = transformer.fit_transform(bag_of_words)
return tfidf, features
else:
return bag_of_words, features
In [ ]:
toy_corpus = ['this is document one', 'this is document two', 'text analysis on documents is fun']
In [ ]:
toy_bag_of_words, toy_features = create_bag_of_words(toy_corpus)
In [ ]:
toy_corpus
In [ ]:
toy_features
In [ ]:
np_bag_of_words = toy_bag_of_words.toarray()
np_bag_of_words
Our data has been transformed from a document into a 3 x 9 matrix, where each row in the matrix corresponds to a document, and each column corresponds to a feature (in the order they appear in toy_features
). A 1 indicates the existence of the feature or word in the document, and a 0 indicates the word is not present.
It is very common that this representation will be a "sparse" matrix, or a matrix that has a lot of 0s. With sparse matrices, it is often more efficient to keep track of which values aren't 0 and where those non-zero entries are located, rather than to save the entire matrix. To save space, the scipy
library has special ways of storing sparse matrices in an efficient way.
Our toy corpus is now ready to be analyzed. We used this toy example to illustrate how a document is turned into a matrix to be used in text analysis. When you're applying this to real text data, the matrix will be much larger and harder to interpret, but it's important that you know the process.
In [ ]:
#solution
exercise_corpus = ['Batman is friends with Superman',
'Superman is enemies with Lex Luthor',
'Batman is enemies with Lex Luthor']
exercise_bag_of_words, exercise_features = create_bag_of_words(exercise_corpus)
In [ ]:
np_bag_of_words = exercise_bag_of_words.toarray()
In [ ]:
exercise_features
In [ ]:
np_bag_of_words
In [ ]:
def get_word_counts(bag_of_words, feature_names):
"""
Get the ordered word counts from a bag_of_words
Parameters
----------
bag_of_words: obj
scipy sparse matrix from CounterVectorizer
feature_names: ls
list of words
Returns
-------
word_counts: dict
Dictionary of word counts
"""
np_bag_of_words = bag_of_words.toarray()
word_count = np.sum(np_bag_of_words,axis=0)
np_word_count = np.asarray(word_count).ravel()
dict_word_counts = dict( zip(feature_names, np_word_count) )
orddict_word_counts = OrderedDict(
sorted(dict_word_counts.items(), key=lambda x: x[1], reverse=True), )
return orddict_word_counts
In [ ]:
get_word_counts(toy_bag_of_words, toy_features)
Note that the words "document" and "documents" both appear separately in the list. Should they be treated as the same words, since one is just the plural of the other, or should they be considered distinct words? These are the types of decisions you will have to make in your preprocessing steps.
In [ ]:
get_word_counts(exercise_bag_of_words, exercise_features)
In [ ]:
corpus = df_socialservices_data['textfromurl'].values #pull all the descriptions and put them in a numpy array
corpus
In [ ]:
def create_topics(tfidf, features, N_TOPICS=3, N_TOP_WORDS=5,):
"""
Given a matrix of features of text data generate topics
Parameters
-----------
tfidf: scipy sparse matrix
sparse matrix of text features
N_TOPICS: int
number of topics (default 10)
N_TOP_WORDS: int
number of top words to display in each topic (default 10)
Returns
-------
ls_keywords: ls
list of keywords for each topics
doctopic: array
numpy array with percentages of topic that fit each category
N_TOPICS: int
number of assumed topics
N_TOP_WORDS: int
Number of top words in a given topic.
"""
with progressbar.ProgressBar(max_value=progressbar.UnknownLength) as bar:
i=0
lda = LatentDirichletAllocation( n_topics= N_TOPICS,
learning_method='online') #create an object that will create 5 topics
bar.update(i)
i+=1
doctopic = lda.fit_transform( tfidf )
bar.update(i)
i+=1
ls_keywords = []
for i,topic in enumerate(lda.components_):
word_idx = np.argsort(topic)[::-1][:N_TOP_WORDS]
keywords = ', '.join( features[i] for i in word_idx)
ls_keywords.append(keywords)
print(i, keywords)
bar.update(i)
i+=1
return ls_keywords, doctopic
In [ ]:
corpus_bag_of_words, corpus_features = create_bag_of_words(corpus)
Let's examine our features.
In [ ]:
corpus_features
The first aspect to notice about the feature list is that the first few entries are numbers that have no real semantic meaning. The feature lists also includes numerous other useless words, such as prepositions and articles, that will just add noise to our analysis.
We can also notice the words action and activities, or the words addition and additional, are close enough to each other that it might not make sense to treat them as entirely separate words. Part of your cleaning and preprocessing duties will be manually inspecting your lists of features, seeing where these issues arise, and making decisions to either remove them from your analysis or address them separately.
Let's get the count of the number of times that each of the words appears in our corpus.
In [ ]:
get_word_counts(corpus_bag_of_words, corpus_features)
Our top words are articles, prepositions and conjunctions that are not informative whatsoever, so we're probably not going to come up with anything interesting ("garbage in, garbage out").
Nevertheless, let's forge blindly ahead and try to create topics, and see the quality of the results that we get.
In [ ]:
ls_corpus_keywords, corpus_doctopic = create_topics(corpus_bag_of_words, corpus_features)
These topics don't give us any real insight to what the data contains - one of the topics is "and, the, to, of, in"! There are some hints to the subjects of the websites ("YWCA", "youth") and their locations ("Evanston"), but the signal is being swamped by the noise.
The word "click" also comes up. This word might be useful in some contexts, but since we scraped this data from websites, it's likely that "click" is more related to the website itself (e.g. "Click here to find out more") as opposed to the content of the website.
We'll have to clean and process our data to get any meaningful information out of this text.
To clean and normalize text, we'll remove all special characters, numbers, and punctuation, so we're left with only the words themselves. Then we will make all the text lowercase; this uniformity will ensure that the algorithm doesn't treat "the" and "The" as different words, for example.
To remove the special characters, numbers and punctuation we will use regular expressions.
Regular Expressions, or "regexes" for short, let you find all the words or phrases in a document or text file that match a certain pattern. These rules are useful for pulling out useful information from a large amount of text. For example, if you want to find all email addresses in a document, you might look for everything that looks like some combination of letters, _, . followed by @, followed by more letters, and ending in .com or .edu. If you want to find all the credit card numbers in a document, you might look for everywhere you see the pattern "four numbers, space, four numbers, space, four numbers, space, four numbers." Regexes are also helpful if you are scraping information from websites, because you can use them to separate the content from the HTML code used for formatting the website.
A full tutorial on regular expressions would be outside the scope of this tutorial, but many good tutorials that can be found on-line. regex101.com is also a great interactive tool for developing and checking regular expressions.
"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." -- Jaime Zawinski
A word of warning: Regexes can work much more quickly than plain text sorting; however, if your regular expressions are becoming overly complicated, it's a good idea to find a simpler way to do what you want to do. Any developer should keep in mind there is a trade-off between optimization and understandability. The general philosophy of programming in Python is that your code is meant to be as understandable by people as much as possible, because human time is more valuable than computer time. You should therefore lean toward understandability rather than overly optimizing your code to make it run as quickly as possible. Your future-self, code-reviewers, people who inherit your code, and anyone else who has to make sense of your code in the future will appreciate it.
For our purposes, we are going to use a regular expression to match all characters that are not letters -- punctuation, quotes, special characters and numbers -- and replace them with spaces. Then we'll make all of the remaining characters lowercase.
We will be using the re
library in python for regular expression matching.
In [ ]:
#get rid of the punctuations and set all characters to lowercase
RE_PREPROCESS = r'\W+|\d+' #the regular expressions that matches all non-characters
#get rid of punctuation and make everything lowercase
#the code below works by looping through the array of text ("corpus")
#for a given piece of text ("comment") we invoke the `re.sub` command
#the `re.sub` command takes 3 arguments: (1) the regular expression to match,
#(2) what we want to substitute in place of that matching string (' ', a space)
#and (3) the text we want to apply this to.
#we then invoke the `lower()` method on the output of the `re.sub` command
#to make all the remaining characters lowercase.
#the result is a list, where each entry in the list is a cleaned version of the
#corresponding entry in the original corpus.
#we then make the list into a numpy array to use it in analysis
processed_corpus = np.array( [ re.sub(RE_PREPROCESS, ' ', comment).lower() for comment in corpus] )
In [ ]:
corpus[0]
This text includes a lot of useful information, but also includes some things we don't want or need. There are some weird special characters (like \xe2\x80\x94
). There are also some numbers, which are informative and interesting to a human reading the text (phone numbers, addresses, "since 1899," "impacts the lives of nearly 20,000 children"), but when we break down the documents into individual words, the numbers will become meaningless. We'll also want to remove all punctuation, so that we can say any two things separated by a space are individual words.
In [ ]:
processed_corpus[0]
All lowercase, all numbers and special characters have been removed. Out text is now normalized.
Now that we've cleaned our text, we can tokenize it by deciding which words or phrases are the most meaningful. In this case, we'll want to split our text into individual words. Normally the CountVectorizer
handles this for us.
To go from a whole document to a list of individual words, we can use the .split()
command. By default, this command splits based on spaces in between words, so we don't need to specify that explicitly.
In [ ]:
tokens = processed_corpus[0].split()
In [ ]:
tokens
Stopwords are words that are found commonly throughout a text and carry little semantic meaning. Examples of common stopwords are prepositions, articles and common nouns. For example, the words the and of are totally ubiquitous, so they won't serve as meaningful features, whether to distinguish documents from each other or to tell what a given document is about. You may also run into words that you want to remove based on where you obtained your corpus of text or what it's about. There are many lists of common stopwords available for you to use, both for general documents and for specific contexts, so you don't have to start from scratch.
We can eliminate stopwords by checking all the words in our corpus against a list of commonly occuring stopwords.
In [ ]:
eng_stopwords = stopwords.words('english')
In [ ]:
eng_stopwords
In [ ]:
#sample of stopwords
#this is an example of slicing where we implicitly start at the beginning and move to the end
#we select every 10th entry in the array
eng_stopwords[::10]
Notice that this list includes "weren" and "hasn" as well as single letters ("t"). Why do you think these are contained in the list of stopwords?
In [ ]:
eng_stopwords[::5]
In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus, stop_words=eng_stopwords)
dict_processed_word_counts = get_word_counts(processed_bag_of_words, processed_features)
dict_processed_word_counts
Much better! Now this is starting to look like a reasonable representation of our corpus of text.
We mentioned that, in addition to stopwords that are common across all types of text analysis problems, there wil also be specific stopwords based on the context of your domain. Notice how the top words include words like "services," "youth," "community," "mission"? It makes sense that these words are so common, but we'd expect to see them in every website in our corpus - after all, we're looking at websites of social service organizations in Chicago! - so they won't be very helpful in analysis.
One quick way to remove some of these domain-specific stopwords is by dropping some of your most frequent words. We'll start out by dropping the top 20. You'll want to change this number, playing with making it bigger and smaller, to see how it affects your resulting topics.
In [ ]:
top_20_words = list(dict_processed_word_counts.keys())[:20]
domain_specific_stopwords = eng_stopwords + top_20_words
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
stop_words=domain_specific_stopwords)
In [ ]:
dict_processed_word_counts = get_word_counts(processed_bag_of_words, processed_features)
dict_processed_word_counts
This is a bit better - although we still see some words that are probably very common ("care", "communities"), words like "catholic," "north," and "violence" will probably help us come up with more specific categories within the broader realm of social services. Let's see what topics we produce.
In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words,
processed_features)
Now we are starting to get somewhere! We can manipulate the number of topics we want to find and the number of words to use for each topic to see if we can understand more from our corpus.
In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 5,
N_TOP_WORDS= 10)
Some structure is starting to reveal itself - "legal" and "law" appear in the same topic, as do "violence," "domestic," and "women" (probably appearing in websites of women's shelters). Adding more topics has revealed to larger subtopics. Let's see if increasing the number of topics gives us more information.
However, we can see that "donatebutton" and "companylogo" are still present - these are more likely artifacts of the websites than useful information about the charities! This is an iterative process - after seeing the results of some analysis, you will need to go back to the preprocessing step and add more words to your list of stopwords or change how you cleaned the data.
In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 10,
N_TOP_WORDS= 15)
This looks like a good amount of topics for now. Some of the top words are quite similar, like "volunteer" and "volunteers," or "child" and "children." Let's move to stemming and lemmatization.
We can further process our text through stemming and lemmatization, or replacing words with their root or simplest form. For example "systems," "systematic," and "system" are all different words, but we can replace all these words with "system" without sacrificing much meaning.
A lemma is the original dictionary form of a word (e.g. the lemma for "lies," "lied," and "lying" is "lie"). The process of turning a word into its simplest form is stemming. There are several well known stemming algorithms -- Porter, Snowball, Lancaster -- that all have their respective strengths and weaknesses. For this tutorial, we'll use the Porter Stemmer.
In [ ]:
stemmer = SnowballStemmer("english")
print(stemmer.stem('lies'))
print(stemmer.stem("lying"))
print(stemmer.stem('systematic'))
print(stemmer.stem("running"))
In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
stop_words=domain_specific_stopwords,
stem=False)
In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 10,
N_TOP_WORDS= 15)
Obviously, reducing a document to a bag of words means losing much of its meaning - we put words in certain orders, and group words together in phrases and sentences, precisely to give them more meaning. If you follow the processing steps we've gone through so far, splitting your document into individual words and then removing stopwords, you'll completely lose all phrases like "kick the bucket," "commander in chief," or "sleeps with the fishes."
One way to address this is to break down each document similarly, but rather than treating each word as an individual unit, treat each group of 2 words, or 3 words, or n words, as a unit. We call this a "bag of n-grams," where n is the number of words in each chunk. Then you can analyze which groups of words commonly occur together (in a fixed order).
Let's transform our corpus into a bag of n-grams with n=2: a bag of 2-grams, AKA a bag of bi-grams.
In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
stop_words=domain_specific_stopwords,
stem=True,
NGRAM_RANGE=(0,2))
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 10,
N_TOP_WORDS= 15)
We can see that this lets us uncover patterns that we couldn't when we just used a bag of words: "north shore" and "domest violenc" come up as words. Note that this still includes the individual words, as well as the bi-grams.
A final step in cleaning and processing our text data is Term Frequency-Inverse Document Frequency (TF-IDF). TF-IDF is based on the idea that the words (or terms) that are most related to a certain topic will occur frequently in documents on that topic, and infrequently in other. To reweight words so that the we capture words that are unique to a document and suppress words that are common throughout the corpus by inversely weighting them by their frequency that are common throughout the corpus
In [ ]:
processed_bag_of_words, processed_features = create_bag_of_words(processed_corpus,
stop_words=domain_specific_stopwords,
stem=True,
NGRAM_RANGE=(0,2),
USE_IDF = True)
In [ ]:
dict_word_counts = get_word_counts(processed_bag_of_words,
processed_features)
In [ ]:
dict_word_counts
The words counts have been reweighted to emphasize the more meaningful words of the corpus, while de-emphasizing the words that are found commonly throughout the corpus.
In [ ]:
processed_keywords, processed_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 10,
N_TOP_WORDS= 15)
In [ ]:
exercise_keywords, exercise_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 5,
N_TOP_WORDS= 25)
In [ ]:
exercise_keywords, exercise_doctopic = create_topics(processed_bag_of_words,
processed_features,
N_TOPICS = 10,
N_TOP_WORDS= 25)
In [ ]:
#grab the topic_id of the majority topic for each document and store it in a list
ls_topic_id = [np.argsort(processed_doctopic[comment_id])[::-1][0] for comment_id in range(len(corpus))]
df_socialservices_data['topic_id'] = ls_topic_id #add to the dataframe so we can compare with the job titles
Now that each row is tagged with a topic ID. Let's see how well the topics explain the social services by looking at the first topic, and seeing how similar the social services within that topic are to each other.
In [ ]:
topic_num = 0
print(processed_keywords[topic_num])
df_socialservices_data[ df_socialservices_data.topic_id == topic_num ].head(10)
In [ ]:
topic_num = 3
print(processed_keywords[topic_num])
df_socialservices_data[ df_socialservices_data.topic_id == topic_num ].head(10)
Previously, we used topic modeling to infer relationships between social service facilities within the data. That is an example of unsupervised learning: we were looking to uncover structure in the form of topics, or groups of agencies, but we did not necessarily know the ground truth of how many groups we should find or which agencies belonged in which group.
Now we turn our attention to supervised learning. In supervised learning, we have a known outcome or label (Y) that we want to produce given some data (X), and in general, we want to be able to produce this Y when we don't know it, or when we only have X.
In order to produce labels we need to first have examples our algorithm can learn from, a "training set." In the context of text analysis, developing a training set can be very expensive, as it can require a large amount of human labor or linguistic expertise. Document classification is an example of supervised learning in which want to characterize our documents based on their contents (X). A common example of document classification is spam e-mail detection. Another example of supervised learning in text analysis is sentiment analysis, where X is our documents and Y is the state of the author. This "state" is dependent on the question you're trying to answer, and can range from the author being happy or unhappy with a product to the author being politically conservative or liberal. Another example is part-of-speech tagging where X are individual words and Y is the part-of-speech.
In this section, we'll train a classifier to classify social service agencies. Let's see if we can label a new website as belonging to facility type "income" or "health."
In [ ]:
df_socialservices_data.factype.value_counts()
In [ ]:
mask = df_socialservices_data.factype.isin(['income','health'])
In [ ]:
df_income_health = df_socialservices_data[mask]
In [ ]:
df_train, df_test = train_test_split(df_income_health, test_size=0.20, random_state=17)
In [ ]:
df_train.head()
In [ ]:
df_train['factype'].unique()
In [ ]:
Counter(df_train['factype'].values)
In [ ]:
df_test.head()
In [ ]:
df_test['factype'].unique()
In [ ]:
Counter(df_test['factype'].values)
In [ ]:
train_labels = df_train.factype.values
train_corpus = np.array( [re.sub(RE_PREPROCESS, ' ', text).lower() for text in df_train.textfromurl.values])
test_labels = df_test.factype.values
test_corpus = np.array( [re.sub(RE_PREPROCESS, ' ', text).lower() for text in df_test.textfromurl.values])
labels = np.append(train_labels, test_labels)
Just as we had done in the unsupervised learning context, we have to transform our data. This time we have to transform our testing and training set into two different bags of words. The classifier will learn from the training set, and we will evaluate the classifier's performance on the testing set.
In [ ]:
#parameters for vectorizer
ANALYZER = "word" #unit of features are single words rather then phrases of words
STRIP_ACCENTS = 'unicode'
TOKENIZER = None
NGRAM_RANGE = (0,2) #Range for pharases of words
MIN_DF = 0.01 # Exclude words that have a frequency less than the threshold
MAX_DF = 0.8 # Exclude words that have a frequency greater then the threshold
vectorizer = CountVectorizer(analyzer=ANALYZER,
tokenizer=None, # alternatively tokenize_and_stem but it will be slower
ngram_range=NGRAM_RANGE,
stop_words = stopwords.words('english'),
strip_accents=STRIP_ACCENTS,
min_df = MIN_DF,
max_df = MAX_DF)
In [ ]:
NORM = None #turn on normalization flag
SMOOTH_IDF = True #prvents division by zero errors
SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
USE_IDF = True #flag to control whether to use TFIDF
transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)
#get the bag-of-words from the vectorizer and
#then use TFIDF to limit the tokens found throughout the text
start_time = time.time()
train_bag_of_words = vectorizer.fit_transform( train_corpus ) #using all the data on for generating features!! Bad!
test_bag_of_words = vectorizer.transform( test_corpus )
if USE_IDF:
train_tfidf = transformer.fit_transform(train_bag_of_words)
test_tfidf = transformer.transform(test_bag_of_words)
features = vectorizer.get_feature_names()
print('Time Elapsed: {0:.2f}s'.format(
time.time()-start_time))
We cannot pass the labels "income" or "health" directly to the classifier. Instead, we to encode them as 0s and 1s using the labelencoder
part of sklearn
.
In [ ]:
#relabel our labels as a 0 or 1
le = preprocessing.LabelEncoder()
le.fit(labels)
labels_binary = le.transform(labels)
In [ ]:
list(zip(labels,labels_binary))
We also need to create arrays of indices so we can access the training and testing sets accordingly.
In [ ]:
train_size = df_train.shape[0]
train_set_idx = np.arange(0,train_size)
test_set_idx = np.arange(train_size, len(labels))
train_labels_binary = labels_binary[train_set_idx]
test_labels_binary = labels_binary[test_set_idx]
The classifier we are using in the example is LogisticRegression. As we saw in the Machine Learning tutorial, first we decide on a classifier, then we fit the classifier to the data to create a model. We can then test our model on the test set by passing the features (X) from our test set to get predicted labels. The model will output the probability of each document being classified as income or health.
In [ ]:
clf = LogisticRegression(penalty='l1')
mdl = clf.fit(train_tfidf, labels_binary[train_set_idx]) #train the classifer to get the model
y_score = mdl.predict_proba( test_tfidf ) #score of the document referring to an income or health agency
In [ ]:
def plot_precision_recall_n(y_true, y_prob, model_name):
"""
y_true: ls
ls of ground truth labels
y_prob: ls
ls of predic proba from model
model_name: str
str of model name (e.g, LR_123)
"""
from sklearn.metrics import precision_recall_curve
y_score = y_prob
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true, y_score)
precision_curve = precision_curve[:-1]
recall_curve = recall_curve[:-1]
pct_above_per_thresh = []
number_scored = len(y_score)
for value in pr_thresholds:
num_above_thresh = len(y_score[y_score>=value])
pct_above_thresh = num_above_thresh / float(number_scored)
pct_above_per_thresh.append(pct_above_thresh)
pct_above_per_thresh = np.array(pct_above_per_thresh)
plt.clf()
fig, ax1 = plt.subplots()
ax1.plot(pct_above_per_thresh, precision_curve, 'b')
ax1.set_xlabel('percent of population')
ax1.set_ylabel('precision', color='b')
ax1.set_ylim(0,1.05)
ax2 = ax1.twinx()
ax2.plot(pct_above_per_thresh, recall_curve, 'r')
ax2.set_ylabel('recall', color='r')
ax2.set_ylim(0,1.05)
name = model_name
plt.title(name)
plt.show()
In [ ]:
plot_precision_recall_n(labels_binary[test_set_idx], y_score[:,1], 'LR')
If we examine our precision-recall curve we can see that our precision is 1 up to 40 percent of the population. We can use a "precision at k" curve to see what percent of the corpus can be tagged by the classifier, and which should undergo a manual clerical review. Based on this curve, we might say that we can use our classifier to tag the 25% of the documents that have the highest scores as 1, and manually review the rest.
Alternatively, we can try to maximize the entire precision-recall space. In this case we need a different metric.
In [ ]:
def plot_precision_recall(y_true,y_score):
"""
Plot a precision recall curve
Parameters
----------
y_true: ls
ground truth labels
y_score: ls
score output from model
"""
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_true,y_score[:,1])
plt.plot(recall_curve, precision_curve)
plt.xlabel('Recall')
plt.ylabel('Precision')
auc_val = auc(recall_curve,precision_curve)
print('AUC-PR: {0:1f}'.format(auc_val))
plt.show()
plt.clf()
In [ ]:
plot_precision_recall(labels_binary[test_set_idx],y_score)
The AUC shows how accurate our scores are under different cutoff thresholds. The model will output a score between 0 and 1. We specify a range of cutoff values and label all of the examples as 0 or 1 based on whether they are above or below each cutoff value. The closer our scores are to the true values, the more resilient they are to different cutoffs. For instance, if our scores were perfect, our AUC would be 1.
In [ ]:
def display_feature_importances(coef,features, labels, num_features=10):
"""
output feature importances
Parameters
----------
coef: numpy
feature importances
features: ls
feature names
labels: ls
labels for the classifier
num_features: int
number of features to output (default 10)
Example
--------
"""
coef = mdl.coef_.ravel()
dict_feature_importances = dict( zip(features, coef) )
orddict_feature_importances = OrderedDict(
sorted(dict_feature_importances.items(), key=lambda x: x[1]) )
ls_sorted_features = list(orddict_feature_importances.keys())
label0_features = ls_sorted_features[:num_features]
label1_features = ls_sorted_features[-num_features:]
print(labels[0],label0_features)
print(labels[1], label1_features)
In [ ]:
display_feature_importances(mdl.coef_.ravel(), features, ['health','income'])
The feature importances give us the words which are the most relevant for distinguishing the type of social service agency (between income and health). Some of these make sense ("city church" seems more likely to be health than income), but some don't make as much sense, or seem to be artifacts from the website that we should remove ("housing humancarelogo").
We need to know how to pass into the function we want the top 25 feature importances. We can do this by consulting the docstring of the function.
From this docstring we can see that num_features
is a keyword argument that is set to 10 by default. We can pass num_features=25
into the keyword argument instead to get the top 25 feature importances.
In [ ]:
display_feature_importances(mdl.coef_.ravel(),
features,
['health','income'],
num_features=25)
Recall from the machine learning tutorial that we are seeking the find the most general pattern in the data in order to have to most general model that will be successful at classifying new unseen data. Our previous strategy above was the Out-of-sample and holdout set. With this strategy we try to find a general pattern by randomly dividing our data into a test and training set based on some percentage split (e.g., 50-50 or 80-20). We train on the test set and evaluate on the test set, where we pretend that we don't have the labels for the test set. A significant drawback with this approach is that we may be lucky or unlucky with our random split, and so our estimate of how we'd perform on truly new data is overly optimistic or overly pessimistic. A possible solution is to create many random splits into training and testing sets and evaluate each split to estimate the performance of a given model.
A more sophisticated holdout training and testing procedure is cross-validation. In cross-validation we split our data into k folds or partitions, where k is usually 5 or 10. We then iterate k times. In each iteration, one of the folds is used as a test set, and the rest of the folds are combined to form the training set. We can then evaluate the performance at each iteration to estimate the performance of a given method. An advantage of using cross-validation is all examples of data are used in the training set at least once.
In [ ]:
def create_test_train_bag_of_words(train_corpus, test_corpus):
"""
Create test and training set bag of words
Parameters
----------
train_corpus: ls
ls of raw text for text corpus.
test_corpus: ls
ls of raw text for train corpus.
Returns
-------
(train_bag_of_words,test_bag_of_words): scipy sparse matrix
bag-of-words representation of train and test corpus
features: ls
ls of words used as features.
"""
#parameters for vectorizer
ANALYZER = "word" #unit of features are single words rather then phrases of words
STRIP_ACCENTS = 'unicode'
TOKENIZER = None
NGRAM_RANGE = (0,2) #Range for pharases of words
MIN_DF = 0.01 # Exclude words that have a frequency less than the threshold
MAX_DF = 0.8 # Exclude words that have a frequency greater then the threshold
vectorizer = CountVectorizer(analyzer=ANALYZER,
tokenizer=None, # alternatively tokenize_and_stem but it will be slower
ngram_range=NGRAM_RANGE,
stop_words = stopwords.words('english'),
strip_accents=STRIP_ACCENTS,
min_df = MIN_DF,
max_df = MAX_DF)
NORM = None #turn on normalization flag
SMOOTH_IDF = True #prevents division by zero errors
SUBLINEAR_IDF = True #replace TF with 1 + log(TF)
USE_IDF = True #flag to control whether to use TFIDF
transformer = TfidfTransformer(norm = NORM,smooth_idf = SMOOTH_IDF,sublinear_tf = True)
#get the bag-of-words from the vectorizer and
#then use TFIDF to limit the tokens found throughout the text
train_bag_of_words = vectorizer.fit_transform( train_corpus )
test_bag_of_words = vectorizer.transform( test_corpus )
if USE_IDF:
train_tfidf = transformer.fit_transform(train_bag_of_words)
test_tfidf = transformer.transform(test_bag_of_words)
features = vectorizer.get_feature_names()
return train_tfidf, test_tfidf, features
In [ ]:
from sklearn.cross_validation import StratifiedKFold
cv = StratifiedKFold(train_labels_binary, n_folds=3)
train_labels_binary = le.transform(train_labels)
for i, (train,test) in enumerate(cv):
cv_train = train_corpus[train]
cv_test = train_corpus[test]
bag_of_words_train, bag_of_words_test, feature_names = create_test_train_bag_of_words(cv_train,
cv_test)
probas_ = clf.fit(bag_of_words_train,
train_labels_binary[train]).predict_proba(bag_of_words_test)
cv_test_labels = train_labels_binary[test]
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(cv_test_labels,
probas_[:,1])
auc_val = auc(recall_curve,precision_curve)
plt.plot(recall_curve, precision_curve, label='AUC-PR {0} {1:.2f}'.format(i,auc_val))
plt.ylim(0,1.05)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc="lower left", fontsize='x-small')
In this case we did 5-fold cross-validation and plotted precision-recall curves for each iteration. You can see that there is a marked difference between the iterations. We can then average the AUC-PR of each iteration to estimate the performance of our method.
In [ ]:
from sklearn.cross_validation import StratifiedKFold
cv = StratifiedKFold(train_labels_binary, n_folds=5)
train_labels_binary = le.transform(train_labels)
for i, (train,test) in enumerate(cv):
cv_train = train_corpus[train]
cv_test = train_corpus[test]
bag_of_words_train, bag_of_words_test, feature_names = create_test_train_bag_of_words(cv_train,
cv_test)
probas_ = clf.fit(bag_of_words_train,
train_labels_binary[train]).predict_proba(bag_of_words_test)
cv_test_labels = train_labels_binary[test]
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(cv_test_labels,
probas_[:,1])
auc_val = auc(recall_curve,precision_curve)
plt.plot(recall_curve, precision_curve, label='AUC-PR {0} {1:.2f}'.format(i,auc_val))
plt.ylim(0,1.05)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend(loc="lower left", fontsize='x-small')
In [ ]:
df_test
In [ ]:
num_comments = 2
label0_comment_idx = y_score[:,1].argsort()[:num_comments]
label1_comment_idx = y_score[:,1].argsort()[-num_comments:]
test_set_labels = labels[test_set_idx]
#convert back to the indices of the original dataset
top_comments_testing_set_idx = np.concatenate([label0_comment_idx,
label1_comment_idx])
#these are the 5 comments the model is most sure of
for i in top_comments_testing_set_idx:
print(
u"""{}:{}\n---\n{}\n===""".format(test_set_labels[i],
y_score[i,1],
test_corpus[i]))
These are the top 2 examples that the model is the most sure of for each label. We can see our important feature words in the descriptions, which gives a hint of how the model made these classifications.
A great resource for NLP in Python is Natural Language Processing with Python.