This notebook presents practical methods for learning from natural text. Using simple combinations of deep learning, classification, and regression, I demonstrate how to predict a blogger's gender and age with high accuracy based on his or her blog posts. More specifically, I create text features using the Word2Vec deep learning model implemented in the Gensim Python package, and then perform classification and regression using the machine learning toolkits in GraphLab Create.
The notebook is divided to the following sections:
Each section can be executed independently. So feel free to skip ahead if you are impatient.
Required Python Packages:
Let's start!
Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo.
% sudo pip install --upgrade beautifulsoup4 % sudo pip install --upgrade gensim % sudo pip install --upgrade nltk % sudo pip install --upgrade graphlab-create
You will need a product key for GraphLab Create.
You'll also need to install additional data from nltk. This will only need to be done once.
In [ ]:
# Uncomment if this is your first time using nltk
#import nltk
#nltk.download()
First we need to download a relevant dataset. For this notebook, I chose to use the The Blog Authorship Corpus, which contains 681,288 posts as well as general details about each blogger, such as age, gender, industry, and even astrological sign (Schler et al. 2006). After downloading, unzip the corpus into /home/graphlab_create/data/blogs/xml. (If you use a different directory, make sure to change the BASE_DIR
variable value in the code below.)
Each blogger's blog posts are formatted as an XML file that looks like this:
Unfortunately, some of the XML files are malformed. So instead of using regular XML DOM Parsers such as minidom, I used the more robust BeautifulSoup package. The following code creates and saves an SFrame object that contains all the blog post data, one row per blogger.
Note that this parses 19,320 files and can take some time. It will also generate a bunch of warning messages about URLs, which we are not showing here. Don't worry about those. Feel free to get a cup of coffee and come back in a few minutes.
In [1]:
import os
import graphlab as gl
from bs4 import BeautifulSoup
BASE_DIR = "/home/graphlab_create/data/blogs" # NOTE: Update BASE_DIR to your own directory path
class BlogData2SFrameParser(object):
#Some constants
ID = "id"
GENDER = "gender"
AGE = "age"
SIGN = "sign"
POSTS = "posts"
DATES = "dates"
INDUSTRY = "industry"
def __init__(self, xml_files_dir, sframe_outpath):
"""
Parse all the blog posts XML files in the xml_files_dir and insert them into an SFrame object,
which is later saved to `sframe_outpath`
:param xml_files_dir: the directory which contains XML files of the The Blog Authorship Corpus
:param sframe_outpath: the out path to save the SFrame.
"""
self._bloggers_data = []
for p in os.listdir(xml_files_dir):
if p.endswith(".xml"):
#We parse each XML file and convert it to a dict
self._bloggers_data.append(self.parse_blog_xml_to_dict("%s%s%s" % (xml_files_dir, os.path.sep, p)))
print "Successfully parsed %s blogs" % len(self._bloggers_data)
# self._bloggers_data is a list of dict which we can easily load to a SFrame object. However, the dict object
# are loaded into a single column named X1. To create separate column for each dict key we use the unpack function.
self._sf = gl.SFrame(self._bloggers_data).unpack('X1')
#Now we can use the rename function in order to remove the X1. prefix from the column names and save the SFrame for later use
self._sf.rename({c:c.replace("X1.", "") for c in self._sf.column_names()} )
self._sf.save(sframe_outpath)
def parse_blog_xml_to_dict(self, path):
"""
Parse the blog post in the input XML file and return dict with the blogger's personal information and posts
:param path: the path of the xml file
:return: dict with the blogger's personal details and posts
:rtype: dict
"""
blogger_dict = {}
#Extract the blogger personal details from the file name
blog_id,gender,age,industry, sign = path.split(os.path.sep)[-1].split(".xml")[0].split(".")
blogger_dict[self.ID] = blog_id
blogger_dict[self.GENDER] = gender
blogger_dict[self.AGE] = int(age)
blogger_dict[self.INDUSTRY] = industry
blogger_dict[self.SIGN] = sign
blogger_dict[self.POSTS] = []
blogger_dict[self.DATES] = []
#The XML files are not well formatted, so we need to do some hacks.
s = file(path,"r").read().replace(" ", " ")
# First, strip the <Blog> and </Blog> tags at the beginning and end of the document
s = s.replace("<Blog>", "").replace("</Blog>", "").strip()
# Now, split the document into individual blog posts by the <date> tag
for e in s.split("<date>")[1:]:
# Separate the date stamp from the rest of the post
date_and_post = e.split("</date>")
blogger_dict[self.DATES].append(date_and_post[0].strip())
post = date_and_post[1].replace("<post>","").replace("</post>","").strip()
post = BeautifulSoup(post).get_text()
blogger_dict[self.POSTS].append(post)
if len(blogger_dict[self.DATES]) != len(blogger_dict[self.POSTS]):
raise Exception("Warning: Mismatch between the number of posts and the number of dates in file %s" % path)
return blogger_dict
@property
def sframe(self):
return self._sf
sframe_save_path = "%s/blogs.sframe" % BASE_DIR
b = BlogData2SFrameParser("%s/xml" % BASE_DIR, sframe_save_path)
sf = b.sframe
Gensim reads input from disk. So we'll need some glue code to get things into the right format. We will use the SFrame.apply() function to create sperate text files for each blogger's posts. The created text files are then used to construct our Word2Vec model.
In [2]:
os.mkdir("%s/txt" % BASE_DIR)
sf.apply(lambda r: file("%s/txt/%s.txt" % (BASE_DIR, r["id"]),"w").write("\n".join(r['posts']))).__materialize__()
Note: There's a mysterious call to '__materialize__()' in the last code block. SFrame and SArray operations are lazily evaluated. This is an optimization step that allows SFrame to chain expensive operations together and perform them as needed. A side effect of this behavior is that operations may not be performed when you make the call. In our case, we want to write out the data right away. So we calling '__materialize__()' to force the SFrame to not be lazy and materialize the results now.
Now that we have the data in an SFrame, you can call '.show()' to visualize it in GraphLab Canvas.
In [3]:
gl.canvas.set_target('ipynb')
sf.show()
Out[3]:
After loading the blogs data into a SFrame object, the next step is to use the blog post texts and train a Word2Vec model. Without getting into too much details, Word2Vec learns the semantic relationship between words. For our purposes, it's okay to treat Word2Vec as a magical blackbox that takes words as input and returns a vector of numbers that represent the input word and its meaning.
In this notebook, I use the Gensim package written by Radim Řehůřek to train a Word2Vec model from the set of blogs. This section contains the code for constructing Word2Vec. You can follow along and train your own model or just directly download the trained model (file1, file2, file3) and skip to the next section on creating & evaluating Machine Learning classification and regression models.
First, I create a TrainSentences class that takes as input a directory with English text files and returns an iterator which can be used to split senetences into list of words.
In [4]:
import os
import gensim
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
BASE_DIR = "/home/graphlab_create/data/blogs" # NOTE: Update BASE_DIR to your own directory path
class TrainSentences(object):
"""
Iterator class that returns Sentences from texts files in a input directory
"""
RE_WIHTE_SPACES = re.compile("\s+")
STOP_WORDS = set(stopwords.words("english"))
def __init__(self, dirname):
"""
Initialize a TrainSentences object with a input directory that contains text files for training
:param dirname: directory name which contains the text files
"""
self.dirname = dirname
def __iter__(self):
"""
Sentences iterator that return sentences parsed from files in the input directory.
Each sentences is returned as list of words
"""
#First iterate on all files in the input directory
for fname in os.listdir(self.dirname):
# read line from file (Without reading the entire file)
for line in file(os.path.join(self.dirname, fname), "rb"):
# split the read line into sentences using NLTK
for s in txt2sentences(line, is_html=True):
# split the sentence into words using regex
w =txt2words(s, lower=True, is_html=False, remove_stop_words=False,
remove_none_english_chars=True)
#skip short sentneces with less than 3 words
if len(w) < 3:
continue
yield w
def txt2sentences(txt, is_html=False, remove_none_english_chars=True):
"""
Split the English text into sentences using NLTK
:param txt: input text.
:param is_html: If True thenremove HTML tags using BeautifulSoup
:param remove_none_english_chars: if True then remove non-english chars from text
:return: string in which each line consists of single sentence from the original input text.
:rtype: str
"""
if is_html:
txt = BeautifulSoup(txt).get_text()
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# split text into sentences using nltk packages
for s in tokenizer.tokenize(txt):
if remove_none_english_chars:
#remove none English chars
s = re.sub("[^a-zA-Z]", " ", s)
yield s
def txt2words(txt, lower=True, is_html=False, remove_none_english_chars=True, remove_stop_words=True):
"""
Split text into words list
:param txt: the input text
:param lower: if to make the text to lowercase or not.
:param is_html: If True then remove HTML tags using BeautifulSoup
:param remove_none_english_chars: if True then remove non-english chars from text
:param remove_stop_words: if True then remove stop words from text
:return: words list create from the input text according to the input parameters.
:rtype: list
"""
if is_html:
txt = BeautifulSoup(txt).get_text()
if lower:
txt = txt.lower()
if remove_none_english_chars:
txt = re.sub("[^a-zA-Z]", " ", txt)
words = TrainSentences.RE_WIHTE_SPACES.split(txt.strip().lower())
if remove_stop_words:
#remove stop words from text
words = [w for w in words if w not in TrainSentences.STOP_WORDS]
return words
Now I create a 'sentences' object and train the Word2Vec model using Gensim. (This will take a while--about 40 minutes on my 8 core desktop with 24GB of RAM. It also generates a lot of output, of which we are only showing an excerpt here.) Feel free to go and grab lunch.
In [5]:
sentences = TrainSentences("%s/txt" % BASE_DIR)
model = gensim.models.Word2Vec(sentences, size=300, workers=8, min_count=40)
model.save("%s/blog_posts_300_c_40.word2vec" % BASE_DIR)
Using mostly the default Word2Vec parameters, I construct a Word2Vec model that can map a word and to a vector of size 300. If you are intrested in constructing Word2Vec models faster and with more complex criteria, you can read more about it here: Deep learning with word2vec and gensim, Deep learning with word2vec, and Bag of words meets bags of popcorn.
Let's see if our model works. Let's check for what words the model considers the most similar to the words "lol" and "gemini." Notice that in our pre-processing steps, all the words are converted to be in lowercase.
In [6]:
model.most_similar("lol")
Out[6]:
In [7]:
model.most_similar("gemini")
Out[7]:
Seems like the model works! The training process also saved the Word2Vec model on disk for later use. Now we are ready to use the trained model to analyze each blogger's posts and predict the blogger's attributes.
Now we are ready to try classification. We'll build a few classifiers to predict both the gender and the age category of the blogger based on his or her posts. Using the output vectors of Word2Vec as input features, the classifier can get much better results than using a bag-of-words model.
For those of you who skipped the previous section. You can download the trained Word2Vec model from here: file1, file2, file3, and load using the following code.
In [8]:
import gensim
BASE_DIR = "/home/graphlab_create/data/blogs" # NOTE: Update BASE_DIR to your own directory path
model_download_path = "%s/blog_posts_300_c_40.word2vec" % BASE_DIR
model = gensim.models.Word2Vec.load(model_download_path)
If you skipped the previous sections, you'll also need to load the blog data as an SFrame object:
In [9]:
import graphlab as gl
sframe_save_path = "%s/blogs.sframe" % BASE_DIR
sf = gl.load_sframe(sframe_save_path)
print sf.num_rows()
Before we can build the classifiers, we need to generate the necessary features. We'll benchmark the Word2Vec features against a simple alternative, n-gram features, which are also used in the sentiment analysis notebook.
In [10]:
# first we join the posts list to a single string
sf['posts'] = sf['posts'].apply(lambda posts:"\n".join(posts))
# Construct Bag-of-Words model and evaluate it
sf['1gram features'] = gl.text_analytics.count_ngrams(sf['posts'], 1)
sf['2gram features'] = gl.text_analytics.count_ngrams(sf['posts'], 2)
Generating the Word2Vec average vectors requires four steps:
Below, we define the DeepTextAnalyzer object that converts text into its corresponding average vector representation.
In [11]:
from numpy import average
import graphlab as gl
import numpy as np
import gensim
class DeepTextAnalyzer(object):
def __init__(self, word2vec_model):
"""
Construct a DeepTextAnalyzer using the input Word2Vec model
:param word2vec_model: a trained Word2Vec model
"""
self._model = word2vec_model
def txt2vectors(self,txt, is_html):
"""
Convert input text into an iterator that returns the corresponding vector representation of each
word in the text, if it exists in the Word2Vec model
:param txt: input text
:param is_html: if True, then extract the text from the input HTML
:return: iterator of vectors created from the words in the text using the Word2Vec model.
"""
words = txt2words(txt,is_html=is_html, lower=True, remove_none_english_chars=True)
words = [w for w in words if w in self._model]
if len(words) != 0:
for w in words:
yield self._model[w]
def txt2avg_vector(self, txt, is_html):
"""
Calculate the average vector representation of the input text
:param txt: input text
:param is_html: is the text is a HTML
:return the average vector of the vector representations of the words in the text
"""
vectors = self.txt2vectors(txt,is_html=is_html)
vectors_sum = next(vectors, None)
if vectors_sum is None:
return None
count =1.0
for v in vectors:
count += 1
vectors_sum = np.add(vectors_sum,v)
#calculate the average vector and replace +infy and -inf with numeric values
avg_vector = np.nan_to_num(vectors_sum/count)
return avg_vector
Using the DeepTextAnalyzer, we can calculate each blogger's average vector.
In [12]:
dt = DeepTextAnalyzer(model)
sf['vectors'] = sf['posts'].apply(lambda p: dt.txt2avg_vector(p, is_html=True))
sf['vectors'].head(1)
Out[12]:
Let's remove all rows with missing values from the SFrame and look at the results in GraphLab Canvas.
In [13]:
sf = sf.dropna()
print sf.column_names()
Randomly split the data into a training set and a test set. Then train a classifier on the train set and evaluate it on the test set. SFrame operations are lazily evaluated. So the lambda apply and dropna operations that we called above are actually queued up. The random split operation forces the SFrame to materialize the results. So this operation will take a while, because it is only now computing the vector representation of each blogger's posts.
In [14]:
train_set, test_set = sf.random_split(0.8, seed=5)
Okay, feature engineering is done! Let's see how well the single word counts (one-grams) work in predicting the blogger's gender.
In [15]:
cls = gl.classifier.create(train_set, target='gender', features=['1gram features'])
baseline_result = cls.evaluate(test_set)
print baseline_result
That got us an accuracy of about 0.74. For kicks, we'll throw in the 2-gram features as well. This blows up the feature space to more than 13 million sparse features and makes the training much more expensive. (My desktop was unresponsive for a few minutes since the training took up all available memory.) The results are less than overwhelming. Granted, the classifier hasn't been properly tuned, and performance should improve with better hyperparameter settings. But tuning itself is a computationally intensive process. Also, as we'll see, the word2vec features provide clear gains, at a fraction of the computation cost.
In [16]:
cls2 = gl.classifier.create(train_set, target='gender',features=['2gram features', '1gram features'] )
ngram_result = cls2.evaluate(test_set)
print ngram_result
The bag-of-words classifiers achieved an accuracy of roughly 0.74-0.75. Let's see what Word2Vec features can do.
In [17]:
cls3 = gl.classifier.create(train_set, target='gender',features=['vectors'])
word2vec_result = cls3.evaluate(test_set)
print word2vec_result
And we got an accuracy of ~0.79. This accuracy for gender prediction that utilize only the author's blog posts is probably better than the results obtained by Schler et al. 2006 (in order to actually prove that this results are better, we will need to add more tests. But this is a topic for another notebook). Let try to improve the results by training the classifier with some more features: the blogger's industry and age.
In [18]:
cls4 = gl.classifier.create(train_set, target='gender', features=['vectors', 'industry', 'age'] )
word2vec_industry_age_result = cls4.evaluate(test_set)
print word2vec_industry_age_result
We got slighter better results with the additional features. We can use the SFrame to easily engineer additional features, such as the number of urls in the posts and the total number of posts.
In [19]:
train_set['posts_length'] = train_set['posts'].apply(lambda p: len(p))
test_set['posts_length'] = test_set['posts'].apply(lambda p: len(p))
cls5 = gl.classifier.create(train_set, target='gender',features=['vectors', 'industry', 'age', 'posts_length'] )
word2vec_industry_age_posts_lenth_result = cls5.evaluate(test_set)
print word2vec_industry_age_posts_lenth_result
We get pretty much the same results. Let us move on to predicting the blogger's age.
Let's first look at some basic statistics of the bloggers' age using the show() function.
In [20]:
sf.show(['age'])
Out[20]:
Notice that the bloggers are between from 13 to 48 years old, with an average of about 23. To predict a blogger's age we can use both classification and regression. A regression model can estimate the actual age, whereas a classifier can predict the age group of the blogger. Lets start by constructing a regression model.
Constructing a regression model form using the regression toolkit is really straight forward. We can try both linear regression and boosted trees regression. We'll use only the Word2Vec features.
In [21]:
linear_model = gl.linear_regression.create(train_set, target='age',features=['vectors'])
linear_model.evaluate(test_set)
Out[21]:
In [22]:
boosted_tree_model = gl.boosted_trees_regression.create(train_set, target='age',features=['vectors'])
boosted_tree_model.evaluate(test_set)
Out[22]:
RMSE measures the average number of years by which the estimate is off from the real age. The linear regression model performed better in terms of RMSE than the boosted trees model. It is important to note that some of these blogs are written over time duration of years. Therefore, it can be hard to predict the blogger's exact age. It may be better to instead classify the blogger's age group using classification models.
Similar to Schler et al. 2006, we divide bloggers in our dataset into one of three age categories: 10s (13-17), 20s (23-27), and 30s (33-42).
We remove from the SFrame the bloggers that are out side of these age category.
In [23]:
valid_age = range(13,18) + range(23,28) + range(33,43)
sf_age_categories = sf.filter_by(valid_age, 'age')
In [24]:
def get_age_category(age):
if 13 <= age <=17:
return "10s"
elif 23 <= age <= 27:
return "20s"
elif 33 <= age <= 42:
return "30s"
return None
sf['age_category'] = sf['age'].apply(lambda age: get_age_category(age))
sf_age_categories = sf.dropna() # remove blogger without age category
print sf_age_categories.num_rows()
Now let's construct and evaluate an age category classfication model:
In [25]:
train_set2, test_set2 = sf_age_categories.random_split(0.8, seed=5)
cls = gl.classifier.create(train_set2, target='age_category', features=['vectors'])
age_categoy_result = cls.evaluate(test_set2)
print age_categoy_result
This shows that the features derived from Word2Vec can be used to predict the age category of a blogger with an accuracy of about 0.71.
In this notebook, we demonstrate that deep learning can generate useful features for predicting the gender and age of a blogger based on his or her blog post content. If you want to continue to explore this dataset yourself, there are a lot more that can be done. You can try to predict a blogger's astrological sign using his or her blogs. (I didn't succeed in building a prediction model that works better than random prediction. But you might!) You can try to predict a blogger's professional industry. You can also train the Word2Vec model on a different text corpus such as Wikipedia, and see if that gives you better results. We hope that the methods and code presented in this notebook can assist you to solve other text analysis tasks.
In [ ]: