Practical Deep Text Learning

This notebook presents practical methods for learning from natural text. Using simple combinations of deep learning, classification, and regression, I demonstrate how to predict a blogger's gender and age with high accuracy based on his or her blog posts. More specifically, I create text features using the Word2Vec deep learning model implemented in the Gensim Python package, and then perform classification and regression using the machine learning toolkits in GraphLab Create.

The notebook is divided to the following sections:

Each section can be executed independently. So feel free to skip ahead if you are impatient.

Required Python Packages:

Let's start!

0. Setup

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo.

% sudo pip install --upgrade beautifulsoup4
% sudo pip install --upgrade gensim
% sudo pip install --upgrade nltk
% sudo pip install --upgrade graphlab-create

You will need a product key for GraphLab Create.

You'll also need to install additional data from nltk. This will only need to be done once.


In [ ]:
# Uncomment if this is your first time using nltk
#import nltk
#nltk.download()

1. Preparing the Dataset

First we need to download a relevant dataset. For this notebook, I chose to use the The Blog Authorship Corpus, which contains 681,288 posts as well as general details about each blogger, such as age, gender, industry, and even astrological sign (Schler et al. 2006). After downloading, unzip the corpus into /home/graphlab_create/data/blogs/xml. (If you use a different directory, make sure to change the BASE_DIR variable value in the code below.)

Each blogger's blog posts are formatted as an XML file that looks like this:

Date1 Blog Post Text1 Date2 Blog Post Text2 ....

Unfortunately, some of the XML files are malformed. So instead of using regular XML DOM Parsers such as minidom, I used the more robust BeautifulSoup package. The following code creates and saves an SFrame object that contains all the blog post data, one row per blogger.

Note that this parses 19,320 files and can take some time. It will also generate a bunch of warning messages about URLs, which we are not showing here. Don't worry about those. Feel free to get a cup of coffee and come back in a few minutes.


In [1]:
import os
import graphlab as gl
from bs4 import  BeautifulSoup

BASE_DIR = "/home/graphlab_create/data/blogs" # NOTE: Update BASE_DIR to your own directory path
class BlogData2SFrameParser(object):
    #Some constants
    ID = "id"
    GENDER = "gender"
    AGE = "age"
    SIGN = "sign"
    POSTS = "posts"
    DATES = "dates"
    INDUSTRY = "industry"

    def __init__(self, xml_files_dir, sframe_outpath):
        """
        Parse all the blog posts XML files in the xml_files_dir and insert them into an SFrame object,
        which is later saved to `sframe_outpath`
        :param xml_files_dir: the directory which contains XML files of the The Blog Authorship Corpus
        :param sframe_outpath: the out path to save the SFrame.
        """
        self._bloggers_data = []


        for p in os.listdir(xml_files_dir):
            if p.endswith(".xml"):
                #We parse each XML file and convert it to a dict
                self._bloggers_data.append(self.parse_blog_xml_to_dict("%s%s%s" % (xml_files_dir, os.path.sep, p)))
        print "Successfully parsed %s blogs" % len(self._bloggers_data)

        # self._bloggers_data is a list of dict which we can easily load to a SFrame object. However, the dict object
        # are loaded into a single column named X1. To create separate column for each dict key we use the unpack function.        
        self._sf = gl.SFrame(self._bloggers_data).unpack('X1')

        #Now we can use the rename function in order to remove the X1. prefix from the column names and save the SFrame for later use
        self._sf.rename({c:c.replace("X1.", "") for c in self._sf.column_names()} )        
        self._sf.save(sframe_outpath)


    def parse_blog_xml_to_dict(self, path):
        """
        Parse the blog post in the input XML file and return dict with the  blogger's personal information and posts
        :param path: the path of the xml file
        :return: dict with the blogger's personal details and posts
        :rtype: dict
        """
        blogger_dict = {}
        #Extract the blogger personal details from the file name
        blog_id,gender,age,industry, sign = path.split(os.path.sep)[-1].split(".xml")[0].split(".")
        blogger_dict[self.ID] = blog_id
        blogger_dict[self.GENDER] = gender
        blogger_dict[self.AGE] = int(age)
        blogger_dict[self.INDUSTRY] = industry
        blogger_dict[self.SIGN] = sign
        blogger_dict[self.POSTS] = []
        blogger_dict[self.DATES] = []

        #The XML files are not well formatted, so we need to do some hacks.
        s = file(path,"r").read().replace(" ", " ")

        # First, strip the <Blog> and </Blog> tags at the beginning and end of the document
        s = s.replace("<Blog>", "").replace("</Blog>", "").strip()

        # Now, split the document into individual blog posts by the <date> tag
        for e in s.split("<date>")[1:]:
            # Separate the date stamp from the rest of the post
            date_and_post = e.split("</date>")
            blogger_dict[self.DATES].append(date_and_post[0].strip())
            post = date_and_post[1].replace("<post>","").replace("</post>","").strip()
            post = BeautifulSoup(post).get_text()
            blogger_dict[self.POSTS].append(post)


        if len(blogger_dict[self.DATES]) != len(blogger_dict[self.POSTS]):
            raise Exception("Warning: Mismatch between the number of posts and the number of dates in file %s" % path)

        return blogger_dict
    @property
    def sframe(self):
        return self._sf

sframe_save_path = "%s/blogs.sframe" % BASE_DIR
b = BlogData2SFrameParser("%s/xml" % BASE_DIR, sframe_save_path)
sf = b.sframe

Gensim reads input from disk. So we'll need some glue code to get things into the right format. We will use the SFrame.apply() function to create sperate text files for each blogger's posts. The created text files are then used to construct our Word2Vec model.


In [2]:
os.mkdir("%s/txt" % BASE_DIR)
sf.apply(lambda r: file("%s/txt/%s.txt" % (BASE_DIR, r["id"]),"w").write("\n".join(r['posts']))).__materialize__()

Note: There's a mysterious call to '__materialize__()' in the last code block. SFrame and SArray operations are lazily evaluated. This is an optimization step that allows SFrame to chain expensive operations together and perform them as needed. A side effect of this behavior is that operations may not be performed when you make the call. In our case, we want to write out the data right away. So we calling '__materialize__()' to force the SFrame to not be lazy and materialize the results now.

Now that we have the data in an SFrame, you can call '.show()' to visualize it in GraphLab Canvas.


In [3]:
gl.canvas.set_target('ipynb')
sf.show()


Out[3]:

2. Training a Word2Vec Model

After loading the blogs data into a SFrame object, the next step is to use the blog post texts and train a Word2Vec model. Without getting into too much details, Word2Vec learns the semantic relationship between words. For our purposes, it's okay to treat Word2Vec as a magical blackbox that takes words as input and returns a vector of numbers that represent the input word and its meaning.

In this notebook, I use the Gensim package written by Radim Řehůřek to train a Word2Vec model from the set of blogs. This section contains the code for constructing Word2Vec. You can follow along and train your own model or just directly download the trained model (file1, file2, file3) and skip to the next section on creating & evaluating Machine Learning classification and regression models.

First, I create a TrainSentences class that takes as input a directory with English text files and returns an iterator which can be used to split senetences into list of words.


In [4]:
import os
import gensim
import re
import nltk
from nltk.corpus import stopwords
from bs4 import BeautifulSoup

BASE_DIR = "/home/graphlab_create/data/blogs" # NOTE: Update BASE_DIR to your own directory path
class TrainSentences(object):
    """
    Iterator class that returns Sentences from texts files in a input directory
    """
    RE_WIHTE_SPACES = re.compile("\s+")
    STOP_WORDS = set(stopwords.words("english"))
    def __init__(self, dirname):
        """
        Initialize a TrainSentences object with a input directory that contains text files for training
        :param dirname: directory name which contains the text files        
        """
        self.dirname = dirname

    def __iter__(self):
        """
        Sentences iterator that return sentences parsed from files in the input directory.
        Each sentences is returned as list of words
        """
        #First iterate  on all files in the input directory
        for fname in os.listdir(self.dirname):
            # read line from file (Without reading the entire file)
            for line in file(os.path.join(self.dirname, fname), "rb"):
                # split the read line into sentences using NLTK
                for s in txt2sentences(line, is_html=True):
                    # split the sentence into words using regex
                    w =txt2words(s, lower=True, is_html=False, remove_stop_words=False,
                                                 remove_none_english_chars=True)
                    #skip short sentneces with less than 3 words
                    if len(w) < 3:
                        continue
                    yield w

def txt2sentences(txt, is_html=False, remove_none_english_chars=True):
    """
    Split the English text into sentences using NLTK
    :param txt: input text.
    :param is_html: If True thenremove HTML tags using BeautifulSoup
    :param remove_none_english_chars: if True then remove non-english chars from text
    :return: string in which each line consists of single sentence from the original input text.
    :rtype: str
    """
    if is_html:
        txt = BeautifulSoup(txt).get_text()
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

    # split text into sentences using nltk packages
    for s in tokenizer.tokenize(txt):
        if remove_none_english_chars:
            #remove none English chars
            s = re.sub("[^a-zA-Z]", " ", s)
        yield s
    
def txt2words(txt, lower=True, is_html=False, remove_none_english_chars=True, remove_stop_words=True):
    """
    Split text into words list
    :param txt: the input text
    :param lower: if to make the  text to lowercase or not.
    :param is_html: If True then  remove HTML tags using BeautifulSoup
    :param remove_none_english_chars: if True then remove non-english chars from text
    :param remove_stop_words: if True then remove stop words from text
    :return: words list create from the input text according to the input parameters.
    :rtype: list
    """
    if is_html:
        txt = BeautifulSoup(txt).get_text()
    if lower:
        txt = txt.lower()
    if remove_none_english_chars:
        txt = re.sub("[^a-zA-Z]", " ", txt)

    words = TrainSentences.RE_WIHTE_SPACES.split(txt.strip().lower())
    if remove_stop_words:
        #remove stop words from text
        words = [w for w in words if w not in TrainSentences.STOP_WORDS]
    return words

Now I create a 'sentences' object and train the Word2Vec model using Gensim. (This will take a while--about 40 minutes on my 8 core desktop with 24GB of RAM. It also generates a lot of output, of which we are only showing an excerpt here.) Feel free to go and grab lunch.


In [5]:
sentences = TrainSentences("%s/txt" % BASE_DIR)
model = gensim.models.Word2Vec(sentences, size=300, workers=8, min_count=40)
model.save("%s/blog_posts_300_c_40.word2vec" % BASE_DIR)


[INFO] collecting all words and their counts
[INFO] training on 136804432 words took 2355.8s, 58070 words/s
[INFO] saving Word2Vec object under /home/graphlab_create/data/blogs/blog_posts_300_c_40.word2vec, separately None
[INFO] not storing attribute syn0norm
[INFO] storing numpy array 'syn0' to /home/graphlab_create/data/blogs/blog_posts_300_c_40.word2vec.syn0.npy
[INFO] storing numpy array 'syn1' to /home/graphlab_create/data/blogs/blog_posts_300_c_40.word2vec.syn1.npy

Using mostly the default Word2Vec parameters, I construct a Word2Vec model that can map a word and to a vector of size 300. If you are intrested in constructing Word2Vec models faster and with more complex criteria, you can read more about it here: Deep learning with word2vec and gensim, Deep learning with word2vec, and Bag of words meets bags of popcorn.

Let's see if our model works. Let's check for what words the model considers the most similar to the words "lol" and "gemini." Notice that in our pre-processing steps, all the words are converted to be in lowercase.


In [6]:
model.most_similar("lol")


[INFO] precomputing L2-norms of word weight vectors
Out[6]:
[(u'haha', 0.7674974203109741),
 (u'lildevil', 0.720061182975769),
 (u'dynamitedg', 0.7186048030853271),
 (u'hahaha', 0.7165870666503906),
 (u'yea', 0.706267774105072),
 (u'fabityfabfab', 0.692556619644165),
 (u'hellokittylovzme', 0.6910575032234192),
 (u'shevy', 0.6903275847434998),
 (u'djkthegr', 0.6773291826248169),
 (u'sehne', 0.6644361615180969)]

In [7]:
model.most_similar("gemini")


Out[7]:
[(u'libra', 0.8335538506507874),
 (u'capricorn', 0.8181889057159424),
 (u'aquarius', 0.8059983849525452),
 (u'aries', 0.7940620183944702),
 (u'virgo', 0.7938823103904724),
 (u'scorpio', 0.7834067344665527),
 (u'pisces', 0.7768857479095459),
 (u'sagittarius', 0.7740199565887451),
 (u'taurus', 0.7478488683700562),
 (u'zodiac', 0.723747730255127)]

Seems like the model works! The training process also saved the Word2Vec model on disk for later use. Now we are ready to use the trained model to analyze each blogger's posts and predict the blogger's attributes.

3. Creating & Evaluating Classifiers

Now we are ready to try classification. We'll build a few classifiers to predict both the gender and the age category of the blogger based on his or her posts. Using the output vectors of Word2Vec as input features, the classifier can get much better results than using a bag-of-words model.

For those of you who skipped the previous section. You can download the trained Word2Vec model from here: file1, file2, file3, and load using the following code.


In [8]:
import gensim
BASE_DIR = "/home/graphlab_create/data/blogs" # NOTE: Update BASE_DIR to your own directory path
model_download_path = "%s/blog_posts_300_c_40.word2vec" % BASE_DIR
model = gensim.models.Word2Vec.load(model_download_path)


[INFO] loading Word2Vec object from /home/graphlab_create/data/blogs/blog_posts_300_c_40.word2vec
[INFO] loading syn0 from /home/graphlab_create/data/blogs/blog_posts_300_c_40.word2vec.syn0.npy with mmap=None
[INFO] loading syn1 from /home/graphlab_create/data/blogs/blog_posts_300_c_40.word2vec.syn1.npy with mmap=None
[INFO] setting ignored attribute syn0norm to None

If you skipped the previous sections, you'll also need to load the blog data as an SFrame object:


In [9]:
import graphlab as gl
sframe_save_path = "%s/blogs.sframe" % BASE_DIR
sf = gl.load_sframe(sframe_save_path)
print sf.num_rows()


19320

3.1 Feature engineering

Before we can build the classifiers, we need to generate the necessary features. We'll benchmark the Word2Vec features against a simple alternative, n-gram features, which are also used in the sentiment analysis notebook.


In [10]:
# first we join the posts list to a single string
sf['posts'] = sf['posts'].apply(lambda posts:"\n".join(posts)) 

# Construct Bag-of-Words model and evaluate it
sf['1gram features'] = gl.text_analytics.count_ngrams(sf['posts'], 1)
sf['2gram features'] = gl.text_analytics.count_ngrams(sf['posts'], 2)

Generating the Word2Vec average vectors requires four steps:

  1. Transfer each bloger's posts into a list of words.
  2. Use the Word2Vec model to map each word into its corresponding vector. (We'll do this only for words that are included in the Word2Vec model.)
  3. Calculate the average vector of all the word vectors.
  4. Lastly, we will use the average calaculate vector as input to our classfication algorithms.

Below, we define the DeepTextAnalyzer object that converts text into its corresponding average vector representation.


In [11]:
from numpy import average
import graphlab as gl
import numpy as np
import gensim

class DeepTextAnalyzer(object):
    def __init__(self, word2vec_model):
        """
        Construct a DeepTextAnalyzer using the input Word2Vec model
        :param word2vec_model: a trained Word2Vec model
        """
        self._model = word2vec_model

    def txt2vectors(self,txt, is_html):
        """
        Convert input text into an iterator that returns the corresponding vector representation of each
        word in the text, if it exists in the Word2Vec model
        :param txt: input text
        :param is_html: if True, then extract the text from the input HTML
        :return: iterator of vectors created from the words in the text using the Word2Vec model.
        """
        words = txt2words(txt,is_html=is_html, lower=True, remove_none_english_chars=True)
        words = [w for w in words if w in self._model]
        if len(words) != 0:
            for w in words:
                yield self._model[w]


    def txt2avg_vector(self, txt, is_html):
        """
        Calculate the average vector representation of the input text
        :param txt: input text
        :param is_html: is the text is a HTML
        :return the average vector of the vector representations of the words in the text  
        """
        vectors = self.txt2vectors(txt,is_html=is_html)
        vectors_sum = next(vectors, None)
        if vectors_sum is None:
            return None
        count =1.0
        for v in vectors:
            count += 1
            vectors_sum = np.add(vectors_sum,v)
        
        #calculate the average vector and replace +infy and -inf with numeric values 
        avg_vector = np.nan_to_num(vectors_sum/count)
        return avg_vector

Using the DeepTextAnalyzer, we can calculate each blogger's average vector.


In [12]:
dt = DeepTextAnalyzer(model)
sf['vectors'] = sf['posts'].apply(lambda p: dt.txt2avg_vector(p, is_html=True))
sf['vectors'].head(1)


Out[12]:
dtype: array
Rows: 1
[array('d', [0.012604535557329655, 0.054879602044820786, 0.0904444083571434, 0.012575246393680573, -0.06448926031589508, 0.03523287549614906, -0.0017349873669445515, 0.037302080541849136, 0.019732296466827393, -0.0028380516450852156, -0.03617171570658684, -0.030211569741368294, 0.059265512973070145, -0.016376517713069916, -0.02428954467177391, 0.016346734017133713, -0.0006845878669992089, 0.029808631166815758, -0.001393423997797072, 0.01046839077025652, -0.0536954440176487, 0.020142750814557076, -0.032764650881290436, -0.0060912203043699265, -0.0073578255251049995, 0.014595184475183487, 0.0012583564966917038, -0.0525243915617466, -0.01746908202767372, 0.01896553672850132, 0.04609547555446625, -0.06341788172721863, -0.016633527353405952, 0.041018836200237274, -0.0489996001124382, -0.004209735430777073, -0.013284904882311821, 0.04584050551056862, -0.024157751351594925, -0.015318844467401505, -0.03572612628340721, 0.010020504705607891, 0.01933750882744789, 0.0471440814435482, -0.038565292954444885, -0.004639579448848963, 0.025136299431324005, 0.017895253375172615, -0.0074853147380054, -0.05658518150448799, 0.019308101385831833, -0.04762481153011322, 0.015183519572019577, 0.04126725718379021, 0.012069158256053925, 0.04616081342101097, -0.04652436450123787, -0.0350184291601181, 0.046059202402830124, 0.010615011677145958, -0.02820979431271553, 0.017745980992913246, -0.05776236206293106, 0.018051480874419212, -0.02307523787021637, 0.025938035920262337, -0.001170123927295208, 0.021539734676480293, 0.04912836104631424, 0.00330935581587255, -0.04921851307153702, 0.031874556094408035, 0.007241516839712858, -0.06860924512147903, 0.012496729381382465, 0.02692459151148796, 0.002778894267976284, -0.024367745965719223, -0.029117828235030174, 0.0026462916284799576, 0.0013245047302916646, -0.007550460752099752, 0.031065719202160835, 0.008026349358260632, -0.043550144881010056, -0.027642332017421722, 0.058655980974435806, 0.019528070464730263, 0.028870102018117905, 0.043329592794179916, 0.016563354060053825, 0.030260786414146423, -0.00636585196480155, 0.013522556982934475, 0.01302525494247675, -0.04023490101099014, -0.019169269129633904, 0.006410487927496433, -0.027096742764115334, 0.00826142355799675, -0.0011532100616022944, 0.012507018633186817, -0.0212711114436388, -0.08576739579439163, -0.023442767560482025, 0.04792207106947899, -0.011149311438202858, -0.04573160782456398, -0.019624322652816772, -0.05888773500919342, -0.018478894606232643, -0.007574104238301516, -0.02120867744088173, -0.012790458276867867, 0.04612373560667038, 0.0035052138846367598, 0.05190417915582657, 0.004082612693309784, 0.0070382775738835335, -0.012935448437929153, 0.02881212718784809, 0.03776417300105095, -0.05116482079029083, -0.07703803479671478, -0.015154514461755753, -0.01737312227487564, -0.012305397540330887, 0.0259209256619215, -0.059005577117204666, -0.002980702556669712, -0.0011761231580749154, -0.001604782184585929, 0.051239192485809326, -0.036717817187309265, 0.04146266728639603, 0.008646705187857151, -0.013771510683000088, -0.0031840838491916656, -0.022171122953295708, 0.012290295213460922, -0.02277674525976181, -0.0016936497995629907, 0.016646938398480415, 0.015758031979203224, -0.009412403218448162, 0.01396714523434639, -0.03505697101354599, -0.021226583048701286, 0.020337166264653206, -0.016492506489157677, -0.013801313005387783, 0.04836142808198929, -0.01688632182776928, -0.014864296652376652, 0.027103174477815628, 0.000841900531668216, 0.011327732354402542, 0.018413545563817024, 0.059501659125089645, 0.027884572744369507, 0.02568002976477146, 0.0023337737657129765, -0.005481530912220478, -0.03638114780187607, -0.036091599613428116, 0.051848117262125015, 0.016824305057525635, 0.10352840274572372, 0.009514165110886097, 0.016132302582263947, -0.05149019882082939, 0.008226466365158558, 0.04827508330345154, 0.06647384166717529, 0.00275649712421, 0.005567316431552172, 0.0016942773945629597, -0.004413999617099762, 0.016311438754200935, -0.030939340591430664, -0.0005547385080717504, -0.01593383215367794, 0.013364830054342747, -0.013141979463398457, 0.0011922522680833936, 0.006322520785033703, -0.0004896017489954829, -0.0028291114140301943, -0.009287447668612003, -0.021965233609080315, 0.03152938187122345, 0.00810872670263052, -0.028736872598528862, -0.023163029924035072, -0.03366544097661972, 0.020645910874009132, -0.03314598277211189, 0.03880348056554794, 0.009069174528121948, 0.00633469270542264, 0.09019562602043152, 0.04395481199026108, 0.021493952721357346, -0.0012203545775264502, -0.008003506809473038, 0.0005816234624944627, 0.027125393971800804, -0.05667264759540558, 0.038607753813266754, 0.04694390669465065, -0.00565087515860796, -0.024970751255750656, -0.020624076947569847, 0.011477457359433174, 0.0016564808320254087, -0.025250554084777832, -0.03610401600599289, 0.018454844132065773, -0.019847378134727478, 0.008955851197242737, 0.03807211294770241, -0.06424836069345474, -0.02729138359427452, -0.020875059068202972, -0.01452101580798626, -0.007545968517661095, 0.010136705823242664, -0.011983807198703289, 0.01491461880505085, -0.02777835913002491, -0.011003761552274227, -0.010576434433460236, -0.008059872314333916, 0.0036743979435414076, 0.03312287479639053, 0.03527246415615082, 0.0364689826965332, -0.0005058578681200743, -0.030745765194296837, 0.0609147809445858, 0.014962667599320412, 0.05409712344408035, 0.006046036258339882, -0.03300958499312401, -0.006111537106335163, -0.02362757734954357, -0.012275554239749908, 0.010423519648611546, -0.018905002623796463, 0.054251376539468765, -0.0059964340180158615, -0.05477603152394295, -0.022443506866693497, -0.0036348667927086353, -0.03470706194639206, -0.015869691967964172, -0.06002680957317352, -0.03329383209347725, -0.02543313056230545, -0.01381655689328909, -0.015787973999977112, -0.0037349367048591375, -0.03783893957734108, 0.00030305792461149395, 0.01221040915697813, 0.037274520844221115, 0.035005323588848114, -0.02690398320555687, -0.014468496665358543, 0.04189489781856537, 0.033100321888923645, -0.008360632695257664, 0.043549370020627975, -0.05221143364906311, -0.010840457864105701, 0.019972994923591614, -0.08662424981594086, -0.020530475303530693, 0.051277268677949905, -0.04399894177913666, -0.023578867316246033, 0.01965881511569023, -0.0009816517122089863, -0.017721623182296753, -0.018783023580908775, 0.024514226242899895, 0.031012779101729393, 0.033788781613111496, 0.022824862971901894, -0.023204617202281952, -0.010844551958143711, -0.04690912365913391, -0.010056046769022942, 0.017791585996747017, -0.03158311918377876, 0.0071300626732409, 0.0005410840385593474, 0.0021672432776540518, 0.031238645315170288, 0.12138962000608444])]

Let's remove all rows with missing values from the SFrame and look at the results in GraphLab Canvas.


In [13]:
sf = sf.dropna()
print sf.column_names()


['age', 'dates', 'gender', 'id', 'industry', 'posts', 'sign', '1gram features', '2gram features', 'vectors']

Randomly split the data into a training set and a test set. Then train a classifier on the train set and evaluate it on the test set. SFrame operations are lazily evaluated. So the lambda apply and dropna operations that we called above are actually queued up. The random split operation forces the SFrame to materialize the results. So this operation will take a while, because it is only now computing the vector representation of each blogger's posts.


In [14]:
train_set, test_set = sf.random_split(0.8, seed=5)

3.2 Predicting blogger gender

Okay, feature engineering is done! Let's see how well the single word counts (one-grams) work in predicting the blogger's gender.


In [15]:
cls = gl.classifier.create(train_set, target='gender', features=['1gram features'])
baseline_result = cls.evaluate(test_set)
print baseline_result


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14645
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 646395
PROGRESS: Number of coefficients    : 667424
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 1.000000  | 0.810433     | 0.950358          | 0.730469            |
PROGRESS: | 2         | 5        | 1.000000  | 1.539367     | 0.925982          | 0.660156            |
PROGRESS: | 3         | 6        | 1.000000  | 1.995312     | 0.968385          | 0.710938            |
PROGRESS: | 4         | 7        | 1.000000  | 2.436006     | 0.974053          | 0.704427            |
PROGRESS: | 5         | 8        | 1.000000  | 2.918556     | 0.986343          | 0.747396            |
PROGRESS: | 6         | 9        | 1.000000  | 3.359423     | 0.988119          | 0.750000            |
PROGRESS: | 10        | 13       | 1.000000  | 5.163912     | 0.990850          | 0.744792            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: SVM:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14645
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 646395
PROGRESS: Number of coefficients    : 667424
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 1.000000  | 0.704964     | 0.950358          | 0.730469            |
PROGRESS: | 2         | 5        | 1.000000  | 1.285862     | 0.947422          | 0.679688            |
PROGRESS: | 3         | 6        | 1.000000  | 1.696845     | 0.978081          | 0.735677            |
PROGRESS: | 4         | 7        | 1.000000  | 2.129355     | 0.982656          | 0.726562            |
PROGRESS: | 5         | 8        | 1.000000  | 2.542319     | 0.989211          | 0.747396            |
PROGRESS: | 6         | 9        | 1.000000  | 2.933923     | 0.989484          | 0.730469            |
PROGRESS: | 10        | 13       | 1.000000  | 4.584358     | 0.992694          | 0.734375            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.744792
PROGRESS: SVMClassifier                   : 0.734375
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|    female    |      female     |  1668 |
|    female    |       male      |  295  |
|     male     |      female     |  732  |
|     male     |       male      |  1204 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.7365991279815337}

That got us an accuracy of about 0.74. For kicks, we'll throw in the 2-gram features as well. This blows up the feature space to more than 13 million sparse features and makes the training much more expensive. (My desktop was unresponsive for a few minutes since the training took up all available memory.) The results are less than overwhelming. Granted, the classifier hasn't been properly tuned, and performance should improve with better hyperparameter settings. But tuning itself is a computationally intensive process. Also, as we'll see, the word2vec features provide clear gains, at a fraction of the computation cost.


In [16]:
cls2 = gl.classifier.create(train_set, target='gender',features=['2gram features', '1gram features'] )
ngram_result = cls2.evaluate(test_set)
print ngram_result


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14656
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 2
PROGRESS: Number of unpacked features : 12887989
PROGRESS: Number of coefficients    : 13287908
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 4        | 1.000000  | 50.759790    | 0.993518          | 0.743725            |
PROGRESS: | 2         | 6        | 1.000000  | 71.944949    | 0.994678          | 0.749009            |
PROGRESS: | 3         | 7        | 1.000000  | 75.804266    | 0.994814          | 0.749009            |
PROGRESS: | 4         | 8        | 1.000000  | 79.621543    | 0.994883          | 0.754293            |
PROGRESS: | 5         | 9        | 1.000000  | 83.857531    | 0.996384          | 0.756935            |
PROGRESS: | 6         | 10       | 1.000000  | 89.148152    | 0.996725          | 0.752972            |
PROGRESS: | 7         | 11       | 1.000000  | 93.253768    | 0.996725          | 0.751651            |
PROGRESS: | 8         | 12       | 1.000000  | 97.345938    | 0.996588          | 0.745046            |
PROGRESS: | 9         | 13       | 1.000000  | 101.510659   | 0.996043          | 0.752972            |
PROGRESS: | 10        | 14       | 1.000000  | 105.875637   | 0.996930          | 0.758256            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: SVM:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14656
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 2
PROGRESS: Number of unpacked features : 12887989
PROGRESS: Number of coefficients    : 13287908
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 1.000000  | 97.441278    | 0.993518          | 0.743725            |
PROGRESS: | 2         | 5        | 1.000000  | 154.346597   | 0.994883          | 0.743725            |
PROGRESS: | 3         | 6        | 1.000000  | 171.045539   | 0.995633          | 0.745046            |
PROGRESS: | 4         | 7        | 1.000000  | 174.742157   | 0.995360          | 0.742404            |
PROGRESS: | 5         | 9        | 1.000000  | 201.992722   | 0.996043          | 0.743725            |
PROGRESS: | 6         | 10       | 1.000000  | 211.561595   | 0.995838          | 0.741083            |
PROGRESS: | 7         | 11       | 1.000000  | 215.532938   | 0.996452          | 0.745046            |
PROGRESS: | 8         | 12       | 1.000000  | 219.572237   | 0.996043          | 0.742404            |
PROGRESS: | 9         | 14       | 1.000000  | 242.171512   | 0.996043          | 0.742404            |
PROGRESS: | 10        | 15       | 1.000000  | 252.585340   | 0.996247          | 0.743725            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.758256
PROGRESS: SVMClassifier                   : 0.743725
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|    female    |      female     |  1669 |
|    female    |       male      |  294  |
|     male     |      female     |  660  |
|     male     |       male      |  1276 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.7553218774044627}

The bag-of-words classifiers achieved an accuracy of roughly 0.74-0.75. Let's see what Word2Vec features can do.


In [17]:
cls3 = gl.classifier.create(train_set, target='gender',features=['vectors'])
word2vec_result = cls3.evaluate(test_set)
print word2vec_result


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14652
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Number of coefficients    : 301
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 3.832028     | 0.795796          | 0.819974            |
PROGRESS: | 2         | 3        | 5.833947     | 0.795796          | 0.812089            |
PROGRESS: | 3         | 4        | 7.743807     | 0.796137          | 0.808147            |
PROGRESS: | 4         | 5        | 9.687033     | 0.796137          | 0.809461            |
PROGRESS: | 5         | 6        | 11.621431    | 0.796205          | 0.809461            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: SVM:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14652
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Number of coefficients    : 301
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 4        | 1.000000  | 0.192293     | 0.655269          | 0.667543            |
PROGRESS: | 2         | 6        | 1.000000  | 0.298776     | 0.665916          | 0.692510            |
PROGRESS: | 3         | 7        | 1.000000  | 0.369018     | 0.690145          | 0.699080            |
PROGRESS: | 4         | 8        | 1.000000  | 0.441306     | 0.700519          | 0.720105            |
PROGRESS: | 5         | 9        | 1.000000  | 0.506479     | 0.734302          | 0.758213            |
PROGRESS: | 6         | 10       | 1.000000  | 0.588495     | 0.723724          | 0.737188            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.809461
PROGRESS: SVMClassifier                   : 0.781866
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|    female    |      female     |  1572 |
|    female    |       male      |  391  |
|     male     |      female     |  408  |
|     male     |       male      |  1528 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.7950756604257502}

And we got an accuracy of ~0.79. This accuracy for gender prediction that utilize only the author's blog posts is probably better than the results obtained by Schler et al. 2006 (in order to actually prove that this results are better, we will need to add more tests. But this is a topic for another notebook). Let try to improve the results by training the classifier with some more features: the blogger's industry and age.


In [18]:
cls4 = gl.classifier.create(train_set, target='gender', features=['vectors', 'industry', 'age'] )
word2vec_industry_age_result = cls4.evaluate(test_set)
print word2vec_industry_age_result


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14650
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 3
PROGRESS: Number of unpacked features : 302
PROGRESS: Number of coefficients    : 341
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 3.817053     | 0.802457          | 0.790301            |
PROGRESS: | 2         | 3        | 5.930517     | 0.804573          | 0.790301            |
PROGRESS: | 3         | 4        | 7.986661     | 0.804915          | 0.788991            |
PROGRESS: | 4         | 5        | 10.060300    | 0.804846          | 0.787680            |
PROGRESS: | 5         | 6        | 12.001943    | 0.804846          | 0.787680            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: SVM:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14650
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 3
PROGRESS: Number of unpacked features : 302
PROGRESS: Number of coefficients    : 341
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 4        | 1.000000  | 0.154809     | 0.656451          | 0.668414            |
PROGRESS: | 2         | 6        | 1.000000  | 0.261841     | 0.669898          | 0.664482            |
PROGRESS: | 3         | 7        | 1.000000  | 0.340653     | 0.695563          | 0.686763            |
PROGRESS: | 4         | 8        | 1.000000  | 0.415740     | 0.705392          | 0.705111            |
PROGRESS: | 5         | 9        | 1.000000  | 0.485970     | 0.731126          | 0.712975            |
PROGRESS: | 6         | 10       | 1.000000  | 0.556509     | 0.743276          | 0.715596            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.78768
PROGRESS: SVMClassifier                   : 0.750983
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|    female    |      female     |  1589 |
|    female    |       male      |  374  |
|     male     |      female     |  404  |
|     male     |       male      |  1532 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.800461656835086}

We got slighter better results with the additional features. We can use the SFrame to easily engineer additional features, such as the number of urls in the posts and the total number of posts.


In [19]:
train_set['posts_length'] = train_set['posts'].apply(lambda p: len(p))
test_set['posts_length'] = test_set['posts'].apply(lambda p: len(p))

cls5 = gl.classifier.create(train_set, target='gender',features=['vectors', 'industry', 'age', 'posts_length'] )
word2vec_industry_age_posts_lenth_result = cls5.evaluate(test_set)
print word2vec_industry_age_posts_lenth_result


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14733
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 303
PROGRESS: Number of coefficients    : 342
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 2        | 3.983127     | 0.802552          | 0.777941            |
PROGRESS: | 2         | 3        | 5.999601     | 0.804181          | 0.783824            |
PROGRESS: | 3         | 4        | 7.934816     | 0.803638          | 0.783824            |
PROGRESS: | 4         | 5        | 9.944985     | 0.803502          | 0.783824            |
PROGRESS: | 5         | 6        | 11.900313    | 0.803434          | 0.783824            |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: SVM:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14733
PROGRESS: Number of classes           : 2
PROGRESS: Number of feature columns   : 4
PROGRESS: Number of unpacked features : 303
PROGRESS: Number of coefficients    : 342
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 4        | 1.000000  | 0.182455     | 0.657639          | 0.630882            |
PROGRESS: | 2         | 6        | 1.000000  | 0.297339     | 0.670332          | 0.654412            |
PROGRESS: | 3         | 7        | 1.000000  | 0.367681     | 0.693749          | 0.676471            |
PROGRESS: | 4         | 8        | 1.000000  | 0.441621     | 0.705898          | 0.707353            |
PROGRESS: | 5         | 9        | 1.000000  | 0.514203     | 0.727822          | 0.719118            |
PROGRESS: | 6         | 10       | 1.000000  | 0.589245     | 0.742008          | 0.732353            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: LogisticClassifier, SVMClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: LogisticClassifier              : 0.783824
PROGRESS: SVMClassifier                   : 0.752941
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 4

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|    female    |      female     |  1580 |
|    female    |       male      |  383  |
|     male     |      female     |  396  |
|     male     |       male      |  1540 |
+--------------+-----------------+-------+
[4 rows x 3 columns]
, 'accuracy': 0.8002051808155938}

We get pretty much the same results. Let us move on to predicting the blogger's age.

3.2 Predicting blogger age

Let's first look at some basic statistics of the bloggers' age using the show() function.


In [20]:
sf.show(['age'])


Out[20]:

Notice that the bloggers are between from 13 to 48 years old, with an average of about 23. To predict a blogger's age we can use both classification and regression. A regression model can estimate the actual age, whereas a classifier can predict the age group of the blogger. Lets start by constructing a regression model.

3.2.1 Regression Models

Constructing a regression model form using the regression toolkit is really straight forward. We can try both linear regression and boosted trees regression. We'll use only the Word2Vec features.


In [21]:
linear_model = gl.linear_regression.create(train_set, target='age',features=['vectors'])
linear_model.evaluate(test_set)


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14642
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Number of coefficients    : 301
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+---------------+-----------------+--------------------+----------------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-rmse | Validation-rmse | Training-max_error | Validation-max_error |
PROGRESS: +-----------+----------+--------------+---------------+-----------------+--------------------+----------------------+
PROGRESS: | 1         | 2        | 1.397436     | 26.083441     | 27.646445       | 5.591975           | 5.852079             |
PROGRESS: +-----------+----------+--------------+---------------+-----------------+--------------------+----------------------+
PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

Out[21]:
{'max_error': 30.03233650607854, 'rmse': 5.741340607284524}

In [22]:
boosted_tree_model = gl.boosted_trees_regression.create(train_set, target='age',features=['vectors']) 
boosted_tree_model.evaluate(test_set)


PROGRESS: Boosted trees regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 15413
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter        RMSE Elapsed time
PROGRESS:      0   1.722e+01        0.48s
PROGRESS:      1   1.280e+01        0.66s
PROGRESS:      2   9.879e+00        0.89s
PROGRESS:      3   8.019e+00        1.05s
PROGRESS:      4   6.847e+00        1.21s
PROGRESS:      5   6.137e+00        1.38s
PROGRESS:      6   5.696e+00        1.54s
PROGRESS:      7   5.407e+00        1.87s
PROGRESS:      8   5.202e+00        2.05s
PROGRESS:      9   5.048e+00        2.22s
Out[22]:
{'max_error': 28.603580384618983, 'rmse': 6.124479850206823}

RMSE measures the average number of years by which the estimate is off from the real age. The linear regression model performed better in terms of RMSE than the boosted trees model. It is important to note that some of these blogs are written over time duration of years. Therefore, it can be hard to predict the blogger's exact age. It may be better to instead classify the blogger's age group using classification models.

3.2.2 Classfication Models

Similar to Schler et al. 2006, we divide bloggers in our dataset into one of three age categories: 10s (13-17), 20s (23-27), and 30s (33-42).

We remove from the SFrame the bloggers that are out side of these age category.


In [23]:
valid_age = range(13,18) + range(23,28) + range(33,43)
sf_age_categories = sf.filter_by(valid_age, 'age')

In [24]:
def get_age_category(age):
    if 13 <= age <=17:
        return "10s"
    elif 23 <= age <= 27:
        return "20s"
    elif 33 <= age <= 42:
        return "30s"    
    return None
        
sf['age_category'] = sf['age'].apply(lambda age: get_age_category(age))
sf_age_categories = sf.dropna() # remove blogger without age category
print sf_age_categories.num_rows()


18780

Now let's construct and evaluate an age category classfication model:


In [25]:
train_set2, test_set2 = sf_age_categories.random_split(0.8, seed=5)
cls = gl.classifier.create(train_set2, target='age_category', features=['vectors'])
age_categoy_result = cls.evaluate(test_set2)
print age_categoy_result


PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.

PROGRESS: The following methods are available for this type of problem.
PROGRESS: Boosted trees classifier:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14208
PROGRESS: Number of classes           : 3
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Starting Boosted Trees
PROGRESS: --------------------------------------------------------
PROGRESS:   Iter      Accuracy          Elapsed time
PROGRESS:         (training) (validation)
PROGRESS:      0   7.319e-01   6.486e-01        0.84s
PROGRESS:      1   7.493e-01   6.641e-01        1.10s
PROGRESS:      2   7.601e-01   6.538e-01        1.53s
PROGRESS:      3   7.695e-01   6.628e-01        1.79s
PROGRESS:      4   7.817e-01   6.641e-01        2.05s
PROGRESS:      5   7.898e-01   6.615e-01        2.32s
PROGRESS:      6   7.984e-01   6.602e-01        2.58s
PROGRESS:      7   8.079e-01   6.731e-01        2.84s
PROGRESS:      8   8.141e-01   6.654e-01        3.12s
PROGRESS:      9   8.237e-01   6.705e-01        3.53s
PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 14208
PROGRESS: Number of classes           : 3
PROGRESS: Number of feature columns   : 1
PROGRESS: Number of unpacked features : 300
PROGRESS: Number of coefficients    : 602
PROGRESS: Starting L-BFGS
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes   | Step size | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: | 1         | 3        | 1.000000  | 0.160432     | 0.485783          | 0.504505            |
PROGRESS: | 2         | 5        | 1.000000  | 0.293274     | 0.643088          | 0.624196            |
PROGRESS: | 3         | 6        | 1.000000  | 0.384782     | 0.626689          | 0.577864            |
PROGRESS: | 4         | 7        | 1.000000  | 0.478458     | 0.671171          | 0.620335            |
PROGRESS: | 5         | 8        | 1.000000  | 0.568476     | 0.660825          | 0.622909            |
PROGRESS: | 6         | 9        | 1.000000  | 0.655969     | 0.681236          | 0.638353            |
PROGRESS: | 10        | 13       | 1.000000  | 0.988998     | 0.716005          | 0.671815            |
PROGRESS: +-----------+----------+-----------+--------------+-------------------+---------------------+
PROGRESS: BoostedTreesClassifier, LogisticClassifier
PROGRESS: The returned model will be chosen according to validation accuracy.
PROGRESS: Model selection based on validation accuracy:
PROGRESS: ---------------------------------------------
PROGRESS: BoostedTreesClassifier          : 0.670527670528
PROGRESS: LogisticClassifier              : 0.671815
PROGRESS: ---------------------------------------------
PROGRESS: Selecting LogisticClassifier based on validation set performance.
{'confusion_matrix': Columns:
	target_label	str
	predicted_label	str
	count	int

Rows: 9

Data:
+--------------+-----------------+-------+
| target_label | predicted_label | count |
+--------------+-----------------+-------+
|     10s      |       10s       |  1293 |
|     10s      |       20s       |  337  |
|     10s      |       30s       |   29  |
|     20s      |       10s       |  221  |
|     20s      |       20s       |  1365 |
|     20s      |       30s       |   39  |
|     30s      |       10s       |   37  |
|     30s      |       20s       |  437  |
|     30s      |       30s       |   37  |
+--------------+-----------------+-------+
[9 rows x 3 columns]
, 'accuracy': 0.7101449275362319}

This shows that the features derived from Word2Vec can be used to predict the age category of a blogger with an accuracy of about 0.71.

4. Where to Go From Here

In this notebook, we demonstrate that deep learning can generate useful features for predicting the gender and age of a blogger based on his or her blog post content. If you want to continue to explore this dataset yourself, there are a lot more that can be done. You can try to predict a blogger's astrological sign using his or her blogs. (I didn't succeed in building a prediction model that works better than random prediction. But you might!) You can try to predict a blogger's professional industry. You can also train the Word2Vec model on a different text corpus such as Wikipedia, and see if that gives you better results. We hope that the methods and code presented in this notebook can assist you to solve other text analysis tasks.

5. Further Reading


In [ ]: