Elementary, My Dear Watson!

Text Analytics Tutorial using Sherlock Holmes Stories

In this notebook, I will demonstrate several text analytics techniques that can be used to analyze various text corpora in order to extract various interesting insights from the text. Namely, throughout this notebook, I will use The Sherlock Holmes Books Collection to show how to (a) calculate various textual statistics; (b) create the social network among entities that appear in the books; (c) use a topic model to discover abstract topics in the text; and (d) utilize Word2Vec to find connections among various section of the text, and do some predictions.

To perform the text analytics presented in this notebook, we will use NLTK Python package, GraphLab Create's SFrame and SGraph objects, as well as GraphLab's text analytics toolkit, and Word2Vec deep learning inspired model implemented in the Gensim Python package.

The notebook is divided to the following sections:

Each section can be executed independently. So feel free to skip ahead, just remember to import all the required packages, and define all the needed functions.

Required Python Packages:

Let's do some text analytics!

0. Setup

Before we begin, make sure you have installed all the required Python packages. (The instructions below use pip. You can use easy_install, too.) Also, consider using virtualenv for a cleaner installation experience instead of sudo. I also recommend to run the code via IPython Notebook.

$ sudo pip install --upgrade gensim
$ sudo pip install --upgrade nltk
$ sudo pip install --upgrade graphlab-create
$ sudo pip install --upgrade pyldavis

You will need a product key for GraphLab Create, and to make Stanford Named Entity Recognizer work with Pyhton NLTK.

After installing NLTK and from an interactive shell download the punkt model by importing nltk and running nltk.download(). From the resulting interactive window navigate to the model tab and select punkt and download to your system.

To prepare the Stanford Named Entity Recognizer to work in your system make sure you read the following links : [1 - How to] (http://textminingonline.com/how-to-use-stanford-named-entity-recognizer-ner-in-python-nltk-and-other-programming-languages) [2 - API] (http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford) [3 - Stanford Parser FAQ] (http://nlp.stanford.edu/software/parser-faq.shtml) [4 - download] (http://nlp.stanford.edu/software/CRF-NER.html) 5 - In the extracted directory it seems to help to run the stanford-ner.jar file in some computer configurations.

Take note of what directory you downloaded and extracted the files to as you will need the location to pass the classifier and Stanford NER as arguments later in this notebook.

Also in case you haven’t already, make sure you are running the latest [Java JDK] (http://www.oracle.com/technetwork/java/javase/downloads/index.html)

For a quick test you can open an ipython shell and try the following:

from nltk.tag import StanfordNERTagger st = StanfordNERTagger(‘[path_to_your_downloaded_package_classifiers_directory]/english.all.3class.distsim.crf.ser.gz', '[path_to_your_downloaded_package_root_directory]/stanford-ner.jar') st.tag('When we turned him over, the Boots recognized him at once as being the same gentleman who had engaged the room under the name of Joseph Stangerson.'.split())

1. Preparing the Dataset

1.1 Constructing the Dataset

"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay."
                                                -The Adventure of the Copper Beeches

Throughout this notebook, we will be analyzing Sherlock Holmes's stories collection. So first, we will download the stories in ASCII format from the sherlock-holm.es website. sherlock-holm.es website contains over sixty downloadable stories, we will use the following code to download the stories and insert them into a SFrame object.

Important Note: in some countries, such as the U.S., few of Sherlock Holmes's books & stories are still under copyright restrictions. For more information, please advise the following website, and read the guidelines that appear in the end of sherlock-holm.es ASCII download page.


In [ ]:
import re
import urllib2
import graphlab as gl

BASE_DIR = "/Users/pablo/fromHomeMac/sherlock/data" # NOTE: Update BASE_DIR to your own directory path

books_url = "http://sherlock-holm.es/ascii/"
re_books_links = re.compile("\"piwik_download\"\s+href=\"(?P<link>.*?)\">(?P<title>.*?)</a>", re.MULTILINE)
html = urllib2.urlopen(books_url).read()
books_list = [m.groupdict() for m in re_books_links.finditer(html)]
print books_list


[{'link': '/stories/plain-text/cano.txt', 'title': 'The Complete Canon'}, {'link': '/stories/plain-text/cnus.txt', 'title': 'The Canon \xe2\x80\x94 U.S. edition'}, {'link': '/stories/plain-text/advs.txt', 'title': 'The Adventures of Sherlock Holmes'}, {'link': '/stories/plain-text/mems.txt', 'title': 'The Memoirs of Sherlock Holmes'}, {'link': '/stories/plain-text/retn.txt', 'title': 'The Return of Sherlock Holmes'}, {'link': '/stories/plain-text/lstb.txt', 'title': 'His Last Bow'}, {'link': '/stories/plain-text/case.txt', 'title': 'The Case-Book of Sherlock Holmes'}, {'link': '/stories/plain-text/stud.txt', 'title': 'A Study In Scarlet'}, {'link': '/stories/plain-text/sign.txt', 'title': 'The Sign of the Four'}, {'link': '/stories/plain-text/houn.txt', 'title': 'The Hound of the Baskervilles'}, {'link': '/stories/plain-text/vall.txt', 'title': 'The Valley of Fear'}, {'link': '/stories/plain-text/scan.txt', 'title': 'A Scandal in Bohemia'}, {'link': '/stories/plain-text/redh.txt', 'title': 'The Red-Headed League'}, {'link': '/stories/plain-text/iden.txt', 'title': 'A Case of Identity'}, {'link': '/stories/plain-text/bosc.txt', 'title': 'The Boscombe Valley Mystery'}, {'link': '/stories/plain-text/five.txt', 'title': 'The Five Orange Pips'}, {'link': '/stories/plain-text/twis.txt', 'title': 'The Man with the Twisted Lip'}, {'link': '/stories/plain-text/blue.txt', 'title': 'The Adventure of the Blue Carbuncle'}, {'link': '/stories/plain-text/spec.txt', 'title': 'The Adventure of the Speckled Band'}, {'link': '/stories/plain-text/engr.txt', 'title': "The Adventure of the Engineer's Thumb"}, {'link': '/stories/plain-text/nobl.txt', 'title': 'The Adventure of the Noble Bachelor'}, {'link': '/stories/plain-text/bery.txt', 'title': 'The Adventure of the Beryl Coronet'}, {'link': '/stories/plain-text/copp.txt', 'title': 'The Adventure of the Copper Beeches'}, {'link': '/stories/plain-text/silv.txt', 'title': 'Silver Blaze'}, {'link': '/stories/plain-text/yell.txt', 'title': 'Yellow Face'}, {'link': '/stories/plain-text/stoc.txt', 'title': "The Stockbroker's Clerk"}, {'link': '/stories/plain-text/glor.txt', 'title': 'The \xe2\x80\x9cGloria Scott\xe2\x80\x9d'}, {'link': '/stories/plain-text/musg.txt', 'title': 'The Musgrave Ritual'}, {'link': '/stories/plain-text/reig.txt', 'title': 'The Reigate Puzzle'}, {'link': '/stories/plain-text/croo.txt', 'title': 'The Crooked Man'}, {'link': '/stories/plain-text/resi.txt', 'title': 'The Resident Patient'}, {'link': '/stories/plain-text/gree.txt', 'title': 'The Greek Interpreter'}, {'link': '/stories/plain-text/nava.txt', 'title': 'The Naval Treaty'}, {'link': '/stories/plain-text/fina.txt', 'title': 'The Final Problem'}, {'link': '/stories/plain-text/empt.txt', 'title': 'The Empty House'}, {'link': '/stories/plain-text/norw.txt', 'title': 'The Norwood Builder'}, {'link': '/stories/plain-text/danc.txt', 'title': 'The Dancing Men'}, {'link': '/stories/plain-text/soli.txt', 'title': 'The Solitary Cyclist'}, {'link': '/stories/plain-text/prio.txt', 'title': 'The Priory School'}, {'link': '/stories/plain-text/blac.txt', 'title': 'Black Peter'}, {'link': '/stories/plain-text/chas.txt', 'title': 'Charles Augustus Milverton'}, {'link': '/stories/plain-text/sixn.txt', 'title': 'The Six Napoleons'}, {'link': '/stories/plain-text/3stu.txt', 'title': 'The Three Students'}, {'link': '/stories/plain-text/gold.txt', 'title': 'The Golden Pince-Nez'}, {'link': '/stories/plain-text/miss.txt', 'title': 'The Missing Three-Quarter'}, {'link': '/stories/plain-text/abbe.txt', 'title': 'The Abbey Grange'}, {'link': '/stories/plain-text/seco.txt', 'title': 'The Second Stain'}, {'link': '/stories/plain-text/wist.txt', 'title': 'Wisteria Lodge'}, {'link': '/stories/plain-text/card.txt', 'title': 'The Cardboard Box'}, {'link': '/stories/plain-text/redc.txt', 'title': 'The Red Circle'}, {'link': '/stories/plain-text/bruc.txt', 'title': 'The Bruce-Partington Plans'}, {'link': '/stories/plain-text/dyin.txt', 'title': 'The Dying Detective'}, {'link': '/stories/plain-text/lady.txt', 'title': 'Lady Frances Carfax'}, {'link': '/stories/plain-text/devi.txt', 'title': "The Devil's Foot"}, {'link': '/stories/plain-text/last.txt', 'title': 'His Last Bow'}, {'link': '/stories/plain-text/illu.txt', 'title': 'The Illustrious Client'}, {'link': '/stories/plain-text/blan.txt', 'title': 'The Blanched Soldier'}, {'link': '/stories/plain-text/maza.txt', 'title': 'The Mazarin Stone'}, {'link': '/stories/plain-text/3gab.txt', 'title': 'The Three Gables'}, {'link': '/stories/plain-text/suss.txt', 'title': 'The Sussex Vampire'}, {'link': '/stories/plain-text/3gar.txt', 'title': 'The Three Garridebs'}, {'link': '/stories/plain-text/thor.txt', 'title': 'Thor Bridge'}, {'link': '/stories/plain-text/cree.txt', 'title': 'The Creeping Man'}, {'link': '/stories/plain-text/lion.txt', 'title': "The Lion's Mane"}, {'link': '/stories/plain-text/veil.txt', 'title': 'The Veiled Lodger'}, {'link': '/stories/plain-text/shos.txt', 'title': 'Shoscombe Old Place'}, {'link': '/stories/plain-text/reti.txt', 'title': 'The Retired Colourman'}]

We got the books' titles and links, now let's download the books' texts.


In [ ]:
# Filter books due to copyright issues. In this code, we filtered "The Complete Canon", “Case-Book of Sherlock Holmes” books, and
# "The Canon — U.S. edition" book (For more information please read the note above).
filtered_books = set(["The Complete Canon", "The Case-Book of Sherlock Holmes", "The Canon — U.S. edition" ])
books_list = filter(lambda d: d['title'] not in filtered_books, books_list )

#Download books' texts (to not overload the website we download the text in batch and not in parallel)
for d in books_list:
    d['text'] = urllib2.urlopen("http://sherlock-holm.es" + d['link']).read().strip()

Let's load the dict list into a SFrame object.


In [ ]:
sf = gl.SFrame(books_list).unpack("X1", column_name_prefix="")
sf.save("%s/books.sframe" % BASE_DIR)
sf.head(3)


[INFO] This non-commercial license of GraphLab Create is assigned to pablo@dato.com and will expire on January 02, 2020. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-1136 - Server binary: /Users/pablo/anaconda/envs/dato-env/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1453918014.log
[INFO] GraphLab Server Version: 1.8
Out[ ]:
['
\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
linktexttitle
/stories/plain-
text/advs.txt ...
THE ADVENTURES OF
SHERLOCK HOLMES\\n\\n ...
The Adventures of
Sherlock Holmes ...
/stories/plain-
text/mems.txt ...
THE MEMOIRS OF SHERLOCK
HOLMES\\n\\n ...
The Memoirs of Sherlock
Holmes ...
/stories/plain-
text/retn.txt ...
THE RETURN OF SHERLOCK
HOLMES\\n\\n ...
The Return of Sherlock
Holmes ...
\n', '[3 rows x 3 columns]
\n', '
']

2. Calculating Various Statistics

In this section, I will demonstrate how it is very straight-forward to utilize GraphLab Create SFrame object to calculate & visualize various statistics.

In previous section, we created a SFrame object which consists of 64 texts. Let us first load the SFrame object.


In [ ]:
import graphlab as gl
import re

BASE_DIR = "/Users/pablo/fromHomeMac/sherlock/data" # NOTE: Update BASE_DIR to your own directory path
gl.canvas.set_target('ipynb')
sf = gl.load_sframe("%s\\books.sframe" % BASE_DIR)

Using Python, it is very easy to calculate the number of characters in a text, we just need to use the built-in len function. Let's calculate the number of characters in each downloaded text using the the len function and SArray's apply function (notice that each column in a SFrame object is a SArray object).


In [ ]:
sf['chars_num'] = sf['text'].apply(lambda t: len(t))
sf.head(3)


Out[ ]:
['
\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
linktexttitlechars_num
/stories/plain-
text/advs.txt ...
THE ADVENTURES OF
SHERLOCK HOLMES\\n\\n ...
The Adventures of
Sherlock Holmes ...
610886
/stories/plain-
text/mems.txt ...
THE MEMOIRS OF SHERLOCK
HOLMES\\n\\n ...
The Memoirs of Sherlock
Holmes ...
511747
/stories/plain-
text/retn.txt ...
THE RETURN OF SHERLOCK
HOLMES\\n\\n ...
The Return of Sherlock
Holmes ...
662242
\n', '[3 rows x 4 columns]
\n', '
']

Let's use the show function to visualize the distribution of text length in each one of our downloaded text.


In [ ]:
sf['chars_num'].show()


We can see that the mean characters number in the download stories is 95020.42, and the maximal number of characters in a story is 662,242 characters. Let's also calculate the number of words in each text. Calculating the number of words in a text is little trickier and there are several methods to perform this task. Using one of the following methods:


In [ ]:
text = """I think that you know me well enough, Watson, to understand that I am by no means a nervous man. At the same time,
it is stupidity rather than courage to refuse to recognize danger when it is close upon you."""

#using the split function
print text.split()


['I', 'think', 'that', 'you', 'know', 'me', 'well', 'enough,', 'Watson,', 'to', 'understand', 'that', 'I', 'am', 'by', 'no', 'means', 'a', 'nervous', 'man.', 'At', 'the', 'same', 'time,', 'it', 'is', 'stupidity', 'rather', 'than', 'courage', 'to', 'refuse', 'to', 'recognize', 'danger', 'when', 'it', 'is', 'close', 'upon', 'you.']

In [ ]:
#Using NLTK 
#Note: Remember to download the NLTK's punkt package by running nltk.download() from the Interactive Python Shell
import nltk
print nltk.word_tokenize(text)


['I', 'think', 'that', 'you', 'know', 'me', 'well', 'enough', ',', 'Watson', ',', 'to', 'understand', 'that', 'I', 'am', 'by', 'no', 'means', 'a', 'nervous', 'man', '.', 'At', 'the', 'same', 'time', ',', 'it', 'is', 'stupidity', 'rather', 'than', 'courage', 'to', 'refuse', 'to', 'recognize', 'danger', 'when', 'it', 'is', 'close', 'upon', 'you', '.']

You can see that both using the split function, or using the regular expression work pretty-well. However, it is important to notice, that the first regular expression can mistakenly split words, such as "S.H." into two words, while the split function doesn't remove punctuation. Therefore, if we want to be precise, we can use the NLTK's tokenize package and remove punctuation from the results. Nevertheless, for our case, it is good enough to use the regular expression method to count words.


In [ ]:
sf['words_num'] = sf['text'].apply(lambda t: len(re_words_split.findall(t)))

We can also use NLTK to count the number of sentences in each story.


In [ ]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def txt2sentences(txt, remove_none_english_chars=True):
    """
    Split the English text into sentences using NLTK
    :param txt: input text.    
    :param remove_none_english_chars: if True then remove none English chars from text
    :return: string in which each line consists of single sentence from the original input text.
    :rtype: str
    """        
    # decode to utf8 to avoid encoding problems - if someone has better idea how to solve encoding 
    # problem I will love to learn about it.     
    txt = txt.decode("utf8") 
    # split text into sentences using NLTK package
    for s in tokenizer.tokenize(txt):
        if remove_none_english_chars:
            #remove none English chars
            s = re.sub("[^a-zA-Z]", " ", s)
        yield s

sf['sentences_num'] = sf['text'].apply(lambda t: len(list(txt2sentences(t))))

In [ ]:
sf[['chars_num','words_num','sentences_num']].show()


Until now I calculated very basic text statistics. Let's try to do something more complicated like count the number of time the words 'Sherlock', 'Watson', and 'Elementary' appeared in each story. We will do it using GraphLab's text_analytics.count_words toolkit.

Note: To count the frequency a word appears in a text, one can also consider using the collection.Counter function.


In [ ]:
sf['words_count'] = gl.text_analytics.count_words(sf['text'], to_lower=True)
sf['sherlock_count'] = sf['words_count'].apply(lambda d: d.get('sherlock',0))
sf['watson_count'] = sf['words_count'].apply(lambda d: d.get('watson',0))
sf['elementary_count'] = sf['words_count'].apply(lambda d: d.get('elementary',0))
sf[['sherlock_count', 'watson_count', 'elementary_count']].show()


It is nice to see that the mean number of times the word 'Sherlock' appear in the stories is 9.609 times while the mean the word 'Watson' appear is only 1.734. Moreover, there are stories, such as The Adventure of the Lion's Mane that the word 'Sherlock' doesn't appear even once.

Let's try to use simple linear regression to predict the number of times the word 'Sherlock' appear in a text based on the number of time the word 'Watson' appear in the text.


In [ ]:
linear_reg = gl.linear_regression.create(sf, target='sherlock_count', features=['watson_count'])
linear_reg.show()


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 64
PROGRESS: Number of features          : 1
PROGRESS: Number of unpacked features : 1
PROGRESS: Number of coefficients    : 2
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 1.005666     | 81.541394          | 14.740310     |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:

According to the simple linear regression, we have the following equation:

$$sherlock_{count} = 2.1404*watson_{count} + 5.8972$$

There are a lot of other really interesting insights that one can discover using similar methodology. I leave the reader to discover these insights by themselves. Let's move to the next section and create some nice graphs using various entity extraction tools.

3. Constructing Social Networks

"Listen, what I said before John, I meant it. I don’t have friends; I’ve just got one."
                                               -Sherlock, The Hounds of Baskerville, 2012 

One of my main fields of interest are social networks. I love to study and visualize graphs of various types of networks. One of the nice studies that I read not long ago showes that it is possible to create the social network of book characters. For example, Miranda et al. built and analyzed a social network utilizing the Odyssey of Homer.

To manually create a precise social network between Sherlock Holmes characters, we can read the stories and whenever two characters have a conversation, or appear in the same scene, we add to the network nodes with the two characters names (if there are not in the network already), and create a link between the two characters. In case, we want to create a weighted social network, we can also add a weight to each link with the number of times each two characters talked to each other.

When processing a large text corpus, manually using this process to construct a social network can very time-consuming. Therefore, we would like to perform this process automatically. One of the ways to consturct the social network is by using various NLP algorithms that analyze the text and "understand" the relationships between two entities. However, I am not familiar with open source tools that can analyze a text corpus and infer the connections between two entities with high precision.

In this section, I will demonstrate some very simple techniques that can be utilized to study the social connections among characters in Sherlock Holmes stories. These techniques won't create the most precise social network. However, the created network is sufficient to observe some interesting insights about the relationships among the stories' characters.

3.1 Constructing Social Network using Names List

Using this techniques, we will split the downloaded Sherlock Holmes stories into sentences, and using a predefined list of names of book characters we will create a social network with links among the stories characters by adding a link between each two characters that appear in the same sentence. Let start constructing the social network by splitting the stories into sentences.


In [ ]:
import graphlab as gl
import re,nltk
gl.canvas.set_target('ipynb')

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def txt2sentences(txt, remove_none_english_chars=True):
    """
    Split the English text into sentences using NLTK
    :param txt: input text.    
    :param remove_none_english_chars: if True then remove none English chars from text
    :return: string in which each line consists of single sentence from the original input text.
    :rtype: str
    """        
    txt = txt.decode("utf8") 
    # split text into sentences using nltk packages
    for s in tokenizer.tokenize(txt):
        if remove_none_english_chars:
            #remove none English chars
            s = re.sub("[^a-zA-Z]", " ", s)
        yield s
        
sf = gl.load_sframe("%s/books.sframe" % BASE_DIR)
sf['sentences'] = sf['text'].apply(lambda t: list(txt2sentences(t)))

In [ ]:
sf_sentences = sf.flat_map(['title', 'text'], lambda t: [[t['title'],s.strip()] for s in txt2sentences(t['text'])])
sf_sentences = sf_sentences.rename({'text': 'sentence'})
re_words_split = re.compile("(\w+)")

#split each sentence into words
sf_sentences['words'] = sf_sentences['sentence'].apply(lambda s:re_words_split.findall(s))
sf_sentences.save("%s/sentences.sframe" % BASE_DIR)
sf_sentences.head(3)


Out[ ]:
['
\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
titlesentencewords
The Adventures of
Sherlock Holmes ...
THE ADVENTURES OF
SHERLOCK HOLMES ...
[THE, ADVENTURES, OF,
SHERLOCK, HOLMES, Art ...
The Adventures of
Sherlock Holmes ...
I have seldom heard him
mention her under any ...
[I, have, seldom, heard,
him, mention, her, un ...
The Adventures of
Sherlock Holmes ...
In his eyes she eclipses
and predominates ...
[In, his, eyes, she,
eclipses, and, ...
\n', '[3 rows x 3 columns]
\n', '
']

We created a SFrame named sf_sentences in which each row contains a single sentence. Now let's find out which two or more characters from the following link appear in the same sentences. Notice that we only use the characters unique names so we don't mix up between characters with similar names. For example, the name Holmes can represent both Sherlock Holmes and Mycroft Holmes.


In [ ]:
main_characters_set = set(["Irene","Mycroft","Lestrade","Sherlock","Moran","Moriarty","Watson" ])
sf_sentences['characters'] = sf_sentences['words'].apply(lambda w: list(set(w) & main_characters_set))

Now, the 'characters' column contain the names of the main characters that appear in the same sentences together. Let's use this information to create the characters social network by constructing a SGraph object.


In [ ]:
import itertools
from collections import Counter
from graphlab import SGraph, Vertex, Edge

def get_characters_graph(sf, min_edge_strength=1):
    """
    Constructs a social network from the an input SFrame. In the social network the verticies are the characters
    and the edges are only between characters that appear in the same sentence at least min_edge_strength times
    :param sf: input SFrame object that contains 'characters' column   
    :param min_edge_strength: minimal connetion strength between two characters.  
    :return: SGraph object constructed from the input SFrame. The graph only contains edges with 
        the at least the input minimal strength between between the characters.
    :rtype: gl.SGraph
    """
    #filter sentences with less than two characters
    sf['characters_num'] = sf['characters'].apply(lambda l: len(l))
    sf = sf_sentences[sf['characters_num'] > 1]
    characters_links = []
    for l in sf['characters']:    
        # if there are more than two characters in the same sentences. Create all link combinations between
        # all the characters (order doesn't matter)
        characters_links += itertools.combinations(l,2)

    #calculating the connections strength between each two characters
    c = Counter(characters_links)
    g = SGraph()

    edges_list = []
    for l,s in c.iteritems():    
        if s < min_edge_strength:
            # filter out connections that appear less than min_edge_strength
            continue
        edges_list.append(Edge(l[0], l[1], attr={'strength':s}))

    g = g.add_edges(edges_list)
    return g

g = get_characters_graph(sf_sentences)
g.show(vlabel="__id", elabel="strength", node_size=200)


According to Sherlock's social network, it can be noticed that Sherlock has two main social circles. The first one is circle of friends that include Mycroft and Lestrade. Additionally, he has a circle of enemies that include Moriaty and Moran. Additionally, we can notice Watson is strongly connected to Sherlock and Sherlock's nemesis Moriaty.
Let's repeat the experiments only this time we also add minor characters from the following link.


In [ ]:
minor_characters_set = set(["Irene","Mycroft","Lestrade","Sherlock","Moran","Moriarty","Watson","Baynes","Billy","Bradstreet","Gregson"
                            ,"Hopkins","Hudson","Shinwell","Athelney","Mary","Langdale","Toby","Wiggins"])

sf_sentences['characters'] = sf_sentences['words'].apply(lambda w: list(set(w) & minor_characters_set))
sf_sentences['characters_num'] = sf_sentences['characters'].apply(lambda l: len(l))
sf_sentences = sf_sentences[sf_sentences['characters_num'] > 1]

g = get_characters_graph(sf_sentences)
g.show(vlabel="__id", elabel="strength", node_size=200)


We got a more complex social network with the additional minor characters. I believe this social network graph can be improved by increasing the scope of characters search from single sentence to multiple sentences, or by using characters additional names and nick names. I leave the reader to try to improve the graph by themselves.

3.2 Constructing Social Network using Named Entity Recognition

One of the disadvantages of the above method is that you need a predefined list of names to create the social network. However, in many cases this list is unavailable. Therefore, we need another method to find entities in the text. One common method to achieve this is using Named Entity Recognition (NER). By using NER algorithms, we can classify elements in the text into pre-defined categories, such as the names of persons, organizations, and locations. There are many tools that can perform NER, such as OpenNLP, Stanford Named Entity Recognizer, Rosette Entity Extractor. In this notebook, we will use the Stanford Named Entity Recognizer via NLTK. We will use NER algorithms to automatically construct an entity list of the most common characters of the book.

Please note that making NLTK run Stanford Named Entity Recognizer can be non-trivial. For more details, on how to make NLTK work with Stanford Named Entity Recognizer please read the information provided in the following links 1,2, & 3.

NOTE: running the next code section can take several minutes.


In [ ]:
from nltk.tag import StanfordNERTagger

sf_books =  gl.load_sframe("%s/books.sframe" % BASE_DIR)

#IMPORTANT: The directory that include the Stanford Named Entity Recognizer files it need to be updated according 
# to the local installation directory
#STANFORD_DIR = BASE_DIR + "/stanford-ner-2015-06-16/"

#need to insert as parameters the stanford-ner.jar and the type of classifier we want to use
st = StanfordNERTagger('/Users/pablo/fromHomeMac/sherlock/stanford-ner-2015-04-20/classifiers/english.all.3class.distsim.crf.ser.gz',
                       '/Users/pablo/Downloads/stanford-ner-2015-04-20/stanford-ner.jar')
st.java_options = "-Xmx4096m"

sf_books['sentences'] = sf_books['text'].apply(lambda t: list(txt2sentences(t)))
sf_books['words'] = sf_books['sentences'].apply(lambda l: [re_words_split.findall(s) for s in l])
sf_books['NER'] = sf_books['words'].apply(lambda w: st.tag_sents(w))
sf_books['person'] = sf_books['NER'].apply(lambda n: [e[0] for s in n for e in s if e[1] == 'PERSON'])

person_list = []
for p in sf_books['person']:
    person_list += p
    
print len(set(person_list))


1248

In [ ]:
from collections import Counter
c = Counter(person_list)
# We are removing some mistken classified words, too common names, and etc. to make the constructed social network
# more readable.
characters_set = set(i[0] for i in c.most_common(200)) - set(['the', 'You', 'Mrs', 'He', 'Dr', 'me','did', 'Mr', 
                                      'Now', 'My', 'Miss', 'of', 'Sir', 'Here', 'All', 'Our', 'sir',
                                      'man', 'father', 'What', 'There', 'When', 'no', 'Lord', 'you', 'St',
                                      'John', 'James',  'Holmes', 'Arthur', 'Conan', 'Doyle', 'Lady'])

sf_sentences = gl.load_sframe("%s/sentences.sframe" % BASE_DIR)
sf_sentences['characters'] = sf_sentences['words'].apply(lambda w: list(set(w) & characters_set))
sf_sentences['characters_num'] = sf_sentences['characters'].apply(lambda l: len(l))
sf_sentences = sf_sentences[sf_sentences['characters_num'] > 1]
g = get_characters_graph(sf_sentences, min_edge_strength=3)
print g.summary()
g.show(vlabel="__id", elabel="strength", node_size=200)


{'num_edges': 351, 'num_vertices': 184}

In [ ]:
# adding a function to clean the graph as in some cases the Stanford NER maps 'I' as a person.
def clean_graph(g, remove_entities_set):
    vertices = g.vertices[g.vertices["__id"].apply(lambda v: v not in remove_entities_set)] 
    edges = g.edges[g.edges.apply(lambda e: e["__src_id"] not in remove_entities_set and e["__dst_id"] not in remove_entities_set)]
    return gl.SGraph(vertices, edges)

In [ ]:
#cleaning the graph and displaying it again
g = clean_graph(g, {"I"})
g.show(vlabel="__id", elabel="strength", node_size=200)


The NER algorithm did pretty good job, and most of the names of the identified entities looks logical (at least to me). Additionally, we can understand the link between the various book characters. We can also notice that in many of the graph's components that have only two vertices the connection is between each characfter first and it's last names. Let use GraphLab graph_analytics toolkit and focus on the social network's largest component.


In [ ]:
def get_graph_largest_compnent(g):
    """
    Returns a graph with the largest component of the input graph
    :param g: input graph (SGraph object)
    :return: a graph of the largest component in the input object
    :rtype: gl.SGraph
    """
    
    cc = gl.connected_components.create(g)    
    #add each vertices its component id
    g.vertices['component_id'] = cc['graph'].vertices['component_id']
    # calculate the component id of the largest component
    largest_component_id = cc['component_size'].sort('Count', ascending=False)[0]['component_id']
    largest_component_verticies = g.vertices.filter_by(largest_component_id, 'component_id')['__id']
    h = g.get_neighborhood(largest_component_verticies, 1)
    return h

h = get_graph_largest_compnent(g)  
h.show(vlabel="__id", elabel="strength", node_size=300)


PROGRESS: +-----------------------------+
PROGRESS: | Number of components merged |
PROGRESS: +-----------------------------+
PROGRESS: | 166                         |
PROGRESS: | 0                           |
PROGRESS: +-----------------------------+

In [ ]:
h = clean_graph(g, {"I"})
h.show(vlabel="__id", elabel="strength", node_size=300)


According the above graph that can be created almost automatically. We can easily identify the main characters in the stories. Additionally, we can observe that the strongest connection is between Sherlock and Watson. Moreover, we can see various connections among the main and minor characters of the book. However, from only looking at the graph, it is non-trivial to understand the various communities and their relationships.

Using similar methods, we can learn more on each character by finding connections among person and location and person and organization. I leave the reader to find additional insights on the various characters on their own.

4. Topic Model

"I have known him for some time," said I, "but I never knew him do anything yet without a very good reason, and with that our conversation drifted off on to other topics.
                                                                 -Memoirs of Sherlock Holmes

According to Wikipedia article, topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. I personally find topic models an interesting tool to explore a large text corpus. In this section, we are going to demonstrate how it is possible to utilze GraphLab's topic model toolkit with the pyLDAvis package to uncover topics in a set of documents. Namely, we will use GraphLab's topic model toolkit to analyze paragraphs in Sherlock Holmes stories.

We will start by separating each story into paragraphs.


In [ ]:
import graphlab as gl
import re

sf =  gl.load_sframe("%s/books.sframe" % BASE_DIR)
sf_paragraphs = sf.flat_map(['title', 'text'], lambda t: [[t['title'],p.strip()] for p in t['text'].split("\n\n")])
sf_paragraphs = sf_paragraphs.rename({'text': 'paragraph'})

Let's calculate the number of words in each paragraph, and filter the paragraph that have less than 25 words.


In [ ]:
re_words_split = re.compile("(\w+)")
sf_paragraphs['paragraph_words_number'] = sf_paragraphs['paragraph'].apply(lambda p: len(re_words_split.findall(p)) )
sf_paragraphs = sf_paragraphs[sf_paragraphs['paragraph_words_number'] >=25]

Using the stories' paragraphs as documents, we can utilize GraphLab's topic model toolkit to discover topics that appear in these paragraph. We create a topic model with 10 topics to learn.

Note: the topic model results may be different in each run.


In [ ]:
docs =  gl.text_analytics.count_ngrams(sf_paragraphs['paragraph'], n=1)
stopwords = gl.text_analytics.stopwords()
# adding some additional stopwords to make the topic model more clear
stopwords |= set(['man', 'mr', 'sir', 'make', 'made', 'll', 'door', 'long', 'day', 'small']) 
docs = docs.dict_trim_by_keys(stopwords, exclude=True)	
docs = docs.dropna()
topic_model = gl.topic_model.create(docs, num_topics=10)


PROGRESS: Learning a topic model
PROGRESS:        Number of documents     11657
PROGRESS:            Vocabulary size     18734
PROGRESS:    Running collapsed Gibbs sampling
PROGRESS: +-----------+---------------+----------------+-----------------+
PROGRESS: | Iteration | Elapsed Time  | Tokens/Second  | Est. Perplexity |
PROGRESS: +-----------+---------------+----------------+-----------------+
PROGRESS: | 10        | 691.907ms     | 5.10963e+06    | 0               |
PROGRESS: +-----------+---------------+----------------+-----------------+

Let's view the most common word in each topic


In [ ]:
topic_model.get_topics().print_rows(100)
topic_model.save("%s/topic_model" % BASE_DIR)


+-------+----------+------------------+
| topic |   word   |      score       |
+-------+----------+------------------+
|   0   |  watson  | 0.0249172864212  |
|   0   |   case   | 0.0183231362925  |
|   0   |  matter  | 0.0131338268433  |
|   0   |  young   | 0.0101521241764  |
|   0   |   dear   | 0.00934935807382 |
|   1   |  heard   | 0.0148204289721  |
|   1   |    "     | 0.0140823880733  |
|   1   |  london  | 0.0124359891451  |
|   1   |   put    | 0.0107612040286  |
|   1   |   back   | 0.0107044316517  |
|   2   |   life   | 0.0126994044378  |
|   2   |   room   | 0.0113810279897  |
|   2   |  thing   | 0.00934614260247 |
|   2   |  clear   | 0.00888757688141 |
|   2   |   head   | 0.00791312472414 |
|   3   |   room   |  0.026319693207  |
|   3   |   hand   | 0.0229133243962  |
|   3   |   left   | 0.0192597191394  |
|   3   |   side   | 0.0136831637474  |
|   3   |   half   | 0.0133809858691  |
|   4   | morning  | 0.0217436676387  |
|   4   |   back   |  0.01319229855   |
|   4   |  window  | 0.0118008893424  |
|   4   |  round   | 0.0100906155246  |
|   4   |  young   | 0.00832236632326 |
|   5   |  night   | 0.0180842149072  |
|   5   |  light   | 0.0142598435382  |
|   5   | brought  | 0.0112509631229  |
|   5   |  street  | 0.0100980276367  |
|   5   |   told   | 0.00857952724021 |
|   6   |  holmes  | 0.0378032415173  |
|   6   |   time   | 0.0299076826969  |
|   6   | thought  |  0.019438113858  |
|   6   | sherlock | 0.0148685230388  |
|   6   |  found   | 0.0132199997686  |
|   7   |    "     |  0.206028291106  |
|   7   |    "i    | 0.0349375618017  |
|   7   |   "it    |  0.019365019214  |
|   7   |   "the   | 0.00912148359554 |
|   7   |   wife   | 0.00891424443563 |
|   8   |    "     |  0.111275157582  |
|   8   |  holmes  | 0.0559314006385  |
|   8   |   face   | 0.0275724760557  |
|   8   |    "i    | 0.0234501993895  |
|   8   |   "you   | 0.0146209259627  |
|   9   |  great   | 0.0149938904351  |
|   9   |  house   |  0.013777634467  |
|   9   |  friend  | 0.0115148326658  |
|   9   |   work   |  0.010920847193  |
|   9   |  "well   | 0.0100157264725  |
+-------+----------+------------------+
[50 rows x 3 columns]

Reading the above table, we can understand some of the topics. However, it still hard to get good overall overview. Therefore, we will use the excellent pyLDAvis package, developed by Ben Mabey, to better the various topics in the books.


In [ ]:
import pyLDAvis
import pyLDAvis.graphlab
pyLDAvis.enable_notebook()
pyLDAvis.graphlab.prepare(topic_model, docs)


Out[ ]:
['\n', '\n', '\n', '\n', '
\n', '']

From the above visualization, we can observe that the algorithm returned pretty interesting results. For example, one identified topic is related to Watson, locations (room, street, house, etc.), and time (days, hours, etc.). While, another topic is related to Holmes, men, and murder. For me these are pretty interesting results. I recommend the reader to try investigate the results by themselves. Moreover, I think that running the topic model algorithm on other text corpus can help to better understand this algorithms advantages.

5. Finding SImilar Paragraphs using Word2Vec

"I have notes of several similar cases, though none, as I remarked before, which were quite as prompt. My whole examination served to 
turn my conjecture into a certainty. Circumstantial evidence is occasionally very convincing, as when you find a trout in the milk, to 
quote Thoreau's example."
                                                                                                 -The Adventure of the Noble Bachelor

These days, no NLP related post can be complete without including the words "deep learning." Therefore, in this section I will demonstrate how to use Word2Vec deep learning inspired algorithm to search for paragraphs that have similar text or writing style.

First, let's build a Word2Vec model using Sherlock's stories. We will construct the Word2Vec model using the Gensim package and a similar method to the one presented in Word2vec Tutorial and in my previous post.


In [ ]:
import graphlab as gl
import urllib2
import gensim
import nltk
import re

txt = urllib2.urlopen("https://sherlock-holm.es/stories/plain-text/cnus.txt").read()

re_words_split = re.compile("(\w+)")
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
def txt2words(s):
    s = re.sub("[^a-zA-Z]", " ", s).lower()
    return re_words_split.findall(s)

class MySentences(object):
        def __init__(self, txt):
            self._txt = txt.decode("utf8") 
            
        def __iter__(self):
            """
            Split the English text into sentences and then to words using NLTK
            :param txt: input text.    
            :param remove_none_english_chars: if True then remove none English chars from text
            :return: list of words in which each list consists of single sentence's words from the original input text.
            :rtype: str
            """                
            # split text into sentences using NLTK package
            for s in tokenizer.tokenize(self._txt):                                    
                yield txt2words(s)

sentences = MySentences(txt)
model = gensim.models.Word2Vec(sentences, size=100, window=5, min_count=3, workers=4)


/Users/pablo/anaconda/envs/dato-env/lib/python2.7/site-packages/numpy/lib/utils.py:99: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  warnings.warn(depdoc, DeprecationWarning)

We now have a trained Word2Vec model, let's see if it gives reasonable results:


In [ ]:
print model.most_similar("watson")
print model.most_similar("holmes")


[(u'mortimer', 0.7463816404342651), (u'dear', 0.6983240842819214), (u'colleague', 0.6832994222640991), (u'oh', 0.6541128158569336), (u'tut', 0.6502598524093628), (u'trevelyan', 0.6456252336502075), (u'ah', 0.6448663473129272), (u'jack', 0.6376811861991882), (u'god', 0.6342098712921143), (u'sake', 0.6111361980438232)]
[(u'lestrade', 0.7618856430053711), (u'mac', 0.7348191738128662), (u'mcmurdo', 0.7301948070526123), (u'phelps', 0.7230634689331055), (u'melas', 0.7184404134750366), (u'baynes', 0.7174237966537476), (u'douglas', 0.7089315056800842), (u'cornelius', 0.6709353923797607), (u'sternly', 0.6695888042449951), (u'barker', 0.6632921695709229)]

We got that the most similar word to Watson is Mortimer and the most similar word to Holmes is Lestrade. These results sound logical enough. Let us calculate the average vector of each paragraph.


In [ ]:
import graphlab as gl
import re
import numpy as np

BASE_DIR = r"/Users/pablo/fromHomeMac/sherlock/data" # NOTE: Update BASE_DIR to your own directory path
sf =  gl.load_sframe("%s/books.sframe" % BASE_DIR)
sf_paragraphs = sf.flat_map(['title', 'text'], lambda t: [[t['title'],p.strip()] for p in t['text'].split("\n\n")])
sf_paragraphs = sf_paragraphs.rename({'text': 'paragraph'})
sf_paragraphs['paragraph_words_number'] = sf_paragraphs['paragraph'].apply(lambda p: len(re_words_split.findall(p)) )
sf_paragraphs = sf_paragraphs[sf_paragraphs['paragraph_words_number'] >=25]

def txt2avg_vector(txt, w2v_model):
    words = [w for w in txt2words(txt.lower()) if w in w2v_model]
    v = np.mean([w2v_model[w] for w in words],axis=0)    
    return v

sf_paragraphs['mean_vector'] = sf_paragraphs['paragraph'].apply(lambda p: txt2avg_vector(p, model))

Now we have the mean vector value of each paragraph. Let's utilize GraphLab Create nearest neighbors toolkit to identify paragraphs that have similar text or writing style. We will acheive that by calaculating the nearest neighbor to each the mean vector of each paragraph.


In [ ]:
#construncting nearest neighbors model
nn_model = gl.nearest_neighbors.create(sf_paragraphs, features=['mean_vector'])

#calaculating the two nearest neighbors of each paragraph from all the paragraphs 
r = nn_model.query(sf_paragraphs, k=2)
r.head(10)


PROGRESS: Starting ball tree nearest neighbors model training.
PROGRESS: +------------+--------------+
PROGRESS: | Tree level | Elapsed Time |
PROGRESS: +------------+--------------+
PROGRESS: | 0          | 166.646ms    |
PROGRESS: | 1          | 317.073ms    |
PROGRESS: | 2          | 470.327ms    |
PROGRESS: | 3          | 621.256ms    |
PROGRESS: | 4          | 694.293ms    |
PROGRESS: +------------+--------------+
PROGRESS: +--------------+-------------+--------------+
PROGRESS: | Query points | % Complete. | Elapsed Time |
PROGRESS: +--------------+-------------+--------------+
PROGRESS: | 1            | 0           | 81.39ms      |
PROGRESS: | 28           | 0           | 1.10s        |
PROGRESS: | 51           | 0.25        | 2.09s        |
PROGRESS: | 76           | 0.5         | 3.09s        |
PROGRESS: | 103          | 0.75        | 4.09s        |
PROGRESS: | 130          | 1           | 5.10s        |
PROGRESS: | 158          | 1.25        | 6.15s        |
PROGRESS: | 186          | 1.5         | 7.11s        |
PROGRESS: | 214          | 1.75        | 8.11s        |
PROGRESS: | 244          | 2           | 9.10s        |
PROGRESS: | 273          | 2.25        | 10.22s       |
PROGRESS: | 297          | 2.5         | 11.14s       |
PROGRESS: | 325          | 2.75        | 12.10s       |
PROGRESS: | 346          | 2.75        | 13.11s       |
PROGRESS: | 372          | 3           | 14.12s       |
PROGRESS: | 401          | 3.25        | 15.15s       |
PROGRESS: | 432          | 3.5         | 16.12s       |
PROGRESS: | 462          | 3.75        | 17.13s       |
PROGRESS: | 490          | 4           | 18.13s       |
PROGRESS: | 520          | 4.25        | 19.16s       |
PROGRESS: | 548          | 4.5         | 20.21s       |
PROGRESS: | 568          | 4.75        | 21.23s       |
PROGRESS: | 586          | 5           | 22.13s       |
PROGRESS: | 610          | 5           | 23.13s       |
PROGRESS: | 635          | 5.25        | 24.14s       |
PROGRESS: | 663          | 5.5         | 25.21s       |
PROGRESS: | 690          | 5.75        | 26.15s       |
PROGRESS: | 717          | 6           | 27.16s       |
PROGRESS: | 740          | 6.25        | 28.15s       |
PROGRESS: | 773          | 6.5         | 29.14s       |
PROGRESS: | 799          | 6.75        | 30.16s       |
PROGRESS: | 825          | 7           | 31.16s       |
PROGRESS: | 851          | 7.25        | 32.26s       |
PROGRESS: | 871          | 7.25        | 33.16s       |
PROGRESS: | 898          | 7.5         | 34.18s       |
PROGRESS: | 923          | 7.75        | 35.15s       |
PROGRESS: | 955          | 8           | 36.18s       |
PROGRESS: | 981          | 8.25        | 37.18s       |
PROGRESS: | 1012         | 8.5         | 38.16s       |
PROGRESS: | 1040         | 8.75        | 39.21s       |
PROGRESS: | 1070         | 9           | 40.20s       |
PROGRESS: | 1099         | 9.25        | 41.27s       |
PROGRESS: | 1125         | 9.5         | 42.25s       |
PROGRESS: | 1150         | 9.75        | 43.21s       |
PROGRESS: | 1176         | 10          | 44.20s       |
PROGRESS: | 1202         | 10.25       | 45.22s       |
PROGRESS: | 1229         | 10.5        | 46.21s       |
PROGRESS: | 1258         | 10.75       | 47.19s       |
PROGRESS: | 1287         | 11          | 48.26s       |
PROGRESS: | 1311         | 11          | 49.21s       |
PROGRESS: | 1338         | 11.25       | 50.23s       |
PROGRESS: | 1360         | 11.5        | 51.21s       |
PROGRESS: | 1387         | 11.75       | 52.23s       |
PROGRESS: | 1414         | 12          | 53.22s       |
PROGRESS: | 1446         | 12.25       | 54.21s       |
PROGRESS: | 1474         | 12.5        | 55.21s       |
PROGRESS: | 1505         | 12.75       | 56.25s       |
PROGRESS: | 1538         | 13          | 57.26s       |
PROGRESS: | 1564         | 13.25       | 58.22s       |
PROGRESS: | 1596         | 13.5        | 59.23s       |
PROGRESS: | 1625         | 13.75       | 1m 0s        |
PROGRESS: | 1654         | 14          | 1m 1s        |
PROGRESS: | 1682         | 14.25       | 1m 2s        |
PROGRESS: | 1713         | 14.5        | 1m 3s        |
PROGRESS: | 1745         | 14.75       | 1m 4s        |
PROGRESS: | 1775         | 15          | 1m 5s        |
PROGRESS: | 1802         | 15.25       | 1m 6s        |
PROGRESS: | 1832         | 15.5        | 1m 7s        |
PROGRESS: | 1858         | 15.75       | 1m 8s        |
PROGRESS: | 1883         | 16          | 1m 9s        |
PROGRESS: | 1909         | 16.25       | 1m 10s       |
PROGRESS: | 1940         | 16.5        | 1m 11s       |
PROGRESS: | 1974         | 16.75       | 1m 12s       |
PROGRESS: | 2006         | 17          | 1m 13s       |
PROGRESS: | 2034         | 17.25       | 1m 14s       |
PROGRESS: | 2059         | 17.5        | 1m 15s       |
PROGRESS: | 2091         | 17.75       | 1m 16s       |
PROGRESS: | 2114         | 18          | 1m 17s       |
PROGRESS: | 2143         | 18.25       | 1m 18s       |
PROGRESS: | 2172         | 18.5        | 1m 19s       |
PROGRESS: | 2196         | 18.75       | 1m 20s       |
PROGRESS: | 2220         | 19          | 1m 21s       |
PROGRESS: | 2243         | 19          | 1m 22s       |
PROGRESS: | 2269         | 19.25       | 1m 23s       |
PROGRESS: | 2297         | 19.5        | 1m 24s       |
PROGRESS: | 2321         | 19.75       | 1m 25s       |
PROGRESS: | 2349         | 20          | 1m 26s       |
PROGRESS: | 2378         | 20.25       | 1m 27s       |
PROGRESS: | 2402         | 20.5        | 1m 28s       |
PROGRESS: | 2426         | 20.75       | 1m 29s       |
PROGRESS: | 2457         | 21          | 1m 30s       |
PROGRESS: | 2484         | 21.25       | 1m 31s       |
PROGRESS: | 2509         | 21.5        | 1m 32s       |
PROGRESS: | 2537         | 21.75       | 1m 33s       |
PROGRESS: | 2562         | 21.75       | 1m 34s       |
PROGRESS: | 2596         | 22.25       | 1m 35s       |
PROGRESS: | 2623         | 22.5        | 1m 36s       |
PROGRESS: | 2649         | 22.5        | 1m 37s       |
PROGRESS: | 2675         | 22.75       | 1m 38s       |
PROGRESS: | 2705         | 23          | 1m 39s       |
PROGRESS: | 2740         | 23.5        | 1m 40s       |
PROGRESS: | 2771         | 23.75       | 1m 41s       |
PROGRESS: | 2800         | 24          | 1m 42s       |
PROGRESS: | 2836         | 24.25       | 1m 43s       |
PROGRESS: | 2867         | 24.5        | 1m 44s       |
PROGRESS: | 2893         | 24.75       | 1m 45s       |
PROGRESS: | 2928         | 25          | 1m 46s       |
PROGRESS: | 2952         | 25.25       | 1m 47s       |
PROGRESS: | 2986         | 25.5        | 1m 48s       |
PROGRESS: | 3017         | 25.75       | 1m 49s       |
PROGRESS: | 3039         | 26          | 1m 50s       |
PROGRESS: | 3068         | 26.25       | 1m 51s       |
PROGRESS: | 3096         | 26.5        | 1m 52s       |
PROGRESS: | 3123         | 26.75       | 1m 53s       |
PROGRESS: | 3153         | 27          | 1m 54s       |
PROGRESS: | 3180         | 27.25       | 1m 55s       |
PROGRESS: | 3202         | 27.25       | 1m 56s       |
PROGRESS: | 3231         | 27.5        | 1m 57s       |
PROGRESS: | 3263         | 27.75       | 1m 58s       |
PROGRESS: | 3292         | 28          | 1m 59s       |
PROGRESS: | 3321         | 28.25       | 2m 0s        |
PROGRESS: | 3350         | 28.5        | 2m 1s        |
PROGRESS: | 3380         | 28.75       | 2m 2s        |
PROGRESS: | 3411         | 29.25       | 2m 3s        |
PROGRESS: | 3438         | 29.25       | 2m 4s        |
PROGRESS: | 3467         | 29.5        | 2m 5s        |
PROGRESS: | 3501         | 30          | 2m 6s        |
PROGRESS: | 3528         | 30.25       | 2m 7s        |
PROGRESS: | 3555         | 30.25       | 2m 8s        |
PROGRESS: | 3585         | 30.75       | 2m 9s        |
PROGRESS: | 3612         | 30.75       | 2m 10s       |
PROGRESS: | 3635         | 31          | 2m 11s       |
PROGRESS: | 3665         | 31.25       | 2m 12s       |
PROGRESS: | 3698         | 31.5        | 2m 13s       |
PROGRESS: | 3733         | 32          | 2m 14s       |
PROGRESS: | 3763         | 32.25       | 2m 15s       |
PROGRESS: | 3794         | 32.5        | 2m 16s       |
PROGRESS: | 3822         | 32.75       | 2m 17s       |
PROGRESS: | 3850         | 33          | 2m 18s       |
PROGRESS: | 3885         | 33.25       | 2m 19s       |
PROGRESS: | 3917         | 33.5        | 2m 20s       |
PROGRESS: | 3945         | 33.75       | 2m 21s       |
PROGRESS: | 3976         | 34          | 2m 22s       |
PROGRESS: | 4005         | 34.25       | 2m 23s       |
PROGRESS: | 4033         | 34.5        | 2m 24s       |
PROGRESS: | 4064         | 34.75       | 2m 25s       |
PROGRESS: | 4094         | 35          | 2m 26s       |
PROGRESS: | 4122         | 35.25       | 2m 27s       |
PROGRESS: | 4149         | 35.5        | 2m 28s       |
PROGRESS: | 4181         | 35.75       | 2m 29s       |
PROGRESS: | 4216         | 36          | 2m 30s       |
PROGRESS: | 4244         | 36.25       | 2m 31s       |
PROGRESS: | 4270         | 36.5        | 2m 32s       |
PROGRESS: | 4298         | 36.75       | 2m 33s       |
PROGRESS: | 4326         | 37          | 2m 34s       |
PROGRESS: | 4357         | 37.25       | 2m 35s       |
PROGRESS: | 4390         | 37.5        | 2m 36s       |
PROGRESS: | 4414         | 37.75       | 2m 37s       |
PROGRESS: | 4449         | 38          | 2m 38s       |
PROGRESS: | 4476         | 38.25       | 2m 39s       |
PROGRESS: | 4504         | 38.5        | 2m 40s       |
PROGRESS: | 4533         | 38.75       | 2m 41s       |
PROGRESS: | 4558         | 39          | 2m 42s       |
PROGRESS: | 4590         | 39.25       | 2m 43s       |
PROGRESS: | 4625         | 39.5        | 2m 44s       |
PROGRESS: | 4654         | 39.75       | 2m 45s       |
PROGRESS: | 4681         | 40          | 2m 46s       |
PROGRESS: | 4707         | 40.25       | 2m 47s       |
PROGRESS: | 4736         | 40.5        | 2m 48s       |
PROGRESS: | 4761         | 40.75       | 2m 49s       |
PROGRESS: | 4792         | 41          | 2m 50s       |
PROGRESS: | 4821         | 41.25       | 2m 51s       |
PROGRESS: | 4851         | 41.5        | 2m 52s       |
PROGRESS: | 4883         | 41.75       | 2m 53s       |
PROGRESS: | 4911         | 42          | 2m 54s       |
PROGRESS: | 4939         | 42.25       | 2m 55s       |
PROGRESS: | 4964         | 42.5        | 2m 56s       |
PROGRESS: | 4997         | 42.75       | 2m 57s       |
PROGRESS: | 5027         | 43          | 2m 58s       |
PROGRESS: | 5059         | 43.25       | 2m 59s       |
PROGRESS: | 5084         | 43.5        | 3m 0s        |
PROGRESS: | 5109         | 43.75       | 3m 1s        |
PROGRESS: | 5131         | 44          | 3m 2s        |
PROGRESS: | 5160         | 44.25       | 3m 3s        |
PROGRESS: | 5186         | 44.25       | 3m 4s        |
PROGRESS: | 5214         | 44.5        | 3m 5s        |
PROGRESS: | 5246         | 45          | 3m 6s        |
PROGRESS: | 5270         | 45          | 3m 7s        |
PROGRESS: | 5299         | 45.25       | 3m 8s        |
PROGRESS: | 5328         | 45.5        | 3m 9s        |
PROGRESS: | 5358         | 45.75       | 3m 10s       |
PROGRESS: | 5386         | 46          | 3m 11s       |
PROGRESS: | 5413         | 46.25       | 3m 12s       |
PROGRESS: | 5442         | 46.5        | 3m 13s       |
PROGRESS: | 5469         | 46.75       | 3m 14s       |
PROGRESS: | 5502         | 47          | 3m 15s       |
PROGRESS: | 5528         | 47.25       | 3m 16s       |
PROGRESS: | 5560         | 47.5        | 3m 17s       |
PROGRESS: | 5584         | 47.75       | 3m 18s       |
PROGRESS: | 5615         | 48          | 3m 19s       |
PROGRESS: | 5651         | 48.25       | 3m 20s       |
PROGRESS: | 5676         | 48.5        | 3m 21s       |
PROGRESS: | 5709         | 48.75       | 3m 22s       |
PROGRESS: | 5735         | 49          | 3m 23s       |
PROGRESS: | 5761         | 49.25       | 3m 24s       |
PROGRESS: | 5793         | 49.5        | 3m 25s       |
PROGRESS: | 5820         | 49.75       | 3m 26s       |
PROGRESS: | 5850         | 50          | 3m 27s       |
PROGRESS: | 5877         | 50.25       | 3m 28s       |
PROGRESS: | 5905         | 50.5        | 3m 29s       |
PROGRESS: | 5936         | 50.75       | 3m 30s       |
PROGRESS: | 5961         | 51          | 3m 31s       |
PROGRESS: | 5990         | 51.25       | 3m 32s       |
PROGRESS: | 6025         | 51.5        | 3m 33s       |
PROGRESS: | 6052         | 51.75       | 3m 34s       |
PROGRESS: | 6071         | 52          | 3m 35s       |
PROGRESS: | 6097         | 52.25       | 3m 36s       |
PROGRESS: | 6129         | 52.5        | 3m 37s       |
PROGRESS: | 6162         | 52.75       | 3m 38s       |
PROGRESS: | 6193         | 53          | 3m 39s       |
PROGRESS: | 6223         | 53.25       | 3m 40s       |
PROGRESS: | 6252         | 53.5        | 3m 41s       |
PROGRESS: | 6284         | 53.75       | 3m 42s       |
PROGRESS: | 6314         | 54          | 3m 43s       |
PROGRESS: | 6349         | 54.25       | 3m 44s       |
PROGRESS: | 6378         | 54.5        | 3m 45s       |
PROGRESS: | 6406         | 54.75       | 3m 46s       |
PROGRESS: | 6438         | 55          | 3m 47s       |
PROGRESS: | 6459         | 55.25       | 3m 48s       |
PROGRESS: | 6486         | 55.5        | 3m 49s       |
PROGRESS: | 6517         | 55.75       | 3m 50s       |
PROGRESS: | 6548         | 56          | 3m 51s       |
PROGRESS: | 6577         | 56.25       | 3m 52s       |
PROGRESS: | 6614         | 56.5        | 3m 53s       |
PROGRESS: | 6642         | 56.75       | 3m 54s       |
PROGRESS: | 6669         | 57          | 3m 55s       |
PROGRESS: | 6695         | 57.25       | 3m 56s       |
PROGRESS: | 6725         | 57.5        | 3m 57s       |
PROGRESS: | 6749         | 57.75       | 3m 58s       |
PROGRESS: | 6779         | 58          | 3m 59s       |
PROGRESS: | 6808         | 58.25       | 4m 0s        |
PROGRESS: | 6841         | 58.5        | 4m 1s        |
PROGRESS: | 6867         | 58.75       | 4m 2s        |
PROGRESS: | 6896         | 59          | 4m 3s        |
PROGRESS: | 6924         | 59.25       | 4m 4s        |
PROGRESS: | 6954         | 59.5        | 4m 5s        |
PROGRESS: | 6977         | 59.75       | 4m 6s        |
PROGRESS: | 7007         | 60          | 4m 7s        |
PROGRESS: | 7035         | 60.25       | 4m 8s        |
PROGRESS: | 7061         | 60.5        | 4m 9s        |
PROGRESS: | 7085         | 60.75       | 4m 10s       |
PROGRESS: | 7114         | 61          | 4m 11s       |
PROGRESS: | 7135         | 61          | 4m 12s       |
PROGRESS: | 7155         | 61.25       | 4m 13s       |
PROGRESS: | 7179         | 61.5        | 4m 14s       |
PROGRESS: | 7203         | 61.75       | 4m 15s       |
PROGRESS: | 7224         | 61.75       | 4m 16s       |
PROGRESS: | 7251         | 62          | 4m 17s       |
PROGRESS: | 7273         | 62.25       | 4m 18s       |
PROGRESS: | 7299         | 62.5        | 4m 19s       |
PROGRESS: | 7321         | 62.75       | 4m 20s       |
PROGRESS: | 7346         | 63          | 4m 21s       |
PROGRESS: | 7368         | 63          | 4m 22s       |
PROGRESS: | 7391         | 63.25       | 4m 23s       |
PROGRESS: | 7415         | 63.5        | 4m 24s       |
PROGRESS: | 7442         | 63.75       | 4m 25s       |
PROGRESS: | 7471         | 64          | 4m 26s       |
PROGRESS: | 7497         | 64.25       | 4m 27s       |
PROGRESS: | 7524         | 64.5        | 4m 28s       |
PROGRESS: | 7548         | 64.75       | 4m 29s       |
PROGRESS: | 7579         | 65          | 4m 30s       |
PROGRESS: | 7611         | 65.25       | 4m 31s       |
PROGRESS: | 7638         | 65.5        | 4m 32s       |
PROGRESS: | 7668         | 65.75       | 4m 33s       |
PROGRESS: | 7698         | 66          | 4m 34s       |
PROGRESS: | 7724         | 66.25       | 4m 35s       |
PROGRESS: | 7745         | 66.25       | 4m 36s       |
PROGRESS: | 7769         | 66.5        | 4m 37s       |
PROGRESS: | 7797         | 66.75       | 4m 38s       |
PROGRESS: | 7828         | 67          | 4m 39s       |
PROGRESS: | 7853         | 67.25       | 4m 40s       |
PROGRESS: | 7878         | 67.5        | 4m 41s       |
PROGRESS: | 7905         | 67.75       | 4m 42s       |
PROGRESS: | 7931         | 68          | 4m 43s       |
PROGRESS: | 7955         | 68          | 4m 44s       |
PROGRESS: | 7977         | 68.25       | 4m 45s       |
PROGRESS: | 8006         | 68.5        | 4m 46s       |
PROGRESS: | 8028         | 68.75       | 4m 47s       |
PROGRESS: | 8050         | 69          | 4m 48s       |
PROGRESS: | 8075         | 69.25       | 4m 49s       |
PROGRESS: | 8097         | 69.25       | 4m 50s       |
PROGRESS: | 8123         | 69.5        | 4m 51s       |
PROGRESS: | 8149         | 69.75       | 4m 52s       |
PROGRESS: | 8172         | 70          | 4m 53s       |
PROGRESS: | 8201         | 70.25       | 4m 54s       |
PROGRESS: | 8231         | 70.5        | 4m 55s       |
PROGRESS: | 8256         | 70.75       | 4m 56s       |
PROGRESS: | 8282         | 71          | 4m 57s       |
PROGRESS: | 8307         | 71.25       | 4m 58s       |
PROGRESS: | 8337         | 71.5        | 4m 59s       |
PROGRESS: | 8363         | 71.5        | 5m 0s        |
PROGRESS: | 8386         | 71.75       | 5m 1s        |
PROGRESS: | 8403         | 72          | 5m 2s        |
PROGRESS: | 8424         | 72.25       | 5m 3s        |
PROGRESS: | 8445         | 72.25       | 5m 4s        |
PROGRESS: | 8467         | 72.5        | 5m 5s        |
PROGRESS: | 8483         | 72.75       | 5m 6s        |
PROGRESS: | 8503         | 72.75       | 5m 7s        |
PROGRESS: | 8522         | 73          | 5m 8s        |
PROGRESS: | 8538         | 73          | 5m 9s        |
PROGRESS: | 8557         | 73.25       | 5m 10s       |
PROGRESS: | 8572         | 73.5        | 5m 11s       |
PROGRESS: | 8592         | 73.5        | 5m 12s       |
PROGRESS: | 8611         | 73.75       | 5m 13s       |
PROGRESS: | 8634         | 74          | 5m 14s       |
PROGRESS: | 8649         | 74          | 5m 15s       |
PROGRESS: | 8675         | 74.25       | 5m 16s       |
PROGRESS: | 8697         | 74.5        | 5m 17s       |
PROGRESS: | 8722         | 74.75       | 5m 18s       |
PROGRESS: | 8743         | 75          | 5m 19s       |
PROGRESS: | 8761         | 75          | 5m 20s       |
PROGRESS: | 8784         | 75.25       | 5m 21s       |
PROGRESS: | 8808         | 75.5        | 5m 22s       |
PROGRESS: | 8832         | 75.75       | 5m 23s       |
PROGRESS: | 8858         | 75.75       | 5m 24s       |
PROGRESS: | 8878         | 76          | 5m 25s       |
PROGRESS: | 8903         | 76.25       | 5m 26s       |
PROGRESS: | 8927         | 76.5        | 5m 27s       |
PROGRESS: | 8955         | 76.75       | 5m 28s       |
PROGRESS: | 8979         | 77          | 5m 29s       |
PROGRESS: | 9008         | 77.25       | 5m 30s       |
PROGRESS: | 9030         | 77.25       | 5m 31s       |
PROGRESS: | 9052         | 77.5        | 5m 32s       |
PROGRESS: | 9069         | 77.75       | 5m 33s       |
PROGRESS: | 9089         | 77.75       | 5m 34s       |
PROGRESS: | 9112         | 78          | 5m 35s       |
PROGRESS: | 9135         | 78.25       | 5m 36s       |
PROGRESS: | 9161         | 78.5        | 5m 37s       |
PROGRESS: | 9185         | 78.75       | 5m 38s       |
PROGRESS: | 9208         | 78.75       | 5m 39s       |
PROGRESS: | 9227         | 79          | 5m 40s       |
PROGRESS: | 9248         | 79.25       | 5m 41s       |
PROGRESS: | 9269         | 79.5        | 5m 42s       |
PROGRESS: | 9292         | 79.5        | 5m 43s       |
PROGRESS: | 9315         | 79.75       | 5m 44s       |
PROGRESS: | 9332         | 80          | 5m 45s       |
PROGRESS: | 9356         | 80.25       | 5m 46s       |
PROGRESS: | 9376         | 80.25       | 5m 47s       |
PROGRESS: | 9394         | 80.5        | 5m 48s       |
PROGRESS: | 9408         | 80.5        | 5m 49s       |
PROGRESS: | 9426         | 80.75       | 5m 50s       |
PROGRESS: | 9448         | 81          | 5m 51s       |
PROGRESS: | 9469         | 81          | 5m 52s       |
PROGRESS: | 9491         | 81.25       | 5m 53s       |
PROGRESS: | 9510         | 81.5        | 5m 54s       |
PROGRESS: | 9534         | 81.75       | 5m 55s       |
PROGRESS: | 9562         | 82          | 5m 56s       |
PROGRESS: | 9583         | 82          | 5m 57s       |
PROGRESS: | 9605         | 82.25       | 5m 58s       |
PROGRESS: | 9628         | 82.5        | 5m 59s       |
PROGRESS: | 9643         | 82.5        | 6m 0s        |
PROGRESS: | 9669         | 82.75       | 6m 1s        |
PROGRESS: | 9691         | 83          | 6m 2s        |
PROGRESS: | 9712         | 83.25       | 6m 3s        |
PROGRESS: | 9737         | 83.5        | 6m 4s        |
PROGRESS: | 9755         | 83.5        | 6m 5s        |
PROGRESS: | 9771         | 83.75       | 6m 6s        |
PROGRESS: | 9794         | 84          | 6m 7s        |
PROGRESS: | 9814         | 84          | 6m 8s        |
PROGRESS: | 9837         | 84.25       | 6m 9s        |
PROGRESS: | 9858         | 84.5        | 6m 11s       |
PROGRESS: | 9875         | 84.5        | 6m 11s       |
PROGRESS: | 9897         | 84.75       | 6m 12s       |
PROGRESS: | 9919         | 85          | 6m 13s       |
PROGRESS: | 9945         | 85.25       | 6m 14s       |
PROGRESS: | 9969         | 85.5        | 6m 15s       |
PROGRESS: | 9994         | 85.5        | 6m 16s       |
PROGRESS: | 10016        | 85.75       | 6m 17s       |
PROGRESS: | 10040        | 86          | 6m 18s       |
PROGRESS: | 10062        | 86.25       | 6m 19s       |
PROGRESS: | 10089        | 86.5        | 6m 20s       |
PROGRESS: | 10108        | 86.5        | 6m 21s       |
PROGRESS: | 10132        | 86.75       | 6m 22s       |
PROGRESS: | 10153        | 87          | 6m 23s       |
PROGRESS: | 10176        | 87.25       | 6m 24s       |
PROGRESS: | 10198        | 87.25       | 6m 25s       |
PROGRESS: | 10224        | 87.5        | 6m 27s       |
PROGRESS: | 10247        | 87.75       | 6m 27s       |
PROGRESS: | 10269        | 88          | 6m 28s       |
PROGRESS: | 10294        | 88.25       | 6m 29s       |
PROGRESS: | 10318        | 88.5        | 6m 30s       |
PROGRESS: | 10339        | 88.5        | 6m 32s       |
PROGRESS: | 10360        | 88.75       | 6m 33s       |
PROGRESS: | 10384        | 89          | 6m 34s       |
PROGRESS: | 10403        | 89          | 6m 35s       |
PROGRESS: | 10422        | 89.25       | 6m 36s       |
PROGRESS: | 10439        | 89.5        | 6m 36s       |
PROGRESS: | 10462        | 89.5        | 6m 37s       |
PROGRESS: | 10482        | 89.75       | 6m 39s       |
PROGRESS: | 10495        | 90          | 6m 40s       |
PROGRESS: | 10514        | 90          | 6m 41s       |
PROGRESS: | 10535        | 90.25       | 6m 41s       |
PROGRESS: | 10554        | 90.5        | 6m 42s       |
PROGRESS: | 10572        | 90.5        | 6m 43s       |
PROGRESS: | 10594        | 90.75       | 6m 44s       |
PROGRESS: | 10614        | 91          | 6m 46s       |
PROGRESS: | 10640        | 91.25       | 6m 46s       |
PROGRESS: | 10666        | 91.25       | 6m 47s       |
PROGRESS: | 10685        | 91.5        | 6m 49s       |
PROGRESS: | 10704        | 91.75       | 6m 50s       |
PROGRESS: | 10728        | 92          | 6m 51s       |
PROGRESS: | 10748        | 92          | 6m 52s       |
PROGRESS: | 10770        | 92.25       | 6m 53s       |
PROGRESS: | 10793        | 92.5        | 6m 54s       |
PROGRESS: | 10813        | 92.75       | 6m 55s       |
PROGRESS: | 10832        | 92.75       | 6m 56s       |
PROGRESS: | 10852        | 93          | 6m 57s       |
PROGRESS: | 10874        | 93.25       | 6m 58s       |
PROGRESS: | 10897        | 93.25       | 6m 59s       |
PROGRESS: | 10916        | 93.5        | 7m 0s        |
PROGRESS: | 10934        | 93.75       | 7m 1s        |
PROGRESS: | 10958        | 94          | 7m 2s        |
PROGRESS: | 10975        | 94          | 7m 3s        |
PROGRESS: | 10996        | 94.25       | 7m 4s        |
PROGRESS: | 11014        | 94.25       | 7m 5s        |
PROGRESS: | 11030        | 94.5        | 7m 6s        |
PROGRESS: | 11051        | 94.75       | 7m 7s        |
PROGRESS: | 11063        | 94.75       | 7m 8s        |
PROGRESS: | 11080        | 95          | 7m 9s        |
PROGRESS: | 11100        | 95          | 7m 10s       |
PROGRESS: | 11116        | 95.25       | 7m 11s       |
PROGRESS: | 11133        | 95.5        | 7m 12s       |
PROGRESS: | 11151        | 95.5        | 7m 13s       |
PROGRESS: | 11170        | 95.75       | 7m 14s       |
PROGRESS: | 11189        | 95.75       | 7m 15s       |
PROGRESS: | 11207        | 96          | 7m 16s       |
PROGRESS: | 11228        | 96.25       | 7m 17s       |
PROGRESS: | 11248        | 96.25       | 7m 18s       |
PROGRESS: | 11264        | 96.5        | 7m 19s       |
PROGRESS: | 11285        | 96.75       | 7m 20s       |
PROGRESS: | 11309        | 97          | 7m 21s       |
PROGRESS: | 11328        | 97          | 7m 22s       |
PROGRESS: | 11346        | 97.25       | 7m 23s       |
PROGRESS: | 11364        | 97.25       | 7m 24s       |
PROGRESS: | 11379        | 97.5        | 7m 25s       |
PROGRESS: | 11394        | 97.5        | 7m 26s       |
PROGRESS: | 11410        | 97.75       | 7m 27s       |
PROGRESS: | 11427        | 98          | 7m 28s       |
PROGRESS: | 11438        | 98          | 7m 29s       |
PROGRESS: | 11449        | 98          | 7m 30s       |
PROGRESS: | 11463        | 98.25       | 7m 31s       |
PROGRESS: | 11480        | 98.25       | 7m 32s       |
PROGRESS: | 11493        | 98.5        | 7m 33s       |
PROGRESS: | 11506        | 98.5        | 7m 34s       |
PROGRESS: | 11515        | 98.75       | 7m 35s       |
PROGRESS: | 11527        | 98.75       | 7m 36s       |
PROGRESS: | 11539        | 98.75       | 7m 37s       |
PROGRESS: | 11552        | 99          | 7m 38s       |
PROGRESS: | 11564        | 99          | 7m 39s       |
PROGRESS: | 11575        | 99.25       | 7m 40s       |
PROGRESS: | 11585        | 99.25       | 7m 41s       |
PROGRESS: | 11597        | 99.25       | 7m 42s       |
PROGRESS: | 11615        | 99.5        | 7m 43s       |
PROGRESS: | 11626        | 99.5        | 7m 44s       |
PROGRESS: | 11636        | 99.75       | 7m 45s       |
PROGRESS: | 11648        | 99.75       | 7m 46s       |
PROGRESS: | Done         |             | 7m 47s       |
PROGRESS: +--------------+-------------+--------------+
Out[ ]:
['
\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
query_labelreference_labeldistancerank
000.01
033260.1064173227762
110.01
164530.02
220.01
264540.02
364550.01
330.02
464560.01
440.02
\n', '[10 rows x 4 columns]
\n', '
']

Of course the nearest neighbors to each paragraph is the paragraph itself. Therefore, let us filter out paragraph that are with a distance of zero from each other. Additionally, let's look only on two near paragraphs that have small distance from each other (distance < 0.08)


In [ ]:
#filter out paragraphs that are exactly exactly the same
r = r[r['distance'] != 0]

#filter out paragraphs that are with distance >= 0.1
r = r[r['distance'] < 0.08]
r


Out[ ]:
['
\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
query_labelreference_labeldistancerank
435844020.07569352115042
440243580.07569352115042
466486480.07714545542562
520234560.07508787863932
556956840.07626420053462
568455690.07626420053462
914695620.03778018685872
956291460.03778018685872
965289170.06872097012452
\n', '[? rows x 4 columns]
Note: Only the head of the SFrame is printed. This SFrame is lazily evaluated.
You can use len(sf) to force materialization.\n', '

']

Now, let's use join to match between each query_label and reference_label values and their actual paragraphs.


In [ ]:
sf_paragraphs = sf_paragraphs.add_row_number('query_label')
sf_paragraphs = sf_paragraphs.add_row_number('reference_label')
sf_similar = r.join(sf_paragraphs, on="query_label").join(sf_paragraphs, on="reference_label")

In [ ]:
sf_similar[['paragraph','title', 'title.1', 'paragraph.1', 'distance']]


Out[ ]:
['
\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
paragraphtitletitle.1paragraph.1
The German was sent for
but professed to know ...
The Hound of the
Baskervilles ...
His Last BowIt was a blazing hot day
in August. Baker Street ...
At first this vague and
terrible power was ...
A Study In ScarletA Study In ScarletHad the wanderer remained
awake for another half ...
Had the wanderer remained
awake for another half ...
A Study In ScarletA Study In ScarletAt first this vague and
terrible power was ...
"It only remains to
indicate the part which ...
The Hound of the
Baskervilles ...
The Hound of the
Baskervilles ...
Sir Henry was more
pleased than surprise ...
Sir Henry was more
pleased than surprise ...
The Hound of the
Baskervilles ...
The Hound of the
Baskervilles ...
"It only remains to
indicate the part which ...
At the mention of this
gigantic sum we all ...
The Sign of the FourThe Empty HouseAll day I turned these
facts over in my mind, ...
Pictures for "The
Adventure of the Missing ...
The Missing Three-QuarterThe Dancing MenPictures for "The
Adventure of the Dancing ...
Pictures for "The
Adventure of the Golden ...
The Golden Pince-NezThe Priory SchoolPictures for "The
Adventure of the Priory ...
Pictures for "The
Adventure of the Priory ...
The Priory SchoolThe Golden Pince-NezPictures for "The
Adventure of the Golden ...
\n', '\n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', ' \n', '
distance
0.0750878786393
0.0756935211504
0.0756935211504
0.0762642005346
0.0762642005346
0.0771454554256
0.0687209701245
0.0377801868587
0.0377801868587
\n', '[9 rows x 5 columns]
\n', '
']

Let's look at some of the similar paragraphs.


In [ ]:
print sf_similar[1]['paragraph']
print "-"*100
print sf_similar[1]['paragraph.1']


At first this vague and terrible power was exercised only upon the
     recalcitrants who, having embraced the Mormon faith, wished
     afterwards to pervert or to abandon it. Soon, however, it took a
     wider range. The supply of adult women was running short, and
     polygamy without a female population on which to draw was a barren
     doctrine indeed. Strange rumours began to be bandied about--rumours
     of murdered immigrants and rifled camps in regions where Indians had
     never been seen. Fresh women appeared in the harems of the
     Elders--women who pined and wept, and bore upon their faces the
     traces of an unextinguishable horror. Belated wanderers upon the
     mountains spoke of gangs of armed men, masked, stealthy, and
     noiseless, who flitted by them in the darkness. These tales and
     rumours took substance and shape, and were corroborated and
     re-corroborated, until they resolved themselves into a definite name.
     To this day, in the lonely ranches of the West, the name of the
     Danite Band, or the Avenging Angels, is a sinister and an ill-omened
     one.
----------------------------------------------------------------------------------------------------
Had the wanderer remained awake for another half hour a strange sight
     would have met his eyes. Far away on the extreme verge of the alkali
     plain there rose up a little spray of dust, very slight at first, and
     hardly to be distinguished from the mists of the distance, but
     gradually growing higher and broader until it formed a solid,
     well-defined cloud. This cloud continued to increase in size until it
     became evident that it could only be raised by a great multitude of
     moving creatures. In more fertile spots the observer would have come
     to the conclusion that one of those great herds of bisons which graze
     upon the prairie land was approaching him. This was obviously
     impossible in these arid wilds. As the whirl of dust drew nearer to
     the solitary bluff upon which the two castaways were reposing, the
     canvas-covered tilts of waggons and the figures of armed horsemen
     began to show up through the haze, and the apparition revealed itself
     as being a great caravan upon its journey for the West. But what a
     caravan! When the head of it had reached the base of the mountains,
     the rear was not yet visible on the horizon. Right across the
     enormous plain stretched the straggling array, waggons and carts, men
     on horseback, and men on foot. Innumerable women who staggered along
     under burdens, and children who toddled beside the waggons or peeped
     out from under the white coverings. This was evidently no ordinary
     party of immigrants, but rather some nomad people who had been
     compelled from stress of circumstances to seek themselves a new
     country. There rose through the clear air a confused clattering and
     rumbling from this great mass of humanity, with the creaking of
     wheels and the neighing of horses. Loud as it was, it was not
     sufficient to rouse the two tired wayfarers above them.

Although that in the paragraphs match the text is completely different, it still has the several similar motifs. In the first paragraph dog is leading his master. While in the second paragraph the boots are replacing the dog part. In both paragraphs the author use somewhat similar motifs "dreadful sight to see that huge black creature" and "saw something that made me feel sickish." I personally find these results quite interesting.

6. Where to Go From Here

"Thank you," said Holmes, "I only wished to ask you how you would go from here to the Strand."
                                                                              -The Red-Headed League

In this notebook, we presented a short and practical tutorial for NLP, which covered several common NLP topics, such as NER, Topic Model, and Word2Vec. If you want to continue to explore this dataset yourself, there are a lot more that can be done. You can rerun this code using different texts (Harry Potter, Lord of the Rings, and etc.). In addition, you can try to modify the above code to create social networks between persons and locations, or to use GloVe instead of Word2Vec. Furthermore, you can also try to run other graph theory algorithms, such as community detection algorithms, on the constructed social networks to uncover additional interesting insights. We hope that the methods and code presented in this notebook can assist you to solve other text analysis tasks.

7. Further Reading