Pride & Prejudice analysis

Real text analysis

We got familiar with Spacy. In the next section we are going to analyse a real text (Pride & Prejudice).

We would like to:

Extract the names of all the characters from the book (e.g. Elizabeth, Darcy, Bingley)
Visualize characters' occurences with regards to relative position in the book
Authomatically describe any character from the book
Find out which characters have been mentioned in a context of marriage
Build keywords extraction that could be used to display a word cloud (example)

Load text file



In [4]:

    
def read_file(file_name):
    with open(file_name, 'r') as file:
        return file.read()  #.decode('utf-8')

Process full text



In [5]:

    
import spacy

nlp = spacy.load('en')

# Process `text` with Spacy NLP Parser
text = read_file('data/pride_and_prejudice.txt')
processed_text = nlp(text)



In [6]:

    
# How many sentences are in the book (Pride & Prejudice)?
sentences = [s for s in processed_text.sents]
print(len(sentences))

# Print sentences from index 10 to index 15, to make sure that we have parsed the correct book
print(sentences[10:15])









    



5464
["What is his name?"

"Bingley."

, "Is he married or single?"

"Oh!, Single, my dear, to be sure!, A single man of large fortune; four or
five thousand a year., What a fine thing for our girls!"

]

Find all the personal names



In [ ]:

    
# Extract all the personal names from Pride & Prejudice and count their occurrences. 
# Expected output is a list in the following form: [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266) ...].

from collections import Counter, defaultdict

def find_character_occurences(doc):
    """
    Return a list of actors from `doc` with corresponding occurences.
    
    :param doc: Spacy NLP parsed document
    :return: list of tuples in form
        [('elizabeth', 622), ('darcy', 312), ('jane', 286), ('bennet', 266)]
    """
    
    characters = Counter()
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            characters[ent.lemma_] += 1
            
    return characters.most_common()



In [11]:

    
[n for n in find_character_occurences(processed_text)[:20] if n[0]]









    Out[11]:





[('elizabeth', 608),
 ('darcy', 288),
 ('jane', 286),
 ('bennet', 248),
 ('wickham', 179),
 ('collins', 175),
 ('bingley', 166),
 ('lydia', 160),
 ('lizzy', 92),
 ('gardiner', 91),
 ('lady catherine', 71),
 ('kitty', 69),
 ('mary', 36),
 ('hurst', 32),
 ('phillips', 30),
 ('miss bingley', 30),
 ('catherine', 24),
 ('longbourn', 24),
 ('forster', 22)]

Plot characters personal names as a time series



In [12]:

    
# Matplotlib Jupyter HACK
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt



In [13]:

    
# Plot characters' mentions as a time series relative to the position of the actor's occurrence in a book.

def get_character_offsets(doc):
    """
    For every character in a `doc` collect all the occurences offsets and store them into a list. 
    The function returns a dictionary that has actor lemma as a key and list of occurences as a value for every character.
    
    :param doc: Spacy NLP parsed document
    :return: dict object in form
        {'elizabeth': [123, 543, 4534], 'darcy': [205, 2111]}
    """
    
    character_offsets = defaultdict(list)
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            character_offsets[ent.lemma_].append(ent.start)
            
    return dict(character_offsets)

character_occurences = get_character_offsets(processed_text)



In [14]:

    
from matplotlib.pyplot import hist
from cycler import cycler

NUM_BINS = 10

def normalize(occurencies, normalization_constant):
    return [o / float(len(processed_text)) for o in occurencies]

def plot_character_timeseries(character_offsets, character_labels, normalization_constant=None):
    """
    Plot characters' personal names specified in `character_labels` list as time series.
    
    :param character_offsets: dict object in form {'elizabeth': [123, 543, 4534], 'darcy': [205, 2111]}
    :param character_labels: list of strings that should match some of the keys in `character_offsets`
    :param normalization_constant: int
    """
    x = [character_offsets[character_label] for character_label in character_labels] 
        
    with plt.style.context('fivethirtyeight'):
        plt.figure()
        n, bins, patches = plt.hist(x, NUM_BINS, label=character_labels)
        plt.clf()
        
        ax = plt.subplot(111)
        for i, a in enumerate(n):
            ax.plot([float(x) / (NUM_BINS - 1) for x in range(len(a))], a, label=character_labels[i])
            
        matplotlib.rcParams['axes.prop_cycle'] = cycler(color=['r','k','c','b','y','m','g','#54a1FF'])
        ax.legend(loc='center left', bbox_to_anchor=(1, 0.5))

#plot_character_timeseries(character_occurences, ['darcy', 'bingley'], normalization_constant=len(processed_text))
plot_character_timeseries(character_occurences, ['darcy', 'bingley'])

Spacy parse tree in action



In [16]:

    
# Find words (adjectives) that describe Mr. Darcy.

def get_character_adjectives(doc, character_lemma):
    """
    Find all the adjectives related to `character_lemma` in `doc`
    
    :param doc: Spacy NLP parsed document
    :param character_lemma: string object
    :return: list of adjectives related to `character_lemma`
    """
    
    adjectives = []
    for ent in processed_text.ents:
        if ent.lemma_ == character_lemma:
            for token in ent.subtree:
                if token.pos_ == 'ADJ': # Replace with if token.dep_ == 'amod':
                    adjectives.append(token.lemma_)
    
    for ent in processed_text.ents:
        if ent.lemma_ == character_lemma:
            if ent.root.dep_ == 'nsubj':
                for child in ent.root.head.children:
                    if child.dep_ == 'acomp':
                        adjectives.append(child.lemma_)
    
    return adjectives

ch_adjs = get_character_adjectives(processed_text, 'darcy')



In [20]:

    
ch_adjs = list(set(ch_adjs))  # dedup
ch_adjs = sorted([c for c in ch_adjs if not ('-' in c or '_' in c)])
ch_adjs









    Out[20]:





['abominable',
 'all',
 'answerable',
 'ashamed',
 'bad',
 'bewitch',
 'clever',
 'confidential',
 'dear',
 'deficient',
 'delighted',
 'disappointing',
 'eld',
 'engaged',
 'equal',
 'few',
 'fond',
 'good',
 'grave',
 'handsome',
 'heightened',
 'impatient',
 'intimate',
 'last',
 'late',
 'little',
 'many',
 'much',
 'natured',
 'opposite',
 'own',
 'poor',
 'present',
 'proud',
 'punctual',
 'same',
 'selfish',
 'sensible',
 'studious',
 'such',
 'superior',
 'sure',
 'surprised',
 'tall',
 'that',
 'tranquil',
 'true',
 'unappeasable',
 'unwilling',
 'unworthy',
 'which',
 'whose',
 'worth',
 'wretched']



In [21]:

    
# Find characters that are 'talking', 'saying', 'doing' the most. Find the relationship between 
# entities and corresponding root verbs.

character_verb_counter = Counter()
VERB_LEMMA = 'say'

for ent in processed_text.ents:
    if ent.label_ == 'PERSON' and ent.root.head.lemma_ == VERB_LEMMA:
        character_verb_counter[ent.text] += 1

character_verb_counter.most_common(10)
        
# Find all the characters that got married in the book.
#
# Here is an example sentence from which this information could be extracted:
# 
# "her mother was talking to that one person (Lady Lucas) freely,
# openly, and of nothing else but her expectation that Jane would soon
# be married to Mr. Bingley."
#









    Out[21]:





[('Elizabeth', 40),
 ('Bennet', 27),
 ('Jane', 15),
 ('Miss Bingley', 7),
 ('Gardiner', 5),
 ('Lydia', 5),
 ('Darcy', 4),
 ('Wickham', 4),
 ('Lady Catherine', 4),
 ('Collins', 4)]

Extract Keywords



In [22]:

    
# Extract Keywords using noun chunks from the news article (file 'article.txt').
# Spacy will pick some noun chunks that are not informative at all (e.g. we, what, who).
# Try to find a way to remove non informative keywords.

article = read_file('data/article.txt')
doc = nlp(article)

keywords = Counter()
for chunk in doc.noun_chunks:
    if nlp.vocab[chunk.lemma_].prob < - 8: # probablity value -8 is arbitrarily selected threshold
        keywords[chunk.lemma_] += 1

keywords.most_common(20)









    Out[22]:





[('-PRON-', 16),
 ('terrorism', 3),
 ('al - qaeda', 3),
 ('many country', 2),
 ('the ministry', 2),
 ('the responsibility', 2),
 ('interior', 2),
 ('social medium', 2),
 ('religion', 2),
 ('daesh', 2),
 ('saudi arabia', 2),
 ('excommunication', 2),
 ('god', 1),
 ('tyrant', 1),
 ('discussion', 1),
 ('-PRON- ideology', 1),
 ('the emergence', 1),
 ('the youth', 1),
 ('security expertise', 1),
 ('such a discourse', 1)]



In [ ]: