Exploratory analysis of quoted speech:

This notebook explores a theory about quoted language, using a test corpus of a few Amazon Kindle frequently highlighted passages from Pride and Prejudice (one of the most quoted novels on Amazon). We compare these with other text from the novel.

This analysis is not really meant to go in the paper, but is just for fun.


In [1]:
from collections import Counter
import pandas as pd
%matplotlib inline
from pylab import rcParams
from bs4 import BeautifulSoup
import textacy
rcParams['figure.figsize'] = 10, 4
import matplotlib.pyplot as plt
plt.style.use('ggplot')

In [2]:
import spacy
nlp = spacy.load('en')

In [3]:
with open('pride.txt') as f: 
    pride = f.read()

In [4]:
doc = nlp(pride)

In [7]:
def proportionWithTag(doc, tag):
    """ Returns the proportion of words in the document that have a certain POS tag. 
    If given a list instead of a tag, returns the proportions of words in the document 
    that have those tags."""
    totalWords = len(doc)
    if type(tag) == list: 
        wordsWithTag = [word for word in doc if word.tag_ in tag]
    else: 
        wordsWithTag = [word for word in doc if word.tag_ == tag]
    return len(wordsWithTag)/totalWords

def proportionWithLemma(doc, lemma): 
    totalWords = len(doc)
    wordsWithLemma = [word for word in doc if word.lemma_ == lemma]
    return len(wordsWithLemma)/totalWords

In [8]:
def beProportion(doc): 
    totalWords = len(doc)
    bes = [word for word in doc if word.lemma_ == 'be' and word.tag_ in verbtags] # 488 is "be" 
    return len(bes)/totalWords

From the Penn Treebank table: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

  1. VB Verb, base form
  2. VBD Verb, past tense
  3. VBG Verb, gerund or present participle
  4. VBN Verb, past participle
  5. VBP Verb, non-3rd person singular present
  6. VBZ Verb, 3rd person singular present

In [9]:
presentVerbTags = ['VB', 'VBG', 'VBP', 'VBZ']
verbtags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']

Pride and Prejudice Highlights

Scraped from: https://kindle.amazon.com/work/pride-prejudice-jane-austen-ebook/B00K1K87SQ/B008476HBM and from: https://web.archive.org/web/20140821065046/https://kindle.amazon.com/work/pride-prejudice-jane-austen-ebook/B000AGXUCE/B000JMLFLW

It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.

11996 Highlighters

Pride relates more to our opinion of ourselves, vanity to what we would have others think of us."

11799 Highlighters

In vain I have struggled. It will not do. My feelings will not be repressed. You must allow me to tell you how ardently I admire and love you.

5841 Highlighters

Happiness in marriage is entirely a matter of chance. If the dispositions of the parties are ever so well known to each other or ever so similar beforehand, it does not advance their felicity in the least. They always continue to grow sufficiently unlike afterwards to have their share of vexation; and it is better to know as little as possible of the defects of the person with whom you are to pass your life."

5788 Highlighters

There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.

5290 Highlighters

Popular Highlights

there are very few of us who have heart enough to be really in love without encouragement.

8300 Highlighters

Vanity and pride are different things, though the words are often used synonymously. A person may be proud without being vain. Pride relates more to our opinion of ourselves, vanity to what we would have others think of us."

7627 Highlighters

"Pride," observed Mary, who piqued herself upon the solidity of her reflections, "is a very common failing,

5494 Highlighters


In [10]:
quoted_passages = ["""It is a truth universally acknowledged, 
 that a single man in possession of a good fortune, must be in want of a wife.""",
"""In vain I have struggled. It will not do. My feelings will not be repressed. 
You must allow me to tell you how ardently I admire and love you.""",
"""Happiness in marriage is entirely a matter of chance. If the dispositions of the parties are ever 
so well known to each other or ever so similar beforehand, 
it does not advance their felicity in the least. 
They always continue to grow sufficiently unlike afterwards to have their share of vexation; 
and it is better to know as little as possible of the defects of the person with whom you are to pass your life.""",
"""There are few people whom I really love, and still fewer of whom I think well. 
The more I see of the world, the more am I dissatisfied with it; 
and every day confirms my belief of the inconsistency of all human characters, 
and of the little dependence that can be placed on the appearance of merit or sense.""",
""""Pride," observed Mary, who piqued herself upon the solidity of her reflections, "is a very common failing,""",
"""Vanity and pride are different things, though the words are often used synonymously. 
A person may be proud without being vain. Pride relates more to our opinion of ourselves, 
vanity to what we would have others think of us."""
]

In [11]:
joined = ' '.join(quoted_passages)

In [12]:
quoted = nlp(joined)

In [13]:
tagDict = {"CC": "Coordinating conjunction",
"DT": "Determiner",
"EX": "Existential there",
"IN": "Preposition or subordinating conjunction",
"JJ": "Adjective",
"JJR": "Adjective, comparative",
"JJS": "Adjective, superlative",
"MD": "Modal",
"NN": "Noun, singular or mass",
"NNS": "Noun, plural",
"NNP": "Proper noun, singular",
"NNPS": "Proper noun, plural",
"PDT": "Predeterminer",
"POS": "Possessive ending",
"PRP": "Personal pronoun",
"PRP$": "Possessive pronoun",
"RB": "Adverb",
"RBR": "Adverb, comparative",
"RBS": "Adverb, superlative",
"RP": "Particle",
"TO": "to",
"UH": "Interjection",
"VB": "Verb, base form",
"VBD": "Verb, past tense",
"VBG": "Verb, gerund or present participle",
"VBN": "Verb, past participle",
"VBP": "Verb, non-3rd person singular present",
"VBZ": "Verb, 3rd person singular present",
"WDT": "Wh-determiner",
"WP": "Wh-pronoun",
"WP$": "Possessive wh-pronoun",
"WRB": "Wh-adverb"}

In [14]:
tagset = list(tagDict.keys())

In [15]:
def compareTags(a, b, tagset):
    proportionsDict = {}
    for tag in tagset: 
        proportionsDict[tag] = [proportionWithTag(x, tag) for x in [a, b]]
    df = pd.DataFrame(proportionsDict).T
    df['factor'] = (df[1]/df[0])-1
    return df['factor']

In [16]:
compareTags(doc, quoted, tagset)


Out[16]:
CC     -0.127749
DT      0.370357
EX      1.041084
IN      0.120316
JJ      0.125845
JJR     4.834780
JJS     0.817311
MD      0.331852
NN      0.093802
NNP    -0.664153
NNPS   -1.000000
NNS     0.870761
PDT    -1.000000
POS    -1.000000
PRP    -0.028811
PRP$   -0.152297
RB      0.258460
RBR     0.886243
RBS    -1.000000
RP     -1.000000
TO      0.084048
UH      0.731046
VB      0.213268
VBD    -0.791826
VBG    -0.750679
VBN    -0.391365
VBP     2.236749
VBZ     1.562968
WDT    -0.274522
WP      2.295245
WP$    -1.000000
WRB    -0.294180
Name: factor, dtype: float64

In [17]:
def compareLemmas(a, b, lemmas):
    proportionsDict = {}
    for lemma in lemmas: 
        proportionsDict[lemma] = [proportionWithLemma(x, lemma) for x in [a, b]]
    df = pd.DataFrame(proportionsDict).T
    df['factor'] = df[1]/df[0]
    df['factor'].plot(kind="bar")

In [18]:
compareLemmas(doc, quoted, ['be', 'have', 'do'])