This notebook explores a theory about quoted language, using a test corpus of a few Amazon Kindle frequently highlighted passages from Pride and Prejudice (one of the most quoted novels on Amazon). We compare these with other text from the novel.
This analysis is not really meant to go in the paper, but is just for fun.
In [1]:
from collections import Counter
import pandas as pd
%matplotlib inline
from pylab import rcParams
from bs4 import BeautifulSoup
import textacy
rcParams['figure.figsize'] = 10, 4
import matplotlib.pyplot as plt
plt.style.use('ggplot')
In [2]:
import spacy
nlp = spacy.load('en')
In [3]:
with open('pride.txt') as f:
pride = f.read()
In [4]:
doc = nlp(pride)
In [7]:
def proportionWithTag(doc, tag):
""" Returns the proportion of words in the document that have a certain POS tag.
If given a list instead of a tag, returns the proportions of words in the document
that have those tags."""
totalWords = len(doc)
if type(tag) == list:
wordsWithTag = [word for word in doc if word.tag_ in tag]
else:
wordsWithTag = [word for word in doc if word.tag_ == tag]
return len(wordsWithTag)/totalWords
def proportionWithLemma(doc, lemma):
totalWords = len(doc)
wordsWithLemma = [word for word in doc if word.lemma_ == lemma]
return len(wordsWithLemma)/totalWords
In [8]:
def beProportion(doc):
totalWords = len(doc)
bes = [word for word in doc if word.lemma_ == 'be' and word.tag_ in verbtags] # 488 is "be"
return len(bes)/totalWords
From the Penn Treebank table: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
In [9]:
presentVerbTags = ['VB', 'VBG', 'VBP', 'VBZ']
verbtags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
Scraped from: https://kindle.amazon.com/work/pride-prejudice-jane-austen-ebook/B00K1K87SQ/B008476HBM and from: https://web.archive.org/web/20140821065046/https://kindle.amazon.com/work/pride-prejudice-jane-austen-ebook/B000AGXUCE/B000JMLFLW
It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.
11996 Highlighters
Pride relates more to our opinion of ourselves, vanity to what we would have others think of us."
11799 Highlighters
In vain I have struggled. It will not do. My feelings will not be repressed. You must allow me to tell you how ardently I admire and love you.
5841 Highlighters
Happiness in marriage is entirely a matter of chance. If the dispositions of the parties are ever so well known to each other or ever so similar beforehand, it does not advance their felicity in the least. They always continue to grow sufficiently unlike afterwards to have their share of vexation; and it is better to know as little as possible of the defects of the person with whom you are to pass your life."
5788 Highlighters
There are few people whom I really love, and still fewer of whom I think well. The more I see of the world, the more am I dissatisfied with it; and every day confirms my belief of the inconsistency of all human characters, and of the little dependence that can be placed on the appearance of merit or sense.
5290 Highlighters
Popular Highlights
there are very few of us who have heart enough to be really in love without encouragement.
8300 Highlighters
Vanity and pride are different things, though the words are often used synonymously. A person may be proud without being vain. Pride relates more to our opinion of ourselves, vanity to what we would have others think of us."
7627 Highlighters
"Pride," observed Mary, who piqued herself upon the solidity of her reflections, "is a very common failing,
5494 Highlighters
In [10]:
quoted_passages = ["""It is a truth universally acknowledged,
that a single man in possession of a good fortune, must be in want of a wife.""",
"""In vain I have struggled. It will not do. My feelings will not be repressed.
You must allow me to tell you how ardently I admire and love you.""",
"""Happiness in marriage is entirely a matter of chance. If the dispositions of the parties are ever
so well known to each other or ever so similar beforehand,
it does not advance their felicity in the least.
They always continue to grow sufficiently unlike afterwards to have their share of vexation;
and it is better to know as little as possible of the defects of the person with whom you are to pass your life.""",
"""There are few people whom I really love, and still fewer of whom I think well.
The more I see of the world, the more am I dissatisfied with it;
and every day confirms my belief of the inconsistency of all human characters,
and of the little dependence that can be placed on the appearance of merit or sense.""",
""""Pride," observed Mary, who piqued herself upon the solidity of her reflections, "is a very common failing,""",
"""Vanity and pride are different things, though the words are often used synonymously.
A person may be proud without being vain. Pride relates more to our opinion of ourselves,
vanity to what we would have others think of us."""
]
In [11]:
joined = ' '.join(quoted_passages)
In [12]:
quoted = nlp(joined)
In [13]:
tagDict = {"CC": "Coordinating conjunction",
"DT": "Determiner",
"EX": "Existential there",
"IN": "Preposition or subordinating conjunction",
"JJ": "Adjective",
"JJR": "Adjective, comparative",
"JJS": "Adjective, superlative",
"MD": "Modal",
"NN": "Noun, singular or mass",
"NNS": "Noun, plural",
"NNP": "Proper noun, singular",
"NNPS": "Proper noun, plural",
"PDT": "Predeterminer",
"POS": "Possessive ending",
"PRP": "Personal pronoun",
"PRP$": "Possessive pronoun",
"RB": "Adverb",
"RBR": "Adverb, comparative",
"RBS": "Adverb, superlative",
"RP": "Particle",
"TO": "to",
"UH": "Interjection",
"VB": "Verb, base form",
"VBD": "Verb, past tense",
"VBG": "Verb, gerund or present participle",
"VBN": "Verb, past participle",
"VBP": "Verb, non-3rd person singular present",
"VBZ": "Verb, 3rd person singular present",
"WDT": "Wh-determiner",
"WP": "Wh-pronoun",
"WP$": "Possessive wh-pronoun",
"WRB": "Wh-adverb"}
In [14]:
tagset = list(tagDict.keys())
In [15]:
def compareTags(a, b, tagset):
proportionsDict = {}
for tag in tagset:
proportionsDict[tag] = [proportionWithTag(x, tag) for x in [a, b]]
df = pd.DataFrame(proportionsDict).T
df['factor'] = (df[1]/df[0])-1
return df['factor']
In [16]:
compareTags(doc, quoted, tagset)
Out[16]:
In [17]:
def compareLemmas(a, b, lemmas):
proportionsDict = {}
for lemma in lemmas:
proportionsDict[lemma] = [proportionWithLemma(x, lemma) for x in [a, b]]
df = pd.DataFrame(proportionsDict).T
df['factor'] = df[1]/df[0]
df['factor'].plot(kind="bar")
In [18]:
compareLemmas(doc, quoted, ['be', 'have', 'do'])