Sandbox Exercise 1: Gender and Pronouns

Corpus

The exercise corpus is a limited selection of 19th century British children's literature. The data were compiled by students in this course.

The raw data are found here.

That page has additional corpora, so search through it to see if anything sparks your interest.

I did some minimal cleaning to get the children's literature data in .csv format for our use. The delimiter for this file is a tab, so technically it's a tab separated file, or tsv. As we've seen, we can specify that delimiter with the option "sep = '\t'" when we read it into a Pandas dataframe.

Research has shown that women and men use different types of language. Our question to you: is this the case for 19th century children's literature as well? What can we learn from comparing the lanaguge used by men and women authors in this corpus? To get you started, below are two more specific exercises, but we want you to be creative here!

Before you ask/explore specific questions, the first thing you do should be to explore your data. Summarize what you can summarize, look at the first full text, maybe do some value_counts. Whatever you need to get to know what your data include and how it is formatted.

Exercise 1: Frequent and Distinctive Words

  1. What are the most frequent words in the corpus? What are the most frequent content words in the corpus? Think through the different ways we have identified content words. What are you learning about children's literature during this era?

  2. What words most distinguish male and female authors in the 19th century children's literature dataset? What are you learning from this?

Exercise 2: Dictionary Method

Research shows that women use more personal pronouns compared to men. Your task is to further this research and explore whether there is a difference between the magnitude in the use of personal pronouns and possessive pronouns by men and women authors in our children's literature data, and whether there is a difference in the way in which they use these pronouns.

For our purposes, here is a list of personal and possessive pronouns:

Personal pronouns: I, you, he, she, it, we, they, what, who, me, him, her, us, them
Possessive pronouns: mine, yours, his, hers, ours, theirs

  1. Calculate whether female authors use proportionally more personal pronouns, on average, compared to male authors in our collection of 19th century children's literature.
  2. Calculate whether female authors use proportionally more possessive pronouns, on average, compared to male authors in the same collection.
  3. Compare the way in which one personal pronoun and one possessive pronoun of your choosing is used by male and female authors. Do this by showing what other words are used in the same context as the pronouns you choose to compare.

Because of the length of these novels I suggest not using a part-of-speech tagger. Instead, you may use a list of personal and possessive pronouns, as listed above.

Hint: Depending on your approach, you may need to use a lambda function for this assignment.


In [ ]:
#use this code to get started
import pandas
import nltk
import matplotlib.pyplot as plt

#read in our data, remove rows with missing texts, check the df variable
df = pandas.read_csv("corpora/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df = df.dropna(subset=["text"])
df