Working with Texts in Python

Below is my solution to the exercises posed in the notebook 01-WorkingWithTexts.
As a reminder, you were asked to:

Run the cell below to read in the text of "Pride and Prejudice" and assign it to the variable "austen_string" and read in the text of Louisa May Alcott's "A Garland for Girls," a children's book, and assugn it to the variable "aclott_string." With these variables, print the answer to the following questions

  1. How many words are in each novel?
  2. How many words in each novel appear in title case?
  3. What is the approximate average word length in each novel?
  4. How many words longer than 7 characters are in each novel? (don't worry about punctuation for now)
  5. What proportion of the total words are the long words in each novel?

In [ ]:
austen_string = open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../Data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

First, create variables that split the strings into lists of words. I'll print out the first 10 words to make sure everything looks good.


In [ ]:
austen_words = austen_string.split()
alcott_words = alcott_string.split()
print(austen_words[:10])
print(alcott_words[:10])

To count the number of words, simply take the length of each list and print the result. You can add print statements to make the output look prett.


In [ ]:
#How many words are in each novel?
print("Number of words in Pride and Prejudice:")
print(len(austen_words))
print("Number of words in A Garland for Girls")
print(len(alcott_words))

To count the number of words that are in title case we can use list comprehension.


In [ ]:
#How many words in each novel appear in title case?
print("Number of words in title case in Pride and Prejudice:")
print(len([word for word in austen_words if word.istitle()]))
print("Number of words in title case in A Garland for Girls:")
print(len([word for word in alcott_words if word.istitle()]))

To get the average length of the words in each novel we first transform each word into its length, using the len function and list comprehension. I'll print out the first 10 word length to make sure we did it right. We can then sum the list and divide by the total number of words.


In [ ]:
austen_word_length = [len(word) for word in austen_words]
print(austen_word_length[:10])
alcott_word_length = [len(word) for word in alcott_words]
print(alcott_word_length[:10])
print("Average word length in Pride and Prejudice:")
print(sum(austen_word_length)/len(austen_word_length))
print("Average word length in A Garland for Girls:")
print(sum(alcott_word_length)/len(alcott_word_length))

To count the number of long words I create two new variables and use list comprehension to keep only words longer than 7 characters. I then divide the length of these lists by the total number of words, to get proportion.


In [ ]:
##     How many words longer than 7 characters are in each novel? (don't worry about punctuation for now)
##     What proportion of the total words are the long words in each novel

austen_long = [word for word in austen_words if len(word)>7]
print("Number of long words in Pride and Prejudice:")
print(len(austen_long))

alcott_long = [word for word in alcott_words if len(word)>7]
print("Number of long words in A Garland for Girls:")
print(len(alcott_long))

print("Proportion of words that are long in Pride and Prejudice:")
print(len(austen_long)/len(austen_words))
print("Proportion of words that are long in A Garland for Girls:")
print(len(alcott_long)/len(alcott_words))

dir()
locals()

We find what we might expect: there are proportionally fewer long word in the children's novel A Garland for Girls (9.5%) compared to the adult novel Pride and Prejudice (15%)