Working with Texts in Python

Adapted from a lesson by Teddy Roland

With only the tools we learned in the last tutorial we can do a good amount of text analysis. No special libraries or functions, just counting.

Lesson Goals

Get comfortable reading a text into Python and manipulating it
Apply Wednesday's lesson and do simple counts on a text
Start building more comfort with the Python programming language

Outline

On your own, use the tools we have already learned to answer a few questions about two novels
In small groups, compare your solutions and discuss any differences
Discuss in the larger group

Exploratory Natural Language Processing Tasks

Now that we have some of Python's basics in our toolkit, we can immediately perform the kinds of tasks that are the bread and butter of text analysis: counting. When we first meet a text in the wild, we often wish to find out a little about it before digging in deeply, so we start with simple questions like "How many words are in this text?" or "What is the average word length?"

Run the cell below to read in the text of "Pride and Prejudice" and assign it to the variable "austen_string", and read in the text of Louisa May Alcott's "A Garland for Girls," a children's book, and assign it to the variable "aclott_string". With these variables, print the answer to the following questions:

How many words are in each novel?
How many words in each novel appear in title case?
What is the approximate average word length in each novel? (don't worry about punctuation for now)
How many words longer than 7 characters are in each novel? (don't worry about punctuation for now)
What proportion of the total words are the long words in each novel?



In [ ]:

    
#read in the texts
austen_string = open('../Data/Austen_PrideAndPrejudice.txt', encoding='utf-8').read()
alcott_string = open('../Data/Alcott_GarlandForGirls.txt', encoding='utf-8').read()

#print the first 100 characters of each text to make sure everything is in order
print(austen_string[:100])
print(alcott_string[:100])



In [ ]: