We have learned how to work with numbers in the Python package pandas, and how to work with text in Python using built-in Python functions and using the NLTK package. To operationalize concepts and analyze the numbers, we can combine these two packages together.
We have seen texts in the form of raw text. Today we'll deal with text that is in the form of a .csv file. We can read it into Python in the same way we read in the numerical dataset from the National Survey of Family Growth.
Data preparation
I created a .csv file from a collection of 19th century children's literature. The data were compiled by students in this course.
The raw data are found here.
That page has additional corpora, so search through it to see if anything sparks your interest.
I did some minimal cleaning to get the children's literature data in .csv format for our use. The delimiter for this file is a tab, so technically it's a tab separated file, or tsv. We can specify that delimiter with the option "sep = '\t'"
In [ ]:
import pandas
import nltk
import string
import matplotlib.pyplot as plt #note this last import statement. Why might we use "import as"?
#read in our data
df = pandas.read_csv("../Data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df
Notice this is a typical dataframe, possibly with more columns as strings than numbers. The text in contained in the column 'text'.
Notice also there are missing texts. For now, we will drop these texts so we can move forward with text analysis. In your own work, you should justify dropping missing texts when possible.
In [ ]:
df = df.dropna(subset=["text"])
df
In [ ]:
##Ex: Print the first text in the dataframe (starts with "A DOG WITH A BAD NAME").
###Hint: Remind yourself about the syntax for slicing a dataframe
The first thing we probably want to do is describe our data, to make sure everything is in order. We can use the describe function for the numerical data, and the value_counts function for categorical data.
In [ ]:
print(df.describe()) #get descriptive statistics for all numerical columns
print()
print(df['author gender'].value_counts()) #frequency counts for categorical data
print()
print(df['year'].value_counts()) #treat year as a categorical variable
print()
print(df['year'].mode()) #find the year in which the most novels were published
We can do a few things by just using the metadata already present.
For example, we can use the groupby and the count() function to graph the number of books by male and female authors. This is similar to the value_counts() function, but allows us to plot the output.
In [ ]:
#creat a pandas object that is a groupby dataframe, grouped on author gender
grouped_gender = df.groupby("author gender")
print(grouped_gender['text'].count())
Let's graph the number of texts by gender of author.
In [ ]:
grouped_gender['text'].count().plot(kind = 'bar')
plt.show()
In [ ]:
#Ex: Create a variable called 'grouped_year' that groups the dataframe by year.
## Print the number of texts per year.
We can graph this via a line graph.
In [ ]:
grouped_year['text'].count().plot(kind = 'line')
plt.show()
Oops! That doesn't look right! Python automatically converted the year to scientific notation. We can set that option to False.
In [ ]:
plt.ticklabel_format(useOffset=False) #forces Python to not convert numbers
grouped_year['text'].count().plot(kind = 'line')
plt.show()
We haven't done any text analysis yet. Let's apply some of our text analysis techniques to the text, add columns with the output, and analyze/visualize the output.
Luckily for us, pandas has an attribute called 'str' which allows us to access Python's built-in string functions.
For example, we can make the text lowercase, and assign this to a new column.
Note: You may get a "SettingWithCopyWarning" highlighted with a pink background. This is not an error, it is Python telling you that while the code is valid, you might be doing something stupid. In this case, the warning is a false positive. In most cases you should read the warning carefully and try to fix your code.
In [ ]:
df['text_lc'] = df['text'].str.lower()
df
In [ ]:
##Ex: create a new column, 'text_split', that contains the lower case text split into list.
####HINT: split on white space, don't tokenize it.
We can also apply a function to each row. To get a word count of a text file we would take the length of the split string like this:
len(text_split)
If we want to do this on every row in our dataframe, we can use the apply() function.
In [ ]:
df['word_count'] = df['text_split'].apply(len)
df
What is the average length of each novel in our data? With pandas, this is easy!
In [ ]:
df['word_count'].mean()
(These are long novels!) We can also group and slice our dataframe to do further analyses.
In [ ]:
###Ex: print the average novel length for male authors and female authors.
###### What conclusions might you draw from this?
In [ ]:
###Ex: graph the average novel length by gender
In [ ]:
##EX: Add error bars to your graph
This one is a bit tricky. If you're not quite there, no worries! We'll work through it together.
Ex: plot the average novel length by year, with error bars. Your x-axis should be year, and your y-axis number of words.
HINT: Copy and paste what we did above with gender, and then change the necessary variables and options. By my count, you should only have to change one variable, and one graph option.
In [ ]:
#Write your exercise solution here
If we want to apply nltk functions we can do so using .apply(). If we want to use list comprehension on the split text, we have to introduce one more Python trick: the lambda function. This simply allows us to write our own function to apply to each row in our dataframe. For example, we may want tokenize our text instead of splitting on the white space. To do this we can use the lambda function.
Note: If you want to explore lambda functions more, see the notebook titled A-Bonus_LambdaFunctions.ipynb
in this folder.
Because of the length of the novels tokenizing the text takes a bit of time. We'll instead tokenize the title only.
In [ ]:
df['title_tokens'] = df['title'].apply(nltk.word_tokenize)
df['title_tokens']
With this tokenized list we might want to, for example, remove punctuation. Again, we can use the lambda function, with list comprehension.
In [ ]:
df['title_tokens_clean'] = df['title_tokens'].apply(lambda x: [word for word in x if word not in list(string.punctuation)])
df['title_tokens_clean']
We may want to extract the text from our dataframe, to do further analyses on the text only. We can do this using the tolist() function and the join() function.
In [ ]:
novels = df['text'].tolist()
print(novels[:1])
In [ ]:
#turn all of the novels into one long string using the join function
cat_novels = ''.join(n for n in novels)
print(cat_novels[:100])
Motivating Question: Is there a difference in the average TTR for male and female authors?
To answer this, go step by step.
For computational reasons we will use the list we created by splitting on white spaces rather than tokenized text. So this is approximate only.
We first need to count the token type in each novel. We can do this in two steps. First, create a column that contains a list of the unique token types, by applying the set function.
In [ ]:
##Ex: create a new column, 'text_type', which contains a list of unique token types
In [ ]:
##Ex: create a new column, 'type_count', which is a count of the token types in each novel.
##Ex: create a new column, 'ttr', which contains the type-token ratio for each novel.
In [ ]:
##Ex: Print the average ttr by author gender
##Ex: Graph this result with error bars