Exericse Solution for Pandas and Text Analysis

First recreate our dataframe (df)



In [ ]:

    
import pandas
import nltk
import string
import matplotlib.pyplot as plt #note this last import statement. Why might we use "import as"?

#read in our data
df = pandas.read_csv("../Data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df = df.dropna(subset=["text"])
df['text_lc'] = df['text'].str.lower()
df['text_split']=df['text_lc'].str.split()
df['word_count'] = df['text_split'].apply(len)
df



In [ ]:

    
##Ex: create a new column, 'text_split', that contains the lower case text split into list. 
####HINT: split on white space, don't tokenize it.

df['text_split'] = df['text_lc'].str.split()
df



In [ ]:

    
###Ex: print the average novel length for male authors and female authors.
###### What conclusions might you draw from this?

###Ex: graph the average novel length by gender

grouped_gender = df.groupby('author gender')
print(grouped_gender['word_count'].mean())
grouped_gender['word_count'].mean().plot(kind='bar')
plt.show()



In [ ]:

    
#What if we want to put error bars on this? We can add a 'yerr' option to our graph, and use the std() calculation to add error bars.
grouped_gender = df.groupby('author gender')
grouped_gender['word_count'].mean().plot(kind = 'bar', yerr = grouped_gender['word_count'].std())
plt.show()



In [ ]:

    
##Ex: plot the average novel length by year, with error bars. 
#Your x-axis should be year, and your y-axis number of words.
plt.ticklabel_format(useOffset=False) #forces Python to not convert numbers
grouped_year = df.groupby('year')
grouped_year['word_count'].mean().plot(kind = 'line', yerr = grouped_gender['word_count'].std())
plt.show()

Last Exercise:

Motivating Question: Is there a difference in the average TTR for male and female authors?

To answer this, go step by step.

For computational reasons we will use the list we created by splitting on white spaces rather than tokenized text. So this is approximate only.

We first need to count the token type in each novel. We can do this in two steps. First, create a column that contains a list of the unique token types, by applying the set function.



In [ ]:

    
##Ex: create a new column, 'text_type', which contains a list of unique token types
df['text_type'] = df['text_split'].apply(set)
df['text_type']



In [ ]:

    
##Ex: create a new column, 'type_count', which is a count of the token types in each novel.
##Ex: create a new column, 'ttr', which contains the type-token ratio for each novel.

df['type_count'] = df['text_type'].apply(len)
df['ttr'] = df['type_count']/df['word_count']
df['ttr']



In [ ]:

    
##Ex: Print the average ttr by author gender
##Ex: Graph this result with error bars

grouped = df.groupby('author gender')
print(grouped['ttr'].mean())

grouped['ttr'].mean().plot(kind='bar', yerr= grouped['ttr'].std())
plt.show()