Additional Exercises for 02.27: Dictionary Method

Ex.

  1. Read in the childrens_lit.csv.bz2 file from the data folder.
  2. Come up with a hypothesis on what you think the sentiment ratings is for children's literature.
  3. Do a sentiment analysis on a subset of chilren's literature using the dictionary method from lecture.
    • Use the positive and negative words from lecture

Question 1


In [ ]:
import pandas as pd
import nltk
import string
import matplotlib.pyplot as plt


#read in our data
df = pd.read_csv("../Data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df = df.dropna(subset=["text"])
df

Since the number of children literaturs is a lot to analyze, we'll just randomly select 5 books to do a sentiment analysis using the dictionary method.

Note: In case you're not familiar with seed. Seed is just a function that initializes a fixed state for random number generatoring. Basically if everyone uses the same number as an input to seed(), then everyone will get the same result when generating randomly.


In [ ]:
import numpy as np
np.random.seed(1)
df = df.sample(5)
df

Question 2

Since these literatures are written for children, the overall sentiment rating is probably positive.

Question 3


In [ ]:
# Your code here
df['text_lc'] = df['text'].str.lower()
df['text_split'] = df['text_lc'].apply(nltk.word_tokenize)
df['text_split_clean'] = df['text_split'].apply(lambda x : [word for word in x if word not in string.punctuation])
df

In [ ]:
df['text_length'] = df['text_split_clean'].apply(len)
df

In [ ]:
# Your code here
pos_sent = open("../Data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../Data/negative_words.txt", encoding='utf-8').read()
positive_words = pos_sent.split('\n')
negative_words = neg_sent.split('\n')

In [ ]:
df['num_pos_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in positive_words]))
df['num_neg_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in negative_words]))
df

In [ ]:
df['prop_pos_words'] = df['num_pos_words']/df['text_length']
df['prop_neg_words'] = df['num_neg_words']/df['text_length']
df