Ex.
childrens_lit.csv.bz2
file from the data
folder.
In [ ]:
import pandas as pd
import nltk
import string
import matplotlib.pyplot as plt
#read in our data
df = pd.read_csv("../Data/childrens_lit.csv.bz2", sep = '\t', encoding = 'utf-8', compression = 'bz2', index_col=0)
df = df.dropna(subset=["text"])
df
Since the number of children literaturs is a lot to analyze, we'll just randomly select 5 books to do a sentiment analysis using the dictionary method.
Note: In case you're not familiar with seed. Seed is just a function that initializes a fixed state for random number generatoring. Basically if everyone uses the same number as an input to seed()
, then everyone will get the same result when generating randomly.
In [ ]:
import numpy as np
np.random.seed(1)
df = df.sample(5)
df
Since these literatures are written for children, the overall sentiment rating is probably positive.
In [ ]:
# Your code here
df['text_lc'] = df['text'].str.lower()
df['text_split'] = df['text_lc'].apply(nltk.word_tokenize)
df['text_split_clean'] = df['text_split'].apply(lambda x : [word for word in x if word not in string.punctuation])
df
In [ ]:
df['text_length'] = df['text_split_clean'].apply(len)
df
In [ ]:
# Your code here
pos_sent = open("../Data/positive_words.txt", encoding='utf-8').read()
neg_sent = open("../Data/negative_words.txt", encoding='utf-8').read()
positive_words = pos_sent.split('\n')
negative_words = neg_sent.split('\n')
In [ ]:
df['num_pos_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in positive_words]))
df['num_neg_words'] = df['text_split_clean'].apply(lambda x: len([word for word in x if word in negative_words]))
df
In [ ]:
df['prop_pos_words'] = df['num_pos_words']/df['text_length']
df['prop_neg_words'] = df['num_neg_words']/df['text_length']
df