In this notebook, we will be examining the sentiment and emotions of online fanfiction publications.
Our working hypothesis is that the genre of a work is heavily encapsulated by its emotional content and tone. Therefore, if we can successfully decompose textual data into its representative emotions, we can then predict its genre.
Our goal is two-fold. First, we would like to examine the sentimental and emotional composition of each genre. Second, we would like to see if we can predict the genre of a story based on its text and emotional composition.
In this section, we will set up our working environment.
In [1]:
# imports utility libraries
import pandas as pd
import numpy as np
import pickle
import datetime
import string
import re
import multiprocessing
import warnings
from joblib import Parallel, delayed
from itertools import cycle
from IPython.core.display import display, HTML
# import scipi
from scipy import interp
from scipy.stats import chisquare
# import graphics
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.colors import ListedColormap
%matplotlib inline
# import sklearn tools
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA, SparsePCA, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.preprocessing import scale, label_binarize, Binarizer, Normalizer
from sklearn.manifold import TSNE
from sklearn import metrics, svm, datasets
from sklearn.metrics import silhouette_score, accuracy_score, precision_score, roc_curve, auc
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
# import nltk tools
import nltk
from nltk import tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# import xgboost
import xgboost as xgb
# import gensim tools
from gensim import corpora, models
from gensim.models.coherencemodel import CoherenceModel
from gensim.test.utils import common_corpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
# import textblob
from textblob import TextBlob, Blobber
from textblob.sentiments import NaiveBayesAnalyzer, PatternAnalyzer
# stanford NLP
from pycorenlp import StanfordCoreNLP
stanford = StanfordCoreNLP('http://localhost:9000')
In [2]:
# configure graphics (taken from session notes)
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
(0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
(0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
(0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
(0.4, 0.6509803921568628, 0.11764705882352941),
(0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
(0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]
dark2_cmap = ListedColormap(dark2_colors)
def set_mpl_params():
rcParams['figure.figsize'] = (12, 6)
rcParams['figure.dpi'] = 100
rcParams['axes.prop_cycle'].by_key()['color'][1]
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'white'
rcParams['patch.facecolor'] = dark2_colors[0]
rcParams['font.family'] = 'StixGeneral'
mpl.rcParams.update(mpl.rcParamsDefault)
set_mpl_params()
mpl.style.use('ggplot')
In [3]:
# installs corpus
## nltk.download()
## nltk.download('punkt')
## nltk.download('vader_lexicon')
## nltk.download('movie_reviews')
# sets stop word list for text parsing
stop_words = stopwords.words('english')
In this section, we will be importing the cleaned fanfiction data as well as the emotional lexicons.
The fanfiction data that we will use is scraped from the official fanfiction.net site. We have downloaded a random sample of 5000 stories out of approximately 13 million. Of these 5000 stories, approximately half were both still existing (that is, not deleted) as well as written in English, which is the subset that we will be performing out analysis on.
The python scripts we created to scrape the data can be found here.
Furthermore, for our emotion analysis, we have also downloaded and imported the emotion lexicons of Dr. Saif M. Mohammad, a Senior Research Scientist at the National Research Council Canada (NRC).
For the sake of readability, we have also placed all the functions we will be calling in this section. These functions include those for plotting/graphics, textual decomposition, clustering, and extracting passages/lines.
In [4]:
# opens cleaned data
with open ('../data/clean_data/df_story', 'rb') as fp:
df = pickle.load(fp)
df = df.loc[(df.state == 'online') & (df.language == 'English'), ].copy()
df['title_summary'] = df['title'] + ' ' + df['summary']
df['main_genre'] = [g[0] for g in df.genre]
df.index = range(df.shape[0])
# opens features data
with open ('../data/clean_data/feature_data', 'rb') as fp:
feature_data = pickle.load(fp)
In [5]:
# opens emotion lexicon data
emolex = pd.read_csv("../data/nrc_emotion_data/nrc_emotion_lexicon.txt", sep="\t", header=None)
emolex.columns = ['term', 'emotion', 'is_emotion']
intlex = pd.read_csv("../data/nrc_emotion_data/nrc_affectintensity_lexicon.txt", sep="\t")
intlex.columns = ['term', 'score', 'emotion']
hashlex = pd.read_csv("../data/nrc_emotion_data/nrc_hashtag_emotion_lexicon.txt", sep="\t", header=None)
hashlex.columns = ['emotion', 'hashtag', 'score']
hashlex['hashtag'] = [re.sub('#+', '', str(h)) for h in hashlex.hashtag]
vadlex = pd.read_csv("../data/nrc_emotion_data/nrc_vad_lexicon.txt", sep="\t")
vadlex.columns = ['term', 'valence', 'arousal', 'dominance']
# define emotion lists
base_emotions = list(set(emolex.emotion))
intensity_emotions = list(set(intlex.emotion))
vad_emotions = ['valence', 'arousal', 'dominance']
hash_emotions = list(set(hashlex.emotion))
In [6]:
def plot_breakdown(x, title, r=90, hide=False):
plt.rcParams['figure.figsize'] = [16, 3]
plt.bar(range(len(x)), x)
plt.xticks(range(len(x)), x.index, rotation=r)
if hide:
plt.tick_params(axis='x', bottom=False, labelbottom=False)
plt.title(title)
plt.show()
def plot_silhouette(x, y, title="Silhouette vs. k"):
plt.rcParams['figure.figsize'] = [16, 2]
plt.plot(x, y)
plt.scatter(x, y)
plt.title(title)
plt.show()
# plot t-sne scatter with colored groups
def plot_representation(matrix, group, title, alpha=0.5):
tsne_model = TSNE(n_components=2)
tsne_matrix = tsne_model.fit_transform(matrix)
tsne_df = pd.DataFrame(tsne_matrix)
tsne_df.columns = ["Dimension 1", "Dimension 2"]
tsne_df['group'] = group
fig = sns.lmplot('Dimension 1', 'Dimension 2',
data=tsne_df, hue="group", fit_reg=False,
scatter_kws={'alpha':0.5}, legend=False)
fig.fig.set_size_inches(16, 9)
plt.title(title)
plt.legend(loc='right', ncol=1)
plt.show()
# plot by play and act-scene
def plot_scenes(data, group, outlier=False):
data = data.copy()
data.columns = ['label', 'y', 'x']
data['group'] = group
data['counts'] = data.shape[0] / data['group'].map(data['group'].value_counts())
data['counts'] = (data['counts'] / np.max(data['counts'])) * 100
if outlier:
data['counts'] = [100 if g==-1 else 30 for g in group]
plt.rcParams['figure.figsize'] = [16, 12]
plt.scatter(data['x'], data['y'], s=data['counts'], c=data['group'], cmap=plt.get_cmap('tab20'))
plt.yticks(range(len(sorted(list(set(data['label']))))), sorted(list(set(data['label']))))
plt.xticks([])
plt.ylabel('')
plt.xlabel('')
plt.show()
def plot_emotions(header, emotions):
plot_vars = ['main_genre'] + [header + e for e in emotions]
plot_data = feature_data[plot_vars].groupby("main_genre").agg(sum)
plot_data['genre_counts'] = plot_data.index
plot_data['genre_counts'] = plot_data['genre_counts'].map(df['main_genre'].value_counts())
plot_data = plot_data.div(plot_data.sum(axis=1), axis=0) # normalize by genre
plot_data = plot_data.div(plot_data.sum(axis=0), axis=1) # normalize by emotion
plot_data = plot_data.subtract(plot_data.mean(axis=0), axis=1) # take deviation from mean
for emo in emotions:
fig, ax = plt.subplots(figsize=(16, 3))
plt.bar(plot_data.index, plot_data[header + emo])
plt.xticks(range(len(plot_data.index)),plot_data.index)
plt.title(emo)
plt.xticks(rotation=90)
plt.show()
# create custom tokenizer without stemmer
def base_tokenize(text):
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
tokens = [t.lower() for t in tokens if t.lower() not in stop_words]
return tokens
# create custom tokenizer that includes stemmer
def simple_tokenize(text):
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
tokens = tokenizer.tokenize(text)
tokens = [t.lower() for t in tokens if t.lower() not in stop_words]
return tokens
# create custom tokenizer that includes stemmer
def custom_tokenize(text):
tokenizer = RegexpTokenizer(r'\w+')
stemmer = SnowballStemmer('english')
tokens = tokenizer.tokenize(text)
tokens = [stemmer.stem(w) for w in tokens]
tokens = [t for t in tokens if t not in stop_words]
tokens = [t for t in tokens if len(t) > 1]
return tokens
# turn bodies of text into word count vectors
def get_word_count(vectorizer, text):
count = vectorizer.fit_transform(text)
count = pd.DataFrame(count.toarray())
count.columns = vectorizer.get_feature_names()
return count
# get the top words across all observations
def get_top_words(count, n_top):
total_count = np.sum(count, axis=0)
sorted_index = np.flip(np.argsort(total_count), axis=0)
sorted_index = np.array(sorted_index[0:n_top])
top_words = np.array(count.columns)[sorted_index]
top_words_count = np.array(total_count)[sorted_index]
return top_words, top_words_count
def get_pca(x, n=100):
pca_model = PCA(n_components=n, random_state=0)
pca_matrix = pca_model.fit_transform(scale(x))
print("Explained variance: ", np.sum(pca_model.explained_variance_ratio_))
return pca_matrix
def get_lsa(x, n=100):
lsa_model = TruncatedSVD(n_components=n)
doc_matrix = lsa_model.fit_transform(x)
term_matrix = lsa_model.components_
topic_matrix = np.diag(lsa_model.singular_values_)
print("Explained variance: ", np.sum(lsa_model.explained_variance_ratio_))
return doc_matrix, term_matrix, topic_matrix
def get_doc2vec(x, n=25):
d2v_model = Doc2Vec(vector_size=n, min_count=1, window=5, workers=7, dm=1)
d2v_model.build_vocab(x)
d2v_model.train(x, total_examples=d2v_model.corpus_count, epochs=10)
return d2v_model
def plot_ROC(y_test, y_pred, n_classes=20):
# compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
# interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# average and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='deeppink', linestyle=':', linewidth=2)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='navy', linestyle=':', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multiclass ROC')
plt.legend(loc="lower right")
plt.show()
In this section, we will perform some exploratory data analysis on the fanfiction data.
In [7]:
df_vars = ['storyid', 'title', 'summary', 'fandom', 'rated', 'words', 'main_genre']
df[df_vars].head()
Out[7]:
In [8]:
print('Number of observations: ', df.shape[0])
Let's take a look at the distribution of word lengths.
In [9]:
# examines distribution of number of words
df['words1k'] = df['words']/1000
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(16, 4))
sns.kdeplot(df['words1k'], shade=True, bw=.5, legend=False, ax=ax1)
sns.kdeplot(df['words1k'], shade=True, bw=.5, legend=False, ax=ax2)
plt.xlim(0,100)
plt.show()
The bulk of stories appear to be less than 50 thousand words, with a high proportion between 0-20 thousand words. In other words, we have a significant proportion of short stories and novelettes, and some novellas. Novels become more rare. Finally, there are a few "epics", ranging from 200 thousand to 600 thousand words.
Working with texts on the magnitude of thousands is too computationally intensive and may not provide much additional information needed for our purposes. As such, we will primarily be using the title and summaries as a proxy for the content of the stories instead.
How long are titles and summaries? Is there a systematic way authors write them? Do some words appear more often than others? Here we explore some of those questions.
Let's start by examining character and word count, respectively, for titles.
In [10]:
# examines word/character count of titles
title_cc = [len(row) for row in df['title']]
title_wc = [len(row.split()) for row in df['title']]
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
pd.Series(title_cc).plot.hist(density=True, bins=np.arange(0, max(title_cc), 1), ax=ax1)
pd.Series(title_wc).plot.hist(density=True, bins=np.arange(0, max(title_wc), 1), ax=ax2)
plt.show()
Almost identical in shape of distribution. It would appear stories typically have 2-3 words in the title, or 15-20 characters.
Now let's look at summaries.
In [11]:
# examines word/character count of summaries
summary_cc = [len(row) for row in df['summary']]
summary_wc = [len(row.split()) for row in df['summary']]
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
pd.Series(summary_cc).plot.hist(density=True, bins=np.arange(0, max(summary_cc), 1), ax=ax1)
pd.Series(summary_wc).plot.hist(density=True, bins=np.arange(0, max(summary_wc), 1), ax=ax2)
plt.show()
Again, similar shapes. Note the spike at 255 for the character count; this is a vestige of the original 255 character limit for summaries imposed by the site. Overall, it would appear summary lengths are pretty well dispersed.
Examining the most frequent words:
In [12]:
# get grams of text
onegram_vectorizer = CountVectorizer(tokenizer=custom_tokenize,
ngram_range=(1,1),
max_features=50)
text_onegram = get_word_count(onegram_vectorizer, df.title_summary)
# get top words
top_words, top_wc = get_top_words(text_onegram, n_top=25)
# plot top words
fig, ax = plt.subplots(figsize=(16, 6))
plt.bar(top_words,top_wc)
plt.xticks(range(len(top_words)),top_words)
plt.title("Top 1-grams")
plt.show()
Now let's look at the distribution of genres across the stories. Note that "General" includes stories that do not have a genre label.
In [13]:
# examines top genres individually
genres_indiv = [item for sublist in df['genre'] for item in sublist]
genres_indiv = pd.Series(genres_indiv).value_counts()
genres_indiv = genres_indiv / np.sum(genres_indiv)
# plot top words
fig, ax = plt.subplots(figsize=(16, 6))
plt.bar(genres_indiv.index, genres_indiv.values)
plt.xticks(range(len(genres_indiv.index)), genres_indiv.index)
plt.title("Genres")
plt.xticks(rotation=90)
plt.show()
We can see that romance takes the lead! In fact, ~30% of the genre labels used is romance. In second and third place are humor and drama respectively.
The least popular genres appear to be parody, crime, and spirtual.
This disproportionate distribution of genres should note this when we later build our classifiers, as this will be our target variable.
Since stories fall under more than one genre category, let us also see how the genres combine.
In [14]:
# creates contingency table
gen_pairs = df.loc[[len(row) > 1 for row in df.genre], 'genre']
gen1 = pd.Series([row[0][:3] for row in gen_pairs] + [row[1][:3] for row in gen_pairs])
gen2 = pd.Series([row[1][:3] for row in gen_pairs] + [row[0][:3] for row in gen_pairs])
cross = pd.crosstab(index=gen1, columns=gen2, colnames=[''])
del cross.index.name
# finds relative frequency
for col in cross.columns.values:
cross[col] = cross[col]/np.sum(cross[col])
# plots heatmap
f, ax = plt.subplots(figsize=(16, 16))
cm = sns.color_palette("Blues")
ax = sns.heatmap(cross, cmap=cm, cbar=False, robust=True,
square=True, linewidths=0.1, linecolor='white')
plt.show()
In terms of how genres cross, romance appears to pair with almost everything. Romance is particularly common with drama (the romantic drama) and humor (the rom-com). The only genre that shies away from romance is parody, which goes in hand with humor instead.
The second most crossed genre is adventure, which is often combined with fantasy, sci-fi, mystery, or suspense.
The third genre to note is angst, which is often combined with horror, poetry, or tragedy.
Finally, we would like to examine some of the most important words within each genre. This is done though TF-IDF. Words are processed and cleaned, with captalization and punctuation removed. Stop words are also removed. Finally, the words are stemmed with the Snowball stemmer (also known as the Porter 2), which is an improved and slightly more aggressive version of the commonly used Porter stemmer.
In [15]:
# create vectorizers that converts text to weighted ngrams
get_tfidf = TfidfVectorizer(
tokenizer=custom_tokenize,
sublinear_tf=True,
strip_accents='unicode',
analyzer='word',
token_pattern=r'\w+',
ngram_range=(1,2),
max_features=1000)
# get all weighted one and two grams of titles
text_tfidf = get_word_count(get_tfidf, df.title_summary)
In [16]:
for g in set(df.main_genre):
tfidf_count = text_tfidf.loc[df.main_genre==g].mean(axis=0)
top_words = np.flip(np.argsort(tfidf_count.values), axis=0)[:5]
print(g, ':', ', '.join(tfidf_count.index[top_words]))
As we can see from above, different genres seem to have different distribution of words, and also presumably topics. Since we will be building a classifier for genre, it may make sense to include the topics of the text as potential features.
In this section, we will create features through two different methods.
The first is through Latent Semantic Analysis, which will represent each text as a vector of features. Each feature is then a linear combination of the ngrams in the corpus. The second is through Doc2Vec.
In [17]:
n_lsa = 50
lsa_docm_text, lsa_termm_text, lsa_topicm_text = get_lsa(text_tfidf, n=n_lsa)
With the LSA, we have captured approximately 18 percent of the information of the TF-IDF matrix within 50 aggregate features.
As a form of visualization, we further reduced our 50 features into two dimensions with t-SNE and plotted each story. The observations are then colored according to the genre it belongs to. As seen below, while there appears to be clear clusters in the stories, they are not necessarily separated by genre.
In [18]:
plot_representation(lsa_docm_text, df['main_genre'], 'LSA representation')
Now we will switch over to vector representations using Doc2Vec. Unlike with the bag of words methods, the order of appearance in the words now matter, which hopefully captures some stylistic writing choices. We will be using a window of 5 and taking a vector length of 20.
In [19]:
tagged_text = [TaggedDocument(words=custom_tokenize(t), tags=[str(i)])
for i, t in enumerate(df.title_summary)]
n_d2v = 20
d2v_text = get_doc2vec(tagged_text, n_d2v)
d2v_docm_text = d2v_text.docvecs.vectors_docs
From the plot below, we can see that our vector representation shows no discernable pattern between the extracted content and the genre.
In [20]:
plot_representation(d2v_docm_text, df['main_genre'], 'Doc2Vec representation')
In [21]:
feature_data = df[['storyid', 'main_genre']].copy()
feature_data[['LSA' + str(i) for i in range(n_lsa)]] = pd.DataFrame(lsa_docm_text)
feature_data[['D2V' + str(i) for i in range(n_d2v)]] = pd.DataFrame(d2v_docm_text)
When we tried to represent our text as numerical topic vectors, we saw that, at least based on a two-dimensional representation of the features, there were no discernable grouping or clustering based on genres.
Here, we will try to represent the text in a different way. Instead of trying to extract numerical vectors representing its topics or "meaning", we will try to extract only the sentiment of the text. That is, whether the overall tone of the text is positive or negative.
We will try to extract sentiment through three separate methods. The first is the NLTK sentiment intensity analyzer. The second is TextBlob. The third is the Stanford NLP.
The NLTK sentiment intensity analyzer is based on VADER (Valence Aware Dictionary and sEntiment Reasoner), a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.
In this approach, each of the words in the lexicon is rated on how positive or negative it is. The sentiments themselves are captured from the ratings of Amazon’s Mechanical Turk.
Unlike other sentiment analyzers, VADER doesn't just match the words in the text and in its lexicon, but also looks at context. It considers both capitalization and punctuation, slangs and acronyms, modifiers and negations/conjunctons.
In [22]:
sia = SentimentIntensityAnalyzer()
sia_scores = []
for s in df.title_summary:
sia_scores.append(sia.polarity_scores(s))
In [23]:
feature_data['sia_pos'] = [s['pos'] for s in sia_scores]
feature_data['sia_neg'] = [s['neg'] for s in sia_scores]
feature_data['sia_neu'] = [s['neu'] for s in sia_scores]
feature_data['sia_compound'] = [s['compound'] for s in sia_scores]
VADER outputs four scores: positive, negative, neutral, and compound, which represents the aggregate score.
In [24]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_pos", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
For positivity above, we can see romance and friendship both score highly, which makes a lot of intuitive sense.
In [25]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_neg", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
In terms of negativity, horror and tragedy scored highly, which also makes a lot of intuitive sense.
In [26]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_neu", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
In terms of neutrality, poetry scored the highest.
In [27]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_compound", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
In terms of overall score, humor, romance, spiritual, friendship, family, and crime all scored more positively.
Meanwhile, mystery, horror, sci-fi, supernatural, suspense, tragedy, and hurt-comfort all scored more negatively.
TextBlob is a sentiment altenative that comes in two forms. The first uses the Pattern Analyzer, which includes parts of speech tagging and will find the postivity of a piece of text. The second is a simply bag-of-words Naive Bayes classifier, and will find the polarity and subjectivity of a text.
For both, TextBlob is trained on the NLTK movie reviews corpus.
In [28]:
nba = Blobber(analyzer=NaiveBayesAnalyzer())
pa = Blobber(analyzer=PatternAnalyzer())
def get_blob(s):
return nba(s).sentiment, pa(s).sentiment
num_cores = multiprocessing.cpu_count()
blob_scores = Parallel(n_jobs=num_cores)(delayed(get_blob)(s) for s in df.title_summary)
In [29]:
feature_data['blob_pos'] = [s[0].p_pos for s in blob_scores]
feature_data['blob_pol'] = [s[1].polarity for s in blob_scores]
feature_data['blob_sub'] = [s[1].subjectivity for s in blob_scores]
In [30]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="blob_pos", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
As we can see, TextBlob yields slightly different results. Adventure, mystery, sci-fi, spiritual, and crime are the genres yielding the most positive scores. Meanwhile, poetry and parody yield the most negative.
In [31]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="blob_pol", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
Polarity does not appear to be marked by clear differentiation across genres.
In [32]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="blob_sub", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()
We can see that poetry has the lowest subjectivity score.
Finally, we will look at the Stanford CoreNLP sentiment analysis. Based on Richard Socher et al's Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, the model uses a recursive neural network that uses tensors instead of matrices as parameters.
The model is trained on a sentiment treebank, which is left open-source of manual validation. The treebank associates the individual phrases of various sizes with sentiment labels. Instead of predicting sentiment from whole sentences, the model uses an incremental approach in which the neural network associates subparts of sentences then combines them.
To use the Stanford NLP, we had to first install and run the server.
In [33]:
# java -mx1g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 25000
stanford_scores = []
for s in df.title_summary:
s = re.sub('[\[\]]+', '', s)
result = stanford.annotate(s,
properties={
'annotators': 'sentiment',
'outputFormat': 'json',
'timeout': '10000'
})
for r in result['sentences']:
score = (r['sentimentValue'], r['sentimentDistribution'])
stanford_scores.append(score)
In [34]:
feature_data['stanford_score'] = [int(g[0]) for g in stanford_scores]
feature_data['stanford_s1'] = [g[1][0] for g in stanford_scores]
feature_data['stanford_s2'] = [g[1][1] for g in stanford_scores]
feature_data['stanford_s3'] = [g[1][2] for g in stanford_scores]
feature_data['stanford_s4'] = [g[1][3] for g in stanford_scores]
feature_data['stanford_s5'] = [g[1][4] for g in stanford_scores]
In [35]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="stanford_s5", data=feature_data)
plt.xticks(rotation=90)
plt.ylim(0, 0.25)
plt.ylabel('')
plt.xlabel('')
plt.show()
For the positive end of the sentiment distribution, we can see humor scores the highest.
In [36]:
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="stanford_s1", data=feature_data)
plt.xticks(rotation=90)
plt.ylim(0, 0.25)
plt.ylabel('')
plt.xlabel('')
plt.show()
Meanwhile, crime is the highest for the negative end of the distribution.
Emotion analysis is an extension of sentiment analysis that replaces the positive-negative spectrum with the full range of emotions, such as joy, sadness, or anger.
Unlike sentiment analysis, we were unable to find any built-in libraries for emotion analysis. As such, we will be constructing our own emotion metrics. To do this, we will be using NRC emotion lexicons. These are the EmoLex, the Affect Intensity Lexicon, the VAD lexicon, and the Twitter Hashtag lexicon.
The EmoLex, or NRC Word-Emotion Association Lexicon has annotations for eight emotions across 14,182 unigrams. It uses a binary category in which 1 means a word is associated with an emotion and 0 otherwise. This lexicon is manually collected through crowdsourcing.
Below, we have printed 10 random rows:
In [37]:
random_indices = list(np.random.choice(emolex.loc[emolex.is_emotion==1].index, 10))
emolex.iloc[random_indices]
Out[37]:
In [38]:
print(base_emotions)
The different emotions are shown above.
Now, for each text, we will construct a simple count for each emotion.
In [39]:
token_text = [simple_tokenize(t) for t in df.title_summary]
In [40]:
for emo in base_emotions:
emo_terms = set(emolex.loc[(emolex.emotion == emo) & (emolex.is_emotion == 1)].term)
emo_results = []
for summary in token_text:
emo_results.append(np.sum([w in emo_terms for w in summary]))
feature_data['nrc_' + emo] = emo_results
Below, we will show the normalized aggregated counts for each genre:
In [41]:
plot_emotions('nrc_', base_emotions)
The results align extremely close with our intuition! For example, horror scores highly in both disgust and fear, whereas tragedy scores highest in sadness.
The Affect Intensity Lexicon is a list of English words and their associations with four basic emotions (anger, fear, sadness, joy). Unlike the previous lexicon, here words are associated with different intensities of an emotion. Like the previous lexicon, it was manually constructed, with scaling applied.
The lexicon contains approximately 6,000 words.
In [42]:
random_indices = list(np.random.choice(range(intlex.shape[0]), 5))
intlex.iloc[random_indices]
Out[42]:
In [43]:
print(intensity_emotions)
In [44]:
for emo in intensity_emotions:
emo_terms = intlex.loc[intlex.emotion == emo].term.values
emo_values = intlex.loc[intlex.emotion == emo].score.values
emo_results = []
for summary in token_text:
emo_results.append(np.sum(emo_values[np.isin(emo_terms, summary)]))
feature_data['nrc_int_' + emo] = emo_results
As with before, we will examine the emotions across genres.
In [45]:
plot_emotions('nrc_int_', intensity_emotions)
Sci-fi has the most fear, tragedy has the most sadness, crime has the most anger, and spiritual has the most joy.
The VAD lexicon is based on influential factor analysis studies, which have shown that the primary dimensions of word meaning are valence, arousal, and dominance.
This lexicon has more than 20,000 words and their valence, arousal, and dominance scores, which range from 0 to 1, and created by manual annotation with scaling.
In [46]:
random_indices = list(np.random.choice(range(vadlex.shape[0]), 5))
vadlex.iloc[random_indices]
Out[46]:
In [47]:
print(vad_emotions)
In [48]:
val_results = []
aro_results = []
dom_results = []
for summary in token_text:
has_terms = np.isin(vadlex.term.values, summary)
val_results.append(np.sum(vadlex['valence'][has_terms]))
aro_results.append(np.sum(vadlex['arousal'][has_terms]))
dom_results.append(np.sum(vadlex['dominance'][has_terms]))
feature_data['nrc_vad_valence'] = val_results
feature_data['nrc_vad_arousal'] = aro_results
feature_data['nrc_vad_dominance'] = dom_results
In [49]:
plot_emotions('nrc_vad_', vad_emotions)
Spiritual has the highes valence, horror has the highest arousal, and sci-fi has the highest dominance.
The Twitter Emotion Corpus, also known as the Hashtag Emotion Corpus, is an automated lexicon that is generated from tweets. The score is determined by the word's tendency to co-occur with the emotion-word hashtag.
The advantage of this lexicon is that is is much larger than manually created lexicons and cover slang and internet phrases.
In [50]:
random_indices = list(np.random.choice(range(hashlex.shape[0]), 5))
hashlex.iloc[random_indices]
Out[50]:
In [51]:
print(hash_emotions)
In [52]:
for emo in hash_emotions:
emo_terms = hashlex.loc[hashlex.emotion == emo].hashtag.values
emo_values = hashlex.loc[hashlex.emotion == emo].score.values
emo_results = []
for summary in token_text:
emo_results.append(np.sum(emo_values[np.isin(emo_terms, summary)]))
feature_data['nrc_hash_' + emo] = emo_results
In [53]:
plot_emotions('nrc_hash_', hash_emotions)
While there is a lot of overlap in the results with the manual database, it would appear there are dissimilarities too. For example, spiritual here rated very highly in disgust, whereas the reverse is true in the EmoLex results.
In our final section, we will try to predict genre based off the sentiments and emotions extracted from our text.
In [261]:
# sentiment variables
sia_vars = ['sia_pos', 'sia_neg', 'sia_neu', 'sia_compound']
blob_vars = ['blob_pos', 'blob_pol', 'blob_sub']
stanford_vars = ['stanford_s1', 'stanford_s2', 'stanford_s4', 'stanford_s5']
# emotion variables
base_emotion_vars = ['nrc_' + e for e in base_emotions]
intensity_emotion_vars = ['nrc_int_' + e for e in intensity_emotions]
hash_emotion_vars = ['nrc_hash_' + e for e in hash_emotions]
vad_emotion_vars = ['nrc_vad_' + e for e in vad_emotions]
sentiment_vars = sia_vars + blob_vars + stanford_vars
emotion_vars = base_emotion_vars + intensity_emotion_vars + hash_emotion_vars + vad_emotion_vars
In [433]:
classification_data = pd.DataFrame()
classification_data = feature_data[sentiment_vars + emotion_vars].copy()
#classification_data['storyid'] = df['storyid'].astype('int')
#classification_data['words'] = df['words'].astype('int')
#classification_data[['LSA' + str(i) for i in range(50)]] = feature_data[['LSA' + str(i) for i in range(50)]]
#classification_data[['D2V' + str(i) for i in range(20)]] = feature_data[['D2V' + str(i) for i in range(20)]]
In [407]:
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
In [408]:
clf = LogisticRegression(random_state=0, solver='newton-cg', max_iter=500,
multi_class='multinomial').fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
In [409]:
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))
In [410]:
plot_ROC(y_test_bin, y_pred)
In [411]:
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
In [412]:
clf = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True, gamma='scale', random_state=0))
y_pred = clf.fit(X_train, y_train_bin).decision_function(X_test)
In [413]:
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print(accuracy_score(y_test, best_preds))
In [414]:
plot_ROC(y_test_bin, y_pred)
In [415]:
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
In [416]:
clf = KNeighborsClassifier(n_neighbors=n_classes).fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
In [417]:
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))
In [418]:
plot_ROC(y_test_bin, y_pred)
In [419]:
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
In [420]:
clf = GaussianNB().fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
In [421]:
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))
In [422]:
plot_ROC(y_test_bin, y_pred)
In [423]:
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
In [424]:
clf = MLPClassifier(max_iter=2000).fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)
In [425]:
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))
In [426]:
plot_ROC(y_test_bin, y_pred)
In [449]:
X = classification_data
Y = feature_data['main_genre'].factorize()[0]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
In [450]:
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)
In [489]:
param = {
'booster': 'gbtree',
'n_jobs': 8,
'max_depth': 3,
'eta': 0.1,
'silent': 1,
'reg_alpha': 10,
'reg_lambda': 10,
'gamma': 0,
'objective': 'multi:softprob', # error evaluation for multiclass training
'num_class': len(np.unique(y_train)), # the number of classes that exist in this datset
'early_stopping_rounds': 20}
num_round = 500 # the number of training iterations
bst = xgb.train(param, dtrain, num_round)
y_pred = bst.predict(dtest)
In [490]:
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print(accuracy_score(y_test, best_preds))
In [491]:
plot_ROC(y_test_bin, y_pred)
In [485]:
fig, ax = plt.subplots(figsize=(16, 6))
xgb.plot_importance(bst, ax=ax, max_num_features = 15)
plt.show()