FANFICTION SENTIMENT AND EMOTION ANALYSIS

Introduction

In this notebook, we will be examining the sentiment and emotions of online fanfiction publications.

Our working hypothesis is that the genre of a work is heavily encapsulated by its emotional content and tone. Therefore, if we can successfully decompose textual data into its representative emotions, we can then predict its genre.

Our goal is two-fold. First, we would like to examine the sentimental and emotional composition of each genre. Second, we would like to see if we can predict the genre of a story based on its text and emotional composition.

Environment setup

In this section, we will set up our working environment.

Import libaries



In [1]:

    
# imports utility libraries
import pandas as pd
import numpy as np
import pickle
import datetime
import string
import re
import multiprocessing
import warnings
from joblib import Parallel, delayed
from itertools import cycle
from IPython.core.display import display, HTML

# import scipi
from scipy import interp
from scipy.stats import chisquare

# import graphics
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.colors import ListedColormap
%matplotlib inline

# import sklearn tools
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.decomposition import PCA, SparsePCA, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.preprocessing import scale, label_binarize, Binarizer, Normalizer
from sklearn.manifold import TSNE
from sklearn import metrics, svm, datasets
from sklearn.metrics import silhouette_score, accuracy_score, precision_score, roc_curve, auc
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier

# import nltk tools
import nltk
from nltk import tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# import xgboost
import xgboost as xgb

# import gensim tools
from gensim import corpora, models 
from gensim.models.coherencemodel import CoherenceModel
from gensim.test.utils import common_corpus
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

# import textblob
from textblob import TextBlob, Blobber
from textblob.sentiments import NaiveBayesAnalyzer, PatternAnalyzer

# stanford NLP
from pycorenlp import StanfordCoreNLP
stanford = StanfordCoreNLP('http://localhost:9000')

Configure graphics



In [2]:

    
# configure graphics (taken from session notes)
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843)]
dark2_cmap = ListedColormap(dark2_colors)
def set_mpl_params():
    rcParams['figure.figsize'] = (12, 6)
    rcParams['figure.dpi'] = 100
    rcParams['axes.prop_cycle'].by_key()['color'][1]
    rcParams['lines.linewidth'] = 2
    rcParams['axes.facecolor'] = 'white'
    rcParams['font.size'] = 14
    rcParams['patch.edgecolor'] = 'white'
    rcParams['patch.facecolor'] = dark2_colors[0]
    rcParams['font.family'] = 'StixGeneral'
mpl.rcParams.update(mpl.rcParamsDefault)
set_mpl_params()
mpl.style.use('ggplot')

Import lexical dictionaries



In [3]:

    
# installs corpus
## nltk.download()
## nltk.download('punkt')
## nltk.download('vader_lexicon')
## nltk.download('movie_reviews')

# sets stop word list for text parsing
stop_words = stopwords.words('english')

Data and function setup

In this section, we will be importing the cleaned fanfiction data as well as the emotional lexicons.

The fanfiction data that we will use is scraped from the official fanfiction.net site. We have downloaded a random sample of 5000 stories out of approximately 13 million. Of these 5000 stories, approximately half were both still existing (that is, not deleted) as well as written in English, which is the subset that we will be performing out analysis on.

The python scripts we created to scrape the data can be found here.

Furthermore, for our emotion analysis, we have also downloaded and imported the emotion lexicons of Dr. Saif M. Mohammad, a Senior Research Scientist at the National Research Council Canada (NRC).

For the sake of readability, we have also placed all the functions we will be calling in this section. These functions include those for plotting/graphics, textual decomposition, clustering, and extracting passages/lines.

Data setup



In [4]:

    
# opens cleaned data
with open ('../data/clean_data/df_story', 'rb') as fp:
    df = pickle.load(fp)
df = df.loc[(df.state == 'online') & (df.language == 'English'), ].copy()
df['title_summary'] = df['title'] + ' ' + df['summary']
df['main_genre'] = [g[0] for g in df.genre]
df.index = range(df.shape[0])

# opens features data
with open ('../data/clean_data/feature_data', 'rb') as fp:
    feature_data = pickle.load(fp)



In [5]:

    
# opens emotion lexicon data
emolex = pd.read_csv("../data/nrc_emotion_data/nrc_emotion_lexicon.txt", sep="\t", header=None)
emolex.columns = ['term', 'emotion', 'is_emotion']

intlex = pd.read_csv("../data/nrc_emotion_data/nrc_affectintensity_lexicon.txt", sep="\t")
intlex.columns = ['term', 'score', 'emotion']

hashlex = pd.read_csv("../data/nrc_emotion_data/nrc_hashtag_emotion_lexicon.txt", sep="\t", header=None)
hashlex.columns = ['emotion', 'hashtag', 'score']
hashlex['hashtag'] = [re.sub('#+', '', str(h)) for h in hashlex.hashtag]

vadlex = pd.read_csv("../data/nrc_emotion_data/nrc_vad_lexicon.txt", sep="\t")
vadlex.columns = ['term', 'valence', 'arousal', 'dominance']

# define emotion lists
base_emotions = list(set(emolex.emotion))
intensity_emotions = list(set(intlex.emotion))
vad_emotions = ['valence', 'arousal', 'dominance']
hash_emotions = list(set(hashlex.emotion))

Function setup



In [6]:

    
def plot_breakdown(x, title, r=90, hide=False):
    plt.rcParams['figure.figsize'] = [16, 3]
    plt.bar(range(len(x)), x)
    plt.xticks(range(len(x)), x.index, rotation=r)
    if hide:
        plt.tick_params(axis='x', bottom=False, labelbottom=False)
    plt.title(title)
    plt.show()

def plot_silhouette(x, y, title="Silhouette vs. k"):
    plt.rcParams['figure.figsize'] = [16, 2]    
    plt.plot(x, y)
    plt.scatter(x, y)
    plt.title(title)
    plt.show()
    
# plot t-sne scatter with colored groups 
def plot_representation(matrix, group, title, alpha=0.5):
    tsne_model = TSNE(n_components=2)
    tsne_matrix = tsne_model.fit_transform(matrix)
    tsne_df = pd.DataFrame(tsne_matrix)
    tsne_df.columns = ["Dimension 1", "Dimension 2"]
    tsne_df['group'] = group
    fig = sns.lmplot('Dimension 1', 'Dimension 2', 
                     data=tsne_df, hue="group", fit_reg=False, 
                     scatter_kws={'alpha':0.5}, legend=False)
    fig.fig.set_size_inches(16, 9)
    plt.title(title)
    plt.legend(loc='right', ncol=1)
    plt.show()
    
# plot by play and act-scene
def plot_scenes(data, group, outlier=False):
    data = data.copy()
    data.columns = ['label', 'y', 'x']
    data['group'] = group
    data['counts'] = data.shape[0] / data['group'].map(data['group'].value_counts())
    data['counts'] = (data['counts'] / np.max(data['counts'])) * 100
    if outlier:
        data['counts'] = [100 if g==-1 else 30 for g in group]
    plt.rcParams['figure.figsize'] = [16, 12]    
    plt.scatter(data['x'], data['y'], s=data['counts'], c=data['group'], cmap=plt.get_cmap('tab20'))
    plt.yticks(range(len(sorted(list(set(data['label']))))), sorted(list(set(data['label']))))
    plt.xticks([])
    plt.ylabel('')
    plt.xlabel('')
    plt.show()

def plot_emotions(header, emotions):
    plot_vars = ['main_genre'] + [header + e for e in emotions]    
    plot_data = feature_data[plot_vars].groupby("main_genre").agg(sum)
    plot_data['genre_counts'] = plot_data.index
    plot_data['genre_counts'] = plot_data['genre_counts'].map(df['main_genre'].value_counts())
    plot_data = plot_data.div(plot_data.sum(axis=1), axis=0) # normalize by genre
    plot_data = plot_data.div(plot_data.sum(axis=0), axis=1) # normalize by emotion
    plot_data = plot_data.subtract(plot_data.mean(axis=0), axis=1) # take deviation from mean
    for emo in emotions:
        fig, ax = plt.subplots(figsize=(16, 3))
        plt.bar(plot_data.index, plot_data[header + emo])
        plt.xticks(range(len(plot_data.index)),plot_data.index)
        plt.title(emo)
        plt.xticks(rotation=90)
        plt.show()
        
# create custom tokenizer without stemmer
def base_tokenize(text):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(text) 
    tokens = [t.lower() for t in tokens if t.lower() not in stop_words]
    return tokens

# create custom tokenizer that includes stemmer
def simple_tokenize(text):
    tokenizer = RegexpTokenizer(r'\w+')
    stemmer = SnowballStemmer('english')
    tokens = tokenizer.tokenize(text) 
    tokens = [t.lower() for t in tokens if t.lower() not in stop_words]
    return tokens

# create custom tokenizer that includes stemmer
def custom_tokenize(text):
    tokenizer = RegexpTokenizer(r'\w+')
    stemmer = SnowballStemmer('english')
    tokens = tokenizer.tokenize(text) 
    tokens = [stemmer.stem(w) for w in tokens]
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [t for t in tokens if len(t) > 1]
    return tokens

# turn bodies of text into word count vectors
def get_word_count(vectorizer, text):
    count = vectorizer.fit_transform(text)
    count = pd.DataFrame(count.toarray())
    count.columns = vectorizer.get_feature_names()
    return count

# get the top words across all observations
def get_top_words(count, n_top):
    total_count = np.sum(count, axis=0)
    sorted_index = np.flip(np.argsort(total_count), axis=0)
    sorted_index = np.array(sorted_index[0:n_top])
    top_words = np.array(count.columns)[sorted_index]
    top_words_count = np.array(total_count)[sorted_index]
    return top_words, top_words_count

def get_pca(x, n=100):
    pca_model = PCA(n_components=n, random_state=0)
    pca_matrix = pca_model.fit_transform(scale(x))
    print("Explained variance: ", np.sum(pca_model.explained_variance_ratio_))
    return pca_matrix

def get_lsa(x, n=100):
    lsa_model = TruncatedSVD(n_components=n)
    doc_matrix = lsa_model.fit_transform(x)
    term_matrix = lsa_model.components_
    topic_matrix = np.diag(lsa_model.singular_values_)
    print("Explained variance: ", np.sum(lsa_model.explained_variance_ratio_))
    return doc_matrix, term_matrix, topic_matrix

def get_doc2vec(x, n=25):
    d2v_model = Doc2Vec(vector_size=n, min_count=1, window=5, workers=7, dm=1)
    d2v_model.build_vocab(x)
    d2v_model.train(x, total_examples=d2v_model.corpus_count, epochs=10)
    
    return d2v_model

def plot_ROC(y_test, y_pred, n_classes=20):
    
    # compute ROC curve and ROC area for each class
    fpr = dict()
    tpr = dict()
    roc_auc = dict()
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # compute micro-average ROC curve and ROC area
    fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())
    roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
    
    # aggregate all false positive rates
    all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
    
    # interpolate all ROC curves at this points
    mean_tpr = np.zeros_like(all_fpr)
    for i in range(n_classes):
        mean_tpr += interp(all_fpr, fpr[i], tpr[i])

    # average and compute AUC
    mean_tpr /= n_classes
    fpr["macro"] = all_fpr
    tpr["macro"] = mean_tpr
    roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

    # plot all ROC curves
    plt.figure()
    plt.plot(fpr["micro"], tpr["micro"],
             label='micro-average ROC curve (area = {0:0.2f})'
                   ''.format(roc_auc["micro"]),
             color='deeppink', linestyle=':', linewidth=2)

    plt.plot(fpr["macro"], tpr["macro"],
             label='macro-average ROC curve (area = {0:0.2f})'
                   ''.format(roc_auc["macro"]),
             color='navy', linestyle=':', linewidth=2)

    plt.plot([0, 1], [0, 1], 'k--', lw=2)
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Multiclass ROC')
    plt.legend(loc="lower right")
    plt.show()

Exploratory analysis

In this section, we will perform some exploratory data analysis on the fanfiction data.



In [7]:

    
df_vars = ['storyid', 'title', 'summary', 'fandom', 'rated', 'words', 'main_genre']
df[df_vars].head()









    Out[7]:







  
    
      
      storyid
      title
      summary
      fandom
      rated
      words
      main_genre
    
  
  
    
      0
      2062
      Of the One Who Is Foretold
      Fic: A stranger comes to the Animorphs seeking...
      Animorphs
      K+
      7672.0
      General
    
    
      1
      19673
      A Healer's Touch
      Left near dead, two Jedi are found by local he...
      Star Wars
      K+
      16870.0
      Drama
    
    
      2
      34152
      R a i n (I'm getting sick of the new girl titles)
      basically from Gaia's P.O.V.
      Harry Potter
      K+
      555.0
      Adventure
    
    
      3
      40969
      Scattered Thoughts of a Comatose Civilian Pilot
      It's been brought to my attention that it's ob...
      Gundam Wing/AC
      K
      212.0
      Poetry
    
    
      4
      47543
      Enter The People's Girl, Part 5
      The Rock finds Angelina...and is shocked
      Wrestling
      T
      1457.0
      General



In [8]:

    
print('Number of observations: ', df.shape[0])









    



Number of observations:  2525

Word count and chapter length

Let's take a look at the distribution of word lengths.



In [9]:

    
# examines distribution of number of words
df['words1k'] = df['words']/1000

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True, figsize=(16, 4))
sns.kdeplot(df['words1k'], shade=True, bw=.5, legend=False, ax=ax1)
sns.kdeplot(df['words1k'], shade=True, bw=.5, legend=False, ax=ax2)
plt.xlim(0,100)
plt.show()

The bulk of stories appear to be less than 50 thousand words, with a high proportion between 0-20 thousand words. In other words, we have a significant proportion of short stories and novelettes, and some novellas. Novels become more rare. Finally, there are a few "epics", ranging from 200 thousand to 600 thousand words.

Working with texts on the magnitude of thousands is too computationally intensive and may not provide much additional information needed for our purposes. As such, we will primarily be using the title and summaries as a proxy for the content of the stories instead.

Titles and summaries

How long are titles and summaries? Is there a systematic way authors write them? Do some words appear more often than others? Here we explore some of those questions.

Let's start by examining character and word count, respectively, for titles.



In [10]:

    
# examines word/character count of titles
title_cc = [len(row) for row in df['title']]
title_wc = [len(row.split()) for row in df['title']]

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
pd.Series(title_cc).plot.hist(density=True, bins=np.arange(0, max(title_cc), 1), ax=ax1)
pd.Series(title_wc).plot.hist(density=True, bins=np.arange(0, max(title_wc), 1), ax=ax2)

plt.show()

Almost identical in shape of distribution. It would appear stories typically have 2-3 words in the title, or 15-20 characters.

Now let's look at summaries.



In [11]:

    
# examines word/character count of summaries
summary_cc = [len(row) for row in df['summary']]
summary_wc = [len(row.split()) for row in df['summary']]

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 4))
pd.Series(summary_cc).plot.hist(density=True, bins=np.arange(0, max(summary_cc), 1), ax=ax1)
pd.Series(summary_wc).plot.hist(density=True, bins=np.arange(0, max(summary_wc), 1), ax=ax2)

plt.show()

Again, similar shapes. Note the spike at 255 for the character count; this is a vestige of the original 255 character limit for summaries imposed by the site. Overall, it would appear summary lengths are pretty well dispersed.

Examining the most frequent words:



In [12]:

    
# get grams of text
onegram_vectorizer = CountVectorizer(tokenizer=custom_tokenize,
                                     ngram_range=(1,1),
                                     max_features=50)
text_onegram = get_word_count(onegram_vectorizer, df.title_summary)

# get top words
top_words, top_wc = get_top_words(text_onegram, n_top=25)

# plot top words
fig, ax = plt.subplots(figsize=(16, 6))
plt.bar(top_words,top_wc)
plt.xticks(range(len(top_words)),top_words)
plt.title("Top 1-grams")
plt.show()

Genre distributions

Now let's look at the distribution of genres across the stories. Note that "General" includes stories that do not have a genre label.



In [13]:

    
# examines top genres individually
genres_indiv = [item for sublist in df['genre'] for item in sublist]
genres_indiv = pd.Series(genres_indiv).value_counts()
genres_indiv = genres_indiv / np.sum(genres_indiv)

# plot top words
fig, ax = plt.subplots(figsize=(16, 6))
plt.bar(genres_indiv.index, genres_indiv.values)
plt.xticks(range(len(genres_indiv.index)), genres_indiv.index)
plt.title("Genres")
plt.xticks(rotation=90)
plt.show()

We can see that romance takes the lead! In fact, ~30% of the genre labels used is romance. In second and third place are humor and drama respectively.

The least popular genres appear to be parody, crime, and spirtual.

This disproportionate distribution of genres should note this when we later build our classifiers, as this will be our target variable.

Since stories fall under more than one genre category, let us also see how the genres combine.



In [14]:

    
# creates contingency table
gen_pairs = df.loc[[len(row) > 1 for row in df.genre], 'genre']
gen1 = pd.Series([row[0][:3] for row in gen_pairs] + [row[1][:3] for row in gen_pairs])
gen2 = pd.Series([row[1][:3] for row in gen_pairs] + [row[0][:3] for row in gen_pairs])
cross = pd.crosstab(index=gen1, columns=gen2, colnames=[''])
del cross.index.name

# finds relative frequency
for col in cross.columns.values:
    cross[col] = cross[col]/np.sum(cross[col])
    
# plots heatmap
f, ax = plt.subplots(figsize=(16, 16))
cm = sns.color_palette("Blues")
ax = sns.heatmap(cross, cmap=cm, cbar=False, robust=True,
                 square=True, linewidths=0.1, linecolor='white')

plt.show()

In terms of how genres cross, romance appears to pair with almost everything. Romance is particularly common with drama (the romantic drama) and humor (the rom-com). The only genre that shies away from romance is parody, which goes in hand with humor instead.

The second most crossed genre is adventure, which is often combined with fantasy, sci-fi, mystery, or suspense.

The third genre to note is angst, which is often combined with horror, poetry, or tragedy.

Finally, we would like to examine some of the most important words within each genre. This is done though TF-IDF. Words are processed and cleaned, with captalization and punctuation removed. Stop words are also removed. Finally, the words are stemmed with the Snowball stemmer (also known as the Porter 2), which is an improved and slightly more aggressive version of the commonly used Porter stemmer.



In [15]:

    
# create vectorizers that converts text to weighted ngrams
get_tfidf = TfidfVectorizer(
    tokenizer=custom_tokenize,
    sublinear_tf=True,
    strip_accents='unicode',
    analyzer='word',
    token_pattern=r'\w+',
    ngram_range=(1,2),
    max_features=1000)

# get all weighted one and two grams of titles
text_tfidf = get_word_count(get_tfidf, df.title_summary)



In [16]:

    
for g in set(df.main_genre):
    tfidf_count = text_tfidf.loc[df.main_genre==g].mean(axis=0)
    top_words = np.flip(np.argsort(tfidf_count.values), axis=0)[:5]
    print(g, ':', ', '.join(tfidf_count.index[top_words]))









    



Spiritual : much, haunt, know, hear, tri
Suspense : vs, team, get, one, nation
Poetry : poem, wrote, pleas, read, eye
Horror : evil, death, happen, way, warn
Humor : get, one, find, stori, day
Parody : parodi, charact, hogwart, vs, random
Family : brother, famili, one, kid, stori
Hurt-Comfort : never, left, one, oneshot, want
Drama : one, stori, find, life, end
Sci-Fi : war, end, stori, forev, charact
General : one, drabbl, new, day, year
Angst : one, life, like, end, warn
Tragedy : die, one, death, happen, one shot
Crime : fairi, work, saw, chapter, castl
Romance : love, one, stori, get, happen
Fantasy : stori, god, use, tale, pokemon
Adventure : world, new, find, one, stori
Mystery : dark, mysteri, one, killer, onli
Supernatural : vampir, life, dean, au, dark
Friendship : friendship, one, shot, one shot, name

Text representation

As we can see from above, different genres seem to have different distribution of words, and also presumably topics. Since we will be building a classifier for genre, it may make sense to include the topics of the text as potential features.

In this section, we will create features through two different methods.

The first is through Latent Semantic Analysis, which will represent each text as a vector of features. Each feature is then a linear combination of the ngrams in the corpus. The second is through Doc2Vec.

Topic representation with LSA



In [17]:

    
n_lsa = 50
lsa_docm_text, lsa_termm_text, lsa_topicm_text = get_lsa(text_tfidf, n=n_lsa)









    



Explained variance:  0.1880733116816491

With the LSA, we have captured approximately 18 percent of the information of the TF-IDF matrix within 50 aggregate features.

As a form of visualization, we further reduced our 50 features into two dimensions with t-SNE and plotted each story. The observations are then colored according to the genre it belongs to. As seen below, while there appears to be clear clusters in the stories, they are not necessarily separated by genre.



In [18]:

    
plot_representation(lsa_docm_text, df['main_genre'], 'LSA representation')

Vector representation with Doc2Vec

Now we will switch over to vector representations using Doc2Vec. Unlike with the bag of words methods, the order of appearance in the words now matter, which hopefully captures some stylistic writing choices. We will be using a window of 5 and taking a vector length of 20.



In [19]:

    
tagged_text = [TaggedDocument(words=custom_tokenize(t), tags=[str(i)]) 
                for i, t in enumerate(df.title_summary)]

n_d2v = 20
d2v_text = get_doc2vec(tagged_text, n_d2v)

d2v_docm_text = d2v_text.docvecs.vectors_docs

From the plot below, we can see that our vector representation shows no discernable pattern between the extracted content and the genre.



In [20]:

    
plot_representation(d2v_docm_text, df['main_genre'], 'Doc2Vec representation')



In [21]:

    
feature_data = df[['storyid', 'main_genre']].copy()
feature_data[['LSA' + str(i) for i in range(n_lsa)]] = pd.DataFrame(lsa_docm_text)
feature_data[['D2V' + str(i) for i in range(n_d2v)]] = pd.DataFrame(d2v_docm_text)

Sentiment analysis

When we tried to represent our text as numerical topic vectors, we saw that, at least based on a two-dimensional representation of the features, there were no discernable grouping or clustering based on genres.

Here, we will try to represent the text in a different way. Instead of trying to extract numerical vectors representing its topics or "meaning", we will try to extract only the sentiment of the text. That is, whether the overall tone of the text is positive or negative.

We will try to extract sentiment through three separate methods. The first is the NLTK sentiment intensity analyzer. The second is TextBlob. The third is the Stanford NLP.

NLTK Sentiment Intensity Analyzer (VADER)

The NLTK sentiment intensity analyzer is based on VADER (Valence Aware Dictionary and sEntiment Reasoner), a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

In this approach, each of the words in the lexicon is rated on how positive or negative it is. The sentiments themselves are captured from the ratings of Amazon’s Mechanical Turk.

Unlike other sentiment analyzers, VADER doesn't just match the words in the text and in its lexicon, but also looks at context. It considers both capitalization and punctuation, slangs and acronyms, modifiers and negations/conjunctons.



In [22]:

    
sia = SentimentIntensityAnalyzer()
sia_scores = []
for s in df.title_summary:
    sia_scores.append(sia.polarity_scores(s))



In [23]:

    
feature_data['sia_pos'] = [s['pos'] for s in sia_scores]
feature_data['sia_neg'] = [s['neg'] for s in sia_scores]
feature_data['sia_neu'] = [s['neu'] for s in sia_scores]
feature_data['sia_compound'] = [s['compound'] for s in sia_scores]

VADER outputs four scores: positive, negative, neutral, and compound, which represents the aggregate score.



In [24]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_pos", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

For positivity above, we can see romance and friendship both score highly, which makes a lot of intuitive sense.



In [25]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_neg", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

In terms of negativity, horror and tragedy scored highly, which also makes a lot of intuitive sense.



In [26]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_neu", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

In terms of neutrality, poetry scored the highest.



In [27]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="sia_compound", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

In terms of overall score, humor, romance, spiritual, friendship, family, and crime all scored more positively.

Meanwhile, mystery, horror, sci-fi, supernatural, suspense, tragedy, and hurt-comfort all scored more negatively.

TextBlob Naive Bayes and Pattern Analyzer

TextBlob is a sentiment altenative that comes in two forms. The first uses the Pattern Analyzer, which includes parts of speech tagging and will find the postivity of a piece of text. The second is a simply bag-of-words Naive Bayes classifier, and will find the polarity and subjectivity of a text.

For both, TextBlob is trained on the NLTK movie reviews corpus.



In [28]:

    
nba = Blobber(analyzer=NaiveBayesAnalyzer())
pa = Blobber(analyzer=PatternAnalyzer())

def get_blob(s):
    return nba(s).sentiment, pa(s).sentiment

num_cores = multiprocessing.cpu_count()
blob_scores = Parallel(n_jobs=num_cores)(delayed(get_blob)(s) for s in df.title_summary)



In [29]:

    
feature_data['blob_pos'] = [s[0].p_pos for s in blob_scores]
feature_data['blob_pol'] = [s[1].polarity for s in blob_scores]
feature_data['blob_sub'] = [s[1].subjectivity for s in blob_scores]



In [30]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="blob_pos", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

As we can see, TextBlob yields slightly different results. Adventure, mystery, sci-fi, spiritual, and crime are the genres yielding the most positive scores. Meanwhile, poetry and parody yield the most negative.



In [31]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="blob_pol", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

Polarity does not appear to be marked by clear differentiation across genres.



In [32]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="blob_sub", data=feature_data)
plt.xticks(rotation=90)
plt.ylabel('')
plt.xlabel('')
plt.show()

We can see that poetry has the lowest subjectivity score.

Stanford NLP Sentiment

Finally, we will look at the Stanford CoreNLP sentiment analysis. Based on Richard Socher et al's Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, the model uses a recursive neural network that uses tensors instead of matrices as parameters.

The model is trained on a sentiment treebank, which is left open-source of manual validation. The treebank associates the individual phrases of various sizes with sentiment labels. Instead of predicting sentiment from whole sentences, the model uses an incremental approach in which the neural network associates subparts of sentences then combines them.

To use the Stanford NLP, we had to first install and run the server.



In [33]:

    
# java -mx1g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 25000

stanford_scores = []
for s in df.title_summary:
    s = re.sub('[\[\]]+', '', s)
    result = stanford.annotate(s,
                               properties={
                                'annotators': 'sentiment',
                                'outputFormat': 'json',
                                'timeout': '10000'
                               })
    for r in result['sentences']:
        score = (r['sentimentValue'], r['sentimentDistribution'])
    stanford_scores.append(score)



In [34]:

    
feature_data['stanford_score'] = [int(g[0]) for g in stanford_scores]
feature_data['stanford_s1'] = [g[1][0] for g in stanford_scores]
feature_data['stanford_s2'] = [g[1][1] for g in stanford_scores]
feature_data['stanford_s3'] = [g[1][2] for g in stanford_scores]
feature_data['stanford_s4'] = [g[1][3] for g in stanford_scores]
feature_data['stanford_s5'] = [g[1][4] for g in stanford_scores]



In [35]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="stanford_s5", data=feature_data)
plt.xticks(rotation=90)
plt.ylim(0, 0.25)
plt.ylabel('')
plt.xlabel('')
plt.show()

For the positive end of the sentiment distribution, we can see humor scores the highest.



In [36]:

    
fig, ax = plt.subplots(figsize=(16, 3))
ax = sns.boxplot(x="main_genre", y="stanford_s1", data=feature_data)
plt.xticks(rotation=90)
plt.ylim(0, 0.25)
plt.ylabel('')
plt.xlabel('')
plt.show()

Meanwhile, crime is the highest for the negative end of the distribution.

Emotion analysis

Emotion analysis is an extension of sentiment analysis that replaces the positive-negative spectrum with the full range of emotions, such as joy, sadness, or anger.

Unlike sentiment analysis, we were unable to find any built-in libraries for emotion analysis. As such, we will be constructing our own emotion metrics. To do this, we will be using NRC emotion lexicons. These are the EmoLex, the Affect Intensity Lexicon, the VAD lexicon, and the Twitter Hashtag lexicon.

NRC Word-Emotion Association Lexicon (EmoLex)

The EmoLex, or NRC Word-Emotion Association Lexicon has annotations for eight emotions across 14,182 unigrams. It uses a binary category in which 1 means a word is associated with an emotion and 0 otherwise. This lexicon is manually collected through crowdsourcing.

Below, we have printed 10 random rows:



In [37]:

    
random_indices = list(np.random.choice(emolex.loc[emolex.is_emotion==1].index, 10))
emolex.iloc[random_indices]









    Out[37]:







  
    
      
      term
      emotion
      is_emotion
    
  
  
    
      64480
      incredulous
      anger
      1
    
    
      121045
      stranded
      negative
      1
    
    
      65256
      inexpensive
      positive
      1
    
    
      129327
      tragedy
      sadness
      1
    
    
      28069
      correctness
      trust
      1
    
    
      64025
      incessant
      negative
      1
    
    
      123485
      suppress
      negative
      1
    
    
      74275
      loafer
      negative
      1
    
    
      3765
      alimony
      negative
      1
    
    
      21505
      clamor
      negative
      1



In [38]:

    
print(base_emotions)









    



['anticipation', 'trust', 'sadness', 'positive', 'disgust', 'anger', 'fear', 'surprise', 'joy', 'negative']

The different emotions are shown above.

Now, for each text, we will construct a simple count for each emotion.



In [39]:

    
token_text = [simple_tokenize(t) for t in df.title_summary]



In [40]:

    
for emo in base_emotions:
    emo_terms = set(emolex.loc[(emolex.emotion == emo) & (emolex.is_emotion == 1)].term)
    emo_results = []
    for summary in token_text:
        emo_results.append(np.sum([w in emo_terms for w in summary]))
    feature_data['nrc_' + emo] = emo_results

Below, we will show the normalized aggregated counts for each genre:



In [41]:

    
plot_emotions('nrc_', base_emotions)

The results align extremely close with our intuition! For example, horror scores highly in both disgust and fear, whereas tragedy scores highest in sadness.

NRC Emotion Intensity Lexicon (Affect Intensity Lexicon)

The Affect Intensity Lexicon is a list of English words and their associations with four basic emotions (anger, fear, sadness, joy). Unlike the previous lexicon, here words are associated with different intensities of an emotion. Like the previous lexicon, it was manually constructed, with scaling applied.

A score of 1 means that the word has the highest amount of emotion.
A score of 0 means that the word has the lowest amount of emotion.

The lexicon contains approximately 6,000 words.



In [42]:

    
random_indices = list(np.random.choice(range(intlex.shape[0]), 5))
intlex.iloc[random_indices]









    Out[42]:







  
    
      
      term
      score
      emotion
    
  
  
    
      4216
      hunter
      0.359
      sadness
    
    
      4678
      happier
      0.797
      joy
    
    
      2748
      exam
      0.359
      fear
    
    
      28
      hostile
      0.898
      anger
    
    
      4783
      godisgreat
      0.721
      joy



In [43]:

    
print(intensity_emotions)









    



['joy', 'sadness', 'anger', 'fear']



In [44]:

    
for emo in intensity_emotions:
    emo_terms = intlex.loc[intlex.emotion == emo].term.values
    emo_values = intlex.loc[intlex.emotion == emo].score.values
    emo_results = []
    for summary in token_text:
        emo_results.append(np.sum(emo_values[np.isin(emo_terms, summary)]))
    feature_data['nrc_int_' + emo] = emo_results

As with before, we will examine the emotions across genres.



In [45]:

    
plot_emotions('nrc_int_', intensity_emotions)

Sci-fi has the most fear, tragedy has the most sadness, crime has the most anger, and spiritual has the most joy.

NRC Valence, Arousal, Dominance Lexicon

The VAD lexicon is based on influential factor analysis studies, which have shown that the primary dimensions of word meaning are valence, arousal, and dominance.

Valence is the positive--negative dimension
Arousal is the active--passive dimension
Dominance is the powerful--weak dimension

This lexicon has more than 20,000 words and their valence, arousal, and dominance scores, which range from 0 to 1, and created by manual annotation with scaling.



In [46]:

    
random_indices = list(np.random.choice(range(vadlex.shape[0]), 5))
vadlex.iloc[random_indices]









    Out[46]:







  
    
      
      term
      valence
      arousal
      dominance
    
  
  
    
      14288
      raw
      0.143
      0.349
      0.342
    
    
      4926
      diligence
      0.688
      0.423
      0.700
    
    
      14144
      rabble
      0.184
      0.602
      0.292
    
    
      2504
      canoe
      0.542
      0.300
      0.418
    
    
      8883
      implantation
      0.396
      0.604
      0.741



In [47]:

    
print(vad_emotions)









    



['valence', 'arousal', 'dominance']



In [48]:

    
val_results = []
aro_results = []
dom_results = []    
for summary in token_text:
    has_terms = np.isin(vadlex.term.values, summary)
    val_results.append(np.sum(vadlex['valence'][has_terms]))
    aro_results.append(np.sum(vadlex['arousal'][has_terms]))
    dom_results.append(np.sum(vadlex['dominance'][has_terms]))
feature_data['nrc_vad_valence'] = val_results
feature_data['nrc_vad_arousal'] = aro_results
feature_data['nrc_vad_dominance'] = dom_results



In [49]:

    
plot_emotions('nrc_vad_', vad_emotions)

Spiritual has the highes valence, horror has the highest arousal, and sci-fi has the highest dominance.

NRC Hashtag Emotion Lexicon

The Twitter Emotion Corpus, also known as the Hashtag Emotion Corpus, is an automated lexicon that is generated from tweets. The score is determined by the word's tendency to co-occur with the emotion-word hashtag.

The advantage of this lexicon is that is is much larger than manually created lexicons and cover slang and internet phrases.



In [50]:

    
random_indices = list(np.random.choice(range(hashlex.shape[0]), 5))
hashlex.iloc[random_indices]









    Out[50]:







  
    
      
      emotion
      hashtag
      score
    
  
  
    
      1065
      anticipation
      navy
      0.802394
    
    
      15341
      surprise
      rowing
      1.128577
    
    
      16271
      surprise
      carly
      0.731480
    
    
      950
      anticipation
      pilot
      0.851184
    
    
      32096
      disgust
      b/c
      0.049274



In [51]:

    
print(hash_emotions)









    



['anticipation', 'trust', 'sadness', 'disgust', 'anger', 'fear', 'joy', 'surprise']



In [52]:

    
for emo in hash_emotions:
    emo_terms = hashlex.loc[hashlex.emotion == emo].hashtag.values
    emo_values = hashlex.loc[hashlex.emotion == emo].score.values
    emo_results = []
    for summary in token_text:
        emo_results.append(np.sum(emo_values[np.isin(emo_terms, summary)]))
    feature_data['nrc_hash_' + emo] = emo_results



In [53]:

    
plot_emotions('nrc_hash_', hash_emotions)

While there is a lot of overlap in the results with the manual database, it would appear there are dissimilarities too. For example, spiritual here rated very highly in disgust, whereas the reverse is true in the EmoLex results.

Classification of genre

In our final section, we will try to predict genre based off the sentiments and emotions extracted from our text.



In [261]:

    
# sentiment variables
sia_vars = ['sia_pos', 'sia_neg', 'sia_neu', 'sia_compound']
blob_vars = ['blob_pos', 'blob_pol', 'blob_sub']
stanford_vars = ['stanford_s1', 'stanford_s2', 'stanford_s4', 'stanford_s5']

# emotion variables
base_emotion_vars = ['nrc_' + e for e in base_emotions]
intensity_emotion_vars = ['nrc_int_' + e for e in intensity_emotions]
hash_emotion_vars = ['nrc_hash_' + e for e in hash_emotions]
vad_emotion_vars = ['nrc_vad_' + e for e in vad_emotions]

sentiment_vars = sia_vars + blob_vars + stanford_vars
emotion_vars = base_emotion_vars + intensity_emotion_vars + hash_emotion_vars + vad_emotion_vars



In [433]:

    
classification_data = pd.DataFrame()
classification_data = feature_data[sentiment_vars + emotion_vars].copy()

#classification_data['storyid'] = df['storyid'].astype('int')
#classification_data['words'] = df['words'].astype('int')
#classification_data[['LSA' + str(i) for i in range(50)]] = feature_data[['LSA' + str(i) for i in range(50)]]
#classification_data[['D2V' + str(i) for i in range(20)]] = feature_data[['D2V' + str(i) for i in range(20)]]

Multiclass logistic regression



In [407]:

    
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)









    



(1641, 50) (884, 50) (1641,) (884,)



In [408]:

    
clf = LogisticRegression(random_state=0, solver='newton-cg', max_iter=500,
                         multi_class='multinomial').fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)



In [409]:

    
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))









    



Accuracy:  0.28846153846153844



In [410]:

    
plot_ROC(y_test_bin, y_pred)

Support vector machine



In [411]:

    
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)









    



(1641, 50) (884, 50) (1641,) (884,)



In [412]:

    
clf = OneVsRestClassifier(svm.SVC(kernel='rbf', probability=True, gamma='scale', random_state=0))
y_pred = clf.fit(X_train, y_train_bin).decision_function(X_test)



In [413]:

    
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print(accuracy_score(y_test, best_preds))









    



0.29298642533936653



In [414]:

    
plot_ROC(y_test_bin, y_pred)

KNN



In [415]:

    
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)









    



(1641, 50) (884, 50) (1641,) (884,)



In [416]:

    
clf = KNeighborsClassifier(n_neighbors=n_classes).fit(X_train, y_train) 
y_pred = clf.predict_proba(X_test)



In [417]:

    
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))









    



Accuracy:  0.29524886877828055



In [418]:

    
plot_ROC(y_test_bin, y_pred)

Gaussian naive bayes



In [419]:

    
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)









    



(1641, 50) (884, 50) (1641,) (884,)



In [420]:

    
clf = GaussianNB().fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)



In [421]:

    
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))









    



Accuracy:  0.16063348416289594



In [422]:

    
plot_ROC(y_test_bin, y_pred)

Multi-layer perceptron classifier



In [423]:

    
X = scale(classification_data.values)
Y = feature_data['main_genre'].factorize()[0]
n_classes = len(set(Y))

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)









    



(1641, 50) (884, 50) (1641,) (884,)



In [424]:

    
clf = MLPClassifier(max_iter=2000).fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)



In [425]:

    
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print('Accuracy: ', accuracy_score(y_test, best_preds))









    



Accuracy:  0.21153846153846154



In [426]:

    
plot_ROC(y_test_bin, y_pred)

XGBoost



In [449]:

    
X = classification_data
Y = feature_data['main_genre'].factorize()[0]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.35, random_state=0)
y_train_bin = label_binarize(y_train, classes=list(set(Y)))
y_test_bin = label_binarize(y_test, classes=list(set(Y)))
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)









    



(1641, 36) (884, 36) (1641,) (884,)



In [450]:

    
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)



In [489]:

    
param = {
    'booster': 'gbtree',
    'n_jobs': 8,
    'max_depth': 3, 
    'eta': 0.1, 
    'silent': 1, 
    'reg_alpha': 10,
    'reg_lambda': 10,
    'gamma': 0,
    'objective': 'multi:softprob',  # error evaluation for multiclass training
    'num_class': len(np.unique(y_train)), # the number of classes that exist in this datset
    'early_stopping_rounds': 20}  

num_round = 500  # the number of training iterations

bst = xgb.train(param, dtrain, num_round)
y_pred = bst.predict(dtest)



In [490]:

    
best_preds = np.asarray([np.argmax(line) for line in y_pred])
print(accuracy_score(y_test, best_preds))









    



0.3190045248868778



In [491]:

    
plot_ROC(y_test_bin, y_pred)



In [485]:

    
fig, ax = plt.subplots(figsize=(16, 6))
xgb.plot_importance(bst, ax=ax, max_num_features = 15)
plt.show()

	storyid	title	summary	fandom	rated	words	main_genre
0	2062	Of the One Who Is Foretold	Fic: A stranger comes to the Animorphs seeking...	Animorphs	K+	7672.0	General
1	19673	A Healer's Touch	Left near dead, two Jedi are found by local he...	Star Wars	K+	16870.0	Drama
2	34152	R a i n (I'm getting sick of the new girl titles)	basically from Gaia's P.O.V.	Harry Potter	K+	555.0	Adventure
3	40969	Scattered Thoughts of a Comatose Civilian Pilot	It's been brought to my attention that it's ob...	Gundam Wing/AC	K	212.0	Poetry
4	47543	Enter The People's Girl, Part 5	The Rock finds Angelina...and is shocked	Wrestling	T	1457.0	General

	term	emotion	is_emotion
64480	incredulous	anger	1
121045	stranded	negative	1
65256	inexpensive	positive	1
129327	tragedy	sadness	1
28069	correctness	trust	1
64025	incessant	negative	1
123485	suppress	negative	1
74275	loafer	negative	1
3765	alimony	negative	1
21505	clamor	negative	1

	term	score	emotion
4216	hunter	0.359	sadness
4678	happier	0.797	joy
2748	exam	0.359	fear
28	hostile	0.898	anger
4783	godisgreat	0.721	joy

	term	valence	arousal	dominance
14288	raw	0.143	0.349	0.342
4926	diligence	0.688	0.423	0.700
14144	rabble	0.184	0.602	0.292
2504	canoe	0.542	0.300	0.418
8883	implantation	0.396	0.604	0.741

	emotion	hashtag	score
1065	anticipation	navy	0.802394
15341	surprise	rowing	1.128577
16271	surprise	carly	0.731480
950	anticipation	pilot	0.851184
32096	disgust	b/c	0.049274