D.Kuster
July 2015
A few months ago, a great video of Kurt Vonnegut circulated the web. He describes an idea for plotting the simple shapes of stories as good vs. ill fortune across the length of the story. He says:
"There's no reason why these simple shapes of stories can't be fed into computers"
by which he probably meant:
"Once you have a sequence of numerical scores, it is easy to draw these lines using computers"
click the image -> video on YouTube.com
...but how to get the scores? When I watched the video, I just happened to be developing natural language processing tools for analyzing text. Swimming in that context, it was very natural to wonder:
The problem (of plotting the shapes of stories) is similar to sentiment analysis on the narrator's viewpoint. But it was not clear if existing sentiment models were good enough to pick up on this signal across the length of a long story. Specifically, language is full of subtle contextual cues that can interact with each other. Sentiment models can struggle to balance the interactions amongst many positive vs. negative terms and return stable predictions over long-range context windows. And even once you have the algorithmic bits worked out, validating the result can be tricky when human interpretations differ.
This notebook implements a series of experiments using indico's machine learning API's to quickly test hypotheses. If you simply want to consume the story, keep on scrolling. If you want to tinker, maybe do some experiments of your own...this post was written as a Jupyter notebook for exactly that reason. Grab the notebook here on Github. And when you discover something interesting, please let us know, we love this stuff.
Using two hacks and a multinomial logistic regression model of n-grams with TF-IDF features, a pre-trained sentiment model can score the long-range sentiment of text of stories, books, and movies. The models do a reasonable job of summarizing the "shapes of stories" directly from text. This method can be easily extended to search across databases to find stories with similar plot shape as the query story.
If you haven't watched the Vonnegut video yet, "listen and learn" is always a good first step. Few lectures are as short and sweet as this one!
Vonnegut gave us a method to describe the shapes of stories by plotting the protagonist's current fortune (good vs. ill) from the beginning to the end of the story. His method requires a human to read each story and intepret meaning thorugh their own personal context and understanding of stories in general. We're going to automate the scoring method using python and machine learning models for sentiment analysis.
To have any hope of automating a solution, we need clear, specific, solvable technical requirements. Then we'll write code to implement the functionality that satisfies each requirement.
The first step to solving a problem is to define it clearly in terms of desired outcomes and technical requirements. Let's do that next. We'll periodically update our progress by incrementing the numbers inside the brackets to indicate percent completion of each requirement.
The method must be:
We start with a very simple experiment. Then we'll evaluate our technical requirements and prioritize how we want to extend and add features. There are (at least) three good reasons to start with a simple experiment:
In [1]:
import sys
import os
import pandas as pd # dataframes to store text samples + scores
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn # for more appealing plots
seaborn.set_style("darkgrid")
# Pretty printing
import pprint
pp = pprint.PrettyPrinter(indent=4)
# indico API
import indicoio # https://pypi.python.org/pypi/IndicoIo; install using `pip install -U indicoio`
In [2]:
# Is an indico API key defined?
if not indicoio.config.api_key:
sys.exit("Unable to find an indico API key, please add one to your python environment. Aborting.")
Note: If this is your first time using an indico API, register for a free API key at indico.io. There are several ways to define your key for python; a good option is to add an environment variable to define your personal API key. On Linux or OSX, you can simply add a line to ~/.bash_profile
to define your personal API key. For example: export INDICO_API_KEY=<your api key>
Link to full transcription of The Lion King movie, transcribed by fans
In [3]:
# Sample text from the beginning of `The Lion King` fan-transcribed script
input_text = """Rafiki puts the juice and sand he collects on
Simba's brow---a ceremonial crown. He then picks
Simba up and ascends to the point of Pride Rock.
Mufasa and Sarabi follow. With a crescendo in the
music and a restatement of the refrain, Rafiki
holds Simba up for the crowd to view."""
As a father of young children and an older brother who frequently pulled babysitting duty, I've watched The Lion King a gazillion times. So I can say that input_text
is reservedly positive, and it occurs at the begining of The Lion King movie, when other positive things are happening in the movie (in general). As an English speaker, I don't read anything in the phrase as contradictory or particularly polarizing. The phrase "crescendo in the music" is a cue that something important (positive?) is happening, and life experience suggests the presentation of a new prince is a reservedly positive thing. Polite, modest smiles all around. If a sentiment score of 0.0 is totally negative, 1.0 is very positive, and 0.5 is neutral, I'd expect a sentiment score for the input text to be: 0.6 < score < 0.9.
You should make your own estimate too...do it now, before you see the predictions!
In [4]:
score = indicoio.batch_sentiment([input_text]) # make an API call to get sentiment score for the input text
print(score)
Is the score reasonable? Yes, everything agrees with my self-test.
We only ran a single test (n=1), so we cannot tell the difference between a reproducible method and a happy accident. Let's apply what we learned with this simple experiment to the technical requirements:
The sentiment analysis API abstracts the machine learning process into an API call, so you can use a pre-trained model on any input text. Send a list of strings, get back a list of scores. It is optimized to be fast, robust, and "accurate enough" (90% accuracy on IMDB), so it is well suited to the task at hand, where we want to score many samples across a story.
Behind the scenes, huge corpora of labeled text data were used to train a multinomial logistic regression model to differentiate between text labeled as positive vs. text labeled as negative. When you send a list of strings to the API, each string is tokenized into n-grams to compute TF-IDF (term frequency-inverse document frequency) features. Then the model uses the pre-trained features to predict the positivity of the input text, and return the score. The models are deployed on robust, load-balanced, distributed architecture, so multiple users can spray multiple requests and reliably get results.
Because it abstracts away a whole mess of data collection, cleaning, feature engineering, pre-training, regularization, validation, deployment, and performance/scaling. Using the API above, you can use one line to get sentiment scores. For cases where you need to tweak things to get the best accuracy possible, or need to train on specific data for a specific purpose or domain, you should probably build your own custom model. But even then, pre-trained models can be great for running quick experiments to explore your data and quickly discover if it is worth investing in a custom model.
Also, we haven't yet satisfied the main goal: draw the shape of a story using an automated method that emits scores as it scans a text from beginning to end. Technically, we could cut a story into chunks and paste each chunk in as input_text
. But that would be silly, the whole point is to automate the method so we can compare to Vonnegut's shapes of stories. Let's add functionality to automatically generate samples from a text, retrieve sentiment scores for each sample, and plot the results for comparison to Vonnegut's shapes of stories.
To extend functionality to scan across the text and emit a sequence of scores, we'll need to upgrade our code to use data structures that can contain sequences (rather than just single values).
We also need a strategy for generating samples from a chunk of input text. A naive strategy is to slice the input text at regularly spaced intervals. Although it might seem appealing for the simplicity, slices introduce discontinuities in the context of the story, and this will cause problems with sentiment analysis. In other words, how should we choose where we make cuts so that we don't destroy information at at the boundary between chunks? Sentiment analysis can be tricky, and we want to make sure we are preserving all the information we can.
One useful solution to eliminate the slice boundaries is to sample from a sliding context window. This will require us to evaluate a greater number of samples, but the sentiment analysis model is fast, so no worries!
In [5]:
def sample_window(seq, window_size = 10, stride = 1):
"""
Generator slides a window across the input sequence
and returns samples; window size and stride define
the context window
"""
for pos in xrange(0, len(seq), stride):
yield seq[pos : pos + window_size]
In [6]:
def merge(seq, stride = 4):
"""
Generator strides across the input sequence,
combining the elements between each stride.
"""
for pos in xrange(0, len(seq), stride):
yield seq[pos : pos + stride]
In [7]:
d = {} # dictionary to store results (regardless of story lengths)
# Parse text
delim = " "
words = [s for s in input_text.split()] # simplest tokenization method
# Merge words into chunks
### Note: this defines the granularity of context for sentiment analysis,
### might be worth experimenting with other sampling strategies!
merged_words = [" ".join(w) for w in merge(words, 5)]
# Sample a sliding window of context
delim = " "
samples = [delim.join(s) for s in sample_window(merged_words, 3, 1)]
pp.pprint(samples) # comment this line out for big input!
d['samples'] = samples
# Score sentiment using indico API
print("\n Submitting %i samples..." % (len(samples)))
scores = indicoio.batch_sentiment(samples)
d['scores'] = scores
print(" ...done!")
In [8]:
# Load the dict into a dataframe; overkill for now but useful below
df = pd.DataFrame()
for k,v in d.iteritems():
df[k] = pd.Series(v) # keys -> columns, values -> rows
df.plot(figsize = (16,8))
df # display the table of values
Out[8]:
Check out the visual explanation below!
Importantly, we satisfied the main goal of drawing the shape of a story using a fully-automated method that emits scores as it scans a text from beginning to end. We can compare these to Vonnegut's shapes of stories! Now let's improve the accuracy. We'll do it by finding more robust data, making our tests more rigorous, and validating the results in an experiment where human interpretation is relatively easy and transparent.
Now that we have a pipeline for drawing the shapes of stories, we need a better test. The first step is to find data.
To validate Vonnegut's hypothesis, I initially wanted to score the same stories he described. But I've only read Hamlet once, and that was more than enough. Vonnegut's stories may be archetypal fiction plots, but for me it was very hard to validate performance when I couldn't remember the context and sequence of events from those stories. Wait a second...he mentioned the Cinderella story, everyone knows that one, right?
I searched the web for a canonical version of Cinderella, but quickly discovered that the myth has dozens of variations. Having been exposed to many versions since childhood, it was impossible to attribute my interpretation of the Cinderella story to any single context or version. Turns out academics hypothesize how each version of Cinderella reflects the culture from whence it came...what a mess! This is the opposite of a good test for validating the performance of our plotlines. We want authoritative versions.
Finally, thinking "what is the most popular version of cinderella?"...I definitely remember watching Disney's version of Cinderella! Were movies scripts a better test than written stories?
It turns out that movies have a number of useful constraints for the task at hand. Written stories are typically consumed in many sittings, many contexts, over many hours, but movies are:
Unfortunately, I couldn't find a good script of Disney's 1950 version of Cinderella freely available on the web. However, fans have transcribed many other movies, including The Lion King, Aladdin, The Little Mermaid, Sleeping Beauty, and more:
Now that we have multiple texts, we need to abstract the simple code above to iterate across a corpus of text files. Dataframes are a good data structure for storing and manipulating the result here. We'll also need to add some cleaning/munging code since movie scripts from the internet can be messy.
In [9]:
# define your corpus here as a list of text files
corpus = ["aladdin.txt",
"lionking.txt",
"mulan.txt",
"hunchback.txt",
"rescuersdownunder.txt",
"sleepingbeauty.txt",
"littlemermaid.txt"]
In [10]:
# New dict to hold data
d = {}
# Map names to input files on filesystem
root_fp = os.getcwd()
corpus_fp = os.path.join(root_fp, "texts") # put your text files in ./texts
# print("Looking for input text files: '%s'" % corpus_fp)
for t in corpus:
fp = os.path.join(corpus_fp, t)
print(" Reading '%s'" % t)
with open(fp, 'rb') as f:
text_name = t.split(".")[0] # strip .txt file extensions
sample_col = text_name + "_sample"
score_col = text_name + "_sentiment"
lines = [] # list to receive cleaned lines of text
# Quick text cleaning and transformations
for line in f:
if str(line) == str(""): # there are many blank lines in movie scripts, ignore them
continue
else:
line = line.replace("\n", " ").lower().strip().strip('*') # chain any other text transformations here
lines.append(line)
print(" %i lines read from '%s' with size: %5.2f kb" % (len(lines), t, sys.getsizeof(lines)/1024.))
# Construct a big string of clean text
text = " ".join(line for line in lines)
# split on sentences (period + space)
delim = ". "
sentences = [_ + delim for _ in text.split(delim)] # regexes are the more robust (but less readable) way to do this...
merged_sentences = [delim.join(s) for s in merge(sentences, 10)] # merge sentences into chunks
# split on words (whitespace)
delim = " "
words = [_ for _ in text.split()]
merged_words = [" ".join(w) for w in merge(words, 120)] # merge words into chunks
# Generate samples by sliding context window
delim = " "
samples = [delim.join(s) for s in sample_window(merged_words, 10, 1)]
d[sample_col] = samples
print(" submitting %i samples for '%s'" % (len(samples), text_name))
# API to get scores
scores = indicoio.batch_sentiment(samples)
d[score_col] = scores
print("\n...complete!")
In [11]:
df = pd.DataFrame()
# for k,v in d.iteritems():
for k,v in sorted(d.iteritems()): # sort to ensure dataframe is defined by longest sequence, which happens to be Aladdin
df[k] = pd.Series(v) # keys -> columns; rows -> columns
In [12]:
print(len(df))
df.head(5) # inspect the first 5 rows...looks OK?
Out[12]:
In [13]:
# inspect the last 5 rows;
# since sequences are of unequal length, there should be a bunch of NaN's
# at the end for all but the longest sequence
df.tail(5)
Out[13]:
In [14]:
ax = df['lionking_sentiment'].plot(colormap = 'jet', figsize=(16,8))
ax.set_xlabel("sample")
ax.set_ylabel("sentiment_score")
ax.set_title("Lion King")
Out[14]:
In [15]:
# Pick out a few stories to compare visually
combo = pd.DataFrame()
combo['lionking_sentiment'] = df['lionking_sentiment']
combo['aladdin_sentiment'] = df['aladdin_sentiment']
# combo['littlemermaid_sentiment'] = df['littlemermaid_sentiment']
ax2 = combo.plot(colormap='jet', figsize = (16,8)) # ignore mistmatched sequence length
ax2.set_xlabel("sample")
ax2.set_ylabel("sentiment_score")
Out[15]:
Looks like The Lion King and Aladdin have very similar plot lines, from the sequencing (song, then something happens) to the big hump of hero happiness followed by the valley of gloom. Are we detecting some kind of Disney formula for movies/stories? It's hard to compare these visually when the plotlines are happening across different lengths. We'll need better way to compare, some kind of similarity metric that is robust to unequal lengths.
Note: due to the way we're sampling context with a sliding window, there is a tendency for sentiment scores to return towards neutral at the end of the story as the window runs off the end and starts losing context.
In [16]:
# Pull out a single story here to test smoothing methods
df_sentiment = df["lionking_sentiment"].dropna()
In [17]:
df_roll = pd.rolling_mean(df_sentiment, 10)
ax = df_roll.plot(colormap = 'jet', figsize = (16, 8))
ax.set_xlabel("sample")
ax.set_ylabel("sentiment_score")
ax.set_title("Lion King, smoothed with rolling mean")
ax.set_xlim((0, 110))
Out[17]:
The moving average smoothing isn't bad, and we're getting pretty close to Vonnegut's simple shapes! But since we used a sliding window to resample the text for sentiment analysis, another sliding window method is probably not the best choice here, as it could falsely convey a stronger confidence or stability of prediction than is justified by scores. We've also traded sensitivity for smoothness. Sentiment tends to be sensitive to the balance of positive vs. negative weights, so noise is probably a useful quantity to track, especially since we don't yet know how it varies across stories. Also, the bigger kernels take a while to accumulate samples, eating into the beginning of a story where interesting stuff can happen. Another method might be a better choice---one that doesn't consume data to build up statistical moments. Let's give Lowess smoothing a try and compare to the raw scores.
In [18]:
import scipy.stats as stats
import statsmodels.api as sm
lowess = sm.nonparametric.lowess(df_sentiment, df_sentiment.index, frac = 0.05)
fig = plt.gcf()
plt.plot(df_sentiment.index, df_sentiment, '.') # plot the values as dots
plt.plot(lowess[:, 0], lowess[:, 1]) # plot the smoothed output as solid line
fig.set_size_inches(16,8)
plt.xlabel('sample')
plt.ylabel('sentiment_score')
plt.title('Lion King, smoothed with Lowess')
plt.xlim((0, 110))
Out[18]:
The method of dynamic time warping is great for comparing sequences that have arbitrary insertions between otherwise similar data. It can also solve our problem of comparing sequences of unequal lengths. Intuitively, DTW seems like a good fit for our problem. Let's test it.
Image from: Rakthanmanon et al. "Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping", Figure 3.
Read the linked papers for details, but the gist is dynamic time warping gives us a way to map one sequence onto another sequence, using dynamic programming and distance measures between elements in each sequence. The best mapping from one sequence to another is the path that minimizes the pairwise distances. Lower distances indicate sequences with high similarity.
If we put one sequence (e.g., "lionking_sentiment") on the x-axis and another sequence (e.g., "aladdin_sentiment") on the y-axis, then the diagonal path from lower-left to upper right illustrates the transformation that best maps the x sequence onto the y sequence. For two identical sequences, the path would be a perfect diagonal. For difference sequences, the path reveals where each sequence is "warped" to accommodate the other sequence.
For the knowledge hounds, here's a link to the original paper that introduced DTW and Rakthanmanon et al.
In [19]:
import dtw # `pip install dtw`
lionking = df['lionking_sentiment'].dropna()
aladdin = df['aladdin_sentiment'].dropna()
print(len(lionking), len(aladdin))
dist, cost, path = dtw.dtw(lionking, aladdin) # compute the best DTW mapping
print("Minimum distance found: %-8.4f" % dist)
In [20]:
from matplotlib import cm # custom colormaps
from matplotlib.pyplot import imshow
imshow(cost.T, origin = 'lower', cmap = cm.hot, interpolation = 'nearest')
plt.plot(path[0], path[1], 'w') # white line shows the best path
plt.xlim((-0.5, cost.shape[0]-0.5))
plt.ylim((-0.5, cost.shape[1]-0.5))
plt.xlabel("lion king")
plt.ylabel("aladdin")
Out[20]:
The Lion King and Aladdin have very similar plots! Hardly any warping is required to map one story sequence onto the other.
What about another pair of stories?
In [21]:
mermaid = df['littlemermaid_sentiment'].dropna()
print(len(lionking), len(mermaid))
dist, cost, path = dtw.dtw(lionking, mermaid)
print("Minimum distance found: %-8.4f" % dist)
In [22]:
from matplotlib import cm # import custom colormaps
from matplotlib.pyplot import imshow
imshow(cost.T, origin = 'lower', cmap = cm.hot, interpolation = 'nearest')
plt.plot(path[0], path[1], 'w') # white line for the best path
plt.xlim((-0.5, cost.shape[0]-0.5))
plt.ylim((-0.5, cost.shape[1]-0.5))
plt.xlabel("lion king")
plt.ylabel("little mermaid")
Out[22]:
The Lion King and The Little Mermaid appear to have similar plotlines, but there are gaps where things happen in Lion King but no corresponding features in Little Mermaid. This different story pacing could be because The Lion King's characters are thoroughly anthropomorphized and speak many lines throughout the movie whereas the characters in The Little Mermaid tend tell the story through action and visuals---the protagnoist loses her voice! Or it could be a difference in transcript length or quality showing through...something to investigate more deeply. We can see from the DTW path that the plotlines are different for the first part of the movie, but the last half is very similar.
Using the DTW distance metric, it is straightward to compare all the pairs of stories in our corpus. Using these distances to sort (or search) for similar (or different) stories is left as an exercise for the reader :) There is probably a neat visualization to be made here, but beyond scope for now!
In [23]:
for i in corpus:
for j in corpus:
(dist, cost, path) = dtw.dtw(df[i.split(".")[0] + "_sentiment"].dropna(),
df[j.split(".")[0] + "_sentiment"].dropna())
print("DTW distance from %s to %s: '%-6.3f'" % (i.split(".")[0], j.split(".")[0], dist))
In [24]:
lowess_frac = 0.05 # same smoothing as above, balances detail and smoothness
lionking_lowess = sm.nonparametric.lowess(df['lionking_sentiment'], df['lionking_sentiment'].index, frac = lowess_frac)
aladdin_lowess = sm.nonparametric.lowess(df['aladdin_sentiment'], df['aladdin_sentiment'].index, frac = lowess_frac)
rescuers_lowess = sm.nonparametric.lowess(df['rescuersdownunder_sentiment'], df['rescuersdownunder_sentiment'].index, frac = lowess_frac)
hunchback_lowess = sm.nonparametric.lowess(df['hunchback_sentiment'], df['hunchback_sentiment'].index, frac = lowess_frac)
fig = plt.gcf()
plt.plot()
plt.plot(lionking_lowess[:, 0], lionking_lowess[:, 1])
plt.plot(aladdin_lowess[:, 0], aladdin_lowess[:, 1])
plt.plot(rescuers_lowess[:, 0], rescuers_lowess[:, 1])
plt.plot(hunchback_lowess[:, 0], hunchback_lowess[:, 1])
plt.xlabel('sample')
plt.ylabel('sentiment_score')
plt.title('4 similar Disney movies: [The Lion King, Aladdin, Rescuers Down Under, Hunchback of Notre Dame]')
fig.set_size_inches(16,8)
In [25]:
lowess_frac = 0.25 # heavy smoothing here to compare to Vonnegut
lionking_lowess = sm.nonparametric.lowess(df['lionking_sentiment'], df['lionking_sentiment'].index, frac = lowess_frac)
aladdin_lowess = sm.nonparametric.lowess(df['aladdin_sentiment'], df['aladdin_sentiment'].index, frac = lowess_frac)
rescuers_lowess = sm.nonparametric.lowess(df['rescuersdownunder_sentiment'], df['rescuersdownunder_sentiment'].index, frac = lowess_frac)
hunchback_lowess = sm.nonparametric.lowess(df['hunchback_sentiment'], df['hunchback_sentiment'].index, frac = lowess_frac)
fig = plt.gcf()
plt.plot()
plt.plot(lionking_lowess[:, 0], lionking_lowess[:, 1])
plt.plot(aladdin_lowess[:, 0], aladdin_lowess[:, 1])
plt.plot(rescuers_lowess[:, 0], rescuers_lowess[:, 1])
plt.plot(hunchback_lowess[:, 0], hunchback_lowess[:, 1])
plt.xlabel('sample')
plt.ylabel('sentiment_score')
plt.title('4 similar Disney movies: [The Lion King, Aladdin, Rescuers Down Under, Hunchback of Notre Dame]')
fig.set_size_inches(16,8)
When we compare many Disney movie scripts, a clear pattern emerges. Maybe this isn't intuitively surprising to you, but the fact we discovered "the Disney formula" directly from the text of a movie script is pretty cool!
Disney introduces characters and sets the scene in a variety of ways, but every story ends in similar fashion:
Basically, the last half of these Disney scripts are very much like Vonnegut's "man-in-a-hole" story shape...but the beginning half can vary quite a lot.
This was fun to discover and write, and I hope you have enjoyed reading about it. If the stuff here ends up being helpful, forked, or incorporated into a cool project, please let us know! We love hearing about innovative applications!
Parallel exploration by a diverse bunch of investigators is a great way to discover cool stuff! It would be great if folks try these and share any discoveries: