In [2]:
# imports libraries
import pickle # import/export lists
import datetime # dates
import string # string parsing
import re # regular expression
import pandas as pd # dataframes
import numpy as np # numerical computation
import matplotlib.pyplot as plt # plot graphics
import seaborn as sns # graphics supplemental
from scipy.stats import chisquare # chi-squared
from nltk.corpus import stopwords # stop words
from IPython.core.display import display, HTML # display HTML
In [3]:
# opens cleaned data
with open ('../clean_data/df_story', 'rb') as fp:
df = pickle.load(fp)
# installs corpus
## import nltk
## nltk.download()
# creates subset of data of online stories
df_online = df.loc[df.state == 'online', ].copy()
# sets current year
cyear = datetime.datetime.now().year
# sets stop word list for text parsing
stop_word_list = stopwords.words('english')
In this section, we take a sample of ~5000 stories from fanfiction.net and break down some of their characteristics.
Let's begin by examining the current state of stories: online, deleted, or missing. Missing stories are stories whose URL has moved due to shifts in the fanfiction achiving system.
In [3]:
# examines state of stories
state = df['state'].value_counts()
# plots chart
(state/np.sum(state)).plot.bar()
plt.xticks(rotation=0)
plt.show()
Surprisingly, it appears only about ~60% of stories that were once published still remain on the site! This is in stark contrast to user profiles, where less than 0.1% are deleted.
From this, we can only guess that authors actively take stories down, presumably to hide earlier works as their writing abilities improve or to replace them with rewrites. Authors who delete their profiles and stories that were deleted for fanfiction policy violations would also contribute to these figures.
Now let's examine the volume of stories published across time (that have survived on the site).
In [4]:
# examines when stories first created
df_online['pub_year'] = [int(row[2]) for row in df_online['published']]
entry = df_online['pub_year'].value_counts().sort_index()
# plots chart
(entry/np.sum(entry)).plot()
plt.xlim([np.min(entry.index.values), cyear-1])
plt.show()
We see a large jump starting in the 2010s, peaking around 2013, then a steady decline afterward. Unlike with profiles, you do not see the dips matching the Great Fanfiction Purge of 2002 and 2012.
The decline could be from a variety of factors. One could be competing fanfiction sites. Most notably, the nonprofit site, Archive of Our Own (AO3), started gaining traction due to its greater inclusivity of works and its tagging system that helps users to filter and search for works.
Another question to ask is if the increasing popularity of fanfiction is fueled by particular fandoms. It is well known in the fanfiction community that fandoms like Star Trek paved the road. Harry Potter and Naruto also held a dominating presence in the 2000s. Later on, we will try to quantify how much each of these fandoms contributed to the volume of fanfiction produced.
Now let's look at the distribution across the stories. Note that "General" includes stories that do not have a genre label.
In [5]:
# examines top genres individually
genres_indiv = [item for sublist in df_online['genre'] for item in sublist]
genres_indiv = pd.Series(genres_indiv).value_counts()
# plots chart
(genres_indiv/np.sum(genres_indiv)).plot.bar()
plt.xticks(rotation=90)
plt.show()
Romance takes the lead! In fact, ~30% of the genre labels used is "Romance". In second and third place are Humor and Drama respectively.
The least popular genres appear to be Crime, Horror, and Mystery.
So far, nothing here deviates much from intuition. We'd expect derivative works to focus more on existing character relationships and/or the canonic world, and less on stand-alone plots and twists.
What about how the genres combine?
In [6]:
# creates contingency table
gen_pairs = df_online.loc[[len(row) > 1 for row in df_online.genre], 'genre']
gen1 = pd.Series([row[0][:3] for row in gen_pairs] + [row[1][:3] for row in gen_pairs])
gen2 = pd.Series([row[1][:3] for row in gen_pairs] + [row[0][:3] for row in gen_pairs])
cross = pd.crosstab(index=gen1, columns=gen2, colnames=[''])
del cross.index.name
# finds relative frequency
for col in cross.columns.values:
cross[col] = cross[col]/np.sum(cross[col])
# plots heatmap
f, ax = plt.subplots(figsize=(6,6))
cm = sns.color_palette("Blues")
ax = sns.heatmap(cross, cmap=cm, cbar=False, robust=True,
square=True, linewidths=0.1, linecolor='white')
plt.show()
In terms of how genres cross, romance appears to pair with almost everything. Romance is particularly common with drama (the romantic drama) and humor (the rom-com). The only genre that shies away from romance is parody, which goes in hand with humor instead.
The second most crossed genre is adventure, which is often combined with fantasy, sci-fi, mystery, or suspense.
The third genre to note is angst, which is often combined with horror, poetry, or tragedy.
The breakdown of how stories are rated are given below.
In [7]:
# examines state of stories
rated = df_online['rated'].value_counts()
rated.plot.pie(autopct='%.f', figsize=(5,5))
plt.ylabel('')
plt.show()
~40% of stories are rated T, ~40% rated K or K+, and ~20% are rated M.
Stories on the site are written predominately in English. The next common languages are Spanish, French, Indonesian, and Portuguese. However, these are all minute, with Spanish consisting of only 5% and the other languages 3% or less.
In [8]:
# examines distribution of languages
top_language = df_online['language'].copy()
top_language[top_language != 'English'] = 'Non-English'
top_language = top_language.value_counts()
top_language.plot.pie(autopct='%.f', figsize=(5,5))
plt.ylabel('')
plt.show()
As for 2017, fanfiction.net has nine different media categories, plus crossovers. The breakdown of these is given below:
In [9]:
# examines distribution of media
media = df_online['media'].value_counts()
(media/np.sum(media)).plot.bar()
plt.xticks(rotation=90)
plt.show()
Anime/Manga is the most popular media, taking up approximately ~30% of all works. TV Shows and Books both contribute to ~20% each.
What about by fandom?
In [10]:
# examines distribution of media
fandom = df_online['fandom'].value_counts()
(fandom[:10]/np.sum(fandom)).plot.bar()
plt.xticks(rotation=90)
plt.show()
The most popular fandom is, unsurprisingly, Harry Potter. However, it still consitutes a much smaller portion of the fanfiction base than initially assumed, at only ~10%.
One question we asked earlier is what fandoms contributed to the increases in stories over time.
In [11]:
df_online['top_fandom'] = df_online['fandom']
nottop = [row not in fandom[:10].index.values for row in df_online['fandom']]
df_online.loc[nottop, 'top_fandom'] = 'Other'
In [12]:
entry_fandom = pd.crosstab(df_online.pub_year, df_online.top_fandom)
entry_fandom = entry_fandom[np.append(fandom[:5].index.values, ['Other'])][:-1]
# plots chart
(entry_fandom/np.sum(entry)).plot.bar(stacked=True)
plt.axes().get_xaxis().set_label_text('')
plt.legend(title=None, frameon=False)
plt.show()
It would appear that backin the year 2000, Harry Potter constituted nearly half of the fanfictions published. However, the overall growth in fanfiction is due to many other fandoms jumping in, with no one particular fandom holding sway.
Of the top 5 fandoms, Harry Potter and Naruto prove to be the most persistent in holding their volumes per year. Twilight saw a giant spike in popularity in 2009 and 2010 but faded since.
Let's take a look at the distribution of word and chapter lengths.
In [13]:
# examines distribution of number of words
df_online['words1k'] = df_online['words']/1000
f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
sns.kdeplot(df_online['words1k'], shade=True, bw=.5, legend=False, ax=ax1)
sns.kdeplot(df_online['words1k'], shade=True, bw=.5, legend=False, ax=ax2)
plt.xlim(0,100)
plt.show()
The bulk of stories appear to be less than 50 thousand words, with a high proportion between 0-20 thousand words. In other words, we have a significant proportion of short stories and novelettes, and some novellas. Novels become more rare. Finally, there are a few "epics", ranging from 200 thousand to 600 thousand words.
The number of chapters per story, unsurprisingly, follows a similarly skewed distribution.
In [14]:
# examines distribution of number of chapters
df_online['chapters'] = df_online['chapters'].fillna(1)
df_online['chapters'].plot.hist(normed=True,
bins=np.arange(1, max(df_online.chapters)+1, 1))
plt.show()
Stories with over 20 chapters become exceedingly rare.
How often are stories completed?
In [15]:
# examines distribution of story status
status = df_online['status'].value_counts()
status.plot.pie(autopct='%.f', figsize=(5,5))
plt.show()
This is unexpected. It looks to be about an even split between completed and incompleted stories.
Let's see what types of stories are the completed ones.
In [16]:
complete = df_online.loc[df_online.status == 'Complete', 'chapters']
incomplete = df_online.loc[df_online.status == 'Incomplete', 'chapters']
plt.hist([complete, incomplete], normed=True, range=[1,10])
plt.show()
Oneshots explain the large proportion of completed stories.
Do authors publish more frequently on certain months or days?
In [17]:
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
months = ['NA', 'January', 'February', 'March', 'April', 'May', 'June', 'July',
'August', 'Septemeber', 'October', 'November', 'December']
In [18]:
# examines when stories first created
df_online['pub_month'] = [months[int(row[0])] for row in df_online['published']]
month = df_online['pub_month'].value_counts()
(month/np.sum(month)).plot.bar()
plt.xticks(rotation=90)
plt.axhline(y=0.0833, color='orange')
plt.show()
It appears some months are more popular than others. September, October, and November are the least popular months. Given that the majority of the user base is from the United States, and presumably children and young adults, this is perhaps due to the timing of the academic calendar -- school begins in the fall. Similarly, the three most popular months (December, July, and April) coincides with winter vacation, summer vacation, and spring break respectively. This is all purely speculatory.
In [19]:
month_xs = chisquare(month)
One thing we can test is how likely these differences are happenstance. Afterall, if you draw a bunch of stories at random, you might by chance get more stories published on certain months than others.
Using a chi-squared test, we found that -- assuming that the month really doesn't matter for publication -- the probability of getting the distribution that we are seeing is ~3%.
That's fairly low. So we have some evidence against the idea that the month doesn't matter for volume of publication.
In [20]:
# examines when stories first created
dayofweek = [days[datetime.date(int(row[2]), int(row[0]), int(row[1])).weekday()]
for row in df_online['published']]
dayofweek = pd.Series(dayofweek).value_counts()
(dayofweek/np.sum(dayofweek)).plot.bar()
plt.xticks(rotation=90)
plt.axhline(y=0.143, color='orange')
plt.show()
As for days of the week, publications are least likely to happen on a Friday.
In [21]:
dayofweek_xs = chisquare(dayofweek)
And how likely is this discrepency in days random? Well, the probability that we would get the distribution that we are seeing, assuming that the day of the week doesn't matter for publication, is only ~0.07%.
That's only a fraction of a percent. Once again, we go against the idea that the day of week doesn't matter for volume of publication.
Friday is at the end of a long week, maybe people are eager to go out and hang out with friends? Reading and writing are more reserved for the quieter, lazier Sundays. At least that is the anecdote of this author.
How long are titles and summaries? Is there a systematic way authors write them? Do some words appear more often than others? Here we explore some of those questions.
Let's start by examining character and word count, respectively, for titles.
In [22]:
# examines word/character count of titles
title_cc = [len(row) for row in df_online['title']]
title_wc = [len(row.split()) for row in df_online['title']]
pd.Series(title_cc).plot.hist(normed=True, bins=np.arange(0, max(title_cc), 1))
plt.show()
pd.Series(title_wc).plot.hist(normed=True, bins=np.arange(0, max(title_wc), 1))
plt.show()
Almost identical in shape of distribution. It would appear stories typically have 2-3 words in the title, or 15-20 characters.
Now let's look at summaries.
In [23]:
# examines word/character count of summaries
summary_cc = [len(row) for row in df_online['summary']]
summary_wc = [len(row.split()) for row in df_online['summary']]
pd.Series(summary_cc).plot.hist(normed=True, bins=np.arange(0, max(summary_cc), 1))
plt.show()
pd.Series(summary_wc).plot.hist(normed=True, bins=np.arange(0, max(summary_wc), 1))
plt.show()
Again, similar shapes. We can see the vestige of the original 255 character limit for summaries. Overall, it would appear summary lengths are pretty well dispersed.
Examining English stories only, let's see what are the top 10 words commonly used in titles. This is excluding stop words, such as "the", "is", or "are".
In [24]:
title_eng_wf = [row.lower().translate(str.maketrans('', '', string.punctuation)).split()
for row in df_online.loc[df_online.language == 'English', 'title']]
title_eng_wf = [item for sublist in title_eng_wf for item in sublist]
title_eng_wf = pd.Series(title_eng_wf).value_counts()
In [25]:
print((title_eng_wf.loc[[row not in stop_word_list
for row in title_eng_wf.index.values]][:10]/np.sum(title_eng_wf)).to_string())
It appears that "love" is the most popular for story titles. In fact, it appears about 1% out of all the words! Then there are time indicator words like "time", "day", and "night". Interesting!
What about story summaries?
In [26]:
summary_eng_wf = [row.lower().translate(str.maketrans('', '', string.punctuation)).split()
for row in df_online.loc[df_online.language == 'English', 'summary']]
summary_eng_wf = [item for sublist in summary_eng_wf for item in sublist]
summary_eng_wf = pd.Series(summary_eng_wf).value_counts()
In [27]:
print((summary_eng_wf.loc[[row not in stop_word_list
for row in summary_eng_wf.index.values]][:10]/np.sum(summary_eng_wf)).to_string())
Once again, "love" is a popular word. Also reoccuring are "one", "new", "life", and "story".
You also start seeing tags like "oneshot". Other tags like "first story", "read and review", and "don't like, don't read" may be also contributing to some of the other words on the list we see. To thoroughly test this, we would need to do more natural language processing and ngrams, which we will reserve for another exercise.
Here is a summary of what we have discovered thus far:
And finally, for the heck of it, introducing our new fanfiction, about fanfiction:
In [28]:
title = 'New Love'
summary = '''Remember the first time in your life that you wrote a story based off some character you really liked?
Thinking no one would find your little oneshot? Well, guess who now knows about it... AU.'''
attributes = 'Not Harry Potter - Rated: T - English - Romance/Humor - Words: 8,290 - Published: Dec 22, 2013'
display(HTML('<a href>' + title + '</a><br>' + summary + '<br><span style="color:grey">' + attributes + '</body>'))
Up next, we will take a look at profiles and the users on the site.