In [1]:

    
# imports libraries
import pickle										# import/export lists
import math											# mathematical functions
import datetime										# dates
import re 											# regular expression
import pandas as pd									# dataframes
import numpy as np									# numerical computation
import matplotlib.pyplot as plt						# plot graphics
import seaborn as sns								# graphics supplemental
import statsmodels.formula.api as smf				# statistical models
from statsmodels.stats.outliers_influence import (
    variance_inflation_factor as vif)				# vif



In [2]:

    
# opens cleaned data
with open ('df_stories', 'rb') as fp:
    df = pickle.load(fp)



In [3]:

    
# creates subset of data of online stories
df_online = df.loc[df.state == 'online', ].copy()



In [4]:

    
# sets current year
cyear = datetime.datetime.now().year

Fanfiction Story Analysis

In this section, we take a sample of ~5000 stories from fanfiction.net and break down some of their characteristics. Then using all available data from the site, we try to make predictions on the number of reviews a story is expected to have given select features. These predicted values can be used as benchmarks to see whether stories are overperforming or underperforming relative to their peers.

Data exploration

Let's begin by examining the current state of stories: online, deleted, or missing. Missing stories are stories whose URL has moved due to shifts in the fanfiction achiving system.



In [5]:

    
# examines state of stories
state = df['state'].value_counts()

# plots chart
(state/np.sum(state)).plot.bar()
plt.xticks(rotation=0)
plt.show()

Surprisingly, it appears only about ~60% of stories that were once published still remain on the site! This is in stark contrast to user profiles, where less than 0.1% are deleted.

From this, we can only guess that authors actively take stories down, presumably to hide earlier works as their writing abilities improve or to replace them with rewrites. Authors who delete their profiles and stories that were deleted for fanfiction policy violations would also contribute to these figures.

Now let's examine the volume of stories published across time.



In [6]:

    
# examines when stories first created
df_online['pub_year'] = [int(row[2]) for row in df_online['published']]
entry = df_online['pub_year'].value_counts().sort_index()

# plots chart
(entry/np.sum(entry)).plot()
plt.xlim([np.min(entry.index.values), cyear-1])
plt.show()

We see a large jump starting in the 2010s, peaking around 2013, then a steady decline afterward. Unlike with profiles, you do not see the dips matching the Great Fanfiction Purge of 2002 and 2012.

The decline could be from a variety of factors. One could be competing fanfiction sites. Most notably, the nonprofit site, Archive of Our Own (AO3), started gaining traction due to its greater inclusivity of works and its tagging system that helps users to filter and search for works.

Another question to ask is if the increasing popularity of fanfiction is fueled by particular fandoms. It is well known in the fanfiction community that fandoms like Star Trek paved the road. Harry Potter and Naruto also held a dominating presence in the 2000s. Later on, we will try to quantify how much each of these fandoms contributed to the volume of fanfiction produced.

Genres

Now let's look at the distribution across the stories. Note that "General" includes stories that do not have a genre label.



In [7]:

    
# examines top genres individually
genres_indiv = [item for sublist in df_online['genre'] for item in sublist]
genres_indiv = pd.Series(genres_indiv).value_counts()

# plots chart
(genres_indiv/np.sum(genres_indiv)).plot.bar()
plt.xticks(rotation=90)
plt.show()

Romance takes the lead! In fact, ~30% of the genre labels used is "Romance". In second and third place are Humor and Drama respectively.

The least popular genres appear to be Crime, Horror, and Mystery.

So far, nothing here deviates much from intuition. We'd expect derivative works to focus more on existing character relationships and/or the canonic world, and less on stand-alone plots and twists.

What about how the genres combine?



In [50]:

    
# creates contingency table
gen_pairs = df_online.loc[[len(row) > 1 for row in df_online.genre], 'genre']
gen1 = pd.Series([row[0][:3] for row in gen_pairs] + [row[1][:3] for row in gen_pairs])
gen2 = pd.Series([row[1][:3] for row in gen_pairs] + [row[0][:3] for row in gen_pairs])
cross = pd.crosstab(index=gen1, columns=gen2, colnames=[''])
del cross.index.name

# finds relative frequency
for col in cross.columns.values:
    cross[col] = cross[col]/np.sum(cross[col])
    
# plots heatmap
f, ax = plt.subplots(figsize=(6,6))
ax = sns.heatmap(cross, cmap=cm, cbar=False, robust=True, 
                 square=True, linewidths=0.1, linecolor='white')

plt.show()

In terms of how genres cross, romance appears to pair with almost everything. Romance is particularly common with drama (the romantic drama) and humor (the rom-com). The only genre that shies away from romance is parody, which goes in hand with humor instead.

The second most crossed genre is adventure, which is often combined with fantasy, sci-fi, mystery, or suspense.

The third genre to note is angst, which is often combined with horror, poetry, or tragedy.

Ratings

The breakdown of how stories are rated are given below.



In [9]:

    
# examines state of stories
rated = df_online['rated'].value_counts()
rated.plot.pie(autopct='%.f', figsize=(5,5))

plt.show()

~40% of stories are rated T, ~40% rated K or K+, and ~20% are rated M.

Media and fandoms

As for 2017, fanfiction.net has nine different media categories, plus crossovers. The breakdown of these is given below:



In [10]:

    
# examines distribution of media
media = df_online['media'].value_counts()
(media/np.sum(media)).plot.bar()
plt.xticks(rotation=90)

plt.show()

Anime/Manga is the most popular media, taking up approximately ~30% of all works. TV Shows and Books both contribute to ~20% each.

What about by fandom?



In [11]:

    
# examines distribution of media
fandom = df_online['fandom'].value_counts()
(fandom[:10]/np.sum(fandom)).plot.bar()
plt.xticks(rotation=90)

plt.show()

The most popular fandom is, unsurprisingly, Harry Potter. However, it still consitutes a much smaller portion of the fanfiction base than initially assumed, at only ~10%.

One question we asked earlier is what fandoms contributed to the increases in stories over time.



In [12]:

    
df_online['top_fandom'] = df_online['fandom']
nottop = [row not in fandom[:10].index.values for row in df_online['fandom']]
df_online.loc[nottop, 'top_fandom'] = 'Other'



In [13]:

    
entry_fandom = pd.crosstab(df_online.pub_year, df_online.top_fandom)
entry_fandom = entry_fandom[np.append(fandom[:5].index.values, ['Other'])][:-1]

# plots chart
(entry_fandom/np.sum(entry)).plot.bar(stacked=True)
plt.axes().get_xaxis().set_label_text('')
plt.legend(title=None, frameon=False)
plt.show()

It would appear that backin the year 2000, Harry Potter constituted nearly half of the fanfictions published. However, the overall growth in fanfiction is due to many other fandoms jumping in, with no one particular fandom holding sway.

Of the top 5 fandoms, Harry Potter and Naruto prove to be the most persistent in holding their volumes per year. Twilight saw a giant spike in popularity in 2009 and 2010 but faded since.

Word count, chapter length and completion status

Let's take a look at the distribution of word and chapter lengths.



In [14]:

    
# examines distribution of number of words
df_online['words1k'] = df_online['words']/1000

f, (ax1, ax2) = plt.subplots(1, 2, sharey=True)
sns.kdeplot(df_online['words1k'], shade=True, bw=.5, legend=False, ax=ax1)
sns.kdeplot(df_online['words1k'], shade=True, bw=.5, legend=False, ax=ax2)
plt.xlim(0,100)

plt.show()

The bulk of stories appear to be less than 50 thousand words, with a high proportion between 0-20 thousand words. In other words, we have a significant proportion of short stories and novelettes, and some novellas. Novels become more rare. Finally, there are a few "epics", ranging from 200 thousand to 600 thousand words.

The number of chapters per story, unsurprisingly, follows a similarly skewed distribution.



In [15]:

    
# examines distribution of number of chapters
df_online['chapters'] = df_online['chapters'].fillna(1)
df_online['chapters'].plot.hist(normed=True, 
                                bins=np.arange(1, max(df_online.chapters)+1, 1))

plt.show()

Stories with over 20 chapters become exceedingly rare.

How often are stories completed?



In [16]:

    
# examines distribution of story status
status = df_online['status'].value_counts()
status.plot.pie(autopct='%.f', figsize=(5,5))

plt.show()

This is unexpected. It looks to be about an even split between completed and incompleted stories.

Let's see what types of stories are the completed ones.



In [17]:

    
complete = df_online.loc[df_online.status == 'Complete', 'chapters']
incomplete = df_online.loc[df_online.status == 'Incomplete', 'chapters']

plt.hist([complete, incomplete], normed=True, range=[1,10])

plt.show()

Oneshots explain the large proportion of completed stories.

Publication timing

Do authors publish more frequently on certain months or days?



In [18]:

    
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
months = ['NA', 'January', 'February', 'March', 'April', 'May', 'June', 'July', 
          'August', 'Septemeber', 'October', 'November', 'December']



In [19]:

    
# examines when stories first created
df_online['pub_month'] = [months[int(row[0])] for row in df_online['published']]
month = df_online['pub_month'].value_counts()
(month/np.sum(month)).plot.bar()
plt.xticks(rotation=90)
plt.axhline(y=0.0833, color='red')

plt.show()

It appears some months are more popular than others. September, October, and November are the least popular months. Given that the majority of the user base is from the United States, and presumably children and young adults, this is perhaps due to the timing of the academic calendar -- school begins in the fall.

Similarly, the three most popular months (December, July, and April) coincides with winter vacation, summer vacation, and spring break respectively.



In [20]:

    
# examines when stories first created
dayofweek = [days[datetime.date(int(row[2]), int(row[0]), int(row[1])).weekday()] 
             for row in df_online['published']]
dayofweek = pd.Series(dayofweek).value_counts()
(dayofweek/np.sum(dayofweek)).plot.bar()
plt.xticks(rotation=90)
plt.axhline(y=0.143, color='red')

plt.show()

As for days of the week, publications are least likely to happen on a Friday.

Performance benchmarking and prediction

The success of a story is typically judged by the number of reviews, favorites, or followers it recieves. Here, we will try to predict how successful a story will be given select observable features, as well as develop a way to benchmark existing stories. That is, if we were given a story's features, we can determine whether that story is overperforming or underperforming relative to its peers.

Reviews, favorites, and follows

First and foremost, let us examine the distribution of each of these "success" metrics.



In [135]:

    
# examines distribution of number of words

df_online['reviews'].fillna(0).plot.hist(normed=True, 
                                         bins=np.arange(0, 50, 1), histtype='step')
df_online['favs'].fillna(0).plot.hist(normed=True, 
                                         bins=np.arange(0, 50, 1), histtype='step')
df_online['follows'].fillna(0).plot.hist(normed=True, 
                                         bins=np.arange(0, 50, 1), histtype='step')
plt.xlim(0,50)

plt.show()



In [14]:

    
df_online.columns.values









    Out[14]:





array(['storyid', 'userid', 'title', 'summary', 'media', 'fandom', 'rated',
       'language', 'genre', 'characters', 'chapters', 'words', 'reviews',
       'favs', 'follows', 'updated', 'published', 'status', 'state',
       'pub_year'], dtype=object)



In [49]:

    
df_online['ratedM'] = [row == 'M' for row in df_online['rated']]
df_online['age'] = [cyear - int(row) for row in df_online['pub_year']]
df_online['fansize'] = [fandom[row] for row in df_online['fandom']]
df_online['complete'] = [row == 'Complete' for row in df_online['status']]



In [66]:

    
# runs OLS regression
formula = 'reviews ~ chapters + words1k + ratedM + age + fansize + complete'
reg = smf.ols(data=df_online, formula=formula).fit()
print(reg.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                reviews   R-squared:                       0.190
Model:                            OLS   Adj. R-squared:                  0.188
Method:                 Least Squares   F-statistic:                     103.0
Date:                Tue, 08 Aug 2017   Prob (F-statistic):          8.65e-117
Time:                        10:14:15   Log-Likelihood:                -15926.
No. Observations:                2639   AIC:                         3.187e+04
Df Residuals:                    2632   BIC:                         3.191e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           -6.6472      4.920     -1.351      0.177     -16.295       3.001
ratedM[T.True]      19.9886      5.055      3.954      0.000      10.076      29.901
complete[T.True]     3.7254      3.996      0.932      0.351      -4.111      11.562
chapters             3.2265      0.399      8.090      0.000       2.444       4.009
words1k              1.1135      0.099     11.298      0.000       0.920       1.307
age                  0.2418      0.495      0.488      0.626      -0.730       1.213
fansize              0.0387      0.022      1.794      0.073      -0.004       0.081
==============================================================================
Omnibus:                     6223.232   Durbin-Watson:                   2.008
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         53684044.481
Skew:                          23.087   Prob(JB):                         0.00
Kurtosis:                     700.201   Cond. No.                         322.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [67]:

    
# runs OLS regression
formula = 'lnreviews ~ lnchapters + lnwords1k + ratedM + age + fansize + complete'
reg = smf.ols(data=df_online, formula=formula).fit()
print(reg.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:              lnreviews   R-squared:                       0.434
Model:                            OLS   Adj. R-squared:                  0.433
Method:                 Least Squares   F-statistic:                     336.5
Date:                Tue, 08 Aug 2017   Prob (F-statistic):          6.35e-321
Time:                        10:14:25   Log-Likelihood:                -3435.8
No. Observations:                2639   AIC:                             6886.
Df Residuals:                    2632   BIC:                             6927.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            1.1808      0.046     25.717      0.000       1.091       1.271
ratedM[T.True]       0.1468      0.045      3.262      0.001       0.059       0.235
complete[T.True]     0.2795      0.037      7.656      0.000       0.208       0.351
lnchapters           0.4958      0.028     17.563      0.000       0.440       0.551
lnwords1k            0.2247      0.019     11.710      0.000       0.187       0.262
age                  0.0442      0.004     10.127      0.000       0.036       0.053
fansize              0.0006      0.000      2.925      0.003       0.000       0.001
==============================================================================
Omnibus:                      113.198   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              203.572
Skew:                           0.334   Prob(JB):                     6.24e-45
Kurtosis:                       4.185   Cond. No.                         337.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [64]:

    
# runs OLS regression
formula = 'lnfavs ~ lnchapters + lnwords1k + ratedM + age + fansize'
reg = smf.ols(data=df_online, formula=formula).fit()
print(reg.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 lnfavs   R-squared:                       0.216
Model:                            OLS   Adj. R-squared:                  0.215
Method:                 Least Squares   F-statistic:                     139.2
Date:                Tue, 08 Aug 2017   Prob (F-statistic):          1.14e-130
Time:                        10:10:56   Log-Likelihood:                -3825.4
No. Observations:                2527   AIC:                             7663.
Df Residuals:                    2521   BIC:                             7698.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          2.0682      0.047     43.806      0.000       1.976       2.161
ratedM[T.True]     0.3411      0.056      6.080      0.000       0.231       0.451
lnchapters        -0.0870      0.035     -2.486      0.013      -0.156      -0.018
lnwords1k          0.3969      0.025     15.956      0.000       0.348       0.446
age               -0.0308      0.006     -5.409      0.000      -0.042      -0.020
fansize            0.0010      0.000      4.207      0.000       0.001       0.001
==============================================================================
Omnibus:                       92.491   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              102.129
Skew:                           0.475   Prob(JB):                     6.65e-23
Kurtosis:                       3.257   Cond. No.                         286.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.



In [32]:

    
# creates copy of only active users
df_active = df_profile.loc[df_profile.status != 'inactive', ].copy()

# creates age variable
df_active['age'] = 17 - pd.to_numeric(df_active['join_year'])
df_active.loc[df_active.age < 0, 'age'] = df_active.loc[df_active.age < 0, 'age'] + 100
df_active = df_active[['st', 'fa', 'fs', 'cc', 'age']]

# turns cc into binary
df_active.loc[df_active['cc'] > 0, 'cc'] = 1

Multicollinearity



In [33]:

    
# displays correlation matrix
df_active.corr()



In [34]:

    
# creates design_matrix 
X = df_active
X['intercept'] = 1

# displays variance inflation factor
vif_results = pd.DataFrame()
vif_results['VIF Factor'] = [vif(X.values, i) for i in range(X.shape[1])]
vif_results['features'] = X.columns
vif_results

Results indicate there is some correlation between two of the independent variables: 'fa' and 'fs', implying one of them may not be necessary in the model.

Nonlinearity

We know from earlier distributions that some of the variables are heavily right-skewed. We created some scatter plots to confirm that the assumption of linearity holds.



In [56]:

    
df_online['lnreviews'] = np.log(df_online['reviews']+1)
df_online['lnchapters'] = np.log(df_online['chapters'])
df_online['lnwords1k'] = np.log(df_online['words1k'])

sns.pairplot(data=df_online, y_vars=['lnreviews'], x_vars=['lnchapters', 'lnwords1k', 'age'])
sns.pairplot(data=df_online, y_vars=['reviews'], x_vars=['chapters', 'words', 'age'])

plt.show()



In [59]:

    
df_online['lnfavs'] = np.log(df_online['favs']+1)

sns.pairplot(data=df_online, y_vars=['lnfavs'], x_vars=['lnchapters', 'lnwords1k', 'age'])
sns.pairplot(data=df_online, y_vars=['favs'], x_vars=['chapters', 'words', 'age'])

plt.show()

The data is clustered around the zeros. Let's try a log transformation.

Regression Model



In [47]:

    
# runs OLS regression
formula = 'st ~ fa + fs + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     st   R-squared:                       0.199
Model:                            OLS   Adj. R-squared:                  0.196
Method:                 Least Squares   F-statistic:                     61.31
Date:                Thu, 03 Aug 2017   Prob (F-statistic):           2.70e-46
Time:                        18:53:22   Log-Likelihood:                -757.62
No. Observations:                 992   AIC:                             1525.
Df Residuals:                     987   BIC:                             1550.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0338      0.029     -1.171      0.242      -0.090       0.023
fa             0.1482      0.029      5.150      0.000       0.092       0.205
fs             0.0401      0.018      2.287      0.022       0.006       0.075
cc             0.6732      0.148      4.538      0.000       0.382       0.964
age            0.0290      0.004      6.847      0.000       0.021       0.037
==============================================================================
Omnibus:                      583.226   Durbin-Watson:                   2.123
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5544.765
Skew:                           2.580   Prob(JB):                         0.00
Kurtosis:                      13.370   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The log transformations helped increase the fit from and R-squared of ~0.05 to ~0.20.

From these results, we can see that:

A 1% change in number of authors favorited is associated with a ~15% change in the number of stories written.
A 1% change in number of stories favorited is associated with a ~4% change in the number of stories written.
Being in a community is associated with a ~0.7 increase in the number of stories written.
One more year on the site is associated with a ~3% change in the number of stories written.

We noted earlier that 'fa' and 'fs' had a correlation of ~0.7. As such, we reran the regression without 'fa' first, then again without 'fs'. The model without 'fs' yielded a better fit (R-squared), as well as AIC and BIC.



In [48]:

    
# runs OLS regression
formula = 'st ~ fa + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     st   R-squared:                       0.195
Model:                            OLS   Adj. R-squared:                  0.192
Method:                 Least Squares   F-statistic:                     79.67
Date:                Thu, 03 Aug 2017   Prob (F-statistic):           3.69e-46
Time:                        18:53:27   Log-Likelihood:                -760.24
No. Observations:                 992   AIC:                             1528.
Df Residuals:                     988   BIC:                             1548.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0169      0.028     -0.605      0.545      -0.072       0.038
fa             0.1989      0.018     10.843      0.000       0.163       0.235
cc             0.7102      0.148      4.806      0.000       0.420       1.000
age            0.0281      0.004      6.636      0.000       0.020       0.036
==============================================================================
Omnibus:                      592.647   Durbin-Watson:                   2.130
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5757.058
Skew:                           2.627   Prob(JB):                         0.00
Kurtosis:                      13.568   Cond. No.                         59.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Without 'fs', we lost some information but not much:

A 1% change in number of authors favorited is associated with a ~20% change in the number of stories written.
Being in a community is associated with a ~0.7 increase in the number of stories written.
One more year on the site is associated with a ~3% change in the number of stories written.

All these results seem to confirm a basic intuition that the more active an user reads (as measured by favoriting authors and stories), the likely it is that user will write more stories. Being longer on the site and being part of a community is also correlated to publications.

To get a sense of the actual magnitude of these effects, let's attempt some plots:



In [99]:

    
def graph(formula, x_range):  
    y = np.array(x_range)
    x = formula(y)
    plt.plot(y,x)  

graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1)))), 
      range(2,100,1))
graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[2])), 
      range(2,100,1))

plt.show()



In [98]:

    
ages = [0, 1, 5, 10, 15]
for age in ages:
    graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[3]*age)), 
          range(2,100,1))

plt.show()

	st	fa	fs	cc	age
st	1.000000	0.089321	0.142494	0.052937	0.170821
fa	0.089321	1.000000	0.706184	0.017645	0.007866
fs	0.142494	0.706184	1.000000	0.118110	0.011833
cc	0.052937	0.017645	0.118110	1.000000	0.113621
age	0.170821	0.007866	0.011833	0.113621	1.000000

	VIF Factor	features
0	1.051990	st
1	2.013037	fa
2	2.064973	fs
3	1.036716	cc
4	1.042636	age
5	2.824849	intercept