In [1]:
# imports libraries
import pickle										# import/export lists
import math											# mathematical functions
import datetime										# dates
import string
import re 											# regular expression
import pandas as pd									# dataframes
import numpy as np									# numerical computation
import matplotlib.pyplot as plt						# plot graphics
import seaborn as sns								# graphics supplemental
import statsmodels.formula.api as smf				# statistical models
from statsmodels.stats.outliers_influence import (
    variance_inflation_factor as vif)				# vif
from nltk.corpus import stopwords

In [3]:
# opens cleaned data
with open ('../clean_data/df_story', 'rb') as fp:
    df = pickle.load(fp)

In [4]:
# creates subset of data of online stories
df_online = df.loc[df.state == 'online', ].copy()

In [5]:
# sets current year
cyear = datetime.datetime.now().year

In [30]:
# sets stop word list for text parsing
stop_word_list = stopwords.words('english')

Fanfiction Story Analysis

Performance benchmarking and prediction

The success of a story is typically judged by the number of reviews, favorites, or followers it recieves. Here, we will try to predict how successful a story will be given select observable features, as well as develop a way to benchmark existing stories. That is, if we were given a story's features, we can determine whether that story is overperforming or underperforming relative to its peers.

Reviews, favorites, and follows

First and foremost, let us examine the distribution of each of these "success" metrics.


In [45]:
# examines distribution of number of words

df_online['reviews'].fillna(0).plot.hist(normed=True, 
                                         bins=np.arange(0, 50, 1), alpha=0.5, histtype='step', linewidth='2')
df_online['favs'].fillna(0).plot.hist(normed=True, 
                                         bins=np.arange(0, 50, 1), alpha=0.5, histtype='step', linewidth='2')
df_online['follows'].fillna(0).plot.hist(normed=True, 
                                         bins=np.arange(0, 50, 1), alpha=0.5, histtype='step', linewidth='2')
plt.xlim(0,50)
plt.legend().set_visible(True)

plt.show()


As expected, reviews, favorites, and follows all have heavily right-skewing distributions. However, there are also differences. A story is mostly likely to have 1 or 2 reviews, not 0. A story is mostly likely to have 0 favorites, but otherwise the favorites distribution looks very similar to reviews. Follows is the one that deviates the most. About one-fourth of stories have 0 or 1 follows.

We assumed authors prefer having reviews first and foremost, then favorites, then follows. The data reveals that it is actually follows that is the most "rare" out of the three metrics, then favorites, and finally review.

This is accordance with intuition. Anyone can sign a review, with or without an account. Only users with accounts can increase a story's favorite counter. Finally, follows are the most "hassling", as they send update messages to a follower's email inbox. Consequently, they are the least common.


In [14]:
df_online.columns.values


Out[14]:
array(['storyid', 'userid', 'title', 'summary', 'media', 'fandom', 'rated',
       'language', 'genre', 'characters', 'chapters', 'words', 'reviews',
       'favs', 'follows', 'updated', 'published', 'status', 'state',
       'pub_year'], dtype=object)

In [46]:
# creates regressand variables
df_online['ratedM'] = [row == 'M' for row in df_online['rated']]
df_online['age'] = [cyear - int(row) for row in df_online['pub_year']]
df_online['fansize'] = [fandom[row] for row in df_online['fandom']]
df_online['complete'] = [row == 'Complete' for row in df_online['status']]
df_online['lnchapters'] = np.log(df_online['chapters'])

In [47]:
# creates independent variables
df_online['lnreviews'] = np.log(df_online['reviews']+1)
df_online['lnfavs'] = np.log(df_online['favs']+1)
df_online['lnfollows'] = np.log(df_online['follows']+1)

In [59]:
df_online['lnfavs'] = np.log(df_online['favs']+1)

sns.pairplot(data=df_online, y_vars=['lnfavs'], x_vars=['lnchapters', 'lnwords1k', 'age'])
sns.pairplot(data=df_online, y_vars=['favs'], x_vars=['chapters', 'words', 'age'])

plt.show()



In [56]:
sns.pairplot(data=df_online, y_vars=['lnreviews'], x_vars=['lnchapters', 'lnwords1k', 'age'])
sns.pairplot(data=df_online, y_vars=['reviews'], x_vars=['chapters', 'words', 'age'])

plt.show()



In [66]:
# runs OLS regression
formula = 'reviews ~ chapters + words1k + ratedM + age + fansize + complete'
reg = smf.ols(data=df_online, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                reviews   R-squared:                       0.190
Model:                            OLS   Adj. R-squared:                  0.188
Method:                 Least Squares   F-statistic:                     103.0
Date:                Tue, 08 Aug 2017   Prob (F-statistic):          8.65e-117
Time:                        10:14:15   Log-Likelihood:                -15926.
No. Observations:                2639   AIC:                         3.187e+04
Df Residuals:                    2632   BIC:                         3.191e+04
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept           -6.6472      4.920     -1.351      0.177     -16.295       3.001
ratedM[T.True]      19.9886      5.055      3.954      0.000      10.076      29.901
complete[T.True]     3.7254      3.996      0.932      0.351      -4.111      11.562
chapters             3.2265      0.399      8.090      0.000       2.444       4.009
words1k              1.1135      0.099     11.298      0.000       0.920       1.307
age                  0.2418      0.495      0.488      0.626      -0.730       1.213
fansize              0.0387      0.022      1.794      0.073      -0.004       0.081
==============================================================================
Omnibus:                     6223.232   Durbin-Watson:                   2.008
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         53684044.481
Skew:                          23.087   Prob(JB):                         0.00
Kurtosis:                     700.201   Cond. No.                         322.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [67]:
# runs OLS regression
formula = 'lnreviews ~ lnchapters + lnwords1k + ratedM + age + fansize + complete'
reg = smf.ols(data=df_online, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:              lnreviews   R-squared:                       0.434
Model:                            OLS   Adj. R-squared:                  0.433
Method:                 Least Squares   F-statistic:                     336.5
Date:                Tue, 08 Aug 2017   Prob (F-statistic):          6.35e-321
Time:                        10:14:25   Log-Likelihood:                -3435.8
No. Observations:                2639   AIC:                             6886.
Df Residuals:                    2632   BIC:                             6927.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            1.1808      0.046     25.717      0.000       1.091       1.271
ratedM[T.True]       0.1468      0.045      3.262      0.001       0.059       0.235
complete[T.True]     0.2795      0.037      7.656      0.000       0.208       0.351
lnchapters           0.4958      0.028     17.563      0.000       0.440       0.551
lnwords1k            0.2247      0.019     11.710      0.000       0.187       0.262
age                  0.0442      0.004     10.127      0.000       0.036       0.053
fansize              0.0006      0.000      2.925      0.003       0.000       0.001
==============================================================================
Omnibus:                      113.198   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              203.572
Skew:                           0.334   Prob(JB):                     6.24e-45
Kurtosis:                       4.185   Cond. No.                         337.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [64]:
# runs OLS regression
formula = 'lnfavs ~ lnchapters + lnwords1k + ratedM + age + fansize'
reg = smf.ols(data=df_online, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 lnfavs   R-squared:                       0.216
Model:                            OLS   Adj. R-squared:                  0.215
Method:                 Least Squares   F-statistic:                     139.2
Date:                Tue, 08 Aug 2017   Prob (F-statistic):          1.14e-130
Time:                        10:10:56   Log-Likelihood:                -3825.4
No. Observations:                2527   AIC:                             7663.
Df Residuals:                    2521   BIC:                             7698.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept          2.0682      0.047     43.806      0.000       1.976       2.161
ratedM[T.True]     0.3411      0.056      6.080      0.000       0.231       0.451
lnchapters        -0.0870      0.035     -2.486      0.013      -0.156      -0.018
lnwords1k          0.3969      0.025     15.956      0.000       0.348       0.446
age               -0.0308      0.006     -5.409      0.000      -0.042      -0.020
fansize            0.0010      0.000      4.207      0.000       0.001       0.001
==============================================================================
Omnibus:                       92.491   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              102.129
Skew:                           0.475   Prob(JB):                     6.65e-23
Kurtosis:                       3.257   Cond. No.                         286.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [32]:
# creates copy of only active users
df_active = df_profile.loc[df_profile.status != 'inactive', ].copy()

# creates age variable
df_active['age'] = 17 - pd.to_numeric(df_active['join_year'])
df_active.loc[df_active.age < 0, 'age'] = df_active.loc[df_active.age < 0, 'age'] + 100
df_active = df_active[['st', 'fa', 'fs', 'cc', 'age']]

# turns cc into binary
df_active.loc[df_active['cc'] > 0, 'cc'] = 1

Multicollinearity


In [33]:
# displays correlation matrix
df_active.corr()


Out[33]:
st fa fs cc age
st 1.000000 0.089321 0.142494 0.052937 0.170821
fa 0.089321 1.000000 0.706184 0.017645 0.007866
fs 0.142494 0.706184 1.000000 0.118110 0.011833
cc 0.052937 0.017645 0.118110 1.000000 0.113621
age 0.170821 0.007866 0.011833 0.113621 1.000000

In [34]:
# creates design_matrix 
X = df_active
X['intercept'] = 1

# displays variance inflation factor
vif_results = pd.DataFrame()
vif_results['VIF Factor'] = [vif(X.values, i) for i in range(X.shape[1])]
vif_results['features'] = X.columns
vif_results


Out[34]:
VIF Factor features
0 1.051990 st
1 2.013037 fa
2 2.064973 fs
3 1.036716 cc
4 1.042636 age
5 2.824849 intercept

Results indicate there is some correlation between two of the independent variables: 'fa' and 'fs', implying one of them may not be necessary in the model.

Nonlinearity

We know from earlier distributions that some of the variables are heavily right-skewed. We created some scatter plots to confirm that the assumption of linearity holds.

The data is clustered around the zeros. Let's try a log transformation.

Regression Model


In [47]:
# runs OLS regression
formula = 'st ~ fa + fs + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     st   R-squared:                       0.199
Model:                            OLS   Adj. R-squared:                  0.196
Method:                 Least Squares   F-statistic:                     61.31
Date:                Thu, 03 Aug 2017   Prob (F-statistic):           2.70e-46
Time:                        18:53:22   Log-Likelihood:                -757.62
No. Observations:                 992   AIC:                             1525.
Df Residuals:                     987   BIC:                             1550.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0338      0.029     -1.171      0.242      -0.090       0.023
fa             0.1482      0.029      5.150      0.000       0.092       0.205
fs             0.0401      0.018      2.287      0.022       0.006       0.075
cc             0.6732      0.148      4.538      0.000       0.382       0.964
age            0.0290      0.004      6.847      0.000       0.021       0.037
==============================================================================
Omnibus:                      583.226   Durbin-Watson:                   2.123
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5544.765
Skew:                           2.580   Prob(JB):                         0.00
Kurtosis:                      13.370   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The log transformations helped increase the fit from and R-squared of ~0.05 to ~0.20.

From these results, we can see that:

  • A 1% change in number of authors favorited is associated with a ~15% change in the number of stories written.
  • A 1% change in number of stories favorited is associated with a ~4% change in the number of stories written.
  • Being in a community is associated with a ~0.7 increase in the number of stories written.
  • One more year on the site is associated with a ~3% change in the number of stories written.

We noted earlier that 'fa' and 'fs' had a correlation of ~0.7. As such, we reran the regression without 'fa' first, then again without 'fs'. The model without 'fs' yielded a better fit (R-squared), as well as AIC and BIC.


In [48]:
# runs OLS regression
formula = 'st ~ fa + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     st   R-squared:                       0.195
Model:                            OLS   Adj. R-squared:                  0.192
Method:                 Least Squares   F-statistic:                     79.67
Date:                Thu, 03 Aug 2017   Prob (F-statistic):           3.69e-46
Time:                        18:53:27   Log-Likelihood:                -760.24
No. Observations:                 992   AIC:                             1528.
Df Residuals:                     988   BIC:                             1548.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0169      0.028     -0.605      0.545      -0.072       0.038
fa             0.1989      0.018     10.843      0.000       0.163       0.235
cc             0.7102      0.148      4.806      0.000       0.420       1.000
age            0.0281      0.004      6.636      0.000       0.020       0.036
==============================================================================
Omnibus:                      592.647   Durbin-Watson:                   2.130
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5757.058
Skew:                           2.627   Prob(JB):                         0.00
Kurtosis:                      13.568   Cond. No.                         59.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Without 'fs', we lost some information but not much:

  • A 1% change in number of authors favorited is associated with a ~20% change in the number of stories written.
  • Being in a community is associated with a ~0.7 increase in the number of stories written.
  • One more year on the site is associated with a ~3% change in the number of stories written.

All these results seem to confirm a basic intuition that the more active an user reads (as measured by favoriting authors and stories), the likely it is that user will write more stories. Being longer on the site and being part of a community is also correlated to publications.

To get a sense of the actual magnitude of these effects, let's attempt some plots:


In [99]:
def graph(formula, x_range):  
    y = np.array(x_range)
    x = formula(y)
    plt.plot(y,x)  

graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1)))), 
      range(2,100,1))
graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[2])), 
      range(2,100,1))

plt.show()



In [98]:
ages = [0, 1, 5, 10, 15]
for age in ages:
    graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[3]*age)), 
          range(2,100,1))

plt.show()