In [7]:
# imports libraries
import pickle										# import/export lists
import datetime										# dates
import re 											# regular expression
import pandas as pd									# dataframes
import numpy as np									# numerical computation
import matplotlib.pyplot as plt						# plot graphics
import seaborn as sns								# graphics supplemental
import statsmodels.formula.api as smf				# statistical models
import statsmodels.discrete.discrete_model as sm
from sklearn.linear_model import LogisticRegression
from statsmodels.stats.outliers_influence import (
    variance_inflation_factor as vif)				# vif

Predicting publications

In this section, we will try to predict whether an user will be an author. If they are an author, we will try to predict the number of stories they would write based off observable characteristics, such as the number of years have they been on the site, how much they read, and whether or not they are in a community.


In [2]:
# opens raw data
with open ('../data/clean_data/df_profile', 'rb') as fp:
    df = pickle.load(fp)

# creates copy with non-missing observations
df_active = df.loc[df.status != 'inactive', ].copy()

Linear Modelling

For this problem, we will use regression analysis. That is, we have some information at hand, and we want to use that information to predict something. This works if we there's some underlying relationship between our information and the thing we want to predict.

More specifically, we will use the following models:

Logistic regression

Logistic regression lets us predict the probability that variable is (or is not) something. The model is as follows:

$Pr(Y_i=1) = \dfrac{1}{1+e^{-(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_k X_{ik} + \epsilon_i)}}$

in which $Y_i$ is $i$th observation of the dependent variable (variable of interest), and $X_{ij}$ is the $i$th observation of the $j$th independent variable.

For our model to work, we would need a few assumptions. These are:

  • Linearity between odds ratio and independent variables / correct functional form
  • Strict exogeneity $E[\epsilon|X] = 0$
  • No multicollinearity

Linear regression

Linear regression lets us predict the value of something. The model is as follows:

$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_k X_{ik} + \epsilon_i$

in which $Y_i$ is $i$th observation of the dependent variable (variable of interest), and $X_{ij}$ is the $i$th observation of the $j$th independent variable.

For this model to work, we would need a few assumptions. These are:

  • Linearity between dependent and independent variables / correct functional form
  • Strict exogeneity $E[\epsilon|X] = 0$
  • No multicollinearity
  • Normally distributed errors
  • Spherical errors (homoscedasticity)

Probability of being an author

Before introducing any type of model, we can already see from our sample data that out of all active users, only ~18% of them are authors. As such, if we simply guess "not author" for all users, we will be correct ~82% of the time. Our goal is to come up with a classification method that has greater accuracy.


In [3]:
# examines status of users
status = df_active['status'].value_counts()

# plots chart
status.plot.pie(autopct='%.f', figsize=(5,5))
plt.ylabel('')
plt.show()


The first approach that we will try is the logit model, or logistic regression, as explained above.

$Pr(Author_i) = \dfrac{1}{1-e^{-x_i^t \beta}}$

Here, the probability that user $i$ is an author is a function of a vector of characteristics $x$. These characteristics will be things that we can observe. In particular, we chose the following:

  • the number of years since they first joined the site
  • the number of favorite authors+stories they have (logarithmically scaled)
  • whether or not they have written a profile
  • whether or not they are in a community

One more characteristic that we would also like to later include, but have not retrieved yet the data for:

  • whether or not they have signed a review

The reasoning behind these variables is simple -- we assume that the longer and more involved an user is on the site, the more likely they are to publish.

From there, we randomly split up our data into two subsets: one to used as training (to build our model), and one to be used as testing (to test the accuracy of the model we built).

The model produced by the training data has statistically significant positive $\beta$ coefficients for all our explanatory variables. This is the relationship we expected. However, when we use our model to predict the testing set, we achieve only ~86% accuracy. This is an improvement from blind guessing, but only slightly.

Estimating number of stories written

Multicollinearity


In [33]:
# displays correlation matrix
df_active.corr()


Out[33]:
id cc fa fs st
id 1.000000 -0.101727 -0.056975 -0.057811 -0.156201
cc -0.101727 1.000000 0.089435 0.174949 0.300189
fa -0.056975 0.089435 1.000000 0.361057 0.120795
fs -0.057811 0.174949 0.361057 1.000000 0.173517
st -0.156201 0.300189 0.120795 0.173517 1.000000

In [10]:
# creates design_matrix 
X = df_active
X['intercept'] = 1

# displays variance inflation factor
vif_results = pd.DataFrame()
vif_results['VIF Factor'] = [vif(X.values, i) for i in range(X.shape[1])]
vif_results['features'] = X.columns
vif_results


Out[10]:
VIF Factor features
0 1.051990 st
1 2.013037 fa
2 2.064973 fs
3 1.036716 cc
4 1.042636 age
5 2.824849 intercept

Results indicate there is some correlation between two of the independent variables: 'fa' and 'fs', implying one of them may not be necessary in the model.

Nonlinearity

We know from earlier distributions that some of the variables are heavily right-skewed. We created some scatter plots to confirm that the assumption of linearity holds.


In [11]:
sns.pairplot(data=df_active, y_vars=['st'], x_vars=['fa', 'fs', 'age'])

plt.show()


The data is clustered around the zeros. Let's try a log transformation.


In [12]:
# takes log transformation
df_active['st'] = np.log(df_active['st']+1)
df_active['fa'] = np.log(df_active['fa']+1)
df_active['fs'] = np.log(df_active['fs']+1)

In [13]:
sns.pairplot(data=df_active, y_vars=['st'], x_vars=['fa', 'fs', 'age'])

plt.show()


Regression Model


In [14]:
# runs OLS regression
formula = 'st ~ fa + fs + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     st   R-squared:                       0.199
Model:                            OLS   Adj. R-squared:                  0.196
Method:                 Least Squares   F-statistic:                     61.31
Date:                Mon, 07 Aug 2017   Prob (F-statistic):           2.70e-46
Time:                        13:07:36   Log-Likelihood:                -757.62
No. Observations:                 992   AIC:                             1525.
Df Residuals:                     987   BIC:                             1550.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0338      0.029     -1.171      0.242      -0.090       0.023
fa             0.1482      0.029      5.150      0.000       0.092       0.205
fs             0.0401      0.018      2.287      0.022       0.006       0.075
cc             0.6732      0.148      4.538      0.000       0.382       0.964
age            0.0290      0.004      6.847      0.000       0.021       0.037
==============================================================================
Omnibus:                      583.226   Durbin-Watson:                   2.123
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5544.765
Skew:                           2.580   Prob(JB):                         0.00
Kurtosis:                      13.370   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The log transformations helped increase the fit from and R-squared of ~0.05 to ~0.20.

From these results, we can see that:

  • A 1% change in number of authors favorited is associated with a ~15% change in the number of stories written.
  • A 1% change in number of stories favorited is associated with a ~4% change in the number of stories written.
  • Being in a community is associated with a ~0.7 increase in the number of stories written.
  • One more year on the site is associated with a ~3% change in the number of stories written.

We noted earlier that 'fa' and 'fs' had a correlation of ~0.7. As such, we reran the regression without 'fa' first, then again without 'fs'. The model without 'fs' yielded a better fit (R-squared), as well as AIC and BIC.


In [15]:
# runs OLS regression
formula = 'st ~ fa + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     st   R-squared:                       0.195
Model:                            OLS   Adj. R-squared:                  0.192
Method:                 Least Squares   F-statistic:                     79.67
Date:                Mon, 07 Aug 2017   Prob (F-statistic):           3.69e-46
Time:                        13:07:37   Log-Likelihood:                -760.24
No. Observations:                 992   AIC:                             1528.
Df Residuals:                     988   BIC:                             1548.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0169      0.028     -0.605      0.545      -0.072       0.038
fa             0.1989      0.018     10.843      0.000       0.163       0.235
cc             0.7102      0.148      4.806      0.000       0.420       1.000
age            0.0281      0.004      6.636      0.000       0.020       0.036
==============================================================================
Omnibus:                      592.647   Durbin-Watson:                   2.130
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5757.058
Skew:                           2.627   Prob(JB):                         0.00
Kurtosis:                      13.568   Cond. No.                         59.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Without 'fs', we lost some information but not much:

  • A 1% change in number of authors favorited is associated with a ~20% change in the number of stories written.
  • Being in a community is associated with a ~0.7 increase in the number of stories written.
  • One more year on the site is associated with a ~3% change in the number of stories written.

All these results seem to confirm a basic intuition that the more active an user reads (as measured by favoriting authors and stories), the likely it is that user will write more stories. Being longer on the site and being part of a community is also correlated to publications.

To get a sense of the actual magnitude of these effects, let's attempt some plots:


In [16]:
def graph(formula, x_range):  
    y = np.array(x_range)
    x = formula(y)
    plt.plot(y,x)  

graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1)))-1), 
      range(2,100,1))
graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[2])-1), 
      range(2,100,1))

plt.show()



In [17]:
ages = [0, 1, 5, 10, 15]
for age in ages:
    graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[3]*age)-1), 
          range(2,100,1))

plt.show()