In [7]:
# imports libraries
import pickle # import/export lists
import datetime # dates
import re # regular expression
import pandas as pd # dataframes
import numpy as np # numerical computation
import matplotlib.pyplot as plt # plot graphics
import seaborn as sns # graphics supplemental
import statsmodels.formula.api as smf # statistical models
import statsmodels.discrete.discrete_model as sm
from sklearn.linear_model import LogisticRegression
from statsmodels.stats.outliers_influence import (
variance_inflation_factor as vif) # vif
In this section, we will try to predict whether an user will be an author. If they are an author, we will try to predict the number of stories they would write based off observable characteristics, such as the number of years have they been on the site, how much they read, and whether or not they are in a community.
In [2]:
# opens raw data
with open ('../data/clean_data/df_profile', 'rb') as fp:
df = pickle.load(fp)
# creates copy with non-missing observations
df_active = df.loc[df.status != 'inactive', ].copy()
For this problem, we will use regression analysis. That is, we have some information at hand, and we want to use that information to predict something. This works if we there's some underlying relationship between our information and the thing we want to predict.
More specifically, we will use the following models:
Logistic regression lets us predict the probability that variable is (or is not) something. The model is as follows:
$Pr(Y_i=1) = \dfrac{1}{1+e^{-(\beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_k X_{ik} + \epsilon_i)}}$
in which $Y_i$ is $i$th observation of the dependent variable (variable of interest), and $X_{ij}$ is the $i$th observation of the $j$th independent variable.
For our model to work, we would need a few assumptions. These are:
Linear regression lets us predict the value of something. The model is as follows:
$Y_i = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + ... + \beta_k X_{ik} + \epsilon_i$
in which $Y_i$ is $i$th observation of the dependent variable (variable of interest), and $X_{ij}$ is the $i$th observation of the $j$th independent variable.
For this model to work, we would need a few assumptions. These are:
Before introducing any type of model, we can already see from our sample data that out of all active users, only ~18% of them are authors. As such, if we simply guess "not author" for all users, we will be correct ~82% of the time. Our goal is to come up with a classification method that has greater accuracy.
In [3]:
# examines status of users
status = df_active['status'].value_counts()
# plots chart
status.plot.pie(autopct='%.f', figsize=(5,5))
plt.ylabel('')
plt.show()
The first approach that we will try is the logit model, or logistic regression, as explained above.
$Pr(Author_i) = \dfrac{1}{1-e^{-x_i^t \beta}}$
Here, the probability that user $i$ is an author is a function of a vector of characteristics $x$. These characteristics will be things that we can observe. In particular, we chose the following:
One more characteristic that we would also like to later include, but have not retrieved yet the data for:
The reasoning behind these variables is simple -- we assume that the longer and more involved an user is on the site, the more likely they are to publish.
From there, we randomly split up our data into two subsets: one to used as training (to build our model), and one to be used as testing (to test the accuracy of the model we built).
The model produced by the training data has statistically significant positive $\beta$ coefficients for all our explanatory variables. This is the relationship we expected. However, when we use our model to predict the testing set, we achieve only ~86% accuracy. This is an improvement from blind guessing, but only slightly.
In [33]:
# displays correlation matrix
df_active.corr()
Out[33]:
In [10]:
# creates design_matrix
X = df_active
X['intercept'] = 1
# displays variance inflation factor
vif_results = pd.DataFrame()
vif_results['VIF Factor'] = [vif(X.values, i) for i in range(X.shape[1])]
vif_results['features'] = X.columns
vif_results
Out[10]:
Results indicate there is some correlation between two of the independent variables: 'fa' and 'fs', implying one of them may not be necessary in the model.
We know from earlier distributions that some of the variables are heavily right-skewed. We created some scatter plots to confirm that the assumption of linearity holds.
In [11]:
sns.pairplot(data=df_active, y_vars=['st'], x_vars=['fa', 'fs', 'age'])
plt.show()
The data is clustered around the zeros. Let's try a log transformation.
In [12]:
# takes log transformation
df_active['st'] = np.log(df_active['st']+1)
df_active['fa'] = np.log(df_active['fa']+1)
df_active['fs'] = np.log(df_active['fs']+1)
In [13]:
sns.pairplot(data=df_active, y_vars=['st'], x_vars=['fa', 'fs', 'age'])
plt.show()
In [14]:
# runs OLS regression
formula = 'st ~ fa + fs + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())
The log transformations helped increase the fit from and R-squared of ~0.05 to ~0.20.
From these results, we can see that:
We noted earlier that 'fa' and 'fs' had a correlation of ~0.7. As such, we reran the regression without 'fa' first, then again without 'fs'. The model without 'fs' yielded a better fit (R-squared), as well as AIC and BIC.
In [15]:
# runs OLS regression
formula = 'st ~ fa + cc + age'
reg = smf.ols(data=df_active, formula=formula).fit()
print(reg.summary())
Without 'fs', we lost some information but not much:
All these results seem to confirm a basic intuition that the more active an user reads (as measured by favoriting authors and stories), the likely it is that user will write more stories. Being longer on the site and being part of a community is also correlated to publications.
To get a sense of the actual magnitude of these effects, let's attempt some plots:
In [16]:
def graph(formula, x_range):
y = np.array(x_range)
x = formula(y)
plt.plot(y,x)
graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1)))-1),
range(2,100,1))
graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[2])-1),
range(2,100,1))
plt.show()
In [17]:
ages = [0, 1, 5, 10, 15]
for age in ages:
graph(lambda x : (np.exp(reg.params[0]+reg.params[1]*(np.log(x-1))+reg.params[3]*age)-1),
range(2,100,1))
plt.show()