In this example, we'll try to recommend questions to be answered to users of stats.stackexchange.com.
The full CrossValidated dataset is available at https://archive.org/details/stackexchange. Helper functions to obtain and process it are defined in data.py
, and we are going to use them here.
In [1]:
import data
(interactions, question_features,
user_features, question_vectorizer,
user_vectorizer) = data.read_data() # This will download the data if not present
interactions
is a matrix with entries equal to 1 if the i-th user posted an answer to the j-th question; the goal is to recommend the questions to users who might answer them.
question_features
is a sparse matrix containing question metadata in the form of tags. vectorizer
is a sklearn.feature_extraction.DictVectorizer
instance that translates the tags into vector form.
user_features
and user_vectorizer
pertain to user features. In this case, we take users' 'About' sections: short snippets of natural language that (often) describe a given user's interests.
Printing the matrices show that we have around 3200 users and 40,000 questions. Questions are described by one or more tags from a set of about 1200.
In [2]:
print(repr(interactions))
print(repr(question_features))
The tags matrix contains rows such as
In [3]:
print(question_vectorizer.inverse_transform(question_features[:3]))
User features are exactly what we would expect from processing raw text:
In [4]:
print(user_vectorizer.inverse_transform(user_features[2]))
In [5]:
import model
import inspect
print(inspect.getsource(model.train_test_split))
train, test = model.train_test_split(interactions)
In [6]:
print(inspect.getsource(model.fit_lightfm_model))
In [9]:
mf_model = model.fit_lightfm_model(train, epochs=1)
The following function will compute the AUC score on the test set:
In [11]:
print(inspect.getsource(model.auc_lightfm))
mf_score = model.auc_lightfm(mf_model, test)
print(mf_score)
Ooops. That's worse than random (due possibly to overfitting).
In this case, this is because the CrossValidated dataset is very sparse: there just aren't enough interactions to support a traditional collaborative filtering model. In general, we'd also like to recommend questions that have no answers yet, making the collaborative model doubly ineffective.
To remedy this, we can try using a content-based model. The following code uses question tags to estimate a logistic regression model for each user, predicting the probability that a user would want to answer a given question.
In [7]:
print(inspect.getsource(model.fit_content_models))
Running this and evaluating the AUC score gives
In [8]:
content_models = model.fit_content_models(train, question_features)
In [9]:
content_score = model.auc_content_models(content_models, test, question_features)
print(content_score)
That's a bit better, but not great. In addition, a linear model of this form fails to capture tag similarity. For example, probit
and logistic regression
are closely related, yet the model will not automatically infer knowledge of one knowledge of the other.
What happens if we estimate theLightFM model with question features?
In [10]:
lightfm_model = model.fit_lightfm_model(train, post_features=question_features)
lightfm_score = model.auc_lightfm(lightfm_model, test, post_features=question_features)
print(lightfm_score)
We can add user features on top for a small additional improvement:
In [11]:
lightfm_model = model.fit_lightfm_model(train, post_features=question_features,
user_features=user_features)
lightfm_score = model.auc_lightfm(lightfm_model, test, post_features=question_features,
user_features=user_features)
print(lightfm_score)
This is quite a bit better, illustrating the fact that an embedding-based model can capture more interesting relationships between content features.
One additional advantage of metadata-based latent models is that they give us useful latent representations of the metadata features themselves --- much in the way word embedding approaches like word2vec.
The code below takes an input CrossValidated tag and finds tags that are close to it (in the cosine similarity sense) in the latent embedding space:
In [28]:
print(inspect.getsource(model.similar_tags))
Let's demonstrate this.
In [27]:
for tag in ['bayesian', 'regression', 'survival', 'p-value']:
print('Tags similar to %s:' % tag)
print(model.similar_tags(lightfm_model, question_vectorizer, tag)[:5])