Recommending questions on CrossValidated

In this example, we'll try to recommend questions to be answered to users of stats.stackexchange.com.

Loading the data

The full CrossValidated dataset is available at https://archive.org/details/stackexchange. Helper functions to obtain and process it are defined in data.py, and we are going to use them here.


In [1]:
import data

(interactions, question_features,
 user_features, question_vectorizer,
 user_vectorizer) = data.read_data() # This will download the data if not present

interactions is a matrix with entries equal to 1 if the i-th user posted an answer to the j-th question; the goal is to recommend the questions to users who might answer them.

question_features is a sparse matrix containing question metadata in the form of tags. vectorizer is a sklearn.feature_extraction.DictVectorizer instance that translates the tags into vector form.

user_features and user_vectorizer pertain to user features. In this case, we take users' 'About' sections: short snippets of natural language that (often) describe a given user's interests.

Printing the matrices show that we have around 3200 users and 40,000 questions. Questions are described by one or more tags from a set of about 1200.


In [2]:
print(repr(interactions))
print(repr(question_features))


<3212x40371 sparse matrix of type '<type 'numpy.int32'>'
	with 57891 stored elements in Compressed Sparse Row format>
<40371x1189 sparse matrix of type '<type 'numpy.int32'>'
	with 151020 stored elements in Compressed Sparse Row format>

The tags matrix contains rows such as


In [3]:
print(question_vectorizer.inverse_transform(question_features[:3]))


[{'elicitation': 1, 'prior': 1, 'intercept': 1, 'bayesian': 1}, {'intercept': 1, 'distributions': 1, 'normality': 1}, {'intercept': 1, 'open-source': 1, 'software': 1}]

User features are exactly what we would expect from processing raw text:


In [4]:
print(user_vectorizer.inverse_transform(user_features[2]))


[{'website': 1, 'blog': 1, 'and': 1, 'user_id:3': 1, 'genetics': 1, 'hunting': 1, 'job': 1, 'university': 1, 'vanderbilt': 1, 'find': 1, 'will': 1, 'near': 1, 'finishing': 1, 'human': 1, 'year': 1, 'end': 1, 'the': 1, 'twitter': 1}]

Fitting models

Train/test split

We can split the dataset into train and test sets by using utility functions defined in model.py.


In [5]:
import model
import inspect
print(inspect.getsource(model.train_test_split))
train, test = model.train_test_split(interactions)


def train_test_split(interactions):

    train = interactions.copy()
    test = interactions.copy()

    for i in range(len(train.data)):
        if random.random() < 0.2:
            train.data[i] = 0
        else:
            test.data[i] = 0

    train.eliminate_zeros()
    test.eliminate_zeros()

    return train, test

Traditional MF model

Let's start with a traditional collaborative filtering model that does not use any metadata. We can do this using lightfm -- we simply do not pass in any metadata matrices. We'll use the following function to train a WARP model.


In [6]:
print(inspect.getsource(model.fit_lightfm_model))


def fit_lightfm_model(interactions, post_features=None, user_features=None, epochs=30):

    model = lightfm.LightFM(loss='warp',
                            learning_rate=0.01,
                            learning_schedule='adagrad',
                            user_alpha=0.0001,
                            item_alpha=0.0001,
                            no_components=30)

    model.fit(interactions,
              item_features=post_features,
              user_features=user_features,
              num_threads=4,
              epochs=epochs)

    return model


In [9]:
mf_model = model.fit_lightfm_model(train, epochs=1)

The following function will compute the AUC score on the test set:


In [11]:
print(inspect.getsource(model.auc_lightfm))
mf_score = model.auc_lightfm(mf_model, test)
print(mf_score)


def auc_lightfm(model, interactions, post_features=None, user_features=None):

    no_users, no_items = interactions.shape

    pid_array = np.arange(no_items, dtype=np.int32)

    scores = []

    for i in range(interactions.shape[0]):
        uid_array = np.empty(no_items, dtype=np.int32)
        uid_array.fill(i)
        predictions = model.predict(uid_array,
                                    pid_array,
                                    item_features=post_features,
                                    user_features=user_features,
                                    num_threads=4)
        y = np.squeeze(np.array(interactions[i].todense()))

        try:
            scores.append(roc_auc_score(y, predictions))
        except ValueError:
            # Just one class
            pass

    return sum(scores) / len(scores)

0.421300275439

Ooops. That's worse than random (due possibly to overfitting).

In this case, this is because the CrossValidated dataset is very sparse: there just aren't enough interactions to support a traditional collaborative filtering model. In general, we'd also like to recommend questions that have no answers yet, making the collaborative model doubly ineffective.

Content-based model

To remedy this, we can try using a content-based model. The following code uses question tags to estimate a logistic regression model for each user, predicting the probability that a user would want to answer a given question.


In [7]:
print(inspect.getsource(model.fit_content_models))


def fit_content_models(interactions, post_features):

    models = []

    for user_row in interactions:
        y = np.squeeze(np.array(user_row.todense()))

        model = LogisticRegression(C=0.4)
        try:
            model.fit(post_features, y)
        except ValueError:
            # Just one class
            pass

        models.append(model)

    return models

Running this and evaluating the AUC score gives


In [8]:
content_models = model.fit_content_models(train, question_features)

In [9]:
content_score = model.auc_content_models(content_models, test, question_features)
print(content_score)


0.662672376323

That's a bit better, but not great. In addition, a linear model of this form fails to capture tag similarity. For example, probit and logistic regression are closely related, yet the model will not automatically infer knowledge of one knowledge of the other.

Hybrid LightFM model

What happens if we estimate theLightFM model with question features?


In [10]:
lightfm_model = model.fit_lightfm_model(train, post_features=question_features)
lightfm_score = model.auc_lightfm(lightfm_model, test, post_features=question_features)
print(lightfm_score)


0.709716075859

We can add user features on top for a small additional improvement:


In [11]:
lightfm_model = model.fit_lightfm_model(train, post_features=question_features,
                                         user_features=user_features)
lightfm_score = model.auc_lightfm(lightfm_model, test, post_features=question_features,
                                  user_features=user_features)
print(lightfm_score)


0.712266019394

This is quite a bit better, illustrating the fact that an embedding-based model can capture more interesting relationships between content features.

Feature embeddings

One additional advantage of metadata-based latent models is that they give us useful latent representations of the metadata features themselves --- much in the way word embedding approaches like word2vec.

The code below takes an input CrossValidated tag and finds tags that are close to it (in the cosine similarity sense) in the latent embedding space:


In [28]:
print(inspect.getsource(model.similar_tags))


def similar_tags(model, vectorizer, tag, number=10):

    tag_idx = vectorizer.vocabulary_[tag]

    tag_embedding = model.item_embeddings[tag_idx]

    sim = (np.dot(model.item_embeddings, tag_embedding)
           / np.linalg.norm(model.item_embeddings, axis=1))

    return np.array(vectorizer.get_feature_names())[np.argsort(-sim)[1:1 + number]].tolist()

Let's demonstrate this.


In [27]:
for tag in ['bayesian', 'regression', 'survival', 'p-value']:
    print('Tags similar to %s:' % tag)
    print(model.similar_tags(lightfm_model, question_vectorizer, tag)[:5])


Tags similar to bayesian:
['mcmc', 'prior', 'markov-chain', 'jeffreys-prior', 'subject-specific']
Tags similar to regression:
['multiple-regression', 'panel-data', 'geomarketing', 'logistic', 'multicollinearity']
Tags similar to survival:
['cox-model', 'disease', 'epidemiology', 'matching', 'disaggregation']
Tags similar to p-value:
['t-test', 'scatterplot', 'proportion', 'statistical-significance', 'hypothesis-testing']