This is for a content-based recommender system. I trained some models in a few articles from an Autism Parents Magazine and try to predict how similar are to pairs of articles. The models are trained somewhere else, here I just want to show the idea of how to validate a recommender model, but not to enter into details of the model.
Our models find similarities among articles, but how can we test that the similarities make sense?
In the "AutismParentMagazine" website each article is assigned to a category, and we can use this classification to validate the model.
Let's define a function to tell wether two articles ($1$ and $2$) belong to the same category:
\begin{eqnarray} f (1,2) &=& 1 & \text{ if } 1\text{, } 2 \, \text{ in same category, or}\nonumber\\ &=& 0 & \text{ elsewhere.}\nonumber \end{eqnarray}If the model finds that two articles ($i$, $j$) are similar and $f(i,j)=1$ (they belong to the same category), then these are a good match; viceversa if $f(i,j)=0$ they may not be good match.
We can then evaluate the model, for all $N$ pairs of similar articles $i$, $j$, as:
\begin{eqnarray}
\text{score} = \frac{1}{N} \sum_{i,j} f(i,j) .
\end{eqnarray}
To illustrate, if the model finds that all articles pairs found to be similar also belong to the same category, then the score is 1.
In fact, the closer the score is to 1, the better the model is.
In [1]:
import pandas as pd
import os
In [2]:
from collections import defaultdict
from gensim import corpora, models, similarities
def get_model_score(ids,matsim,categories):
""" Function to evalate the score for a given model, following the equation defined above"""
num_of_predictions=3
model_score=0
for id,doc in zip(ids,matsim.index):
sims=matsim[doc]
for other_id,score in sims:
#print("ID {} OTHER_ID {} SCORE {}".format(id,other_id,score))
category1=categories[id]
category2=categories[other_id]
if id != other_id:
if category1 == category2:
model_score+=1
N=len(ids)*num_of_predictions
model_score=model_score/N
return model_score
In [3]:
# Read datasets from file:
os.chdir('../data/')
# Read dataframe
input_fname="AutismParentMagazine-posts-tokens.csv"
# Get categories and ids from dataset
df = pd.read_csv(input_fname,index_col=0)
df.head(2)
categories=df['category']
ids=df.index
In [4]:
# Read models and evaluate the score
import pickle
# Read models
matsim = pickle.load(open("lsi-matsim.save", "rb"))
model_score= get_model_score(ids,matsim,categories)
print("LSI model score {}".format(model_score))
# Read models
matsim = pickle.load(open("lda-matsim.save", "rb"))
model_score= get_model_score(ids,matsim,categories)
print("LDA model score {}".format(model_score))
Take home message:
The LSI model seems to perform better in this case.
A score of 0.68 can be interpreted as ~70% of similar articles also belong to the same category.
On the other side, the LDA model has a much lower score 0.24, and hence is not performing well. Probably the performance could be improved by twicking the model parameters (number of topics, etc.).