This example continues illustrating using pandas-munging capabilities in estimators building features that draw from several rows, this time using NMF (nonnegative matrix factorization). We will use a single table from the Movielens dataset (F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS)).
In this example too, we'll only use the dataset table describing the ratings themselves. I.e., each row is an instance of a single rating given by a specific user to a specific movie.
In [1]:
import os
from sklearn import base
import pandas as pd
import scipy as sp
import seaborn as sns
sns.set_style('whitegrid')
sns.despine()
import ibex
from ibex.sklearn import model_selection as pd_model_selection
from ibex.sklearn import decomposition as pd_decomposition
from ibex.sklearn import decomposition as pd_decomposition
from ibex.sklearn import ensemble as pd_ensemble
%pylab inline
In [2]:
ratings = pd.read_csv(
'movielens_data/ml-100k/u.data',
sep='\t',
header=None,
names=['user_id', 'item_id', 'rating', 'timestamp'])
features = ['user_id', 'item_id']
ratings[features + ['rating']].head()
Out[2]:
In Simple Row-Aggregating Features In The Movielens Dataset we looked at direct attributes obtainable from the rankings: the average user and item ranking. Here we'll use Pandas to bring the dataset to a form where we can find latent factors through NMF.
First we pivot the table so that we have a UI matrix of the users as rows, the items as columns, and the ratings as the values:
In [4]:
UI = pd.pivot_table(ratings, values='rating', index='user_id', columns ='item_id')
UI
Out[4]:
We now use NMF for decomposition, and then find the user latent factors in U and item latent factors in I:
In [5]:
d = pd_decomposition.NMF(n_components=20)
U = d.fit_transform(UI.fillna(0))
I = d.components_
Note that the Ibex version of NMF sets the indexes and columns of the U and I appropriately.
In [6]:
U.head()
Out[6]:
In [7]:
I.head()
Out[7]:
Pandas makes it easy to merge the user and item latent factors to the users and items, respectively.
In [8]:
ratings.head()
Out[8]:
In [10]:
rating_comps = pd.merge(
ratings,
U,
left_on='user_id',
right_index=True,
how='left')
rating_comps = pd.merge(
rating_comps,
I.T,
left_on='item_id',
right_index=True,
how='left')
rating_comps.head()
Out[10]:
Let's merge to the results also the number of occurrences to of the users and items, respectively.
In [12]:
rating_comps = pd.merge(
rating_comps,
ratings.groupby(ratings.user_id).size().to_frame().rename(columns={0: 'user_id_count'}),
left_on='user_id',
right_index=True,
how='left')
rating_comps = pd.merge(
rating_comps,
ratings.groupby(ratings.item_id).size().to_frame().rename(columns={0: 'item_id_count'}),
left_on='user_id',
right_index=True,
how='left')
prd_features = [c for c in rating_comps if 'comp_' in c] + ['user_id_count', 'item_id_count']
rating_comps.head()
Out[12]:
We now have a dataframe of latent variables. Let's build a random forest regressor, and use it on this dataframe.
In [13]:
prd = pd_ensemble.RandomForestRegressor().fit(rating_comps[prd_features], ratings.rating)
prd.score(rating_comps[prd_features], ratings.rating)
Out[13]:
Finally, let's check the feature importances.
In [73]:
prd.feature_importances_.to_frame().plot(kind='barh');
We'll now build a Scikit-Learn / Pandas step doing the above.
In [83]:
class RatingsFactorizer(base.BaseEstimator, base.TransformerMixin, ibex.FrameMixin):
def fit(self, X, y):
X = pd.concat([X[['user_id', 'item_id']], y])
X.columns = ['user_id', 'item_id', 'rating']
self._user_id_count = X.groupby(X.user_id).size().to_frame().rename(columns={0: 'user_id_count'})
self._item_id_count = X.groupby(X.item_id).size().to_frame().rename(columns={0: 'item_id_count'})
UI = pd.pivot_table(ratings, values='rating', index='user_id', columns ='item_id')
d = pd_decomposition.NMF(n_components=10)
self._U = d.fit_transform(UI.fillna(0))
self._I = d.components_
return self
def transform(self, X):
rating_comps = pd.merge(
X[['user_id', 'item_id']],
self._U,
left_on='user_id',
right_index=True,
how='left')
rating_comps = pd.merge(
rating_comps,
self._I.T,
left_on='item_id',
right_index=True,
how='left')
rating_comps = pd.merge(
rating_comps,
self._user_id_count,
left_on='user_id',
right_index=True,
how='left')
rating_comps = pd.merge(
rating_comps,
self._item_id_count,
left_on='user_id',
right_index=True,
how='left')
prd_features = [c for c in rating_comps if 'comp_' in c] + ['user_id_count', 'item_id_count']
return rating_comps[prd_features].fillna(0)
We can now use cross validation to assess this scheme.
In [84]:
prd = RatingsFactorizer() | pd_ensemble.RandomForestRegressor()
hist(
pd_model_selection.cross_val_score(
prd,
ratings[features],
ratings.rating,
cv=20,
n_jobs=-1),
color='grey');
xlabel('CV Score')
ylabel('Num Occurrences')
figtext(
0,
-0.1,
'Histogram of cross-validated scores');