Factorization Machine (FM) is one of the newest algorithms in the Machine Learning space, and has been developed in SAS. FM is a general prediction algorithm, similar to Support Vector Machines, that can deal with very sparce data, an area where traditional Machine Learning techniques fail.
Recommendation Engines are notoriously difficult due to their sparcity. We have many users and many rated items, but most users have rated very few of the items. Therefore, we will try to use a Factorization Machine to implement new movie recommendations for users
This notebook has Five parts:
In this section, we will load the necessary Python Packages, as well as set up a connection to our CAS Server
Our API is contained in our swat package, which will convert Python syntax below into language the CAS server can understand and execute. Results are then brought back to the Python Client
In [29]:
#Load Packages
import swat
from swat import *
from swat.render import render_html
from matplotlib import pyplot as plt
import numpy as np
%matplotlib inline
from IPython.display import HTML
swat.options.cas.print_messages = True
In [30]:
# Connect to the session
s = CAS(cashost, casport)
# Define directory and data file name
indata_dir="/viyafiles/ankram/Data"
indata='movie_reviews'
movie_info= 'Movies_10k_desc_final'
# Create a CAS library called DMLib pointing to the defined directory
## Note, need to specify the srctype is path, otherwise it defaults to HDFS
s.table.addCaslib(datasource={'srctype':'path'}, name='DMlib', path=indata_dir);
# Push the relevant table In-Memory if it does not already exist
## Note, this is a server side data load, not being loaded from the client
a = s.loadTable(caslib='DMlib', path=indata+'.sas7bdat', casout={'name':indata});
b = s.loadTable(caslib='DMlib', path=movie_info+'.sas7bdat', casout={'name':movie_info});
# Load necessary actionsets
actions = ['fedSQL', 'transpose','sampling','factmac','astore', 'recommend']
[s.loadactionset(i) for i in actions]
# Set variables for later use by models
target = 'rating'
class_inputs = ['usr_id', 'movie']
all_inputs = [target] + class_inputs
#Pointer Shortcut
indata_p = a.casTable
movie_info_p = b.casTable
Our goal is to recommend two new movies for each user
In [31]:
#See Overview Data
print(len(indata_p), "Movies")
movie_info_p[movie_info_p.columns[0:7]].head()
Out[31]:
In [32]:
#Distribution of Parental Ratings
movie_info_p['parental_rating'].value_counts()
Out[32]:
In [33]:
print(len(indata_p), "Ratings")
print(len(indata_p[class_inputs[0]].value_counts()), "Users")
indata_p.head()
Out[33]:
In [34]:
freq_table = (s.fedSQL.execDirect('''
SELECT ''' + target + ''', count(*) as Frequency
FROM '''+ indata +'''
GROUP BY ''' + target + ''';
''')['Result Set'].sort_values(target)).set_index(np.arange(1,6))
plt.figure(figsize = (15, 10))
plt.bar(np.arange(1,6),freq_table['FREQUENCY'], align='center')
plt.xlabel('Actual Rating', size=20)
plt.ylabel('# of Ratings (Frequency)', size=20)
plt.title('Plot of Rating Frequency', size=25)
plt.xlim([.5,5.5]);
print(indata_p[target].mean(), "Average Review")
In [35]:
# Create a 70/30 stratified Split on Users
s.sampling.stratified(
table = dict(name = indata, groupby = class_inputs[0]),
output = dict(casOut = dict(name = indata + '_prt_' + class_inputs[0], replace = True), copyVars = 'ALL'),
samppct = 70,
partind = True,
seed = 123
)
# Create a 70/30 split for the movies
s.sampling.stratified(
table = dict(name = indata, groupby = class_inputs[1]),
output = dict(casOut = dict(name = indata + '_prt_' + class_inputs[1], replace = True), copyVars = 'ALL'),
samppct = 70,
partind = True,
seed = 123
)
# Combine the samples together into one dataset so that it's stratified by user and movie
# Make the data 'blind' if it is part of the validation set so that it can be assessed
s.fedSQL.execDirect('''
CREATE TABLE ''' + indata +'''_prt {options replace=true} AS
SELECT
a.''' + class_inputs[0] + ''',
a.''' + class_inputs[1] + ''',
a.''' + target + ''',
CASE WHEN a._PartInd_ + b._PartInd_ > 0 THEN 1 ELSE 0 END AS _PartInd_
FROM
''' + indata + '_prt_' + class_inputs[0] + ''' a
INNER JOIN ''' + indata + '_prt_' + class_inputs[1] + ''' b
ON a.''' + class_inputs[0] + ' = b.' + class_inputs[0] + '''
AND a.''' + class_inputs[1] + ' = b.' + class_inputs[1] + ''';
''')
s.CASTable(indata + '_prt')[all_inputs].query('_PartInd_=0').head()
Out[35]:
Bias occurs because users unknowingly rate on different scales. Thus, a four star rating does not mean the same thing for two different users
The Factorization Machine accounts for this bias by assuming a predicted rating is the sum of:
Factorization Machines account for these innate biases when making predictions, and are able to estimate the pairwise interactions between specific users and movies in sparse data.
In [36]:
indata_p[target].mean()
Out[36]:
In [37]:
# We can use SQL to find this, and further format using Python - sort_values() and head()
render_html(
s.fedSQL.execDirect('''
SELECT
''' + class_inputs[0] + ''',
COUNT(''' + target + ''') AS num_ratings,
AVG(''' + target + ''') AS avg_rating,
AVG(''' + target + ''')-3.55 AS user_bias
FROM ''' + indata + '''
GROUP BY usr_id
''')['Result Set'].sort_values(class_inputs[0]).head()
)
In [38]:
# I could use SQL to find this as well, but decided to use Python built-in functionality - groupby()
movie_bias = s.CASTable(indata).groupby(class_inputs[1])[target].summary(subset=['N','Mean']).concat_bygroups().Summary
movie_bias['Movie_Bias'] = movie_bias['Mean'] - 3.55
movie_bias.head()
Out[38]:
In [39]:
#Join the Data together
s.fedSQL.execDirect('''
CREATE TABLE '''+ indata +'''_model {options replace=True} AS
SELECT
t1.*,
t2.year,
t2.parental_rating
FROM
''' + indata + '''_prt t1
LEFT JOIN ''' + movie_info +''' t2
ON t1.movie = t2.movieId
''')
s.dataStep.runCode('''
data '''+ indata +'''_model2 (replace=YES);
set '''+ indata +'''_model;
if parental_rating = ""
then parental_rating="None";
if year ne .;
run;
''')
s.CASTable('movie_reviews_model2').head()
Out[39]:
In [40]:
# Build the factorization machine
class_inputs = ['usr_id', 'movie','parental_rating', 'year']
r = s.factmac.factmac(
table = dict(name = indata + '_model2', where = '_PartInd_ = 1'),
inputs = class_inputs,
nominals = class_inputs,
target = target,
maxIter = 5,
nFactors = 2,
learnStep = 0.1,
seed = 12345,
savestate = dict(name = 'fm_model', replace = True)
)
r['FinalLoss']
Out[40]:
We want to calculate these fit statistics on the holdout sample to get an unbiased estimate of model performance on new data. We want to ensure that ourengines makes robust predicitons on new data and does not overfit
Note: A RMSE of 1 means that on average we miss the actual rating by 1 star
In [41]:
# Score the factorization machine
s.CASTable(indata + '_model2').astore.score(
rstore = dict(name = 'fm_model'),
out = dict(name = indata + '_scored', replace = True),
copyVars = all_inputs + ['_PartInd_']
)
# Find the (predicted - actual) error rate on the validation set
s.fedSQL.execDirect('''
CREATE TABLE eval {options replace=true} AS
SELECT
a.*,
a.P_''' + target + ' - a.' + target + ''' AS error
FROM
''' + indata + '''_scored a
WHERE a._PartInd_= 0
''')
# Compute the Mean Squared Error and Root Mean Squared Error
s.fedSQL.execDirect('''
SELECT
AVG(error**2) AS MSE,
SQRT(AVG(error**2)) AS RMSE
FROM eval
''')
Out[41]:
In [42]:
rating = (s.fedSQL.execDirect('''
SELECT ''' + target + ''',
count(*) AS freqnency,
AVG(P_''' + target + ''') AS avg_prediction
FROM eval
GROUP BY ''' + target + ''';
''')['Result Set'].sort_values(target)).set_index(np.arange(1,6))
rating
Out[42]:
In [43]:
plt.figure(figsize = (12, 8))
plt.bar(np.arange(1,6),rating['rating'], color='#eeefff', align='center')
plt.plot(rating['rating'], rating['AVG_PREDICTION'], linewidth=3, label='Average Prediction')
plt.xlabel('Actual Rating', size=20)
plt.ylabel('Average Predicted Rating', size=20)
plt.title('Plot of Average Predicted vs Actual Ratings', size=25)
plt.ylim([0,6])
plt.xlim([.5,5.5]);
In [44]:
# Transpose the data using the completely redesigned transpose CAS action - this is running multi-threaded
test=s.transpose(
table = dict(name = indata, groupBy = class_inputs[0], vars = target),
id = class_inputs[1],
casOut = dict(name = indata + '_transposed', replace = True)
)
s.CASTable(indata + '_transposed').head()
# Find the movies the users have not watched and predict their potential rating
s.transpose(
table = dict(name = indata + '_transposed', groupBy = class_inputs[0]),
casOut = dict(name = indata + '_long', replace = True)
)
s.dataStep.runcode('''
data ''' + indata + '''_long;
set ''' + indata + '''_long;
''' + class_inputs[1] + ''' = 1.0*_NAME_;
drop _NAME_;
''')
s.fedSQL.execDirect('''
CREATE TABLE scoring_table{options replace=TRUE} AS
SELECT
a.*,
b.title,
b.year,
b.Parental_Rating,
b.genres
FROM
'''+ indata +'''_long a
INNER JOIN '''+ movie_info +''' b
ON
a.'''+ class_inputs[1] +''' = b.movieId
''')
Out[44]:
In [47]:
#Make Recommendations
astore = s.CASTable('scoring_table')[all_inputs].query(target + ' is null').astore.score(
rstore = dict(name = 'fm_model'),
out = dict(name = indata + '_scored_new', replace = True),
copyVars = class_inputs + ['title']
)
#See top recommendations per user
s.CASTable('movie_reviews_scored_new') \
.groupby(class_inputs[0]) \
.sort_values([class_inputs[0], 'P_' + target], ascending = [True, False]) \
.query("parental_rating ^= 'NA'") \
.head(2) \
.head(14)
Out[47]:
In [49]:
#Close the connection
s.close()