This notebook demonstrates the use of many actions within the Recommender System (recommend) action set for SAS Cloud Analytic Services (CAS).
This example uses explicit ratings. The data set is the Book-Crossing data set[1](#data_attribution). The data preparation excludes the implicit ratings and also excludes ratings that do not match an ISBN in the books data set.
You must have access to a SAS Viya 3.3 release of CAS. To connect to CAS from Python, you must install the SAS Scripting Wrapper for Analytics Transfer (SWAT).
Copyright SAS Institute, Inc.
Initial setup
Import packages, including the SAS Wrapper for Analytic Transfer (SWAT) and open source libraries &cir Connect to CAS and start a session &cir Import the books file &cir Import the ratings file &cir Upload the data frames to the server
Simple exploration
Calculate the sparsity &cir View the ratings distribution
Build the recommender system
Partition the ratings table &cir Calculate average ratings by item and user &cir Explore the item ratings and user ratings
Build a matrix factorization model
Sample the data for a hold-out group &cir Build the model using ALS &cir Make recommendations for one user
Build a KNN model
Calculate similarity between users &cir View rating history for similar users &cir Recommend 10 books for one user
Combine search with recommendations
Build a search index &cir Create a simple filter table &cir Make recommendations from the filter table
In [1]:
import html
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import swat
from IPython.core.display import display, HTML, Markdown
from io import StringIO
%matplotlib inline
In [2]:
s = swat.CAS("host.example.com", 5570);
s.loadactionset("dataPreprocess")
s.loadactionset("fedSql")
s.loadactionset("recommend")
data_dir = '/path/to/data/'
ratings_file = data_dir + 'BX-Book-Ratings.csv'
books_file = data_dir + 'BX-Books.csv'
In [3]:
with open(books_file, 'r', encoding='iso-8859-1') as f:
content = html.unescape(f.read())
books_df = pd.read_csv(StringIO(content),
header=0,
error_bad_lines=False,
sep=';', encoding='iso-8859-1',
names=['ISBN', 'Title', 'Author', 'Year_Of_Publication', 'Publisher'],
usecols=['ISBN', 'Title', 'Author', 'Year_Of_Publication', 'Publisher'],
dtype={'Year_Of_Publication': str})
books_df = books_df[books_df.ISBN.str.len() == 10]
In [4]:
books_df.head()
Out[4]:
In [5]:
pattern = re.compile('"\d+";"[0-9X]{10}";"[1-9][0]?"')
buffer = StringIO()
with open(ratings_file, 'r', encoding='iso-8859-1') as f:
for line in f:
if pattern.match(line):
buffer.write(line)
buffer.seek(0)
ratings_df = pd.read_csv(buffer,
skiprows=0,
sep=';',
names=['User_ID', 'ISBN', 'Rating'],
dtype={'User_ID': str, 'ISBN': str, 'Rating': int})
buffer.close()
ratings_df.drop_duplicates(inplace=True)
ratings_df.head()
Out[5]:
In [6]:
ratings = s.upload_frame(ratings_df,
casout=s.CASTable('ratings',
replace=True,
indexVars=['isbn']),
importoptions=dict(filetype="csv",
vars=[
dict(name="User_ID", type="double"),
dict(name="ISBN", type="CHAR", length=10),
dict(name="Rating", type="double")
]))
books = s.upload_frame(books_df,
casout=s.CASTable('books', replace=True),
importoptions=dict(filetype="csv",
vars=[dict(name="ISBN", type="CHAR", length=10)]))
display(Markdown('### Books'))
display(books.table.columninfo())
display(Markdown('### Ratings'))
display(ratings.table.columninfo())
Discard any ratings that do not have a corresponding ISBN in the books table.
In [7]:
original_row_count = len(ratings)
s.dataStep.runCode(code='''
data ratings;
merge ratings(in=ratings) books(in=books keep=isbn);
by isbn;
if books and ratings then output;
run;
''')
final_row_count = len(ratings)
df = pd.DataFrame([[original_row_count], [final_row_count]],
columns=['Ratings Count'],
index=['Original', 'Final'])
df
Out[7]:
In [8]:
ratings['rating'].describe(stats=['mean', 'count', 'nmiss'])
Out[8]:
In [9]:
out = ratings.simple.distinct().Distinct.set_index('Column')
out
Out[9]:
In [10]:
# Store the number of rows in a variable
rating_count = len(ratings)
result = ratings.simple.distinct().Distinct.set_index('Column')
# Store the distinct number of users.
user_count = result.loc['User_ID', 'NDistinct']
# Store the distinct number of items.
item_count = result.loc['ISBN', 'NDistinct']
# Finally, here's the sparsity.
sparsity = 1.0 - (rating_count / (user_count * item_count))
df = pd.DataFrame([rating_count, user_count, item_count, sparsity],
index=['Ratings', 'Users', 'Items', 'Sparsity'],
columns=['Value'])
df
Out[10]:
In [11]:
results = ratings['rating'].value_counts(sort=False)
results
Out[11]:
In [12]:
ax = results.plot.bar(
title='Distribution of Ratings',
figsize=(15,5)
)
Subsequent actions are more efficient if the ratings table is partitioned once by the item and a second table is partitioned by the user.
If you do not perform this step now, then many of the subsequent actions will automatically make a copy of the data and group by the item or the user. If that is done once, it is convenient. However the notebook shows several actions so the data transfer and reorganization is done once so that subsequent actions are more memory and CPU efficient.
In [13]:
ratings_by_item = s.CASTable('ratings_by_item', replace=True)
result = ratings.groupby('ISBN').partition(casout=ratings_by_item)
ratings_by_user = s.CASTable('ratings_by_user', replace=True)
result = ratings.groupby('User_ID').partition(casout=ratings_by_user)
Now that we have two instances of the ratings table, partitioned by different variables, we can drop the original ratings table.
In [14]:
ratings.table.droptable()
Out[14]:
In [15]:
book_recommend = s.CASTable('bookRecommend', replace=True)
results = s.recommend.recomCreate(
system=book_recommend,
user='User_ID',
item='ISBN',
rate='Rating')
In [16]:
avg_user = s.CASTable('avg_user', replace=True)
results = ratings_by_user.recommend.recomRateinfo(
label='avg_user_model',
system=book_recommend,
id='User_ID',
sparseid='ISBN',
sparseval='rating',
casout=avg_user)
In [17]:
result = avg_user.head()
result
Out[17]:
You can view the ratings to confirm that the average is shown. If the first user
has a single rating, you can rerun the preceding cell with avg_user.query("_nrating_ = 2").head()
or a related query.
In [18]:
firstUser = result.loc[0,'User_ID']
count = result.loc[0,'_NRatings_']
ratings_by_user[ratings_by_user.user_id == firstUser].head(count)
Out[18]:
Create an average ratings by item table.
In [19]:
avg_item = s.CASTable('avg_item', replace=True)
results = ratings_by_item.recommend.recomRateinfo(
label='avg_item_model',
system=book_recommend,
id='isbn',
sparseid='user_id',
sparseval='rating',
casOut=avg_item)
In [20]:
avg_item.head()
Out[20]:
In [21]:
avg_user.query('_nratings_ > 3').sort_values('_stat_').head(10)
Out[21]:
In [22]:
avg_user.sort_values(['_stat_', '_nratings_'], ascending=False).head(10)
Out[22]:
In [23]:
s.fedSql.execDirect(query='''
select t1.isbn, t1._stat_ as "Average Rating",
t1._nratings_ as "Number of Ratings",
t2.author, t2.title from
avg_item as t1 join books as t2
on (t1.isbn = t2.isbn) order by 3 desc limit 10
''')
Out[23]:
In [24]:
result = avg_item.query('_nratings_ > 10').sort_values('_stat_').head(10)
result
Out[24]:
In [25]:
#Store the ISBN for the first row.
first_isbn = result.loc[0, 'ISBN']
result = ratings_by_item['rating'].query("isbn eq '%s'" % first_isbn).dataPreprocess.histogram()
display(Markdown('#### Ratings Distribution for ISBN %s' % first_isbn))
display(result.BinDetails.loc[:, ['BinLowerBnd', 'NInBin', 'Percent']])
In [26]:
holdout_users = s.CASTable('holdout_users', replace=True)
ratings_by_user.recommend.recomSample(
system=book_recommend,
label='holdout_users',
withhold=.2,
hold=1,
seed=1234,
id='user_id',
sparseid='isbn',
casout=holdout_users
)
Out[26]:
In [27]:
holdout_users.head(10)
Out[27]:
In [28]:
als_u = s.CASTable('als_u', replace=True)
als_i = s.CASTable('als_i', replace=True)
result = s.recommend.recomAls(
system=book_recommend,
tableu=ratings_by_user,
tablei=ratings_by_item,
label='als1',
casoutu=als_u,
casouti=als_i,
rateinfo=avg_user,
maxiter=20,
hold=holdout_users,
seed=1234,
details=True,
k=20,
stagnation=10,
threshold=.1
)
result.ModelInfo.set_index('Descr')
Out[28]:
In [29]:
ax = result.IterHistory.plot(
x='Iteration', y='Objective',
title='Objective Function',
figsize=(9,6)
)
result.IterHistory.set_index('Iteration')
Out[29]:
In [30]:
users= '104437'
recommendations = s.CASTable('recommendations', replace=True)
s.recommend.recomMfScore(
system=book_recommend,
label='als1',
userlist=users,
n=5,
casout=recommendations
)
Out[30]:
In [31]:
s.fedSql.execDirect(query='''
select t1.*, t2.author, t2.title,
t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
from recommendations as t1
left outer join books as t2 on (t1.isbn = t2.isbn)
left outer join avg_item as t3 on (t1.isbn = t3.isbn)
order by user_id, _rank_
''')
Out[31]:
In [32]:
recommend_heldout = s.CASTable('recommend_heldout', replace=True)
s.recommend.recomMfScore(
system=book_recommend,
label='als1',
usertable=holdout_users,
n=5,
casout=recommend_heldout
)
Out[32]:
In [33]:
result = s.fedsql.execdirect(query='''
select t1.*, t2.author, t2.title,
t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
from recommend_heldout as t1
left outer join books as t2 on (t1.isbn = t2.isbn)
left outer join avg_item as t3 on (t1.isbn = t3.isbn)
order by user_id, _rank_
''')
# There are many rows in the results. Print results for the first three users only.
three = result['Result Set'].loc[[0,5,10],:'User_ID'].values
for user in np.nditer(three):
display(Markdown('#### Recommendations for user %s ' % user))
display(result['Result Set'].query('User_ID == %s' % user))
In [34]:
similar_users = s.CASTable("similar_users", replace=True)
ratings_by_user.recommend.recomSim(
label="similar_users",
system=book_recommend,
id="user_id",
sparseId="isbn",
sparseVal="rating",
measure="cos",
casout=similar_users,
threshold=.2)
Out[34]:
In [35]:
result = similar_users.query("user_id_1 = 104437 and user_id_2 = 199981").head(1)
display(result)
def one_users_ratings(user_id):
result = s.fedSql.execDirect(query='''
select t1.*,
t2.author, t2.title from ratings_by_user as t1
left outer join books as t2 on (t1.isbn = t2.isbn)
where t1.user_id = {}
order by author, isbn;
'''.format(user_id))
display(Markdown('#### Ratings by user %s' % user_id))
display(result)
one_users_ratings(104437)
one_users_ratings(199981)
In [36]:
ratings_by_item.recommend.recomKnnTrain(
label='knn1',
system=book_recommend,
similarity=similar_users,
k=20,
hold=holdout_users,
rateinfo=avg_user,
user=True # need to tell if similarity is for the user or the item
)
Out[36]:
In [37]:
users = ['104437']
knn_recommended = s.CASTable("knn_recommended", replace=True)
s.recommend.recomKnnScore(
system="bookRecommend",
label="knn1",
userList=users,
n=10,
casout=knn_recommended
)
s.fedSql.execDirect(
query='''
select t1.*, t2.author, t2.title,
t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
from knn_recommended as t1
left outer join books as t2 on (t1.isbn = t2.isbn)
left outer join avg_item as t3 on (t1.isbn = t3.isbn)
order by user_id, _rank_;
''')
Out[37]:
First, build the search index.
Afterward, run search queries for terms.
In [38]:
book_search = s.CASTable("book_search", replace=True)
book_search.table.droptable(quiet=True)
books.recommend.recomSearchIndex(
system=book_recommend,
label='book_search',
id='isbn')
Out[38]:
In [39]:
yoga_query = 'yoga fitness'
query_filter = s.CASTable("query_filter", replace=True)
result = book_search.recommend.recomSearchQuery(
system=book_recommend,
label='book_search',
casout=query_filter,
query=yoga_query,
n=100)
In [40]:
query_filter.columnInfo()
Out[40]:
In [41]:
yoga_reader = '99955'
filtered_results = s.CASTable('filtered_results', replace=True)
filtered_results = s.recommend.recomMfScore(
system=book_recommend,
label='als1',
filter=query_filter,
userlist=yoga_reader,
n=5,
casout=filtered_results
)
In [42]:
s.fedSql.execDirect(query='''
select t1.*, t2.author, t2.title,
t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
from filtered_results as t1
left outer join books as t2 on (t1.isbn = t2.isbn)
left outer join avg_item as t3 on (t1.isbn = t3.isbn)
order by user_id, _rank_;
''')
Out[42]:
In [43]:
#s.close()
Data were used with permission from Dr. Cai-Nicolas Ziegler. The following publication provides information about the dataset:
Improving Recommendation Lists Through Topic Diversification Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.
In [ ]: