Build a Recommender System for Books

This notebook demonstrates the use of many actions within the Recommender System (recommend) action set for SAS Cloud Analytic Services (CAS).

This example uses explicit ratings. The data set is the Book-Crossing data set^{[1](#data_attribution)}. The data preparation excludes the implicit ratings and also excludes ratings that do not match an ISBN in the books data set.

You must have access to a SAS Viya 3.3 release of CAS. To connect to CAS from Python, you must install the SAS Scripting Wrapper for Analytics Transfer (SWAT).

For information about SWAT, including installation, see Python-SWAT.
For information about the CAS actions used in this example, see Recommender System Action Set: Details in the SAS Visual Analytics 8.2: Programming Guide.

Notebook contents

Initial setup
Import packages, including the SAS Wrapper for Analytic Transfer (SWAT) and open source libraries &cir Connect to CAS and start a session &cir Import the books file &cir Import the ratings file &cir Upload the data frames to the server
Simple exploration
Calculate the sparsity &cir View the ratings distribution
Build the recommender system
Partition the ratings table &cir Calculate average ratings by item and user &cir Explore the item ratings and user ratings
Build a matrix factorization model
Sample the data for a hold-out group &cir Build the model using ALS &cir Make recommendations for one user
Build a KNN model
Calculate similarity between users &cir View rating history for similar users &cir Recommend 10 books for one user
Combine search with recommendations
Build a search index &cir Create a simple filter table &cir Make recommendations from the filter table

Initial Setup

Import Packages: SAS Wrapper for Analytic Transfer (SWAT) and Open Source Libraries



In [1]:

    
import html
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import re
import swat
from IPython.core.display import display, HTML, Markdown
from io import StringIO

%matplotlib inline

Connect to CAS and start a session



In [2]:

    
s = swat.CAS("host.example.com", 5570);
s.loadactionset("dataPreprocess")
s.loadactionset("fedSql")
s.loadactionset("recommend")

data_dir = '/path/to/data/'
ratings_file = data_dir + 'BX-Book-Ratings.csv'
books_file   = data_dir + 'BX-Books.csv'









    



NOTE: Added action set 'dataPreprocess'.
NOTE: Added action set 'fedSql'.
NOTE: Added action set 'recommend'.

Import the books file

This file include HTML entities such as & that interfere with parsing. Keep the books with 10 characters for the ISBN.



In [3]:

    
with open(books_file, 'r', encoding='iso-8859-1') as f:
    content = html.unescape(f.read())

books_df = pd.read_csv(StringIO(content),
                       header=0, 
                       error_bad_lines=False,
                       sep=';', encoding='iso-8859-1',
                       names=['ISBN', 'Title', 'Author', 'Year_Of_Publication', 'Publisher'],
                       usecols=['ISBN', 'Title', 'Author', 'Year_Of_Publication', 'Publisher'],
                       dtype={'Year_Of_Publication': str})

books_df = books_df[books_df.ISBN.str.len() == 10]



In [4]:

    
books_df.head()









    Out[4]:







  
    
      
      ISBN
      Title
      Author
      Year_Of_Publication
      Publisher
    
  
  
    
      0
      0195153448
      Classical Mythology
      Mark P. O. Morford
      2002
      Oxford University Press
    
    
      1
      0002005018
      Clara Callan
      Richard Bruce Wright
      2001
      HarperFlamingo Canada
    
    
      2
      0060973129
      Decision in Normandy
      Carlo D'Este
      1991
      HarperPerennial
    
    
      3
      0374157065
      Flu: The Story of the Great Influenza Pandemic...
      Gina Bari Kolata
      1999
      Farrar Straus Giroux
    
    
      4
      0393045218
      The Mummies of Urumchi
      E. J. W. Barber
      1999
      W. W. Norton & Company

Import the ratings file

The regex is used to skip invalid lines and to skip ratings with a 0. This example uses explicit ratings only.



In [5]:

    
pattern = re.compile('"\d+";"[0-9X]{10}";"[1-9][0]?"')
buffer = StringIO()

with open(ratings_file, 'r', encoding='iso-8859-1') as f:
    for line in f:
        if pattern.match(line):
            buffer.write(line)
            
buffer.seek(0)            
ratings_df = pd.read_csv(buffer,
                         skiprows=0,
                         sep=';',
                         names=['User_ID', 'ISBN', 'Rating'],
                         dtype={'User_ID': str, 'ISBN': str, 'Rating': int})
buffer.close()

ratings_df.drop_duplicates(inplace=True)
ratings_df.head()

Finally, upload the data frames to the server

Because the ISBNs that are used from the data set are all 10 characters long, it is more efficient to use a fixed-width size for them. The author and title strings vary greatly, so VARCHAR is a better choice for those two columns.



In [6]:

    
ratings = s.upload_frame(ratings_df,
                         casout=s.CASTable('ratings', 
                                           replace=True,
                                           indexVars=['isbn']),
                         importoptions=dict(filetype="csv", 
                                            vars=[
                                                dict(name="User_ID", type="double"),
                                                dict(name="ISBN", type="CHAR", length=10),
                                                dict(name="Rating", type="double")
                                            ]))

books = s.upload_frame(books_df, 
                       casout=s.CASTable('books', replace=True), 
                       importoptions=dict(filetype="csv", 
                                          vars=[dict(name="ISBN", type="CHAR", length=10)]))

display(Markdown('### Books'))
display(books.table.columninfo())

display(Markdown('### Ratings'))
display(ratings.table.columninfo())









    



NOTE: Cloud Analytic Services made the uploaded file available as table RATINGS in caslib CASUSERHDFS(mimcki).
NOTE: The table RATINGS has been created in caslib CASUSERHDFS(mimcki) from binary data uploaded to Cloud Analytic Services.
NOTE: Cloud Analytic Services made the uploaded file available as table BOOKS in caslib CASUSERHDFS(mimcki).
NOTE: The table BOOKS has been created in caslib CASUSERHDFS(mimcki) from binary data uploaded to Cloud Analytic Services.






    




Books






    




§ ColumnInfo




  
    
      
      Column
      ID
      Type
      RawLength
      FormattedLength
      NFL
      NFD
    
  
  
    
      0
      ISBN
      1
      char
      10
      10
      0
      0
    
    
      1
      Title
      2
      varchar
      303
      303
      0
      0
    
    
      2
      Author
      3
      varchar
      143
      143
      0
      0
    
    
      3
      Year_Of_Publication
      4
      double
      8
      12
      0
      0
    
    
      4
      Publisher
      5
      varchar
      151
      151
      0
      0
    
  




elapsed 0.0576s · user 0.247s · sys 0.72s · mem 49.7MB






    




Ratings






    




§ ColumnInfo




  
    
      
      Column
      ID
      Indexed
      Type
      RawLength
      FormattedLength
      NFL
      NFD
    
  
  
    
      0
      User_ID
      1
      N
      double
      8
      12
      0
      0
    
    
      1
      ISBN
      2
      Y
      char
      10
      10
      0
      0
    
    
      2
      Rating
      3
      N
      double
      8
      12
      0
      0
    
  




elapsed 0.0577s · user 0.217s · sys 0.714s · mem 50MB

Discard any ratings that do not have a corresponding ISBN in the books table.



In [7]:

    
original_row_count = len(ratings)

s.dataStep.runCode(code='''
  data ratings;
    merge ratings(in=ratings) books(in=books keep=isbn);
    by isbn;
    if books and ratings then output;
  run;
''')

final_row_count = len(ratings)

df = pd.DataFrame([[original_row_count], [final_row_count]], 
                  columns=['Ratings Count'],
                  index=['Original', 'Final'])
df









    Out[7]:







  
    
      
      Ratings Count
    
  
  
    
      Original
      428204
    
    
      Final
      383596

Confirm there are no missing values for ratings

Check that the value for the NMiss column in the results is 0.



In [8]:

    
ratings['rating'].describe(stats=['mean', 'count', 'nmiss'])









    Out[8]:





mean          7.626589
count    383596.000000
nmiss         0.000000
Name: Rating, dtype: float64

Simple exploration

Calculate the sparsity

Sparsity of ratings is a common problem with recommender systems.



In [9]:

    
out = ratings.simple.distinct().Distinct.set_index('Column')
out









    Out[9]:






Distinct Counts for RATINGS
  
    
      
      NDistinct
      NMiss
      Trunc
    
    
      Column
      
      
      
    
  
  
    
      User_ID
      68077.0
      0.0
      0.0
    
    
      ISBN
      149612.0
      0.0
      0.0
    
    
      Rating
      10.0
      0.0
      0.0



In [10]:

    
# Store the number of rows in a variable
rating_count = len(ratings)

result = ratings.simple.distinct().Distinct.set_index('Column')

# Store the distinct number of users.
user_count = result.loc['User_ID', 'NDistinct']

# Store the distinct number of items. 
item_count = result.loc['ISBN', 'NDistinct']

# Finally, here's the sparsity.
sparsity = 1.0 - (rating_count / (user_count * item_count))

df = pd.DataFrame([rating_count, user_count, item_count, sparsity],
                  index=['Ratings', 'Users', 'Items', 'Sparsity'], 
                  columns=['Value'])
df









    Out[10]:







  
    
      
      Value
    
  
  
    
      Ratings
      383596.000000
    
    
      Users
      68077.000000
    
    
      Items
      149612.000000
    
    
      Sparsity
      0.999962

View the distribution of ratings



In [11]:

    
results = ratings['rating'].value_counts(sort=False)
results









    Out[11]:





1.0      1479
2.0      2374
3.0      5117
4.0      7612
5.0     45342
6.0     31671
7.0     66352
8.0     91722
9.0     60743
10.0    71184
dtype: int64



In [12]:

    
ax = results.plot.bar(
    title='Distribution of Ratings',
    figsize=(15,5)
)

Build the recommender system

Partition the tables

Subsequent actions are more efficient if the ratings table is partitioned once by the item and a second table is partitioned by the user.

If you do not perform this step now, then many of the subsequent actions will automatically make a copy of the data and group by the item or the user. If that is done once, it is convenient. However the notebook shows several actions so the data transfer and reorganization is done once so that subsequent actions are more memory and CPU efficient.



In [13]:

    
ratings_by_item = s.CASTable('ratings_by_item', replace=True)
result = ratings.groupby('ISBN').partition(casout=ratings_by_item)
    
ratings_by_user = s.CASTable('ratings_by_user', replace=True)
result = ratings.groupby('User_ID').partition(casout=ratings_by_user)

Now that we have two instances of the ratings table, partitioned by different variables, we can drop the original ratings table.



In [14]:

    
ratings.table.droptable()









    



NOTE: Cloud Analytic Services dropped table RATINGS from caslib CASUSERHDFS(mimcki).






    Out[14]:





elapsed 0.0572s · user 0.262s · sys 0.751s · mem 44.4MB



In [15]:

    
book_recommend = s.CASTable('bookRecommend', replace=True)

results = s.recommend.recomCreate(
    system=book_recommend,
    user='User_ID',
    item='ISBN',
    rate='Rating')

Determine average ratings by user and item

For the average ratings by user, specify the ratings table that is partitioned by user. If the table is not already partitioned by user, then the action will temporarily group the data for you. However, it slows the action.



In [16]:

    
avg_user = s.CASTable('avg_user', replace=True)

results = ratings_by_user.recommend.recomRateinfo(
    label='avg_user_model',
    system=book_recommend,
    id='User_ID',
    sparseid='ISBN',
    sparseval='rating',
    casout=avg_user)



In [17]:

    
result = avg_user.head()
result









    Out[17]:






Selected Rows from Table AVG_USER
  
    
      
      User_ID
      _Stat_
      _NRatings_
    
  
  
    
      0
      951.0
      8.0
      1.0
    
    
      1
      17871.0
      8.5
      2.0
    
    
      2
      38899.0
      7.5
      4.0
    
    
      3
      58389.0
      8.0
      17.0
    
    
      4
      79712.0
      4.0
      1.0

You can view the ratings to confirm that the average is shown. If the first user has a single rating, you can rerun the preceding cell with avg_user.query("_nrating_ = 2").head() or a related query.



In [18]:

    
firstUser = result.loc[0,'User_ID']
count     = result.loc[0,'_NRatings_']
ratings_by_user[ratings_by_user.user_id == firstUser].head(count)









    Out[18]:






Selected Rows from Table RATINGS_BY_USER
  
    
      
      User_ID
      ISBN
      Rating
    
  
  
    
      0
      951.0
      0375704043
      8.0

Create an average ratings by item table.



In [19]:

    
avg_item = s.CASTable('avg_item', replace=True)

results = ratings_by_item.recommend.recomRateinfo(
    label='avg_item_model',
    system=book_recommend,
    id='isbn',
    sparseid='user_id',
    sparseval='rating',
    casOut=avg_item)



In [20]:

    
avg_item.head()









    Out[20]:






Selected Rows from Table AVG_ITEM
  
    
      
      ISBN
      _Stat_
      _NRatings_
    
  
  
    
      0
      0000913154
      8.0
      1.0
    
    
      1
      0061320935
      6.0
      1.0
    
    
      2
      0140154116
      7.0
      1.0
    
    
      3
      0192815687
      8.0
      2.0
    
    
      4
      0312444567
      9.0
      1.0

Explore item ratings and user ratings

The tables that are created with the recomrateinfo action can be used for simple data exploration.

Discerning reviewers



In [21]:

    
avg_user.query('_nratings_ > 3').sort_values('_stat_').head(10)









    Out[21]:






Selected Rows from Table AVG_USER
  
    
      
      User_ID
      _Stat_
      _NRatings_
    
  
  
    
      0
      195653.0
      1.000000
      5.0
    
    
      1
      241992.0
      2.000000
      8.0
    
    
      2
      95511.0
      3.000000
      8.0
    
    
      3
      256618.0
      3.000000
      4.0
    
    
      4
      253036.0
      3.000000
      4.0
    
    
      5
      105535.0
      3.333333
      9.0
    
    
      6
      185679.0
      3.400000
      5.0
    
    
      7
      227771.0
      3.437500
      16.0
    
    
      8
      37034.0
      3.500000
      6.0
    
    
      9
      274837.0
      3.500000
      4.0

Generous reviewers



In [22]:

    
avg_user.sort_values(['_stat_', '_nratings_'], ascending=False).head(10)









    Out[22]:






Selected Rows from Table AVG_USER
  
    
      
      User_ID
      _Stat_
      _NRatings_
    
  
  
    
      0
      182086.0
      10.0
      165.0
    
    
      1
      175003.0
      10.0
      32.0
    
    
      2
      167455.0
      10.0
      27.0
    
    
      3
      173835.0
      10.0
      26.0
    
    
      4
      52521.0
      10.0
      25.0
    
    
      5
      141203.0
      10.0
      23.0
    
    
      6
      187425.0
      10.0
      21.0
    
    
      7
      12272.0
      10.0
      21.0
    
    
      8
      69775.0
      10.0
      21.0
    
    
      9
      78469.0
      10.0
      17.0

Ten most frequently reviewed books



In [23]:

    
s.fedSql.execDirect(query='''
  select t1.isbn, t1._stat_ as "Average Rating",
    t1._nratings_ as "Number of Ratings",
    t2.author, t2.title from
    avg_item as t1 join books as t2
    on (t1.isbn = t2.isbn) order by 3 desc limit 10
''')









    Out[23]:




§ Result Set




  
    
      
      ISBN
      Average Rating
      Number of Ratings
      Author
      Title
    
  
  
    
      0
      0316666343
      8.185290
      707.0
      Alice Sebold
      The Lovely Bones: A Novel
    
    
      1
      0971880107
      4.390706
      581.0
      Rich Shapero
      Wild Animus
    
    
      2
      0385504209
      8.435318
      487.0
      Dan Brown
      The Da Vinci Code
    
    
      3
      0312195516
      8.182768
      383.0
      Anita Diamant
      The Red Tent (Bestselling Backlist)
    
    
      4
      0060928336
      7.887500
      320.0
      Rebecca Wells
      Divine Secrets of the Ya-Ya Sisterhood: A Novel
    
    
      5
      059035342X
      8.939297
      313.0
      J. K. Rowling
      Harry Potter and the Sorcerer's Stone (Harry P...
    
    
      6
      0142001740
      8.452769
      307.0
      Sue Monk Kidd
      The Secret Life of Bees
    
    
      7
      0446672211
      8.142373
      295.0
      Billie Letts
      Where the Heart Is (Oprah's Book Club (Paperba...
    
    
      8
      044023722X
      7.338078
      281.0
      John Grisham
      A Painted House
    
    
      9
      0452282152
      7.982014
      278.0
      Tracy Chevalier
      Girl with a Pearl Earring
    
  




elapsed 6.38s · user 248s · sys 68.1s · mem 2.25e+04MB

Frequently reviewed books with low ratings



In [24]:

    
result = avg_item.query('_nratings_ > 10').sort_values('_stat_').head(10)
result









    Out[24]:






Selected Rows from Table AVG_ITEM
  
    
      
      ISBN
      _Stat_
      _NRatings_
    
  
  
    
      0
      0971880107
      4.390706
      581.0
    
    
      1
      0679463321
      5.066667
      15.0
    
    
      2
      0349111863
      5.076923
      13.0
    
    
      3
      0142000981
      5.133333
      15.0
    
    
      4
      0871138190
      5.166667
      12.0
    
    
      5
      3518368001
      5.181818
      11.0
    
    
      6
      0312288115
      5.250000
      20.0
    
    
      7
      0802132952
      5.285714
      21.0
    
    
      8
      0802139868
      5.333333
      15.0
    
    
      9
      0425182908
      5.338028
      71.0



In [25]:

    
#Store the ISBN for the first row.
first_isbn = result.loc[0, 'ISBN']

result = ratings_by_item['rating'].query("isbn eq '%s'" % first_isbn).dataPreprocess.histogram() 

display(Markdown('#### Ratings Distribution for ISBN %s' % first_isbn))
display(result.BinDetails.loc[:, ['BinLowerBnd', 'NInBin', 'Percent']])









    




Ratings Distribution for ISBN 0971880107






    






Bin Details for RATINGS_BY_ITEM
  
    
      
      BinLowerBnd
      NInBin
      Percent
    
  
  
    
      0
      1.0
      74
      12.736661
    
    
      1
      2.0
      82
      14.113597
    
    
      2
      3.0
      87
      14.974182
    
    
      3
      4.0
      58
      9.982788
    
    
      4
      5.0
      88
      15.146299
    
    
      5
      6.0
      66
      11.359725
    
    
      6
      7.0
      60
      10.327022
    
    
      7
      8.0
      40
      6.884682
    
    
      8
      9.0
      16
      2.753873
    
    
      9
      10.0
      10
      1.721170

Build a matrix factorization model

First, create a hold-out group. From a random selection of 20% of users, hold out 1 rating.

After that, create the model.



In [26]:

    
holdout_users = s.CASTable('holdout_users', replace=True)

ratings_by_user.recommend.recomSample(
    system=book_recommend,
    label='holdout_users',
    withhold=.2, 
    hold=1,
    seed=1234,
    id='user_id',
    sparseid='isbn',
    casout=holdout_users
    )









    Out[26]:




§ OutputCasTables




  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      holdout_users
      12403
      2
      CASTable('holdout_users', caslib='CASUSERHDFS(...
    
  




elapsed 0.717s · user 11.9s · sys 15.7s · mem 5.13e+03MB



In [27]:

    
holdout_users.head(10)









    Out[27]:






Selected Rows from Table HOLDOUT_USERS
  
    
      
      User_ID
      ISBN
    
  
  
    
      0
      58389.0
      1569875103
    
    
      1
      149562.0
      0156126303
    
    
      2
      172186.0
      0060628626
    
    
      3
      39514.0
      0738846740
    
    
      4
      112570.0
      0446310786
    
    
      5
      205980.0
      0590435272
    
    
      6
      22809.0
      0316779237
    
    
      7
      127893.0
      0440222656
    
    
      8
      152515.0
      0805205306
    
    
      9
      128843.0
      0385499248



In [28]:

    
als_u = s.CASTable('als_u', replace=True)
als_i = s.CASTable('als_i', replace=True)

result = s.recommend.recomAls(
    system=book_recommend,
    tableu=ratings_by_user,
    tablei=ratings_by_item,
    label='als1',
    casoutu=als_u,
    casouti=als_i,
    rateinfo=avg_user,
    maxiter=20,
    hold=holdout_users,
    seed=1234,
    details=True,
    k=20,
    stagnation=10,
    threshold=.1
    )

result.ModelInfo.set_index('Descr')









    Out[28]:







  
    
      
      Value
    
    
      Descr
      
    
  
  
    
      Model Label
      als1
    
    
      Number of Used Ratings
      371193
    
    
      Root Mean Squared Error (RMSE)
      0.2778073207
    
    
      Mean Absolute Error (MAE)
      0.1534288461
    
    
      Number of Ratings in Holdout Sample
      12403
    
    
      Root Mean Squared Error (RMSE) for Holdout Sample
      1.3994474601
    
    
      Mean Absolute Error (MAE) for Holdout Sample
      1.0431048346



In [29]:

    
ax = result.IterHistory.plot(
    x='Iteration', y='Objective',
    title='Objective Function',
    figsize=(9,6)
)

result.IterHistory.set_index('Iteration')









    Out[29]:







  
    
      
      Objective
      Change
      RMSE
      MAE
      RMSEHold
      MAEHold
    
    
      Iteration
      
      
      
      
      
      
    
  
  
    
      0.0
      308269.384217
      NaN
      0.636991
      0.350190
      1.439029
      1.070502
    
    
      1.0
      221229.304210
      -87040.080007
      0.421535
      0.225503
      1.422682
      1.059649
    
    
      2.0
      200539.480746
      -20689.823465
      0.359439
      0.192990
      1.415819
      1.055047
    
    
      3.0
      191015.150613
      -9524.330132
      0.326705
      0.176820
      1.411208
      1.051668
    
    
      4.0
      185949.110170
      -5066.040443
      0.308167
      0.167858
      1.407542
      1.049101
    
    
      5.0
      182935.189625
      -3013.920546
      0.296827
      0.162431
      1.404814
      1.047213
    
    
      6.0
      180974.306264
      -1960.883361
      0.289418
      0.158915
      1.402840
      1.045816
    
    
      7.0
      179606.218714
      -1368.087550
      0.284303
      0.156496
      1.401393
      1.044635
    
    
      8.0
      178598.232981
      -1007.985732
      0.280600
      0.154748
      1.400298
      1.043770
    
    
      9.0
      177823.489153
      -774.743828
      0.277807
      0.153429
      1.399447
      1.043105

Make recommendations for one user



In [30]:

    
users= '104437'

recommendations = s.CASTable('recommendations', replace=True)

s.recommend.recomMfScore(
    system=book_recommend,
    label='als1',
    userlist=users,
    n=5,
    casout=recommendations
)









    Out[30]:




§ OutputCasTables




  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      recommendations
      5
      4
      CASTable('recommendations', caslib='CASUSERHDF...
    
  




elapsed 2.84s · user 71.7s · sys 87.8s · mem 8.72e+03MB



In [31]:

    
s.fedSql.execDirect(query='''
  select t1.*, t2.author, t2.title, 
  t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
  from recommendations as t1
  left outer join books as t2 on (t1.isbn = t2.isbn)
  left outer join avg_item as t3 on (t1.isbn = t3.isbn)
  order by user_id, _rank_
''')









    Out[31]:




§ Result Set




  
    
      
      User_ID
      ISBN
      _Rank_
      _Rating_
      Author
      Title
      Average Rating
      Number of Ratings
    
  
  
    
      0
      104437.0
      0060191457
      1.0
      11.636178
      Gary Indiana
      Three Month Fever: The Andrew Cunanan Story
      1.000000
      1.0
    
    
      1
      104437.0
      0877732248
      2.0
      11.371851
      LARRY DOSSEY
      SPACE TIME MEDICINE
      10.000000
      1.0
    
    
      2
      104437.0
      037541620X
      3.0
      11.302354
      Anne Rice
      The Anne Rice Collection: Mayfair Witches : La...
      3.666667
      3.0
    
    
      3
      104437.0
      0595127924
      4.0
      11.288880
      Benjamin David
      So Is This Wilderness and Other Stories
      1.000000
      1.0
    
    
      4
      104437.0
      0310916739
      5.0
      11.257740
      Larry Richards
      Bib Niv Teen Student
      1.000000
      1.0
    
  




elapsed 7.51s · user 386s · sys 238s · mem 3.52e+04MB

Make recommendations for hold-out users



In [32]:

    
recommend_heldout = s.CASTable('recommend_heldout', replace=True)

s.recommend.recomMfScore(
    system=book_recommend,
    label='als1',
    usertable=holdout_users,
    n=5,
    casout=recommend_heldout
)









    Out[32]:




§ OutputCasTables




  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      recommend_heldout
      62015
      4
      CASTable('recommend_heldout', caslib='CASUSERH...
    
  




elapsed 10.6s · user 7.3e+03s · sys 112s · mem 1.02e+04MB



In [33]:

    
result = s.fedsql.execdirect(query='''
  select t1.*, t2.author, t2.title, 
  t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
  from recommend_heldout as t1
  left outer join books as t2 on (t1.isbn = t2.isbn)
  left outer join avg_item as t3 on (t1.isbn = t3.isbn)
  order by user_id, _rank_
''')

# There are many rows in the results.  Print results for the first three users only.
three = result['Result Set'].loc[[0,5,10],:'User_ID'].values
for user in np.nditer(three):
    display(Markdown('#### Recommendations for user %s ' % user))
    display(result['Result Set'].query('User_ID == %s' % user))









    




Recommendations for user 8.0






    







  
    
      
      User_ID
      ISBN
      _Rank_
      _Rating_
      Author
      Title
      Average Rating
      Number of Ratings
    
  
  
    
      0
      8.0
      067165747X
      1.0
      7.847256
      Sonni Cooper
      BLACK FIRE STAR TREK 8 (Star Trek (Numbered Pa...
      2.000000
      1.0
    
    
      1
      8.0
      0880296879
      2.0
      7.588775
      Jay Carter
      Nasty People
      2.000000
      1.0
    
    
      2
      8.0
      0373225830
      3.0
      7.345195
      Simon Webb
      Safe By His Side (Secret Identity) (Intrigue, ...
      1.000000
      1.0
    
    
      3
      8.0
      0702227579
      4.0
      7.302083
      Hugh Lunn
      Spies Like Us (Uqp Paperbacks Memoir)
      1.000000
      1.0
    
    
      4
      8.0
      0812968344
      5.0
      7.288302
      Joyce Carol Oates
      A Garden of Earthly Delights (Modern Library 2...
      5.333333
      3.0
    
  








    




Recommendations for user 14.0






    







  
    
      
      User_ID
      ISBN
      _Rank_
      _Rating_
      Author
      Title
      Average Rating
      Number of Ratings
    
  
  
    
      5
      14.0
      0345438655
      1.0
      6.446891
      Peter Straub
      Julia
      2.000000
      1.0
    
    
      6
      14.0
      039921268X
      2.0
      6.442082
      Anne-Marie Chapouton
      Ben Finds a Friend
      4.333333
      3.0
    
    
      7
      14.0
      0821774719
      3.0
      6.379757
      Elizabeth Doyle
      Beyond Paradise (Zebra Historical Romance)
      1.000000
      1.0
    
    
      8
      14.0
      0807843490
      4.0
      6.379757
      Janice A. Radway
      Reading the Romance: Women, Patriarchy, and Po...
      1.000000
      1.0
    
    
      9
      14.0
      0449205118
      5.0
      6.369464
      Victoria Holt
      The Time of the Hunter's Moon
      1.000000
      1.0
    
  








    




Recommendations for user 56.0






    







  
    
      
      User_ID
      ISBN
      _Rank_
      _Rating_
      Author
      Title
      Average Rating
      Number of Ratings
    
  
  
    
      10
      56.0
      0375812032
      1.0
      9.871000
      CATHY YOUNG
      One Hot Second : Stories About Desire
      4.000000
      1.0
    
    
      11
      56.0
      0712625976
      2.0
      9.724332
      Evelyn Hart
      Mountains of the Sun
      2.000000
      1.0
    
    
      12
      56.0
      0760708142
      3.0
      9.575278
      U S Air Force
      Roswell Report Case Closed
      3.000000
      1.0
    
    
      13
      56.0
      014025238X
      4.0
      9.435528
      Diane Schoemperlen
      In the Language of Love: A Novel in 100 Chapters
      6.333333
      3.0
    
    
      14
      56.0
      1570960127
      5.0
      9.408507
      Elaine Barbieri
      More Precious Than Gold
      3.000000
      1.0

Build a KNN model

Calculate the similarity between users



In [34]:

    
similar_users = s.CASTable("similar_users", replace=True)

ratings_by_user.recommend.recomSim(
    label="similar_users",
    system=book_recommend,
    id="user_id",
    sparseId="isbn",
    sparseVal="rating",
    measure="cos",
    casout=similar_users,
    threshold=.2)









    Out[34]:




§ OutputCasTables




  
    
      
      casLib
      Name
      Rows
      Columns
      casTable
    
  
  
    
      0
      CASUSERHDFS(mimcki)
      similar_users
      978001
      3
      CASTable('similar_users', caslib='CASUSERHDFS(...
    
  




elapsed 422s · user 3.48e+05s · sys 1.03e+03s · mem 6.58e+05MB

View the similarity for one pair of users

In this case, these two users read three of the same books. They rated the three at 7 and above.



In [35]:

    
result = similar_users.query("user_id_1 = 104437 and user_id_2 = 199981").head(1)
display(result)

def one_users_ratings(user_id):
    result = s.fedSql.execDirect(query='''
    select t1.*,
    t2.author, t2.title from ratings_by_user as t1
    left outer join books as t2 on (t1.isbn = t2.isbn)
    where t1.user_id = {}
    order by author, isbn;
    '''.format(user_id))
    display(Markdown('#### Ratings by user %s' % user_id))
    display(result)
    
one_users_ratings(104437)
one_users_ratings(199981)









    






Selected Rows from Table SIMILAR_USERS
  
    
      
      User_ID_1
      User_ID_2
      _Sim_
    
  
  
    
      0
      104437.0
      199981.0
      0.579126
    
  








    




Ratings by user 104437






    




§ Result Set




  
    
      
      User_ID
      ISBN
      Rating
      Author
      Title
    
  
  
    
      0
      104437.0
      0440212561
      10.0
      DIANA GABALDON
      Outlander
    
    
      1
      104437.0
      0440215625
      7.0
      DIANA GABALDON
      Dragonfly in Amber
    
    
      2
      104437.0
      0440217563
      10.0
      DIANA GABALDON
      Voyager
    
    
      3
      104437.0
      0553250426
      7.0
      Jean M. Auel
      The Clan of the Cave Bear (Earth's Children (P...
    
    
      4
      104437.0
      0671534742
      7.0
      V.C. Andrews
      Music in the Night (Logan)
    
    
      5
      104437.0
      0671729497
      5.0
      V.C. Andrews
      Web of Dreams (Casteel)
    
  




elapsed 5.52s · user 283s · sys 172s · mem 3.54e+04MB






    




Ratings by user 199981






    




§ Result Set




  
    
      
      User_ID
      ISBN
      Rating
      Author
      Title
    
  
  
    
      0
      199981.0
      0440212561
      8.0
      DIANA GABALDON
      Outlander
    
    
      1
      199981.0
      0440215625
      10.0
      DIANA GABALDON
      Dragonfly in Amber
    
    
      2
      199981.0
      0440217563
      8.0
      DIANA GABALDON
      Voyager
    
    
      3
      199981.0
      0385499825
      7.0
      Haven Kimmel
      Girl Named Zippy: Growing Up Small in Moorelan...
    
    
      4
      199981.0
      0671001795
      7.0
      Janet Evanovich
      Two for the Dough
    
    
      5
      199981.0
      0452280362
      7.0
      Linda Bruckheimer
      Dreaming Southern
    
    
      6
      199981.0
      0380730448
      7.0
      Sue Townsend
      The Adrian Mole Diaries : The Secret Diary of ...
    
  




elapsed 5.51s · user 283s · sys 172s · mem 3.54e+04MB

Calculate KNN based on user's similar ratings



In [36]:

    
ratings_by_item.recommend.recomKnnTrain(
  label='knn1',
  system=book_recommend,
  similarity=similar_users,
  k=20,
  hold=holdout_users,
  rateinfo=avg_user,
  user=True # need to tell if similarity is for the user or the item
)









    Out[36]:




§ ModelInfo




  
    
      
      Descr
      Value
    
  
  
    
      0
      Model Label
      knn1
    
    
      1
      Number of Used Ratings
      94581
    
    
      2
      Root Mean Squared Error (RMSE)
      1.1979361055
    
    
      3
      Mean Absolute Error (MAE)
      0.8766571274
    
    
      4
      Number of Ratings in Holdout Sample
      6186
    
    
      5
      Root Mean Squared Error (RMSE) for Holdout Sample
      1.2934692262
    
    
      6
      Mean Absolute Error (MAE) for Holdout Sample
      1.0043781166
    
  




elapsed 35.8s · user 2.22e+03s · sys 187s · mem 3.85e+05MB



In [37]:

    
users = ['104437']

knn_recommended = s.CASTable("knn_recommended", replace=True)

s.recommend.recomKnnScore(
  system="bookRecommend",
  label="knn1",
  userList=users,
  n=10,
  casout=knn_recommended
)

s.fedSql.execDirect(
    query='''
     select t1.*, t2.author, t2.title,
      t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
      from knn_recommended as t1
      left outer join books as t2 on (t1.isbn = t2.isbn)
      left outer join avg_item as t3 on (t1.isbn = t3.isbn)
      order by user_id, _rank_;
    ''')









    Out[37]:




§ Result Set




  
    
      
      User_ID
      ISBN
      _Rank_
      _Rating_
      Author
      Title
      Average Rating
      Number of Ratings
    
  
  
    
      0
      104437.0
      0553562738
      1.0
      10.963542
      Connie Willis
      Doomsday Book
      8.714286
      28.0
    
    
      1
      104437.0
      0812551478
      2.0
      10.963542
      Terry Goodkind
      Blood of the Fold (Sword of Truth, Book 3)
      8.450000
      20.0
    
    
      2
      104437.0
      0812551494
      3.0
      10.963542
      Terry Goodkind
      Soul of the Fire (Sword of Truth, Book 5)
      7.714286
      14.0
    
    
      3
      104437.0
      0553573403
      4.0
      10.963542
      George R.R. Martin
      A Game of Thrones (A Song of Ice and Fire, Boo...
      8.500000
      44.0
    
    
      4
      104437.0
      0812548094
      5.0
      10.963542
      Terry Goodkind
      Stone of Tears (Sword of Truth, Book 2)
      7.909091
      22.0
    
    
      5
      104437.0
      0553579908
      6.0
      10.963542
      George R. R. Martin
      A Clash of Kings (A Song of Fire and Ice, Book 2)
      8.526316
      19.0
    
    
      6
      104437.0
      088038672X
      7.0
      10.963542
      R.A. Salvatore
      Streams of Silver (Forgotten Realms: The Icewi...
      8.600000
      5.0
    
    
      7
      104437.0
      0679452869
      8.0
      10.963542
      Evelyn McFarlane
      If 2: 500 New Questions for the Game of Life
      8.333333
      3.0
    
    
      8
      104437.0
      0786915749
      9.0
      10.963542
      Margaret Weis
      Dragons of Autumn Twilight (Dragonlance: Drago...
      8.285714
      21.0
    
    
      9
      104437.0
      0786863900
      10.0
      10.963542
      Cameron West
      First Person Plural : My Life As a Multiple
      8.000000
      8.0
    
  




elapsed 7.2s · user 396s · sys 253s · mem 3.52e+04MB

Combine search with recommendations

Search can be included with recommendations

First, build the search index.

The recomSearchIndex action generates a global-scope table that is named the same as the label parameter.
The generated index table is always appended when recomSearchIndex is run again with the same label. This can generate duplicate documents in the index table.
To avoid duplicates, the table.dropTable action is run first. The quiet=True parameter is used to ignore whether the table exists or not.

Afterward, run search queries for terms.



In [38]:

    
book_search = s.CASTable("book_search", replace=True)
book_search.table.droptable(quiet=True)

books.recommend.recomSearchIndex(
  system=book_recommend,
  label='book_search',
  id='isbn')









    



NOTE:  Will update the schema with the format from the data table.
NOTE:  The index schema will be updated.






    Out[38]:




§ errorCode

0

§ threads

{'fail_thds': 0, 'success_thds': 2268, 'total_thds': 2268}

§ documents

{'total_docs': 271375, 'invalid_docs': 0, 'append_docs': 271375}


elapsed 2.93s · user 69.6s · sys 44.2s · mem 47.5MB



In [39]:

    
yoga_query = 'yoga fitness'

query_filter = s.CASTable("query_filter", replace=True)

result = book_search.recommend.recomSearchQuery(
  system=book_recommend,
  label='book_search',
  casout=query_filter,
  query=yoga_query,
  n=100)









    



NOTE:  No fields object is detected in the output object, so all fields were returned.



In [40]:

    
query_filter.columnInfo()









    Out[40]:




§ ColumnInfo




  
    
      
      Column
      ID
      Type
      RawLength
      FormattedLength
      Format
      NFL
      NFD
    
  
  
    
      0
      _id_
      1
      varchar
      21
      21
      $
      0
      0
    
    
      1
      _relevance_
      2
      double
      8
      12
      BEST
      0
      0
    
    
      2
      ISBN
      3
      char
      10
      10
      
      0
      0
    
    
      3
      Title
      4
      varchar
      140
      303
      
      0
      0
    
    
      4
      Author
      5
      varchar
      29
      143
      
      0
      0
    
    
      5
      Year_Of_Publication
      6
      double
      8
      12
      
      0
      0
    
    
      6
      Publisher
      7
      varchar
      42
      151
      
      0
      0
    
  




elapsed 0.0596s · user 0.279s · sys 0.661s · mem 50.5MB



In [41]:

    
yoga_reader = '99955'

filtered_results = s.CASTable('filtered_results', replace=True)

filtered_results = s.recommend.recomMfScore(
    system=book_recommend,
    label='als1',
    filter=query_filter,
    userlist=yoga_reader,
    n=5,
    casout=filtered_results
)



In [42]:

    
s.fedSql.execDirect(query='''
     select t1.*, t2.author, t2.title,
      t3._stat_ as "Average Rating", t3._nratings_ as "Number of Ratings"
      from filtered_results as t1
      left outer join books as t2 on (t1.isbn = t2.isbn)
      left outer join avg_item as t3 on (t1.isbn = t3.isbn)
      order by user_id, _rank_;
    ''')









    Out[42]:




§ Result Set




  
    
      
      User_ID
      ISBN
      _Rank_
      _Rating_
      Author
      Title
      Average Rating
      Number of Ratings
    
  
  
    
      0
      99955.0
      0525947590
      1.0
      8.527714
      Joe Decker
      The World's Fittest You: Four Weeks to Total F...
      5.5
      2.0
    
    
      1
      99955.0
      0375422145
      2.0
      7.514956
      Geoff Dyer
      Yoga for People Who Can't Be Bothered to Do It
      6.0
      1.0
    
    
      2
      99955.0
      0395271622
      3.0
      7.494413
      Covert Bailey
      Fit Or Fat New Way to Health and Fitness
      6.5
      2.0
    
    
      3
      99955.0
      0553277480
      4.0
      7.307167
      RICHARD HITTLEMAN
      Richard Hittleman's Yoga : 28 Day Exercise Plan
      8.5
      2.0
    
    
      4
      99955.0
      1401900259
      5.0
      7.212376
      Cyndi Lee
      Om Yoga Flash Cards (Om Yoga)
      5.0
      1.0
    
  




elapsed 9.86s · user 451s · sys 216s · mem 3.53e+04MB



In [43]:

    
#s.close()

Data were used with permission from Dr. Cai-Nicolas Ziegler. The following publication provides information about the dataset:

Improving Recommendation Lists Through Topic Diversification Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW '05), May 10-14, 2005, Chiba, Japan. To appear.



In [ ]:

	User_ID	ISBN	Rating
0	276726	0155061224	5
1	276729	052165615X	3
2	276729	0521795028	6
3	276736	3257224281	8
4	276737	0600570967	6

	ISBN	Title	Author	Year_Of_Publication	Publisher
0	0195153448	Classical Mythology	Mark P. O. Morford	2002	Oxford University Press
1	0002005018	Clara Callan	Richard Bruce Wright	2001	HarperFlamingo Canada
2	0060973129	Decision in Normandy	Carlo D'Este	1991	HarperPerennial
3	0374157065	Flu: The Story of the Great Influenza Pandemic...	Gina Bari Kolata	1999	Farrar Straus Giroux
4	0393045218	The Mummies of Urumchi	E. J. W. Barber	1999	W. W. Norton & Company

	Column	ID	Type	RawLength	FormattedLength
0	ISBN	1	char	10	10
1	Title	2	varchar	303	303
2	Author	3	varchar	143	143
3	Year_Of_Publication	4	double	8	12
4	Publisher	5	varchar	151	151

	NDistinct	NMiss	Trunc
Column
User_ID	68077.0	0.0	0.0
ISBN	149612.0	0.0	0.0
Rating	10.0	0.0	0.0

	Value
Ratings	383596.000000
Users	68077.000000
Items	149612.000000
Sparsity	0.999962

	User_ID	_Stat_	_NRatings_
0	951.0	8.0	1.0
1	17871.0	8.5	2.0
2	38899.0	7.5	4.0
3	58389.0	8.0	17.0
4	79712.0	4.0	1.0

	ISBN	_Stat_	_NRatings_
0	0000913154	8.0	1.0
1	0061320935	6.0	1.0
2	0140154116	7.0	1.0
3	0192815687	8.0	2.0
4	0312444567	9.0	1.0

	User_ID	_Stat_	_NRatings_
0	195653.0	1.000000	5.0
1	241992.0	2.000000	8.0
2	95511.0	3.000000	8.0
3	256618.0	3.000000	4.0
4	253036.0	3.000000	4.0
5	105535.0	3.333333	9.0
6	185679.0	3.400000	5.0
7	227771.0	3.437500	16.0
8	37034.0	3.500000	6.0
9	274837.0	3.500000	4.0

	User_ID	_Stat_	_NRatings_
0	182086.0	10.0	165.0
1	175003.0	10.0	32.0
2	167455.0	10.0	27.0
3	173835.0	10.0	26.0
4	52521.0	10.0	25.0
5	141203.0	10.0	23.0
6	187425.0	10.0	21.0
7	12272.0	10.0	21.0
8	69775.0	10.0	21.0
9	78469.0	10.0	17.0

	ISBN	Average Rating	Number of Ratings	Author	Title
0	0316666343	8.185290	707.0	Alice Sebold	The Lovely Bones: A Novel
1	0971880107	4.390706	581.0	Rich Shapero	Wild Animus
2	0385504209	8.435318	487.0	Dan Brown	The Da Vinci Code
3	0312195516	8.182768	383.0	Anita Diamant	The Red Tent (Bestselling Backlist)
4	0060928336	7.887500	320.0	Rebecca Wells	Divine Secrets of the Ya-Ya Sisterhood: A Novel
5	059035342X	8.939297	313.0	J. K. Rowling	Harry Potter and the Sorcerer's Stone (Harry P...
6	0142001740	8.452769	307.0	Sue Monk Kidd	The Secret Life of Bees
7	0446672211	8.142373	295.0	Billie Letts	Where the Heart Is (Oprah's Book Club (Paperba...
8	044023722X	7.338078	281.0	John Grisham	A Painted House
9	0452282152	7.982014	278.0	Tracy Chevalier	Girl with a Pearl Earring

	BinLowerBnd	NInBin	Percent
0	1.0	74	12.736661
1	2.0	82	14.113597
2	3.0	87	14.974182
3	4.0	58	9.982788
4	5.0	88	15.146299
5	6.0	66	11.359725
6	7.0	60	10.327022
7	8.0	40	6.884682
8	9.0	16	2.753873
9	10.0	10	1.721170

	User_ID	ISBN
0	58389.0	1569875103
1	149562.0	0156126303
2	172186.0	0060628626
3	39514.0	0738846740
4	112570.0	0446310786
5	205980.0	0590435272
6	22809.0	0316779237
7	127893.0	0440222656
8	152515.0	0805205306
9	128843.0	0385499248

	Value
Descr
Model Label	als1
Number of Used Ratings	371193
Root Mean Squared Error (RMSE)	0.2778073207
Mean Absolute Error (MAE)	0.1534288461
Number of Ratings in Holdout Sample	12403
Root Mean Squared Error (RMSE) for Holdout Sample	1.3994474601
Mean Absolute Error (MAE) for Holdout Sample	1.0431048346

	Objective	Change	RMSE	MAE	RMSEHold	MAEHold
Iteration
0.0	308269.384217	NaN	0.636991	0.350190	1.439029	1.070502
1.0	221229.304210	-87040.080007	0.421535	0.225503	1.422682	1.059649
2.0	200539.480746	-20689.823465	0.359439	0.192990	1.415819	1.055047
3.0	191015.150613	-9524.330132	0.326705	0.176820	1.411208	1.051668
4.0	185949.110170	-5066.040443	0.308167	0.167858	1.407542	1.049101
5.0	182935.189625	-3013.920546	0.296827	0.162431	1.404814	1.047213
6.0	180974.306264	-1960.883361	0.289418	0.158915	1.402840	1.045816
7.0	179606.218714	-1368.087550	0.284303	0.156496	1.401393	1.044635
8.0	178598.232981	-1007.985732	0.280600	0.154748	1.400298	1.043770
9.0	177823.489153	-774.743828	0.277807	0.153429	1.399447	1.043105

	User_ID	ISBN	_Rank_	_Rating_	Author	Title	Average Rating	Number of Ratings
0	104437.0	0060191457	1.0	11.636178	Gary Indiana	Three Month Fever: The Andrew Cunanan Story	1.000000	1.0
1	104437.0	0877732248	2.0	11.371851	LARRY DOSSEY	SPACE TIME MEDICINE	10.000000	1.0
2	104437.0	037541620X	3.0	11.302354	Anne Rice	The Anne Rice Collection: Mayfair Witches : La...	3.666667	3.0
3	104437.0	0595127924	4.0	11.288880	Benjamin David	So Is This Wilderness and Other Stories	1.000000	1.0
4	104437.0	0310916739	5.0	11.257740	Larry Richards	Bib Niv Teen Student	1.000000	1.0

	User_ID	ISBN	_Rank_	_Rating_	Author	Title	Average Rating	Number of Ratings
0	8.0	067165747X	1.0	7.847256	Sonni Cooper	BLACK FIRE STAR TREK 8 (Star Trek (Numbered Pa...	2.000000	1.0
1	8.0	0880296879	2.0	7.588775	Jay Carter	Nasty People	2.000000	1.0
2	8.0	0373225830	3.0	7.345195	Simon Webb	Safe By His Side (Secret Identity) (Intrigue, ...	1.000000	1.0
3	8.0	0702227579	4.0	7.302083	Hugh Lunn	Spies Like Us (Uqp Paperbacks Memoir)	1.000000	1.0
4	8.0	0812968344	5.0	7.288302	Joyce Carol Oates	A Garden of Earthly Delights (Modern Library 2...	5.333333	3.0

	User_ID	ISBN	_Rank_	_Rating_	Author	Title	Average Rating	Number of Ratings
5	14.0	0345438655	1.0	6.446891	Peter Straub	Julia	2.000000	1.0
6	14.0	039921268X	2.0	6.442082	Anne-Marie Chapouton	Ben Finds a Friend	4.333333	3.0
7	14.0	0821774719	3.0	6.379757	Elizabeth Doyle	Beyond Paradise (Zebra Historical Romance)	1.000000	1.0
8	14.0	0807843490	4.0	6.379757	Janice A. Radway	Reading the Romance: Women, Patriarchy, and Po...	1.000000	1.0
9	14.0	0449205118	5.0	6.369464	Victoria Holt	The Time of the Hunter's Moon	1.000000	1.0

	User_ID	ISBN	_Rank_	_Rating_	Author	Title	Average Rating	Number of Ratings
10	56.0	0375812032	1.0	9.871000	CATHY YOUNG	One Hot Second : Stories About Desire	4.000000	1.0
11	56.0	0712625976	2.0	9.724332	Evelyn Hart	Mountains of the Sun	2.000000	1.0
12	56.0	0760708142	3.0	9.575278	U S Air Force	Roswell Report Case Closed	3.000000	1.0
13	56.0	014025238X	4.0	9.435528	Diane Schoemperlen	In the Language of Love: A Novel in 100 Chapters	6.333333	3.0
14	56.0	1570960127	5.0	9.408507	Elaine Barbieri	More Precious Than Gold	3.000000	1.0

	User_ID	ISBN	Rating	Author	Title
0	104437.0	0440212561	10.0	DIANA GABALDON	Outlander
1	104437.0	0440215625	7.0	DIANA GABALDON	Dragonfly in Amber
2	104437.0	0440217563	10.0	DIANA GABALDON	Voyager
3	104437.0	0553250426	7.0	Jean M. Auel	The Clan of the Cave Bear (Earth's Children (P...
4	104437.0	0671534742	7.0	V.C. Andrews	Music in the Night (Logan)
5	104437.0	0671729497	5.0	V.C. Andrews	Web of Dreams (Casteel)

	User_ID	ISBN	Rating	Author	Title
0	199981.0	0440212561	8.0	DIANA GABALDON	Outlander
1	199981.0	0440215625	10.0	DIANA GABALDON	Dragonfly in Amber
2	199981.0	0440217563	8.0	DIANA GABALDON	Voyager
3	199981.0	0385499825	7.0	Haven Kimmel	Girl Named Zippy: Growing Up Small in Moorelan...
4	199981.0	0671001795	7.0	Janet Evanovich	Two for the Dough
5	199981.0	0452280362	7.0	Linda Bruckheimer	Dreaming Southern
6	199981.0	0380730448	7.0	Sue Townsend	The Adrian Mole Diaries : The Secret Diary of ...

	Descr	Value
0	Model Label	knn1
1	Number of Used Ratings	94581
2	Root Mean Squared Error (RMSE)	1.1979361055
3	Mean Absolute Error (MAE)	0.8766571274
4	Number of Ratings in Holdout Sample	6186
5	Root Mean Squared Error (RMSE) for Holdout Sample	1.2934692262
6	Mean Absolute Error (MAE) for Holdout Sample	1.0043781166

	User_ID	ISBN	_Rank_	_Rating_	Author	Title	Average Rating	Number of Ratings
0	104437.0	0553562738	1.0	10.963542	Connie Willis	Doomsday Book	8.714286	28.0
1	104437.0	0812551478	2.0	10.963542	Terry Goodkind	Blood of the Fold (Sword of Truth, Book 3)	8.450000	20.0
2	104437.0	0812551494	3.0	10.963542	Terry Goodkind	Soul of the Fire (Sword of Truth, Book 5)	7.714286	14.0
3	104437.0	0553573403	4.0	10.963542	George R.R. Martin	A Game of Thrones (A Song of Ice and Fire, Boo...	8.500000	44.0
4	104437.0	0812548094	5.0	10.963542	Terry Goodkind	Stone of Tears (Sword of Truth, Book 2)	7.909091	22.0
5	104437.0	0553579908	6.0	10.963542	George R. R. Martin	A Clash of Kings (A Song of Fire and Ice, Book 2)	8.526316	19.0
6	104437.0	088038672X	7.0	10.963542	R.A. Salvatore	Streams of Silver (Forgotten Realms: The Icewi...	8.600000	5.0
7	104437.0	0679452869	8.0	10.963542	Evelyn McFarlane	If 2: 500 New Questions for the Game of Life	8.333333	3.0
8	104437.0	0786915749	9.0	10.963542	Margaret Weis	Dragons of Autumn Twilight (Dragonlance: Drago...	8.285714	21.0
9	104437.0	0786863900	10.0	10.963542	Cameron West	First Person Plural : My Life As a Multiple	8.000000	8.0

	Column	ID	Type	RawLength	FormattedLength	Format
0	_id_	1	varchar	21	21	$
1	_relevance_	2	double	8	12	BEST
2	ISBN	3	char	10	10
3	Title	4	varchar	140	303
4	Author	5	varchar	29	143
5	Year_Of_Publication	6	double	8	12
6	Publisher	7	varchar	42	151

	User_ID	ISBN	_Rank_	_Rating_	Author	Title	Average Rating	Number of Ratings
0	99955.0	0525947590	1.0	8.527714	Joe Decker	The World's Fittest You: Four Weeks to Total F...	5.5	2.0
1	99955.0	0375422145	2.0	7.514956	Geoff Dyer	Yoga for People Who Can't Be Bothered to Do It	6.0	1.0
2	99955.0	0395271622	3.0	7.494413	Covert Bailey	Fit Or Fat New Way to Health and Fitness	6.5	2.0
3	99955.0	0553277480	4.0	7.307167	RICHARD HITTLEMAN	Richard Hittleman's Yoga : 28 Day Exercise Plan	8.5	2.0
4	99955.0	1401900259	5.0	7.212376	Cyndi Lee	Om Yoga Flash Cards (Om Yoga)	5.0	1.0