In [1]:
%pylab inline
import pandas as pd

import numpy as np
from __future__ import division
import itertools

import matplotlib.pyplot as plt
import seaborn as sns

import logging
logger = logging.getLogger()


Populating the interactive namespace from numpy and matplotlib

9 Recommendation Systems

two broad groups:

  1. Content-based systems
    focus on the properities of items.

  2. Collaborative filtering systems
    focus on the relationship between users and items.

9.1 A Model for Recommendation Systems

The Utility Matrix

record the preference given by users for certain items.


In [66]:
# Example 9.1
M = pd.DataFrame(index=['A', 'B', 'C', 'D'], columns=['HP1', 'HP2', 'HP3', 'TW', 'SW1', 'SW2', 'SW3'])

M.loc['A', ['HP1', 'TW', 'SW1']] = [4, 5, 1]
M.iloc[1, 0:3] = [5, 5, 4]
M.iloc[2, 3:-1] = [2, 4, 5]
M.iloc[3, [1, -1]] = [3, 3]

M_9_1 = M
M_9_1


Out[66]:
HP1 HP2 HP3 TW SW1 SW2 SW3
A 4 NaN NaN 5 1 NaN NaN
B 5 5 4 NaN NaN NaN NaN
C NaN NaN NaN 2 4 5 NaN
D NaN 3 NaN NaN NaN NaN 3

In practice, the matrix would be even sparser, with the typical user rating only a tiny fraction of all avalibale items.

the goal of a recommendation system is: to predict the blanks in the utility matrix.

  • slightly difference in many application:
    • predict every blank entry $<$ discover some potential entries in each row.
    • find all items with the highest expected ratings $<$ find a large subset of those.

The Long Tail

physical institutions online institutions
provide only the most popular items provide the entire range of items

the long tail force online institutions to recommend items to individual users:

  1. It's no possible to present all avaliable items to the user.

  2. Neither can we expect users to have heared of each of the items they might like.

Applications of Recommendation Systems

  1. Product Recommendations

  2. Movie Recommendations

  3. News Articles

Populating the Utility Matrix

how to discovery the value users place on items:

  1. We can ask users to rate items.
    cons: users are unwilling to do, and so samples are biased by very little fraction of peoples.

  2. We can make inferences from users' behavior.
    eg: items purchased/viewed/rated.

9.2 Content-Based Recommendations

9.2.1 Item Profiles

a record representing important characteristics of items.

Discovering Features
  1. for Documents
    idea: find the identification of words that characterize the topic of a document.
    namely, we expect a sets of words to express the subjects or main ideas of the document.

    1. eliminate stop words.
    2. compute the TF.DIF score for each reamining word in the document.
    3. take as the features of a document the $n$ words with the highest TF.DIF scores.

    to measure the similarity of two documents, the distance measures we could use are:

    1. Jaccard distance
    2. cosine distance
      cosine distance of vectors is not affected by components in which both vectors have 0.
  2. for Images
    invite users to tag the items.
    cons: users are unwilling to do $\implies$ there are not enough tags (bias).

generalize feature vector
  1. feature is discrete. $\to$ boolean value.

  2. feature is numerical. $\to$ normalization.

9.2.5 User Profiles

create vectors with the same components of item profiles to describe the user's preferences.

It could be derived from utility matrix and item profiles.

  1. normalizate untility matrix. ($[-1,1]$ for cosine distance).

  2. value in user profiles = utility value * corresponding item vectors.


In [3]:
# example 9.4

users_name = ['U', 'V']
items_name = ['F{}'.format(x) for x in range(4)]
features_name = ['Julia Roberts', 'others']

# utility matrix
M_uti = pd.DataFrame([
                        [3, 4, 5, 0],
                        [6, 2, 3, 5]  
                     ],
                     index=users_name,
                     columns=items_name
                    )

M_uti


Out[3]:
F0 F1 F2 F3
U 3 4 5 0
V 6 2 3 5

In [4]:
# item profile
M_item = pd.DataFrame(index=items_name, columns=features_name)

M_item.loc[:, features_name[0]] = 1
M_item = M_item.fillna(value=0)

M_item


Out[4]:
Julia Roberts others
F0 1 0
F1 1 0
F2 1 0
F3 1 0

In [5]:
M_uti.apply(lambda x: x - np.mean(x), axis=1)


Out[5]:
F0 F1 F2 F3
U 0 1 2 -3
V 2 -2 -1 1

In [6]:
M_user = M_uti.fillna(value=0).dot(M_item) / 4 #average = sum/len
M_user


Out[6]:
Julia Roberts others
U 3 0
V 4 0

9.2.6 Recommending Items to Users Based on Content

  1. to estimate:
    $$M_{utility}[user, item] = cosineDistant(M_{user}, M_{item})$$

    the more similar, the higher probility to recommend.

  2. classification algorithms:
    Recommend or Not (machine learning):
    one decision per user $\to$ take too long time to construct.
    be used only for relatively small problem size.


In [7]:
# exercise 9.2.1

raw_data = [
        [3.06, 2.68, 2.92],
        [500, 320, 640],
        [6, 4, 6]
    ]

M_item = pd.DataFrame(raw_data, index=['Processor Speed', 'Disk Size', 'Main-Memory Size'], columns=['A', 'B', 'C'])

# items: A, B, C; features: Processor Speed, Disk Size, ...
M_item


Out[7]:
A B C
Processor Speed 3.06 2.68 2.92
Disk Size 500.00 320.00 640.00
Main-Memory Size 6.00 4.00 6.00

In [12]:
# exercise 9.2.1 

# (d)
M_item.apply(lambda x: x / np.mean(x), axis=1)


Out[12]:
A B C
Processor Speed 1.060046 0.928406 1.011547
Disk Size 1.027397 0.657534 1.315068
Main-Memory Size 1.125000 0.750000 1.125000

In [13]:
# exercise 9.2.2 

# (a)
M_item.apply(lambda x: x - np.mean(x), axis=1)


Out[13]:
A B C
Processor Speed 0.173333 -0.206667 0.033333
Disk Size 13.333333 -166.666667 153.333333
Main-Memory Size 0.666667 -1.333333 0.666667

In [30]:
# exercise 9.2.3

M_uti = pd.DataFrame([[4, 2, 5]], index=['user'], columns=['A', 'B', 'C'])
M_uti


Out[30]:
A B C
user 4 2 5

In [31]:
# (a)
M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)
M_uti_nor


Out[31]:
A B C
user 0.333333 -1.666667 1.333333

In [32]:
# (b)
M_user = M_item.dot(M_uti_nor.T) / 3
M_user


Out[32]:
user
Processor Speed 0.148889
Disk Size 162.222222
Main-Memory Size 1.111111

In [79]:
logger.setLevel('WARN')

def create_user_profile(utility_matrix, item_profile):
    """Create user profile by combining utility matrix with item profile in 9.2.5 ."""
    
    assert np.array_equal(utility_matrix.columns, item_profile.columns), \
                      "utility matrix should keep same columns name with item profile."
    
    logger.info('utility_matrix: \n{}\n'.format(utility_matrix))  
    M_uti_notnull = np.ones(utility_matrix.shape)
    M_uti_notnull[utility_matrix.isnull().values] = 0
    logger.info('utility_matrix_isnull: \n{}\n'.format(M_uti_notnull))
    
    
    logger.info('utility_matrix: \n{}\n'.format(item_profile)) 
    M_item_notnull = np.ones(item_profile.shape)
    M_item_notnull[item_profile.isnull().values] = 0
    logger.info('utility_matrix_isnull: \n{}\n'.format(M_item_notnull))
    
    utility_matrix = utility_matrix.fillna(value=0)
    item_profile = item_profile.fillna(value=0)
    M_user = item_profile.dot(utility_matrix.T).values / np.dot(M_item_notnull, M_uti_notnull.T)
    M_user[np.isinf(M_user)] = np.nan   # solve: divide zero
    logger.info('M_user: \n{}\n'.format(M_user))
    
    return pd.DataFrame(M_user, index=item_profile.index, columns=utility_matrix.index)


M_uti = pd.DataFrame([[4, 2, 5], [1, np.nan, 3]], index=['userA', 'userB'], columns=['A', 'B', 'C'])
M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)
print('utility matrix: \n{}\n'.format(M_uti_nor))
print('item profile: \n{}\n'.format(M_item))

create_user_profile(M_uti_nor, M_item)


utility matrix: 
              A         B         C
userA  0.333333 -1.666667  1.333333
userB -1.000000       NaN  1.000000

item profile: 
                       A       B       C
Processor Speed     3.06    2.68    2.92
Disk Size         500.00  320.00  640.00
Main-Memory Size    6.00    4.00    6.00

Out[79]:
userA userB
Processor Speed 0.148889 -0.07
Disk Size 162.222222 70.00
Main-Memory Size 1.111111 0.00

9.3 Collaborative Filtering

identifying similar users and recommending what similar users like.

refine data

  1. rounding the data
    eg: rates of 3, 4 and 5 are "1", otherwise "0".

  2. normalizing rates
    subtracting from each rating the average rating of that user.


In [68]:
# Fig 9.4
M_9_1


Out[68]:
HP1 HP2 HP3 TW SW1 SW2 SW3
A 4 NaN NaN 5 1 NaN NaN
B 5 5 4 NaN NaN NaN NaN
C NaN NaN NaN 2 4 5 NaN
D NaN 3 NaN NaN NaN NaN 3

In [70]:
# rounding the data
M_round = M_9_1.copy()
M_round[M_9_1 <= 2] = np.nan
M_round[M_9_1 > 2] = 1

M_round


Out[70]:
HP1 HP2 HP3 TW SW1 SW2 SW3
A 1 NaN NaN 1 NaN NaN NaN
B 1 1 1 NaN NaN NaN NaN
C NaN NaN NaN NaN 1 1 NaN
D NaN 1 NaN NaN NaN NaN 1

In [75]:
# normalizing ratings
M_norm = M_9_1.apply(lambda x: x - np.mean(x), axis=1)
M_norm


Out[75]:
HP1 HP2 HP3 TW SW1 SW2 SW3
A 0.666667 NaN NaN 1.666667 -2.333333 NaN NaN
B 0.333333 0.333333 -0.666667 NaN NaN NaN NaN
C NaN NaN NaN -1.666667 0.333333 1.333333 NaN
D NaN 0.000000 NaN NaN NaN NaN 0

9.3.2 The Duality of Similarity

  1. We can use information about users to recommend items, whereas even if we find pairs of similar items, it takes an additional step in order to recommend items to users.

    • find $n$ similar users $\to$ recommend item $I$ to user $U$.
      normalize the utility matrix first.
      \begin{align}

      M[U,I] &= Ave(M[U,:]) + Ave(M[0:n,I] - Ave(M[0:n,I])) \\
              &\approx Ave(M[U,:]) + Std(M[0:n,I])
      

      \end{align}

    • find $m$ similar items $\to$ recommend item $I$ to user $U$.
      $$M[U,I] = Ave(M[U,0:m])$$

    • in order to recommend items to user $U$, we need to find all or most of entry in $M[U,:]$.
      tradeoff:

      1. user-item: find similar users, directly get all predict values of all potent items.
        item-item: find similar items, we need calculate all items one by one (additional step) to fill $M[U,:]$.

      2. item-item similarity often provides more reliable information due to the simplicity of items (genre).

    • precompute preferred items for each user.
      utility matrix evolves slowly $\implies$ compute it infrequently and assume that it remains fixed between recomputations.

  2. Items tend to be classifiable in simple terms (eg: genre), whereas the individuals are complex.

9.3.3 Clustering Users and Items

Hierachical approach is prefered:

  1. leave many cluster unmerged at first.

  2. cluster items, and average the corresponding value in utility matrix.

  3. cluster users, and average as well.

  4. repeat several times if we like.

Predict $M[U,I]:

  1. $U \in C$, and $I \in D$.

  2. predict:

    \begin{equation}

    M[U,I] = \begin{cases}
        M_{revised}[C,D] & \quad \text{if existed.}  \\
        \text{estimate using similar users/items} & \quad \text{otherwise}
    \end{cases}
    

    \end{equation}


In [108]:
# Fig 9.8

raw_data = [
    [4, 5, np.nan, 5, 1, np.nan, 3, 2],
    [np.nan, 3, 4, 3, 1, 2, 1, np.nan],
    [2, np.nan, 1, 3, np.nan, 4, 5, 3]
]

import string
M_uti = pd.DataFrame(raw_data, index=list(string.uppercase[:3]), columns=list(string.lowercase[:8]))
M_uti


Out[108]:
a b c d e f g h
A 4 5 NaN 5 1 NaN 3 2
B NaN 3 4 3 1 2 1 NaN
C 2 NaN 1 3 NaN 4 5 3

In [140]:
logger.setLevel('WARN')
# exercise 9.3.1
from scipy.spatial.distance import jaccard, cosine
from itertools import combinations


def calc_distance_among_matrix(M, func_dis):
    for c in list(combinations(M.index, 2)):
        logger.info('c: {}'.format(c))
        u, v = M.loc[c[0]], M.loc[c[1]]
        logger.info('\n u:{},\n v:{}\n'.format(u.values,v.values))
        print('{} {}: {}'.format(c, func_dis.__name__, func_dis(u,v)))

# (a)
calc_distance_among_matrix(M_uti.notnull(), jaccard)


('A', 'B') jaccard: 0.5
('A', 'C') jaccard: 0.5
('B', 'C') jaccard: 0.5

In [141]:
# (b)
calc_distance_among_matrix(M_uti.fillna(value=0), cosine)


('A', 'B') cosine: 0.398959235991
('A', 'C') cosine: 0.385081306188
('B', 'C') cosine: 0.486129880223

In [142]:
# (c)
M_tmp = M_uti.copy()
M_tmp[M_uti < 3] = 0
M_tmp[M_uti >= 3] = 1 
calc_distance_among_matrix(M_tmp, jaccard)


('A', 'B') jaccard: 0.714285714286
('A', 'C') jaccard: 0.75
('B', 'C') jaccard: 0.875

In [144]:
# (d)
calc_distance_among_matrix(M_tmp.fillna(value=0), cosine)


('A', 'B') cosine: 0.42264973081
('A', 'C') cosine: 0.5
('B', 'C') cosine: 0.711324865405

In [146]:
# (e)
M_uti_nor = M_uti.apply(lambda x: x - np.mean(x), axis=1)
M_uti_nor


Out[146]:
a b c d e f g h
A 0.666667 1.666667 NaN 1.666667 -2.333333 NaN -0.333333 -1.333333
B NaN 0.666667 1.666667 0.666667 -1.333333 -0.333333 -1.333333 NaN
C -1.000000 NaN -2.000000 0.000000 NaN 1.000000 2.000000 0.000000

In [148]:
# (f)
calc_distance_among_matrix(M_uti_nor.fillna(value=0), cosine)


('A', 'B') cosine: 0.415693452532
('A', 'C') cosine: 1.11547005384
('B', 'C') cosine: 1.73957399695

In [149]:
# exercise 9.3.2
#todo

9.4 Dimensionality Reduction

UV-decomposition: $$M = U \times V$$

measure: RMSE (root-mean-square error)

Building a Complete UV-Decomposition Algorithm

  1. Preprocessing of the matrix $M$.
    normalization:

    1. subtract: average rating of user $i$, then average rating of item $j$.
    2. subtract: first item, then user.
    3. subtract: half of average of item and half of average of user.
  2. Initializing $U$ and $V$.
    choice: gives the elements of $UV$ the average of the nonblank elements of $M$.
    $\implies$ the element of $U$ and $V$ should be $\sqrt{a/d}$,
    where $a$ is the average nonblank element of $M$, $d$ is the lengths of the short sides of $U$ and $V$.

    local minima contains global minima:

    1. vary the initial values of $U$ and $V$:
      perturb the value $\sqrt{a/d}$ randomly.
    2. vary the way we seek the optimum.
  3. Performing the Optimization.
    different optimization path:
    choose a permutation of the elements and follow that order for every round.

    Gradient Descent $\to$ stochastic gradient descent.

  4. Converging to a Minimum.
    track the amount of improvement in the RMSE obtained.

    stop condition:

    1. stop when that improvement in one round falls below a threshold.
    2. stop when the maximum improvement during a round is below a threshold.
Avoiding Overfitting

solutions:

  1. optimized by only moving the value of a component a fraction of the way from its current value toward its optimized value.

  2. Stop before the process has converged.

  3. Take several different $UV$ decompositions, and average their predicts.


In [ ]:
# exercise 9.4.6

#todo

9.5 The NetFlix Challenge

some facts:

  1. CineMatch was not a very good algorithm.

  2. UV-decomposition alorithm given a 7\% improvement over CineMatch when couped with normalization and a few other tricks.

  3. Combing different algorithms is a preferred strategy.

  4. Genre and other information in IMDB was no useful.

  5. Time of rating turned out to be useful: upward or downward slope with time.

todo: read the papers introduced in the chapter.