WNx - 06 June 2017 Practical Deep Learning I Lesson 4 CodeAlong
In [1]:
import theano
In [2]:
import sys, os
sys.path.insert(1, os.path.join('utils'))
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import print_function, division
In [3]:
path = "data/ml-latest-small/"
model_path = path + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)
batch_size = 64
In [4]:
ratings = pd.read_csv(path+'ratings.csv')
ratings.head()
Out[4]:
In [5]:
len(ratings)
Out[5]:
Just for display purposes, let's read in the movie names too:
In [6]:
movie_names = pd.read_csv(path+'movies.csv').set_index('movieId')['title'].to_dict()
In [7]:
users = ratings.userId.unique()
movies = ratings.movieId.unique()
In [8]:
userid2idx = {o:i for i,o in enumerate(users)}
movieid2idx = {o:i for i,o in enumerate(movies)}
We update the movie and user ids so that they are contiguous integers, which we want when using embeddings.
In [9]:
ratings.movieId = ratings.movieId.apply(lambda x: movieid2idx[x])
ratings.userId = ratings.userId.apply(lambda x: userid2idx[x])
In [10]:
user_min, user_max, movie_min, movie_max = (ratings.userId.min(),
ratings.userId.max(), ratings.movieId.min(), ratings.movieId.max())
user_min, user_max, movie_min, movie_max
Out[10]:
In [11]:
n_users = ratings.userId.nunique()
n_movies = ratings.movieId.nunique()
n_users, n_movies
Out[11]:
This is the number of latent factors in each embedding.
In [12]:
n_factors = 50
np.random.seed = 42
Randomly split into training and validation.
In [13]:
msk = np.random.rand(len(ratings)) < 0.8
trn = ratings[msk]
val = ratings[~msk]
In [14]:
g=ratings.groupby('userId')['rating'].count()
topUsers=g.sort_values(ascending=False)[:15]
In [15]:
g=ratings.groupby('movieId')['rating'].count()
topMovies=g.sort_values(ascending=False)[:15]
In [16]:
top_r = ratings.join(topUsers, rsuffix='_r', how='inner', on='userId')
In [17]:
top_r = top_r.join(topMovies, rsuffix='_r', how='inner', on='movieId')
In [18]:
pd.crosstab(top_r.userId, top_r.movieId, top_r.rating, aggfunc=np.sum)
Out[18]:
In [19]:
user_in = Input(shape=(1,), dtype='int64', name='user_in')
u = Embedding(n_users, n_factors, input_length=1, W_regularizer=l2(1e-4))(user_in)
movie_in = Input(shape=(1,), dtype='int64', name='movie_in')
m = Embedding(n_movies, n_factors, input_length=1, W_regularizer=l2(1e-4))(movie_in)
In [20]:
x = merge([u, m], mode='dot')
x = Flatten()(x)
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')
In [21]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.movieId], val.rating))
Out[21]:
In [22]:
model.optimizer.lr=0.01
In [23]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=3,
validation_data=([val.userId, val.movieId], val.rating))
Out[23]:
In [24]:
model.optimizer.lr=0.001
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6,
validation_data=([val.userId, val.movieId], val.rating))
Out[24]:
The best benchmarks are a bit over 0.9, so this model doesn't seem to be working that well...
The problem is likely to be that we don't have bias terms - that is, a single bias for each user and each movie representing how positive or negative each user is, and how good each movie is. We can add that easily by simply creating an embedding with one output for each movie and each user, and adding it to our output.
In [25]:
def embedding_input(name, n_in, n_out, reg):
inp = Input(shape=(1,), dtype='int64', name=name)
return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)
In [26]:
user_in, u = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, m = embedding_input('movie_in', n_movies, n_factors, 1e-4)
In [27]:
def create_bias(inp, n_in):
x = Embedding(n_in, 1, input_length=1)(inp)
return Flatten()(x)
In [28]:
ub = create_bias(user_in, n_users)
mb = create_bias(movie_in, n_movies)
In [29]:
x = merge([u, m], mode='dot')
x = Flatten()(x)
x = merge([x, ub], mode='sum')
x = merge([x, mb], mode='sum')
model = Model([user_in, movie_in], x)
model.compile(Adam(0.001), loss='mse')
In [30]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=1,
validation_data=([val.userId, val.movieId], val.rating))
Out[30]:
In [31]:
model.optimizer.lr = 0.01
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=6, verbose=0,
validation_data=([val.userId, val.movieId], val.rating))
Out[31]:
In [32]:
model.optimizer.lr = 0.001
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=10, verbose=0,
validation_data=([val.userId, val.movieId], val.rating))
Out[32]:
In [33]:
model.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=5,
validation_data=([val.userId, val.movieId], val.rating))
Out[33]:
This result is quite a bit better than the best benchmarks we could find with a quick Google search (at least the training loss is < 0.89) - so looks like a great approach!
In [34]:
model.save_weights(model_path + 'bias.h5')
model.load_weights(model_path + 'bias.h5')
e can use the model to generate predictions by passing a pair of ints - a user id and a movie id. For instance, this predicts that user #3 would really enjoy movie #6.
In [35]:
model.predict([np.array([3]), np.array([6])])
Out[35]:
In [36]:
g = ratings.groupby('movieId')['rating'].count()
topMovies = g.sort_values(ascending=False)[:2000]
topMovies = np.array(topMovies.index)
First, we'll look at the movie bias term. We create a 'model' - which in Keras is simply a way of associating one or more inputs with one or more outputs, using the functional API. Here, our input is the movie id (a single id), and the output is the movie bias (a single float).
In [37]:
get_movie_bias = Model(movie_in, mb)
movie_bias = get_movie_bias.predict(topMovies)
movie_ratings = [(b[0], movie_names[movies[i]]) for i, b in zip(topMovies, movie_bias)]
Now we can look at the top and bottom rated movies. These ratings are correected for different levels of reviewer sentiment, as well as different types of movies that different reviewers watch.
In [38]:
sorted(movie_ratings, key=itemgetter(0))[:15]
Out[38]:
In [39]:
sorted(movie_ratings, key=itemgetter(0), reverse=True)[:15]
Out[39]:
Hey! I liked Avengers, lol
We can now do the same thing for the embeddings.
In [40]:
get_movie_emb = Model(movie_in, m)
movie_emb = np.squeeze(get_movie_emb.predict([topMovies]))
movie_emb.shape
Out[40]:
Because it's hard to interpret 50 embeddings, we use PCA (Principal Component Analysis) to simplify them down to just 3 vectors.
In [42]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
movie_pca = pca.fit(movie_emb.T).components_
In [43]:
fac0 = movie_pca[0]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac0, topMovies)]
Here's the 1st component. It seems to be 'critically acclaimed' or 'classic'.
In [44]:
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]
Out[44]:
In [45]:
sorted(movie_comp, key=itemgetter(0))[:10]
Out[45]:
In [46]:
fac1 = movie_pca[1]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac1, topMovies)]
The 2nd is 'hollywood blockbuster'.
In [48]:
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]
Out[48]:
In [50]:
sorted(movie_comp, key=itemgetter(0))[:10]
Out[50]:
In [51]:
fac2 = movie_pca[2]
movie_comp = [(f, movie_names[movies[i]]) for f,i in zip(fac2, topMovies)]
The 3rd is 'violent vs happy'.
In [52]:
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]
Out[52]:
In [53]:
sorted(movie_comp, key=itemgetter(0))[:10]
Out[53]:
We can draw a picture to see how various movies appear on the map of these components. This picture shows the 1st and 3rd components.
In [54]:
import sys
stdout, stderr = sys.stdout, sys.stderr # save notebook stdout and stderr
reload(sys)
sys.setdefaultencoding('utf-8')
sys.stdout, sys.stderr = stdout, stderr # restore notebook stdout and stderr
In [55]:
start=50; end=100
X = fac0[start:end]
Y = fac2[start:end]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(topMovies[start:end], X, Y):
plt.text(x,y,movie_names[movies[i]], color=np.random.rand(3)*0.7, fontsize=14)
plt.show()
Rather than creating a special purpose architecture (like our dot-product with bias earlier), it's often both easier and more accurate to use a standard neural network. Let's try it! Here, we simply concatenate the user and movie embeddings into a single vector, which we feed into the neural net.
In [56]:
user_in, u = embedding_input('user_in', n_users, n_factors, 1e-4)
movie_in, m = embedding_input('movie_in', n_movies, n_factors, 1e-4)
In [57]:
x = merge([u,m], mode='concat')
x = Flatten()(x)
x = Dropout(0.3)(x)
x = Dense(70, activation='relu')(x)
x = Dropout(0.75)(x)
x = Dense(1)(x)
NN = Model([user_in, movie_in], x)
NN.compile(Adam(0.001), loss='mse')
In [58]:
NN.fit([trn.userId, trn.movieId], trn.rating, batch_size=64, nb_epoch=8,
validation_data=([val.userId, val.movieId], val.rating))
Out[58]:
This improves on our already impressive accuracy even further!