Practical Deep Learning for Coders, v3

Lesson4_collab



In [ ]:

    
from fastai.collab import *
from fastai.tabular import *

Collaborative filtering example

协同过滤案例

collab models use data in a DataFrame of user, items, and ratings.

collab模型使用的是DataFrame中的一个（包含）用户、电影和评分的数据集。



In [ ]:

    
user,item,title = 'userId','movieId','title'



In [ ]:

    
path = untar_data(URLs.ML_SAMPLE)
path









    Out[ ]:





PosixPath('/home/ubuntu/.fastai/data/movie_lens_sample')



In [ ]:

    
ratings = pd.read_csv(path/'ratings.csv')
ratings.head()

That's all we need to create and train a model:

以上就是我们用来训练模型的全部（数据）：



In [ ]:

    
data = CollabDataBunch.from_df(ratings, seed=42)



In [ ]:

    
y_range = [0,5.5]



In [ ]:

    
learn = collab_learner(data, n_factors=50, y_range=y_range)



In [ ]:

    
learn.fit_one_cycle(3, 5e-3)









    




Total time: 00:03 

  
    epoch
    train_loss
    valid_loss
  
  
    1
    1.629454
    0.982241
  
  
    2
    0.856353
    0.678751
  
  
    3
    0.655987
    0.669647

Movielens 100k

Let's try with the full Movielens 100k data dataset, available from http://files.grouplens.org/datasets/movielens/ml-100k.zip

让我们尝试一下用Movielens的全部数据进行建模。



In [ ]:

    
path=Config.data_path()/'ml-100k'



In [ ]:

    
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=[user,item,'rating','timestamp'])
ratings.head()



In [ ]:

    
movies = pd.read_csv(path/'u.item',  delimiter='|', encoding='latin-1', header=None,
                    names=[item, 'title', 'date', 'N', 'url', *[f'g{i}' for i in range(19)]])
movies.head()









    Out[ ]:







  
    
      
      movieId
      title
      date
      N
      url
      g0
      g1
      g2
      g3
      g4
      ...
      g9
      g10
      g11
      g12
      g13
      g14
      g15
      g16
      g17
      g18
    
  
  
    
      0
      1
      Toy Story (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Toy%20Story%2...
      0
      0
      0
      1
      1
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      2
      GoldenEye (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?GoldenEye%20(...
      0
      1
      1
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      2
      3
      Four Rooms (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Four%20Rooms%...
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
    
      3
      4
      Get Shorty (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Get%20Shorty%...
      0
      1
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      5
      Copycat (1995)
      01-Jan-1995
      NaN
      http://us.imdb.com/M/title-exact?Copycat%20(1995)
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
    
  

5 rows × 24 columns



In [ ]:

    
len(ratings)









    Out[ ]:





100000



In [ ]:

    
rating_movie = ratings.merge(movies[[item, title]])
rating_movie.head()









    Out[ ]:







  
    
      
      userId
      movieId
      rating
      timestamp
      title
    
  
  
    
      0
      196
      242
      3
      881250949
      Kolya (1996)
    
    
      1
      63
      242
      3
      875747190
      Kolya (1996)
    
    
      2
      226
      242
      5
      883888671
      Kolya (1996)
    
    
      3
      154
      242
      3
      879138235
      Kolya (1996)
    
    
      4
      306
      242
      5
      876503793
      Kolya (1996)



In [ ]:

    
data = CollabDataBunch.from_df(rating_movie, seed=42, valid_pct=0.1, item_name=title)



In [ ]:

    
data.show_batch()









    




        
    userId
    title
    target
  
  
    126
    Event Horizon (1997)
    1.0
  
  
    44
    Young Frankenstein (1974)
    4.0
  
  
    718
    Star Trek: First Contact (1996)
    4.0
  
  
    506
    Magnificent Seven, The (1954)
    5.0
  
  
    373
    Good, The Bad and The Ugly, The (1966)
    3.0



In [ ]:

    
y_range = [0,5.5]



In [ ]:

    
learn = collab_learner(data, n_factors=40, y_range=y_range, wd=1e-1)



In [ ]:

    
learn.lr_find()
learn.recorder.plot(skip_end=15)









    



LR Finder is complete, type {learner_name}.recorder.plot() to see the graph.



In [ ]:

    
learn.fit_one_cycle(5, 5e-3)









    




Total time: 00:30 

  
    epoch
    train_loss
    valid_loss
  
  
    1
    0.923900
    0.946068
  
  
    2
    0.865458
    0.890646
  
  
    3
    0.783896
    0.836753
  
  
    4
    0.638374
    0.815428
  
  
    5
    0.561979
    0.814652



In [ ]:

    
learn.save('dotprod')

Here's some benchmarks on the same dataset for the popular Librec system for collaborative filtering. They show best results based on RMSE of 0.91, which corresponds to an MSE of 0.91**2 = 0.83.

这里是一些在同一数据集上建模的基准数据。在表格中我们可以看到最好的模型的RMSE是0.91，对应的MSE是0.91**2 = 0.83。

Interpretation

模型释义

Setup 调用



In [ ]:

    
learn.load('dotprod');



In [ ]:

    
learn.model









    Out[ ]:





EmbeddingDotBias(
  (u_weight): Embedding(944, 40)
  (i_weight): Embedding(1654, 40)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1654, 1)
)



In [ ]:

    
g = rating_movie.groupby(title)['rating'].count()
top_movies = g.sort_values(ascending=False).index.values[:1000]
top_movies[:10]









    Out[ ]:





array(['Star Wars (1977)', 'Contact (1997)', 'Fargo (1996)', 'Return of the Jedi (1983)', 'Liar Liar (1997)',
       'English Patient, The (1996)', 'Scream (1996)', 'Toy Story (1995)', 'Air Force One (1997)',
       'Independence Day (ID4) (1996)'], dtype=object)

Movie bias

电影模型的偏差



In [ ]:

    
movie_bias = learn.bias(top_movies, is_item=True)
movie_bias.shape









    Out[ ]:





torch.Size([1000])



In [ ]:

    
mean_ratings = rating_movie.groupby(title)['rating'].mean()
movie_ratings = [(b, i, mean_ratings.loc[i]) for i,b in zip(top_movies,movie_bias)]



In [ ]:

    
item0 = lambda o:o[0]



In [ ]:

    
sorted(movie_ratings, key=item0)[:15]









    Out[ ]:





[(tensor(-0.3667),
  'Children of the Corn: The Gathering (1996)',
  1.3157894736842106),
 (tensor(-0.3142),
  'Lawnmower Man 2: Beyond Cyberspace (1996)',
  1.7142857142857142),
 (tensor(-0.2926), 'Mortal Kombat: Annihilation (1997)', 1.9534883720930232),
 (tensor(-0.2708), 'Cable Guy, The (1996)', 2.339622641509434),
 (tensor(-0.2669), 'Striptease (1996)', 2.2388059701492535),
 (tensor(-0.2641), 'Free Willy 3: The Rescue (1997)', 1.7407407407407407),
 (tensor(-0.2511), 'Beautician and the Beast, The (1997)', 2.313953488372093),
 (tensor(-0.2418), 'Bio-Dome (1996)', 1.903225806451613),
 (tensor(-0.2345), "Joe's Apartment (1996)", 2.2444444444444445),
 (tensor(-0.2324), 'Island of Dr. Moreau, The (1996)', 2.1578947368421053),
 (tensor(-0.2266), 'Barb Wire (1996)', 1.9333333333333333),
 (tensor(-0.2219), 'Crow: City of Angels, The (1996)', 1.9487179487179487),
 (tensor(-0.2208), 'Grease 2 (1982)', 2.0),
 (tensor(-0.2151), 'Home Alone 3 (1997)', 1.894736842105263),
 (tensor(-0.2089), "McHale's Navy (1997)", 2.1884057971014492)]



In [ ]:

    
sorted(movie_ratings, key=lambda o: o[0], reverse=True)[:15]









    Out[ ]:





[(tensor(0.5913), "Schindler's List (1993)", 4.466442953020135),
 (tensor(0.5700), 'Titanic (1997)', 4.2457142857142856),
 (tensor(0.5623), 'Shawshank Redemption, The (1994)', 4.445229681978798),
 (tensor(0.5412), 'L.A. Confidential (1997)', 4.161616161616162),
 (tensor(0.5368), 'Rear Window (1954)', 4.3875598086124405),
 (tensor(0.5193), 'Star Wars (1977)', 4.3584905660377355),
 (tensor(0.5149), 'As Good As It Gets (1997)', 4.196428571428571),
 (tensor(0.5114), 'Silence of the Lambs, The (1991)', 4.28974358974359),
 (tensor(0.5097), 'Good Will Hunting (1997)', 4.262626262626263),
 (tensor(0.4946), 'Vertigo (1958)', 4.251396648044692),
 (tensor(0.4899), 'Godfather, The (1972)', 4.283292978208232),
 (tensor(0.4855), 'Boot, Das (1981)', 4.203980099502488),
 (tensor(0.4769), 'Usual Suspects, The (1995)', 4.385767790262173),
 (tensor(0.4743), 'Casablanca (1942)', 4.45679012345679),
 (tensor(0.4665), 'Close Shave, A (1995)', 4.491071428571429)]

Movie weights

电影模型权重



In [ ]:

    
movie_w = learn.weight(top_movies, is_item=True)
movie_w.shape









    Out[ ]:





torch.Size([1000, 40])



In [ ]:

    
movie_pca = movie_w.pca(3)
movie_pca.shape









    Out[ ]:





torch.Size([1000, 3])



In [ ]:

    
fac0,fac1,fac2 = movie_pca.t()
movie_comp = [(f, i) for f,i in zip(fac0, top_movies)]



In [ ]:

    
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]









    Out[ ]:





[(tensor(1.2412), 'Home Alone 3 (1997)'),
 (tensor(1.2072), 'Jungle2Jungle (1997)'),
 (tensor(1.2000), 'Bio-Dome (1996)'),
 (tensor(1.1883), 'Leave It to Beaver (1997)'),
 (tensor(1.1570), 'Children of the Corn: The Gathering (1996)'),
 (tensor(1.1309), "McHale's Navy (1997)"),
 (tensor(1.1187), 'D3: The Mighty Ducks (1996)'),
 (tensor(1.0956), 'Congo (1995)'),
 (tensor(1.0950), 'Free Willy 3: The Rescue (1997)'),
 (tensor(1.0524), 'Cutthroat Island (1995)')]



In [ ]:

    
sorted(movie_comp, key=itemgetter(0))[:10]









    Out[ ]:





[(tensor(-1.0692), 'Casablanca (1942)'),
 (tensor(-1.0523), 'Close Shave, A (1995)'),
 (tensor(-1.0142), 'When We Were Kings (1996)'),
 (tensor(-1.0075), 'Lawrence of Arabia (1962)'),
 (tensor(-1.0034), 'Wrong Trousers, The (1993)'),
 (tensor(-0.9905), 'Chinatown (1974)'),
 (tensor(-0.9692), 'Ran (1985)'),
 (tensor(-0.9541), 'Apocalypse Now (1979)'),
 (tensor(-0.9523), 'Wallace & Gromit: The Best of Aardman Animation (1996)'),
 (tensor(-0.9369), 'Some Folks Call It a Sling Blade (1993)')]



In [ ]:

    
movie_comp = [(f, i) for f,i in zip(fac1, top_movies)]



In [ ]:

    
sorted(movie_comp, key=itemgetter(0), reverse=True)[:10]









    Out[ ]:





[(tensor(0.8788), 'Ready to Wear (Pret-A-Porter) (1994)'),
 (tensor(0.8263), 'Keys to Tulsa (1997)'),
 (tensor(0.8066), 'Nosferatu (Nosferatu, eine Symphonie des Grauens) (1922)'),
 (tensor(0.7730), 'Dead Man (1995)'),
 (tensor(0.7513), 'Three Colors: Blue (1993)'),
 (tensor(0.7492), 'Trainspotting (1996)'),
 (tensor(0.7414), 'Cable Guy, The (1996)'),
 (tensor(0.7330), 'Jude (1996)'),
 (tensor(0.7246), 'Clockwork Orange, A (1971)'),
 (tensor(0.7195), 'Stupids, The (1996)')]



In [ ]:

    
sorted(movie_comp, key=itemgetter(0))[:10]









    Out[ ]:





[(tensor(-1.2148), 'Braveheart (1995)'),
 (tensor(-1.1153), 'Titanic (1997)'),
 (tensor(-1.1148), 'Raiders of the Lost Ark (1981)'),
 (tensor(-0.8795), "It's a Wonderful Life (1946)"),
 (tensor(-0.8644), "Mr. Holland's Opus (1995)"),
 (tensor(-0.8619), 'Star Wars (1977)'),
 (tensor(-0.8558), 'Return of the Jedi (1983)'),
 (tensor(-0.8526), 'Pretty Woman (1990)'),
 (tensor(-0.8453), 'Independence Day (ID4) (1996)'),
 (tensor(-0.8450), 'Forrest Gump (1994)')]



In [ ]:

    
idxs = np.random.choice(len(top_movies), 50, replace=False)
idxs = list(range(50))
X = fac0[idxs]
Y = fac2[idxs]
plt.figure(figsize=(15,15))
plt.scatter(X, Y)
for i, x, y in zip(top_movies[idxs], X, Y):
    plt.text(x,y,i, color=np.random.rand(3)*0.7, fontsize=11)
plt.show()

	userId	movieId	rating	timestamp
0	73	1097	4.0	1255504951
1	561	924	3.5	1172695223
2	157	260	3.5	1291598691
3	358	1210	5.0	957481884
4	130	316	2.0	1138999234

	userId	movieId	rating	timestamp
0	196	242	3	881250949
1	186	302	3	891717742
2	22	377	1	878887116
3	244	51	2	880606923
4	166	346	1	886397596

	movieId	title	date	N	url	g1	g2	g3	g4	...	g16
0	1	Toy Story (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Toy%20Story%2...	0	0	1	1	...	0
1	2	GoldenEye (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?GoldenEye%20(...	1	1	0	0	...	1
2	3	Four Rooms (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Four%20Rooms%...	0	0	0	0	...	1
3	4	Get Shorty (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Get%20Shorty%...	1	0	0	0	...	0
4	5	Copycat (1995)	01-Jan-1995	NaN	http://us.imdb.com/M/title-exact?Copycat%20(1995)	0	0	0	0	...	1

	userId	movieId	rating	timestamp	title
0	196	242	3	881250949	Kolya (1996)
1	63	242	3	875747190	Kolya (1996)
2	226	242	5	883888671	Kolya (1996)
3	154	242	3	879138235	Kolya (1996)
4	306	242	5	876503793	Kolya (1996)

userId	title	target
126	Event Horizon (1997)	1.0
44	Young Frankenstein (1974)	4.0
718	Star Trek: First Contact (1996)	4.0
506	Magnificent Seven, The (1954)	5.0
373	Good, The Bad and The Ugly, The (1966)	3.0

epoch	train_loss	valid_loss
1	0.923900	0.946068
2	0.865458	0.890646
3	0.783896	0.836753
4	0.638374	0.815428
5	0.561979	0.814652