02 - Surprise Recommender System

Use a well-supported recommender package
Instead of homebrew matrix decomposition



In [99]:

    
import pandas as pd
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from jlab import load_test_data, get_test_detector_plane

Load up and prep the datasets

Surprise requires a User, Item, Rating system
"Ratings" also need to be on the same scale with the same min/max values
Use melt and MinMaxScaler to achieve these things
In the spirit of the movie ratings system that's popularly used with Surprise, let's set Min/Max to 1/5



In [2]:

    
scaler = MinMaxScaler(feature_range=(1,5))



In [100]:

    
scaler = StandardScaler()



In [101]:

    
# Load, fit the scaler, transform
X_train = pd.read_csv('MLchallenge2_training.csv')
X_train_scaled_values = scaler.fit_transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled_values, columns=X_train.columns,
                              index=X_train.index)

# Load, transform
X_test = load_test_data('test_in.csv')
X_test_scaled_values = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled_values, columns=X_test.columns,
                             index=X_test.index)
# While we're at it, get the detector plane that'll be used for evaluation
eval_planes = get_test_detector_plane(X_test)

# Combine datasets
X = (pd.concat([X_test_scaled, X_train_scaled], axis=0)
     .reset_index(drop=True))

# Melt the dataframe into a user/item/rating format
# For our purposes, it's trackID / kinematic / value
X.index.name = "track_id"
X_melt = X.reset_index().melt(id_vars=['track_id'])

# Also, load our truth values
X_true = pd.read_csv('test_prediction.csv', names=['x', 'y', 'px', 'py', 'pz'],
                     header=None)



In [102]:

    
X.head()









    Out[102]:







  
    
      
      x
      y
      z
      px
      py
      pz
      x1
      y1
      z1
      px1
      ...
      z23
      px23
      py23
      pz23
      x24
      y24
      z24
      px24
      py24
      pz24
    
    
      track_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      0
      0.074161
      0.112144
      0.0
      -1.622264
      -0.354469
      0.498673
      -0.762725
      0.024993
      -0.173963
      -1.690999
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      1
      0.066620
      -0.202530
      0.0
      0.684269
      2.861297
      0.705761
      0.526532
      1.105833
      -0.173963
      1.369923
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2
      -1.086878
      -2.191303
      0.0
      0.425034
      -0.142295
      -1.191579
      -0.542333
      -2.184630
      -0.173963
      0.178825
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3
      1.530702
      0.234789
      0.0
      -0.126673
      0.454445
      -0.173494
      1.289792
      0.485696
      -0.173963
      0.085667
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      4
      1.290224
      -1.584697
      0.0
      -0.066850
      -0.075990
      0.443141
      1.077120
      -1.405488
      -0.173963
      -0.093996
      ...
      0.146014
      -0.09545
      -0.01542
      0.440823
      NaN
      NaN
      0.101972
      NaN
      NaN
      NaN
    
  

5 rows × 150 columns



In [103]:

    
X_melt.sample(10)



In [105]:

    
X_true.head()



In [109]:

    
MIN = X.min().min()



In [110]:

    
MAX = X.max().max()

Train some Surprise predictors



In [111]:

    
from surprise import (
    SVD, SVDpp, SlopeOne, NMF, CoClustering, 
    KNNBasic, KNNWithMeans, KNNWithZScore,
    NormalPredictor, BaselineOnly
)

Simple workflow

Train with just 1k full tracks
Train set (with all detectors) starts after track_id 10000



In [112]:

    
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(MIN, MAX))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(X_melt[['track_id', 'variable', 'value']]
                            .query('track_id >= 10000 and track_id < 11000'),
                            reader)



In [113]:

    
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)









    



Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2397  0.2667  0.2341  0.2468  0.0142  
MAE (testset)     0.1124  0.1067  0.1121  0.1104  0.0026  
Fit time          6.12    6.09    6.11    6.11    0.01    
Test time         6.07    0.54    0.31    2.31    2.66    






    Out[113]:





{'test_rmse': array([0.23972101, 0.2666763 , 0.23408972]),
 'test_mae': array([0.11241622, 0.1066782 , 0.1121256 ]),
 'fit_time': (6.117186069488525, 6.09119176864624, 6.1097939014434814),
 'test_time': (6.071123123168945, 0.5360040664672852, 0.3131752014160156)}

Give them all a shot

See which ones to pursue



In [114]:

    
algo_dict = {'SVD': SVD(),
             'SVDpp': SVDpp(),
             'SlopeOne': SlopeOne(),
             'CoClustering': CoClustering(),
             'KNNWithMeans': KNNWithMeans(),
             'NormalPredictor': NormalPredictor(),
             'BaselineOnly': BaselineOnly()}

for algo in algo_dict:
    print(algo)
    print(cross_validate(algo_dict[algo], data, measures=['RMSE', 'MAE'], cv=3, verbose=True))









    



SVD
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2289  0.2759  0.2260  0.2436  0.0229  
MAE (testset)     0.1130  0.1091  0.1083  0.1102  0.0021  
Fit time          6.20    6.00    5.95    6.05    0.11    
Test time         0.50    0.47    0.49    0.49    0.01    
{'test_rmse': array([0.22891096, 0.2759472 , 0.22598352]), 'test_mae': array([0.11304235, 0.10914811, 0.10827262]), 'fit_time': (6.203150033950806, 5.996194839477539, 5.954233169555664), 'test_time': (0.49585604667663574, 0.47057509422302246, 0.4896547794342041)}
SVDpp
Evaluating RMSE, MAE of algorithm SVDpp on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2503  0.2606  0.3129  0.2746  0.0274  
MAE (testset)     0.1541  0.1676  0.1684  0.1634  0.0066  
Fit time          134.15  133.91  133.54  133.86  0.25    
Test time         6.18    6.39    6.36    6.31    0.09    
{'test_rmse': array([0.25026616, 0.26060685, 0.31291403]), 'test_mae': array([0.15407927, 0.16755922, 0.16842247]), 'fit_time': (134.14813780784607, 133.90549397468567, 133.53579807281494), 'test_time': (6.184978723526001, 6.390493869781494, 6.357921123504639)}
SlopeOne
Evaluating RMSE, MAE of algorithm SlopeOne on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9088  0.8928  0.8996  0.9004  0.0066  
MAE (testset)     0.6598  0.6610  0.6626  0.6612  0.0011  
Fit time          0.45    0.41    0.43    0.43    0.02    
Test time         5.02    4.96    5.01    5.00    0.03    
{'test_rmse': array([0.90875587, 0.89276161, 0.89959105]), 'test_mae': array([0.65984765, 0.66103708, 0.66262992]), 'fit_time': (0.4522852897644043, 0.4134690761566162, 0.43015003204345703), 'test_time': (5.022490978240967, 4.957015752792358, 5.010500907897949)}
NMF






    



---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-114-fae06b225fb7> in <module>
     10 for algo in algo_dict:
     11     print(algo)
---> 12     print(cross_validate(algo_dict[algo], data, measures=['RMSE', 'MAE'], cv=3, verbose=True))

//anaconda3/lib/python3.7/site-packages/surprise/model_selection/validation.py in cross_validate(algo, data, measures, cv, return_train_measures, n_jobs, pre_dispatch, verbose)
     99                                            return_train_measures)
    100                     for (trainset, testset) in cv.split(data))
--> 101     out = Parallel(n_jobs=n_jobs, pre_dispatch=pre_dispatch)(delayed_list)
    102 
    103     (test_measures_dicts,

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

//anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

//anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

//anaconda3/lib/python3.7/site-packages/surprise/model_selection/validation.py in fit_and_score(algo, trainset, testset, measures, return_train_measures)
    162 
    163     start_fit = time.time()
--> 164     algo.fit(trainset)
    165     fit_time = time.time() - start_fit
    166     start_test = time.time()

//anaconda3/lib/python3.7/site-packages/surprise/prediction_algorithms/matrix_factorization.pyx in surprise.prediction_algorithms.matrix_factorization.NMF.fit()

//anaconda3/lib/python3.7/site-packages/surprise/prediction_algorithms/matrix_factorization.pyx in surprise.prediction_algorithms.matrix_factorization.NMF.sgd()

ZeroDivisionError: float division

SVD, SVDpp, and KNN do well

Probably do even better with more data, but it takes time...
Move forward with SVDpp
Train with 1k, create pred workflow for the detector of choice



In [115]:

    
data = Dataset.load_from_df(X_melt[['track_id', 'variable', 'value']]
                            .query('track_id < 50000'),
                            reader)
data = data.build_full_trainset()



In [116]:

    
algo = SVDpp(n_factors=20, n_epochs=20)



In [117]:

    
algo.fit(train)









    Out[117]:





<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x1a739ba2b0>

Time to make predictions

Make a copy of our X_test
For each track, for each plane that we need to predict, predict x, y, px, py, pz



In [118]:

    
def get_kinematic_pred(algo, track_id, kinematic):
    return algo.predict(track_id, kinematic).est



In [119]:

    
def get_track_kinematic_pred_for_plane(algo, track_id, plane):
    kinematics = [k + str(int(plane))
                  for k in ['x', 'y', 'px', 'py', 'pz']]
    plane_dict = {kin: get_kinematic_pred(algo, track_id, kin)
                  for kin in kinematics}
    return plane_dict



In [120]:

    
get_track_kinematic_pred_for_plane(algo, 0, 15)









    Out[120]:





{'x15': 3.033402843600961,
 'y15': 2.978028441579763,
 'px15': 2.9850219111344884,
 'py15': 3.021867014894096,
 'pz15': 3.0038923592644524}



In [121]:

    
def fill_eval_plane_for_track(algo, X, track_id):
    plane = get_test_detector_plane(X.loc[track_id])
    plane_dict = get_track_kinematic_pred_for_plane(algo, track_id, plane)
    for kin in plane_dict:
        X.loc[track_id, kin] = plane_dict[kin]



In [122]:

    
X_pred_scaled = X_test_scaled.copy()



In [123]:

    
for ix in X_pred_scaled.index.values:
    fill_eval_plane_for_track(algo, X_pred_scaled, ix)



In [124]:

    
X_pred_values = scaler.inverse_transform(X_pred_scaled)
X_pred = pd.DataFrame(X_pred_values, columns=X_pred_scaled.columns,
                      index=X_pred_scaled.index)

Spot check!



In [125]:

    
for track in [20, 50, 1000, 5000]:
    plane = get_test_detector_plane(X_test.loc[track])
    print("PRED:\n", X_pred.loc[track, [kin + str(int(plane))
                             for kin in ['x', 'y', 'px', 'py', 'pz']]],
          "\n")
    print("TRUE:\n", X_true.loc[track], "\n\n-------------\n")









    



PRED:
 x11     50.211999
y11     49.927170
px11     0.450275
py11     0.454208
pz11     4.567603
Name: 20, dtype: float64 

TRUE:
 x      4.108309
y     19.228528
px    -0.039872
py     0.212423
pz     2.011996
Name: 20, dtype: float64 

-------------

PRED:
 x20     62.445658
y20     61.497072
px20     0.445079
py20     0.449278
pz20     4.566731
Name: 50, dtype: float64 

TRUE:
 x     18.472740
y    -19.001599
px     0.082835
py    -0.189903
pz     2.080941
Name: 50, dtype: float64 

-------------

PRED:
 x20     62.445658
y20     61.497072
px20     0.445079
py20     0.449278
pz20     4.566731
Name: 1000, dtype: float64 

TRUE:
 x     10.736746
y     13.892162
px     0.025428
py    -0.020759
pz     1.876688
Name: 1000, dtype: float64 

-------------

PRED:
 x13     57.438868
y13     56.440951
px13     0.450998
py13     0.455674
pz13     4.567316
Name: 5000, dtype: float64 

TRUE:
 x     -1.534074
y     39.745310
px     0.062563
py     0.165445
pz     1.700621
Name: 5000, dtype: float64 

-------------

I've tinkered around a bit, and this is discouraging...

Time to go back to the drawing board and try out a boring old sequence model :(

	track_id	variable	value
16769675	196994	px13	-0.313329
9507386	95740	py7	0.581009
13002815	112952	px10	1.071682
10043455	18006	y8	-1.700605
15673042	123366	py12	0.374578
29700965	33820	y24	-2.396870
14251962	134493	px11	0.096533
1672225	35417	z1	-0.173963
21452218	173714	z17	0.018756
23136029	16116	pz18	1.313962

	x	y	px	py	pz
0	-23.123945	3.142886	-0.235592	0.091612	2.413377
1	19.633486	32.319292	0.314376	0.316425	2.592952
2	-8.308506	-39.299613	-0.020097	-0.051232	0.948906
3	19.918838	10.664617	0.038102	0.047740	1.864014
4	13.649239	-20.616935	-0.015548	0.001471	2.323953

	x	y	z	px	py	pz	x1	y1	z1	px1	...	z23	px23	py23	pz23	x24	y24	z24	px24	py24	pz24
track_id
0	0.074161	0.112144	0.0	-1.622264	-0.354469	0.498673	-0.762725	0.024993	-0.173963	-1.690999	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	0.066620	-0.202530	0.0	0.684269	2.861297	0.705761	0.526532	1.105833	-0.173963	1.369923	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1.086878	-2.191303	0.0	0.425034	-0.142295	-1.191579	-0.542333	-2.184630	-0.173963	0.178825	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1.530702	0.234789	0.0	-0.126673	0.454445	-0.173494	1.289792	0.485696	-0.173963	0.085667	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1.290224	-1.584697	0.0	-0.066850	-0.075990	0.443141	1.077120	-1.405488	-0.173963	-0.093996	...	0.146014	-0.09545	-0.01542	0.440823	NaN	NaN	0.101972	NaN	NaN	NaN