02 - Surprise Recommender System

  • Use a well-supported recommender package
  • Instead of homebrew matrix decomposition

In [99]:
import pandas as pd
from surprise import Dataset, Reader
from surprise.model_selection import cross_validate
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from jlab import load_test_data, get_test_detector_plane

Load up and prep the datasets

  • Surprise requires a User, Item, Rating system
  • "Ratings" also need to be on the same scale with the same min/max values
  • Use melt and MinMaxScaler to achieve these things
  • In the spirit of the movie ratings system that's popularly used with Surprise, let's set Min/Max to 1/5

In [2]:
scaler = MinMaxScaler(feature_range=(1,5))

In [100]:
scaler = StandardScaler()

In [101]:
# Load, fit the scaler, transform
X_train = pd.read_csv('MLchallenge2_training.csv')
X_train_scaled_values = scaler.fit_transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled_values, columns=X_train.columns,
                              index=X_train.index)

# Load, transform
X_test = load_test_data('test_in.csv')
X_test_scaled_values = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled_values, columns=X_test.columns,
                             index=X_test.index)
# While we're at it, get the detector plane that'll be used for evaluation
eval_planes = get_test_detector_plane(X_test)

# Combine datasets
X = (pd.concat([X_test_scaled, X_train_scaled], axis=0)
     .reset_index(drop=True))

# Melt the dataframe into a user/item/rating format
# For our purposes, it's trackID / kinematic / value
X.index.name = "track_id"
X_melt = X.reset_index().melt(id_vars=['track_id'])

# Also, load our truth values
X_true = pd.read_csv('test_prediction.csv', names=['x', 'y', 'px', 'py', 'pz'],
                     header=None)

In [102]:
X.head()


Out[102]:
x y z px py pz x1 y1 z1 px1 ... z23 px23 py23 pz23 x24 y24 z24 px24 py24 pz24
track_id
0 0.074161 0.112144 0.0 -1.622264 -0.354469 0.498673 -0.762725 0.024993 -0.173963 -1.690999 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 0.066620 -0.202530 0.0 0.684269 2.861297 0.705761 0.526532 1.105833 -0.173963 1.369923 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1.086878 -2.191303 0.0 0.425034 -0.142295 -1.191579 -0.542333 -2.184630 -0.173963 0.178825 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 1.530702 0.234789 0.0 -0.126673 0.454445 -0.173494 1.289792 0.485696 -0.173963 0.085667 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1.290224 -1.584697 0.0 -0.066850 -0.075990 0.443141 1.077120 -1.405488 -0.173963 -0.093996 ... 0.146014 -0.09545 -0.01542 0.440823 NaN NaN 0.101972 NaN NaN NaN

5 rows × 150 columns


In [103]:
X_melt.sample(10)


Out[103]:
track_id variable value
16769675 196994 px13 -0.313329
9507386 95740 py7 0.581009
13002815 112952 px10 1.071682
10043455 18006 y8 -1.700605
15673042 123366 py12 0.374578
29700965 33820 y24 -2.396870
14251962 134493 px11 0.096533
1672225 35417 z1 -0.173963
21452218 173714 z17 0.018756
23136029 16116 pz18 1.313962

In [105]:
X_true.head()


Out[105]:
x y px py pz
0 -23.123945 3.142886 -0.235592 0.091612 2.413377
1 19.633486 32.319292 0.314376 0.316425 2.592952
2 -8.308506 -39.299613 -0.020097 -0.051232 0.948906
3 19.918838 10.664617 0.038102 0.047740 1.864014
4 13.649239 -20.616935 -0.015548 0.001471 2.323953

In [109]:
MIN = X.min().min()

In [110]:
MAX = X.max().max()

Train some Surprise predictors


In [111]:
from surprise import (
    SVD, SVDpp, SlopeOne, NMF, CoClustering, 
    KNNBasic, KNNWithMeans, KNNWithZScore,
    NormalPredictor, BaselineOnly
)

Simple workflow

  • Train with just 1k full tracks
  • Train set (with all detectors) starts after track_id 10000

In [112]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(MIN, MAX))

# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(X_melt[['track_id', 'variable', 'value']]
                            .query('track_id >= 10000 and track_id < 11000'),
                            reader)

In [113]:
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3, verbose=True)


Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2397  0.2667  0.2341  0.2468  0.0142  
MAE (testset)     0.1124  0.1067  0.1121  0.1104  0.0026  
Fit time          6.12    6.09    6.11    6.11    0.01    
Test time         6.07    0.54    0.31    2.31    2.66    
Out[113]:
{'test_rmse': array([0.23972101, 0.2666763 , 0.23408972]),
 'test_mae': array([0.11241622, 0.1066782 , 0.1121256 ]),
 'fit_time': (6.117186069488525, 6.09119176864624, 6.1097939014434814),
 'test_time': (6.071123123168945, 0.5360040664672852, 0.3131752014160156)}

Give them all a shot

  • See which ones to pursue

In [114]:
algo_dict = {'SVD': SVD(),
             'SVDpp': SVDpp(),
             'SlopeOne': SlopeOne(),
             'CoClustering': CoClustering(),
             'KNNWithMeans': KNNWithMeans(),
             'NormalPredictor': NormalPredictor(),
             'BaselineOnly': BaselineOnly()}

for algo in algo_dict:
    print(algo)
    print(cross_validate(algo_dict[algo], data, measures=['RMSE', 'MAE'], cv=3, verbose=True))


SVD
Evaluating RMSE, MAE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2289  0.2759  0.2260  0.2436  0.0229  
MAE (testset)     0.1130  0.1091  0.1083  0.1102  0.0021  
Fit time          6.20    6.00    5.95    6.05    0.11    
Test time         0.50    0.47    0.49    0.49    0.01    
{'test_rmse': array([0.22891096, 0.2759472 , 0.22598352]), 'test_mae': array([0.11304235, 0.10914811, 0.10827262]), 'fit_time': (6.203150033950806, 5.996194839477539, 5.954233169555664), 'test_time': (0.49585604667663574, 0.47057509422302246, 0.4896547794342041)}
SVDpp
Evaluating RMSE, MAE of algorithm SVDpp on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.2503  0.2606  0.3129  0.2746  0.0274  
MAE (testset)     0.1541  0.1676  0.1684  0.1634  0.0066  
Fit time          134.15  133.91  133.54  133.86  0.25    
Test time         6.18    6.39    6.36    6.31    0.09    
{'test_rmse': array([0.25026616, 0.26060685, 0.31291403]), 'test_mae': array([0.15407927, 0.16755922, 0.16842247]), 'fit_time': (134.14813780784607, 133.90549397468567, 133.53579807281494), 'test_time': (6.184978723526001, 6.390493869781494, 6.357921123504639)}
SlopeOne
Evaluating RMSE, MAE of algorithm SlopeOne on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9088  0.8928  0.8996  0.9004  0.0066  
MAE (testset)     0.6598  0.6610  0.6626  0.6612  0.0011  
Fit time          0.45    0.41    0.43    0.43    0.02    
Test time         5.02    4.96    5.01    5.00    0.03    
{'test_rmse': array([0.90875587, 0.89276161, 0.89959105]), 'test_mae': array([0.65984765, 0.66103708, 0.66262992]), 'fit_time': (0.4522852897644043, 0.4134690761566162, 0.43015003204345703), 'test_time': (5.022490978240967, 4.957015752792358, 5.010500907897949)}
NMF
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-114-fae06b225fb7> in <module>
     10 for algo in algo_dict:
     11     print(algo)
---> 12     print(cross_validate(algo_dict[algo], data, measures=['RMSE', 'MAE'], cv=3, verbose=True))

//anaconda3/lib/python3.7/site-packages/surprise/model_selection/validation.py in cross_validate(algo, data, measures, cv, return_train_measures, n_jobs, pre_dispatch, verbose)
     99                                            return_train_measures)
    100                     for (trainset, testset) in cv.split(data))
--> 101     out = Parallel(n_jobs=n_jobs, pre_dispatch=pre_dispatch)(delayed_list)
    102 
    103     (test_measures_dicts,

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self, iterable)
    919             # remaining jobs.
    920             self._iterating = False
--> 921             if self.dispatch_one_batch(iterator):
    922                 self._iterating = self._original_iterator is not None
    923 

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    757                 return False
    758             else:
--> 759                 self._dispatch(tasks)
    760                 return True
    761 

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in _dispatch(self, batch)
    714         with self._lock:
    715             job_idx = len(self._jobs)
--> 716             job = self._backend.apply_async(batch, callback=cb)
    717             # A job can complete so quickly than its callback is
    718             # called before we get here, causing self._jobs to

//anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    180     def apply_async(self, func, callback=None):
    181         """Schedule a func to be run"""
--> 182         result = ImmediateResult(func)
    183         if callback:
    184             callback(result)

//anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    547         # Don't delay the application, to avoid keeping the input
    548         # arguments in memory
--> 549         self.results = batch()
    550 
    551     def get(self):

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in __call__(self)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

//anaconda3/lib/python3.7/site-packages/joblib/parallel.py in <listcomp>(.0)
    223         with parallel_backend(self._backend, n_jobs=self._n_jobs):
    224             return [func(*args, **kwargs)
--> 225                     for func, args, kwargs in self.items]
    226 
    227     def __len__(self):

//anaconda3/lib/python3.7/site-packages/surprise/model_selection/validation.py in fit_and_score(algo, trainset, testset, measures, return_train_measures)
    162 
    163     start_fit = time.time()
--> 164     algo.fit(trainset)
    165     fit_time = time.time() - start_fit
    166     start_test = time.time()

//anaconda3/lib/python3.7/site-packages/surprise/prediction_algorithms/matrix_factorization.pyx in surprise.prediction_algorithms.matrix_factorization.NMF.fit()

//anaconda3/lib/python3.7/site-packages/surprise/prediction_algorithms/matrix_factorization.pyx in surprise.prediction_algorithms.matrix_factorization.NMF.sgd()

ZeroDivisionError: float division

SVD, SVDpp, and KNN do well

  • Probably do even better with more data, but it takes time...
  • Move forward with SVDpp
  • Train with 1k, create pred workflow for the detector of choice

In [115]:
data = Dataset.load_from_df(X_melt[['track_id', 'variable', 'value']]
                            .query('track_id < 50000'),
                            reader)
data = data.build_full_trainset()

In [116]:
algo = SVDpp(n_factors=20, n_epochs=20)

In [117]:
algo.fit(train)


Out[117]:
<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x1a739ba2b0>

Time to make predictions

  • Make a copy of our X_test
  • For each track, for each plane that we need to predict, predict x, y, px, py, pz

In [118]:
def get_kinematic_pred(algo, track_id, kinematic):
    return algo.predict(track_id, kinematic).est

In [119]:
def get_track_kinematic_pred_for_plane(algo, track_id, plane):
    kinematics = [k + str(int(plane))
                  for k in ['x', 'y', 'px', 'py', 'pz']]
    plane_dict = {kin: get_kinematic_pred(algo, track_id, kin)
                  for kin in kinematics}
    return plane_dict

In [120]:
get_track_kinematic_pred_for_plane(algo, 0, 15)


Out[120]:
{'x15': 3.033402843600961,
 'y15': 2.978028441579763,
 'px15': 2.9850219111344884,
 'py15': 3.021867014894096,
 'pz15': 3.0038923592644524}

In [121]:
def fill_eval_plane_for_track(algo, X, track_id):
    plane = get_test_detector_plane(X.loc[track_id])
    plane_dict = get_track_kinematic_pred_for_plane(algo, track_id, plane)
    for kin in plane_dict:
        X.loc[track_id, kin] = plane_dict[kin]

In [122]:
X_pred_scaled = X_test_scaled.copy()

In [123]:
for ix in X_pred_scaled.index.values:
    fill_eval_plane_for_track(algo, X_pred_scaled, ix)

In [124]:
X_pred_values = scaler.inverse_transform(X_pred_scaled)
X_pred = pd.DataFrame(X_pred_values, columns=X_pred_scaled.columns,
                      index=X_pred_scaled.index)

Spot check!


In [125]:
for track in [20, 50, 1000, 5000]:
    plane = get_test_detector_plane(X_test.loc[track])
    print("PRED:\n", X_pred.loc[track, [kin + str(int(plane))
                             for kin in ['x', 'y', 'px', 'py', 'pz']]],
          "\n")
    print("TRUE:\n", X_true.loc[track], "\n\n-------------\n")


PRED:
 x11     50.211999
y11     49.927170
px11     0.450275
py11     0.454208
pz11     4.567603
Name: 20, dtype: float64 

TRUE:
 x      4.108309
y     19.228528
px    -0.039872
py     0.212423
pz     2.011996
Name: 20, dtype: float64 

-------------

PRED:
 x20     62.445658
y20     61.497072
px20     0.445079
py20     0.449278
pz20     4.566731
Name: 50, dtype: float64 

TRUE:
 x     18.472740
y    -19.001599
px     0.082835
py    -0.189903
pz     2.080941
Name: 50, dtype: float64 

-------------

PRED:
 x20     62.445658
y20     61.497072
px20     0.445079
py20     0.449278
pz20     4.566731
Name: 1000, dtype: float64 

TRUE:
 x     10.736746
y     13.892162
px     0.025428
py    -0.020759
pz     1.876688
Name: 1000, dtype: float64 

-------------

PRED:
 x13     57.438868
y13     56.440951
px13     0.450998
py13     0.455674
pz13     4.567316
Name: 5000, dtype: float64 

TRUE:
 x     -1.534074
y     39.745310
px     0.062563
py     0.165445
pz     1.700621
Name: 5000, dtype: float64 

-------------

I've tinkered around a bit, and this is discouraging...

  • Time to go back to the drawing board and try out a boring old sequence model :(