Collaborative Filtering

This is a starter notebook forked from last year's competition. This is an implementation of Collaborative filtering starter with Keras. Uses only the win(1) and loss(0) label of each match and categorical encoding of team Ids as training data.

Essentially the formula used is shown below: Model prediction = Dot product of the 2 teams in each match (embedding vectors)+ (Team1 bias) + (Team2 bias)


In [ ]:


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
from keras.layers import Input, Dense, Dropout, Flatten, Embedding, merge
from keras.regularizers import l2
from keras.optimizers import Adam
from keras.models import Model

In [4]:
dr = pd.read_csv("../input/RegularSeasonDetailedResults.csv")

In [5]:
dr.tail(n=30)

Preparing the training data

Simple win to 1 loss to 0 encoding


In [6]:
simple_df_1 = pd.DataFrame()
simple_df_1[["team1", "team2"]] =dr[["WTeamID", "LTeamID"]].copy()
simple_df_1["pred"] = 1

simple_df_2 = pd.DataFrame()
simple_df_2[["team1", "team2"]] =dr[["LTeamID", "WTeamID"]]
simple_df_2["pred"] = 0

simple_df = pd.concat((simple_df_1, simple_df_2), axis=0)
simple_df.head()

Display number of unique elements in "team1"


In [7]:
n = simple_df.team1.nunique()
n

In [8]:
trans_dict = {t: i for i, t in enumerate(simple_df.team1.unique())}
simple_df["team1"] = simple_df["team1"].apply(lambda x: trans_dict[x])
simple_df["team2"] = simple_df["team2"].apply(lambda x: trans_dict[x])
simple_df.head()

Shuffle the training to include some randomness


In [9]:
train = simple_df.values
np.random.shuffle(train)

In [10]:
def embedding_input(name, n_in, n_out, reg):
    inp = Input(shape=(1,), dtype="int64", name=name)
    return inp, Embedding(n_in, n_out, input_length=1, W_regularizer=l2(reg))(inp)

def create_bias(inp, n_in):
    x = Embedding(n_in, 1, input_length=1)(inp)
    return Flatten()(x)

Lets Start to create our CL network with Keras

Start by creating the embeddings and bias for the dot products of the 2 teams


In [11]:
n_factors = 50

team1_in, t1 = embedding_input("team1_in", n, n_factors, 1e-4)
team2_in, t2 = embedding_input("team2_in", n, n_factors, 1e-4)

b1 = create_bias(team1_in, n)
b2 = create_bias(team2_in, n)

In [25]:
x = merge([t1, t2], mode="dot")
x = Flatten()(x)
x = merge([x, b1], mode="sum")
x = merge([x, b2], mode="sum")
x = Dense(1, activation="softmax")(x)
model = Model([team1_in, team2_in], x)
model.compile(Adam(0.001), loss="binary_crossentropy")

In [26]:
model.summary()

Now that we have defined our network its time to determine the correct set of numbers that will make our predictions close to actual outputs of a match

Lets learn these numbers by minimising the loss using the Adam optimisation algorithm.


In [27]:
train.shape
#print(train.head())

In [30]:
history = model.fit([train[:, 0], train[:, 1]], train[:, 2],validation_split=0.33, batch_size=64, nb_epoch=5, verbose=2)

In [31]:
plt.plot(history.history["loss"])
plt.show()

In [32]:
# list all data in history
print(history.history.keys())
# summarize history for accuracy
#plt.plot(history.history['acc'])
#plt.plot(history.history['val_acc'])
#plt.title('model accuracy')
#plt.ylabel('accuracy')
#plt.xlabel('epoch')
#plt.legend(['train', 'test'], loc='upper left')
#plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

In [33]:
sub =  pd.read_csv('../input/SampleSubmissionStage2.csv')
sub["team1"] = sub["ID"].apply(lambda x: trans_dict[int(x.split("_")[1])])
sub["team2"] = sub["ID"].apply(lambda x: trans_dict[int(x.split("_")[2])])
sub.head()

In [34]:
sub["pred"] = model.predict([sub.team1, sub.team2])
sub = sub[["ID", "pred"]]
sub.head()

In [ ]:
sub.to_csv("CF_sm.csv", index=False)

In [ ]: