Match Watchability/Excitability Index

  • Data from Kaggle for match events for 5 leagues from 2011/12 - current season (Jan 2017).
  • Need table standings for each team within their league as of that match.
  • For each match, will need to calculate expected win probability to factor into excitement (as per fivethirtyeight)
  • Given time, factor in xG (and calculate xG)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import pairwise_distances
from sklearn import cluster

In [2]:
events_df = pd.read_csv('GA/Data/events.csv')

In [3]:
ginf = pd.read_csv('GA/Data/ginf.csv')

In [40]:
events_df.groupby('id_odsp').id_event.count().plot('hist')
plt.title('Distribution of Events by Match')
plt.xlabel('Events per Match')
plt.show()



In [14]:
## Depending on level of comparison, calculate either across leagues or entire dataset

print events_df.groupby('id_odsp').id_event.count().mean()
print events_df.groupby('id_odsp').id_event.count().median()

## Little difference between mean and median


103.703879215
103.0

In [15]:
## events by time
## spikes at minutes 45 and 90 (ends of halves)

events_df.groupby('time').id_event.count().plot('bar')
plt.show()



In [16]:
events_df.groupby('time').is_goal.sum().plot('bar')
plt.show()



In [19]:
## For loop to assign current home goals at that point in a given match
## Function w apply not working... going to use loop

goals = 0
events_df['prev_id'] = events_df.id_odsp.shift()
for i, row in events_df.iterrows():
    if row.id_odsp != row.prev_id:
        goals = 0
    if (row.is_goal == 1) and (row.side == 1):
        goals += 1
    events_df.loc[i, 'home_goals'] = goals

In [27]:
## away goals

goals = 0
events_df['prev_id'] = events_df.id_odsp.shift()
for i, row in events_df.iterrows():
    if row.id_odsp != row.prev_id:
        goals = 0
    if (row.is_goal == 1) and (row.side == 2):
        goals += 1
    events_df.loc[i, 'away_goals'] = goals

In [31]:
## Data is saved in csv - no more for loops!

#events_df.to_csv('soccer_events.csv', encoding='utf-8')

Calculating the average goal difference throughout a match.

The steps below calculate the average goal difference at each minute of a match to measure how close a given match is. First, I create a new column in the existing dataframe to determine the goal difference at that point in the match. I create a separate dataframe that shows the goal difference at each minute for each match, and then write a function to fill in all minutes for a match in which no events occurred using the preceding goal difference. The average of all values across the entire match is then calculated.


In [ ]:
## Dataframe was previously saved - read in data
df = pd.read_csv('soccer_events.csv')

In [ ]:
df['score_diff'] = abs(df.home_goals - df.away_goals)

## Pivot on score_diff for each game at each point in time
## aggfunc is 'last' to take last occurring difference - i.e., if two goals scored in same minute, take last event
time_diff = df.pivot_table(values = 'score_diff', index = 'id_odsp', columns = 'time', aggfunc = 'last')

## clean up column names a little bit to indicate minutes
time_diff.columns = [str(col) + '_min' for col in time_diff.columns]

## if no events in 0th minute, fill goal difference as 0
time_diff['0_min'].fillna(0, inplace = True)

## for minutes with no events in a match, fill with previous minutes' difference
## ffill each row up until last listed event

def fill_row(match):
    match = match.copy()
    non_nans = match[~match.apply(np.isnan)]
    start, end = non_nans.index[0], non_nans.index[-1]
    match.ix[start:end] = match.ix[start:end].fillna(method='ffill')
    return match

time_diff = time_diff.apply(fill_row, axis = 1)

## Fill up to 90th minute if no events

time_diff.ix[:,'78_min':'90_min'].ffill(axis = 1, inplace = True)

## avg goal differential in match
avg_time_diff = time_diff.transpose().mean().reset_index()
avg_time_diff.columns = [['id_odsp','avg_goal_diff']]
avg_time_diff.head()

The steps below calculate the total number of lead changes by match to indicate a higher amount of scoring, and also more excitement.


In [ ]:
for i, row in goal_df.iterrows():
    if row.home_goals > row.away_goals:
        leader = 'home'
    if row.away_goals > row.home_goals:
        leader = 'away'
    if row.home_goals == row.away_goals:
        leader = 'tied'
    goal_df.loc[i, 'leader'] = leader

goal_df['prev_leader'] = goal_df.groupby('id_odsp').leader.shift().fillna('tied')

## Another for loop to flag if score is lead change

for i, row in goal_df.iterrows():
    if row.id_odsp == row.prev_id:
        if row.leader != row.prev_leader:
            ld_chg = 1
        else: 
            ld_chg = 0
        goal_df.loc[i, 'lead_change'] = ld_chg

## Total lead changes by match
goal_df.groupby('id_odsp').lead_change.sum().sort_values().reset_index()

Now that all fields have been calculated, the steps below will join to the original dataframe with individual match data.


In [ ]:
## Adding lead changes into overall match data
match_df = pd.merge(ginf, goal_df.groupby('id_odsp').lead_change.sum().reset_index(), on = 'id_odsp', how = 'left')
## Will need to fillna for new column w 0 value to indicate no lead changes
match_df.lead_change.fillna(0, inplace = True)

## Match events
match_df = pd.merge(match_df, df.groupby('id_odsp').id_event.count().reset_index(), on = 'id_odsp', how = 'inner')

## Average goal differential
match_df = pd.merge(match_df, avg_time_diff, on = 'id_odsp', how = 'inner')

The next step is to use unsupervised clustering to create groups of matches according to the metrics calculated above.


In [ ]:
X = match_df[['lead_change','id_event','avg_goal_diff']]

sns.pairplot(X)
plt.show()

In [ ]:
X_scaled = preprocessing.StandardScaler().fit_transform(X)

X_scaled_df = pd.DataFrame(X_scaled, columns = X.columns)

k = 4
kmeans = cluster.KMeans(n_clusters=k, random_state = 43)
kmeans.fit(X_scaled)

labels = kmeans.labels_
centroids = kmeans.cluster_centers_
inertia = kmeans.inertia_

print labels
print centroids
print inertia

print metrics.silhouette_score(X_scaled, labels, metric='euclidean')

match_df['cluster'] = labels

In [ ]:
X['cluster'] = labels
cols = X.columns[:-1]
sns.pairplot(X, x_vars=cols, y_vars= cols, hue='cluster')
plt.show()

In [ ]:
match_df.drop([u'odd_h', u'odd_d',u'odd_a', u'odd_over', u'odd_under', u'odd_bts', u'odd_bts_n'], axis = 1, inplace = True)

match_df['date'] = pd.to_datetime(match_df['date'])
match_df['month'] = match_df.date.apply(lambda x: x.month)
match_df['year'] = match_df.date.apply(lambda x: x.year)

centroid_df = pd.DataFrame(centroids, columns = X_scaled_df.columns)
centroid_df
  • Cluster 0 - Second Fewest Lead Changes (rank 3), Second Fewest Events(rank 3), Largest average goal differential (rank 4)
  • Cluster 1 - Most Lead Changes (rank 1), Second Most Events (rank 2), Second largest avg goal diff (rank 3)
  • Cluster 2 - Second most lead changes (rank 2), most events (rank 1), third largest avg goal diff (rank 2)
  • Cluster 3 - fewest lead chanegs (rank 4), fewest events (rank 4), smallest avg goal diff (rank 1)

Save match df with clusters and engineered features

match_df.to_csv('match_cluster.csv', encoding='utf-8')

In [ ]:
match_df = pd.read_csv('match_cluster.csv')

match_df = match_df.drop(['Unnamed: 0','link_odsp','adv_stats'], axis = 1)

england = match_df[match_df.country == 'england']
spain = match_df[match_df.country == 'spain']
germany = match_df[match_df.country == 'germany']
france = match_df[match_df.country == 'france']
italy = match_df[match_df.country == 'italy']