Rugby Analytics

  • Peadar Coyle
  • Alternate title - Probabilistic Programming applied to Sports Analytics

Who am I?

  • I'm a Data Analytics Professional based in Luxembourg
  • I currently work for Vodafone
  • My intellectual background is in Physics and Mathematics
  • I worked at Amazon in Supply Chain Analytics
  • I interned at http://import.io - where I got introduced to Data Science
  • I've made open source contributions to Pandas and Probabilistic Programming and Bayesian Methods for Hackers.
  • All opinions are my own!

Contents: Probabilistic Programming applied to Rugby

  • I'll discuss what probabilistic programming is, why should you care and how to use PyMC from Python to implement these methods.
  • I'll be applying these methods to studying the problem of 'rugby sports analytics' particularly how to model the winning team in the recent Six Nations in Rugby.
  • I will discuss the framework and how I was able to quickly and easily produce an innovative and powerful model as a non-expert.

All Sports Commentary!

  • Attribution: Xkcd

How can statistics help with sports?

  • Well fundamentally a Rugby game is a simulatable event.
  • How do we generate a model to predict the outcome of a tournament?
  • How do we quantify our uncertainty in our model?

What influenced me on this?

Attribution: Quantopian blog

What's wrong with statistics

  • Models should not be built for mathematical convenience (e.g. normality assumption), but to most accurately model the data.

  • Pre-specified models, like frequentist statistics, make many assumptions that are all to easily violated.

"The purpose of computation is insight, not numbers." -- Richard Hamming

What is Bayesian Statistics?

  • At the core: formula to update our beliefs after having observed data (Bayes formula)
  • Implies that we have a prior belief about the world.
  • Updated beliefs after observing data is called posterior.
  • Beliefs are represented using random variables.

So what problem could I apply Bayesian models to?

Bayesian Rugby

I came across the following blog post on http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/28/bayes-premier-league/

  • In this talk, I'm going to reproduce the first model described in the paper using pymc.
  • Since I am a rugby fan I decide to apply the results of the paper Bayesian Football to the Six Nations.

What they did?

  • Deriving our new measuring model and verifying that it works took some effort! But it was all worth it, because now we have:

  • Automatic weight estimations for each Zalando article, which saves workers time

  • A reliable way to know the accuracy of our estimations
  • And most importantly: our warehouse workers can now focus on getting your fashion to you as quickly as possible. That’s isn’t just saving money–that’s priceless.
  • Attribution: The Zalando tech blog

So why Bayesians?

  • Probabilistic Programming is a new paradigm.
  • Attributions: My friend Thomas Wiecki influenced a lot of my thinking on this.
  • I'm going to compare Blackbox Machine Learning with scikit-learn

Limitations of Machine learning

  • A big limitation of Machine Learning is that most of the models are black boxes.

Probabilistic Programming - what's the big deal?

  • We are able to use data and our prior beliefs to generate a model.
  • Generating a model is extremely powerful
  • We can tell a story, which appeals to our understanding of stories.

In [253]:
import os
import math
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pymc # I know folks are switching to "as pm" but I'm just not there yet
%matplotlib inline
import seaborn as sns
from IPython.core.pylabtools import figsize
import seaborn as sns
figsize(12, 12)

Six Nations Rugby

  • Rugby is a physical sport popular worldwide.
  • Six Nations consists of Italy, Ireland, Scotland, England, France and Wales
  • Game consists of scoring tries (similar to touch downs) or kicking the goal.
  • Average player is something like 100kg and 1.82m tall.
  • Paul O'Connell the Irish captain is Height: 6' 6" (1.98 m) Weight: 243 lbs (110 kg)

They compete for this!

  • Significant this year because the World Cup occurs in 2015.
  • Photo: Hostelrome

Motivation

  • Your estimate of the strength of a team depends on your estimates of the other strengths
  • Ireland are a stronger team than Italy for example - but by how much?
  • Source for Results 2014 are Wikipedia.
  • I handcrafted these results
  • Small data


In [220]:
DATA_DIR = os.path.join(os.getcwd(), 'data/')

In [221]:
#The results_2014 is a handcrafted results table from Wikipedia
data_file = DATA_DIR + 'results_2014.csv'
df = pd.read_csv(data_file, sep=',')
df.tail()


Out[221]:
home_team away_team home_score away_score
10 Scotland France 17 19
11 England Wales 29 18
12 Italy England 11 52
13 Wales Scotland 51 3
14 France Ireland 20 22

In [222]:
teams = df.home_team.unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index
teams.head()


Out[222]:
team i
0 Wales 0
1 France 1
2 Ireland 2
3 Scotland 3
4 Italy 4
  • Now we need to do some merging and munging

In [223]:
df = pd.merge(df, teams, left_on='home_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='away_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1)
df.head()


Out[223]:
home_team away_team home_score away_score i_home i_away
0 Wales Italy 23 15 0 4
1 France England 26 24 1 5
2 Ireland Scotland 28 6 2 3
3 Ireland Wales 26 3 2 0
4 Scotland England 0 20 3 5

In [224]:
observed_home_goals = df.home_score.values
observed_away_goals = df.away_score.values
home_team = df.i_home.values
away_team = df.i_away.values
num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)

Now we need to prepare the model for PyMC.


In [225]:
g = df.groupby('i_away')
att_starting_points = np.log(g.away_score.mean())
g = df.groupby('i_home')
def_starting_points = -np.log(g.away_score.mean())

What do we want to infer?

  • We want to infer the latent paremeters (every team's strength) that are generating the data we observe (the scorelines).
  • Moreover, we know that the scorelines are a noisy measurement of team strength, so ideally, we want a model that makes it easy to quantify our uncertainty about the underlying strengths.

While my MCMC gently samples

  • Often we don't know what the Bayesian Model is explicitly, so we have to 'estimate' the Bayesian Model'
  • If we can't solve something, approximate it.
  • Markov-Chain Monte Carlo (MCMC) instead draws samples from the posterior.
  • Fortunately, this algorithm can be applied to almost any model.

What do we want?

  • We want to quantify our uncertainty
  • We want to also use this to generate a model
  • We want the answers as distributions not point estimates

What assumptions do we know for our 'generative story'?

  • We know that the Six Nations in Rugby only has 6 teams.
  • We have data from last year!
  • We also know that in sports scoring is modelled as a Poisson distribution
  • Attribution: Wikipedia

The model.

The league is made up by a total of T= 6 teams, playing each other once in a season. We indicate the number of points scored by the home and the away team in the g-th game of the season (15 games) as $y_{g1}$ and $y_{g2}$ respectively.

The vector of observed counts $\mathbb{y} = (y_{g1}, y_{g2})$ is modelled as independent Poisson: $y_{gi}| \theta_{gj} \tilde\;\; Poisson(\theta_{gj})$ where the theta parameters represent the scoring intensity in the g-th game for the team playing at home (j=1) and away (j=2), respectively.

We model these parameters according to a formulation that has been used widely in the statistical literature, assuming a log-linear random effect model: $$log \theta_{g1} = home + att_{h(g)} + def_{a(g)} $$ $$log \theta_{g2} = att_{a(g)} + def_{h(g)}$$ the parameter home represents the advantage for the team hosting the game and we assume that this effect is constant for all the teams and throughout the season.

  • Key assumption home effect is an advantage in Sports
  • We know these things empirically from our 'domain specific' knowledge
  • Bayesian Models allow you to incorporate beliefs or knowledge into your model!

In addition, the scoring intensity is determined jointly by the attack and defense ability of the two teams involved, represented by the parameters att and def, respectively. In line with the Bayesian approach, we have to specify some suitable prior distributions for all the random parameters in our model. The variable $home$ is modelled as a fixed effect, assuming a standard flat prior distribution. We use the notation of describing the Normal distribution in terms of mean and the precision. $home \tilde\; Normal(0,0.0001)$</p>

Conversely, for each t = 1, ..., T, the team-specific effects are modelled as exchangeable from a common distribution: $att_t \tilde\; Normal(\mu_{att}, \tau_{att})$ and $def_t \tilde\; Normal(\mu_{def}, \tau_{def})$

Note that they're breaking out team strength into attacking and defending strength. A negative defense parameter will sap the mojo from the opposing team's attacking parameter.

I made two tweaks to the model. It didn't make sense to me to have a $\mu_{att}$ when we're enforcing the sum-to-zero constraint by subtracting the mean anyway. So I eliminated $\mu_{att}$ and $\mu_{def}$

Also because of the sum-to-zero constraint, it seemed to me that we needed an intercept term in the log-linear model, capturing the average goals scored per game by the away team. This we model with the following hyperprior. $$intercept \tilde\; Normal(0, 0.001)$$

  • The hyper-priors on the attack and defense parameters are also flat
  • $\mu_att \tilde\; Normal(0,0.001)$
  • $\mu_def \tilde\; Normal(0,0.001)$
  • $\tau_att \tilde\; \Gamma(0.1,0.1)$
  • $\tau_def \tilde\; \Gamma(0.1,0.1)$

Digression: Why the flat priors were picked (our hyperpriors are flat)

  • There are lots of different ways of picking priors - see the literature
  • I learned from this that it didn't make any difference to the results picking a uniform prior versus a flat prior

Often it simply doesn't matter much what prior you use.

Definition: A prior distribution is non-informative if the prior is “flat” relative to the likelihood function.

  • We are trying to have a reasonable model and let inference happen from the data set.
  • ##Key takeaway - Often in Bayesian modelling it doesn't matter what your priors are. Let the data generate the story :)

In [226]:
#hyperpriors
home = pymc.Normal('home', 0, .0001, value=0)
tau_att = pymc.Gamma('tau_att', .1, .1, value=10)
tau_def = pymc.Gamma('tau_def', .1, .1, value=10)
intercept = pymc.Normal('intercept', 0, .0001, value=0)
#team-specific parameters
atts_star = pymc.Normal("atts_star", 
                        mu=0, 
                        tau=tau_att, 
                        size=num_teams, 
                        value=att_starting_points.values)
defs_star = pymc.Normal("defs_star", 
                        mu=0, 
                        tau=tau_def, 
                        size=num_teams, 
                        value=def_starting_points.values)

In [227]:
# trick to code the sum to zero constraint
@pymc.deterministic
def atts(atts_star=atts_star):
    atts = atts_star.copy()
    atts = atts - np.mean(atts_star)
    return atts

@pymc.deterministic
def defs(defs_star=defs_star):
    defs = defs_star.copy()
    defs = defs - np.mean(defs_star)
    return defs

@pymc.deterministic
def home_theta(home_team=home_team, 
               away_team=away_team, 
               home=home, 
               atts=atts, 
               defs=defs, 
               intercept=intercept): 
    return np.exp(intercept + 
                  home + 
                  atts[home_team] + 
                  defs[away_team])
  
@pymc.deterministic
def away_theta(home_team=home_team, 
               away_team=away_team, 
               home=home, 
               atts=atts, 
               defs=defs, 
               intercept=intercept): 
    return np.exp(intercept + 
                  atts[away_team] + 
                  defs[home_team])

Let us run the model!

  • We specify the priors as Gamma distributions

In [228]:
home_points = pymc.Poisson('home_points', 
                          mu=home_theta, 
                          value=observed_home_goals, 
                          observed=True)
away_points = pymc.Poisson('away_points', 
                          mu=away_theta, 
                          value=observed_away_goals, 
                          observed=True)

mcmc = pymc.MCMC([home, intercept, tau_att, tau_def, 
                  home_theta, away_theta, 
                  atts_star, defs_star, atts, defs, 
                  home_points, away_points])
map_ = pymc.MAP( mcmc )
map_.fit()

mcmc.sample(200000, 40000, 20)


 [-----------------100%-----------------] 200000 of 200000 complete in 73.6 sec

Diagnostics

Let's see if/how the model converged. The home parameter looks good, and indicates that home field advantage amounts to goals per game at the intercept. We can see that it converges just like the model for the Premier League in the other tutorial. I wonder and this is left as a question if all field sports have models of this form that converge.


In [229]:
pymc.Matplot.plot(home)


Plotting home
  • We can see in this analysis that home advantage gives about 0.55 points advantage.
  • We also see not too much auto-correlation, so this looks quite good plot wise.
  • We see here how probabilistic programming allows us to quantify our uncertainty about certain parameters.

In [230]:
pymc.Matplot.plot(atts)
# We can plot all of the parameters, just to see.


Plotting atts_0
Plotting atts_1
Plotting atts_2
Plotting atts_3
Plotting atts_4
Plotting atts_5

In [231]:
observed_season = DATA_DIR + 'table_2014.csv'
df_observed = pd.read_csv(observed_season)
df_observed.loc[df_observed.QR.isnull(), 'QR'] = ''
df_avg = pd.DataFrame({'avg_att': atts.stats()['mean'],
                       'avg_def': defs.stats()['mean']}, 
                      index=teams.team.values)

In [232]:
df_new = df_avg.merge(df_observed, left_index=True, right_on='team', how='left')

#Bring in the new csv file
df_results = pd.read_csv('df_avg_data_edited.csv', sep=';')

In [233]:
df_results.tail()


Out[233]:
avg_att avg_def points team position QR
1 0.096729 0.150764 6 France 4 NaN
2 0.179363 -0.560665 8 Ireland 1 winners
3 -0.549843 0.289372 2 Scotland 5 NaN
4 -0.251418 0.515101 0 Italy 6 wooden_spoon
5 0.399824 -0.255811 8 England 2 triple_crown

In [254]:
def fig3():
    fig, ax = plt.subplots(figsize=(12,12))
    for outcome in ['winners', 'wooden_spoon', 'triple_crown', 'NaN']:
        ax.plot(df_results.avg_att[df_results.QR == outcome], 
    df_results.avg_def[df_results.QR == outcome], 'o', label=outcome)
    for label, x, y in zip(df_results.team.values, df_results.avg_att.values, df_results.avg_def.values):
        ax.annotate(label, xy=(x,y), xytext = (-5,5), textcoords = 'offset points')
        ax.set_title('Attack vs Defense avg effect: 13-14 Six Nations')
        ax.set_xlabel('Avg attack effect')
        ax.set_ylabel('Avg defense effect')
        ax.legend()

In [255]:
fig3()


Simulating a season

We would like to now simulate a season. Just to see what happens.


In [235]:
def simulate_season():
    """
    Simulate a season once, using one random draw from the mcmc chain. 
    """
    num_samples = atts.trace().shape[0]
    draw = np.random.randint(0, num_samples)
    atts_draw = pd.DataFrame({'att': atts.trace()[draw, :],})
    defs_draw = pd.DataFrame({'def': defs.trace()[draw, :],})
    home_draw = home.trace()[draw]
    intercept_draw = intercept.trace()[draw]
    season = df.copy()
    season = pd.merge(season, atts_draw, left_on='i_home', right_index=True)
    season = pd.merge(season, defs_draw, left_on='i_home', right_index=True)
    season = season.rename(columns = {'att': 'att_home', 'def': 'def_home'})
    season = pd.merge(season, atts_draw, left_on='i_away', right_index=True)
    season = pd.merge(season, defs_draw, left_on='i_away', right_index=True)
    season = season.rename(columns = {'att': 'att_away', 'def': 'def_away'})
    season['home'] = home_draw
    season['intercept'] = intercept_draw
    season['home_theta'] = season.apply(lambda x: math.exp(x['intercept'] + 
                                                           x['home'] + 
                                                           x['att_home'] + 
                                                           x['def_away']), axis=1)
    season['away_theta'] = season.apply(lambda x: math.exp(x['intercept'] + 
                                                           x['att_away'] + 
                                                           x['def_home']), axis=1)
    season['home_goals'] = season.apply(lambda x: np.random.poisson(x['home_theta']), axis=1)
    season['away_goals'] = season.apply(lambda x: np.random.poisson(x['away_theta']), axis=1)
    season['home_outcome'] = season.apply(lambda x: 'win' if x['home_goals'] > x['away_goals'] else 
                                                    'loss' if x['home_goals'] < x['away_goals'] else 'draw', axis=1)
    season['away_outcome'] = season.apply(lambda x: 'win' if x['home_goals'] < x['away_goals'] else 
                                                    'loss' if x['home_goals'] > x['away_goals'] else 'draw', axis=1)
    season = season.join(pd.get_dummies(season.home_outcome, prefix='home'))
    season = season.join(pd.get_dummies(season.away_outcome, prefix='away'))
    return season


def create_season_table(season):
    """
    Using a season dataframe output by simulate_season(), create a summary dataframe with wins, losses, goals for, etc.
    
    """
    g = season.groupby('i_home')    
    home = pd.DataFrame({'home_goals': g.home_goals.sum(),
                         'home_goals_against': g.away_goals.sum(),
                         'home_wins': g.home_win.sum(),
                         'home_losses': g.home_loss.sum()
                         })
    g = season.groupby('i_away')    
    away = pd.DataFrame({'away_goals': g.away_goals.sum(),
                         'away_goals_against': g.home_goals.sum(),
                         'away_wins': g.away_win.sum(),
                         'away_losses': g.away_loss.sum()
                         })
    df = home.join(away)
    df['wins'] = df.home_wins + df.away_wins
    df['losses'] = df.home_losses + df.away_losses
    df['points'] = df.wins * 2
    df['gf'] = df.home_goals + df.away_goals
    df['ga'] = df.home_goals_against + df.away_goals_against
    df['gd'] = df.gf - df.ga
    df = pd.merge(teams, df, left_on='i', right_index=True)
    df = df.sort_index(by='points', ascending=False)
    df = df.reset_index()
    df['position'] = df.index + 1
    df['champion'] = (df.position == 1).astype(int)
    df['relegated'] = (df.position > 5).astype(int)
    return df  
    
def simulate_seasons(n=100):
    dfs = []
    for i in range(n):
        s = simulate_season()
        t = create_season_table(s)
        t['iteration'] = i
        dfs.append(t)
    return pd.concat(dfs, ignore_index=True)

Simulation

  • We are going to simulate 1000 seasons

In [236]:
simuls = simulate_seasons(1000)

In [237]:
def fig1():
    ax = simuls.points[simuls.team == 'Ireland'].hist()
    median = simuls.points[simuls.team == 'Ireland'].median()
    ax.set_title('Ireland: 2015 points, 1000 simulations')
    ax.plot([median, median], ax.get_ylim())
    plt.annotate('Median: %s' % median, xy=(median + 1, ax.get_ylim()[1]-10))

In [238]:
fig1()


  • So what have we learned so far, we've got 1000 simulations of Ireland and their median points in the table is 8.
  • In Rugby you get 2 points per win, and there are 5 games per year. So this model predicted that Ireland would win most of the time 4 games.

In [239]:
def fig2():
    ax = simuls.gf[simuls.team == 'Ireland'].hist(figsize=(7,5))
    median = simuls.gf[simuls.team == 'Ireland'].median()
    ax.set_title('Ireland: 2015 scores for, 1000 simulations')
    ax.plot([median, median], ax.get_ylim())
    plt.annotate('Median: %s' % median, xy=(median + 1, ax.get_ylim()[1]-10))

In [240]:
fig2()


What happened in reality?

  • Well Ireland actually scored 119 points, so the model over predicted this!
  • We call this 'shrinkage' in the literature.
  • All models are wrong, but some are useful

What are the predictions of the model?

  • So let us look at the winning team on average.
  • We do a simulation and we'll assign probability of 'winning' to the team
  • We used the MCMC to do this.

In [241]:
g = simuls.groupby('team')
df_champs = pd.DataFrame({'percent_champs': g.champion.mean()})
df_champs = df_champs.sort_index(by='percent_champs')
df_champs = df_champs[df_champs.percent_champs > .05]
df_champs = df_champs.reset_index()

In [242]:
fig, ax = plt.subplots(figsize=(8,6))
ax.barh(df_champs.index.values, df_champs.percent_champs.values)

for i, row in df_champs.iterrows():
    label = "{0:.1f}%".format(100 * row['percent_champs'])
    ax.annotate(label, xy=(row['percent_champs'], i), xytext = (3, 10), textcoords = 'offset points')
ax.set_ylabel('Team')
ax.set_title('% of Simulated Seasons In Which Team Finished Winners')
_= ax.set_yticks(df_champs.index + .5)
_= ax.set_yticklabels(df_champs['team'].values)


Unfortunately it seems that in most of the Universes England come top of the Six Nations. And as an Irish man this is firm proof that I put Mathematical rigour before patriotism :) This is a reasonable result, and I hope it proved a nice example of Bayesian models in Rugby Analytics.

What actually happened

We need to investigate like 'scientists' what actually happened.


In [243]:
pd.read_csv('data/results_2015.csv', sep='\t', header=0, 
            names =['Rank','Team','Games','Wins','Draws',
                    'Losses','Points_For','Points_Against','Points'])


Out[243]:
Rank Team Games Wins Draws Losses Points_For Points_Against Points
0 1 Ireland 5 4 0 1 119 56 8
1 2 England 5 4 0 1 157 100 8
2 3 Wales 5 4 0 1 146 93 8
3 4 France 5 2 0 3 103 101 4
4 5 Italy 5 1 0 4 62 182 2
5 6 Scotland 5 0 0 5 73 128 0

Ireland won the Six Nations!

  • The model incorrectly predicted that England would come out on top.
  • Ireland actually won by points difference of 6 points.
  • It really came down to the wire!

In [244]:
from IPython.display import Image
Image(filename='onedoesnotsimply.jpg')


Out[244]:

Conclusion

  • 'All models are wrong, some are useful'
  • Model correctly predicts that Ireland would win 4 games
  • Model incorrectly predicted that England would come out on top
  • In reality it was very close
  • Recommendation: Don't use this model to bet on the Six Nations next years

Want to learn more?


In [245]:
from IPython.display import Image
Image(filename='the-most-interesting-man-in-the-world-meme-generator-i-don-t-give-many-talks-but-when-i-do-i-do-q-a-e38222 copy.jpg')


Out[245]: