I came across the following blog post on http://danielweitzenfeld.github.io/passtheroc/blog/2014/10/28/bayes-premier-league/
Deriving our new measuring model and verifying that it works took some effort! But it was all worth it, because now we have:
Automatic weight estimations for each Zalando article, which saves workers time
Source: Olivier Grisel's talk on ML
In [253]:
import os
import math
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pymc # I know folks are switching to "as pm" but I'm just not there yet
%matplotlib inline
import seaborn as sns
from IPython.core.pylabtools import figsize
import seaborn as sns
figsize(12, 12)
In [220]:
DATA_DIR = os.path.join(os.getcwd(), 'data/')
In [221]:
#The results_2014 is a handcrafted results table from Wikipedia
data_file = DATA_DIR + 'results_2014.csv'
df = pd.read_csv(data_file, sep=',')
df.tail()
Out[221]:
In [222]:
teams = df.home_team.unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index
teams.head()
Out[222]:
In [223]:
df = pd.merge(df, teams, left_on='home_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='away_team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1)
df.head()
Out[223]:
In [224]:
observed_home_goals = df.home_score.values
observed_away_goals = df.away_score.values
home_team = df.i_home.values
away_team = df.i_away.values
num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)
Now we need to prepare the model for PyMC.
In [225]:
g = df.groupby('i_away')
att_starting_points = np.log(g.away_score.mean())
g = df.groupby('i_home')
def_starting_points = -np.log(g.away_score.mean())
The league is made up by a total of T= 6 teams, playing each other once in a season. We indicate the number of points scored by the home and the away team in the g-th game of the season (15 games) as $y_{g1}$ and $y_{g2}$ respectively.
The vector of observed counts $\mathbb{y} = (y_{g1}, y_{g2})$ is modelled as independent Poisson: $y_{gi}| \theta_{gj} \tilde\;\; Poisson(\theta_{gj})$ where the theta parameters represent the scoring intensity in the g-th game for the team playing at home (j=1) and away (j=2), respectively.
We model these parameters according to a formulation that has been used widely in the statistical literature, assuming a log-linear random effect model: $$log \theta_{g1} = home + att_{h(g)} + def_{a(g)} $$ $$log \theta_{g2} = att_{a(g)} + def_{h(g)}$$ the parameter home represents the advantage for the team hosting the game and we assume that this effect is constant for all the teams and throughout the season.
In addition, the scoring intensity is determined jointly by the attack and defense ability of the two teams involved, represented by the parameters att and def, respectively. In line with the Bayesian approach, we have to specify some suitable prior distributions for all the random parameters in our model. The variable $home$ is modelled as a fixed effect, assuming a standard flat prior distribution. We use the notation of describing the Normal distribution in terms of mean and the precision. $home \tilde\; Normal(0,0.0001)$</p>
Conversely, for each t = 1, ..., T, the team-specific effects are modelled as exchangeable from a common distribution: $att_t \tilde\; Normal(\mu_{att}, \tau_{att})$ and $def_t \tilde\; Normal(\mu_{def}, \tau_{def})$
Note that they're breaking out team strength into attacking and defending strength. A negative defense parameter will sap the mojo from the opposing team's attacking parameter.
I made two tweaks to the model. It didn't make sense to me to have a $\mu_{att}$ when we're enforcing the sum-to-zero constraint by subtracting the mean anyway. So I eliminated $\mu_{att}$ and $\mu_{def}$
Also because of the sum-to-zero constraint, it seemed to me that we needed an intercept term in the log-linear model, capturing the average goals scored per game by the away team. This we model with the following hyperprior. $$intercept \tilde\; Normal(0, 0.001)$$
Often it simply doesn't matter much what prior you use.
In [226]:
#hyperpriors
home = pymc.Normal('home', 0, .0001, value=0)
tau_att = pymc.Gamma('tau_att', .1, .1, value=10)
tau_def = pymc.Gamma('tau_def', .1, .1, value=10)
intercept = pymc.Normal('intercept', 0, .0001, value=0)
#team-specific parameters
atts_star = pymc.Normal("atts_star",
mu=0,
tau=tau_att,
size=num_teams,
value=att_starting_points.values)
defs_star = pymc.Normal("defs_star",
mu=0,
tau=tau_def,
size=num_teams,
value=def_starting_points.values)
In [227]:
# trick to code the sum to zero constraint
@pymc.deterministic
def atts(atts_star=atts_star):
atts = atts_star.copy()
atts = atts - np.mean(atts_star)
return atts
@pymc.deterministic
def defs(defs_star=defs_star):
defs = defs_star.copy()
defs = defs - np.mean(defs_star)
return defs
@pymc.deterministic
def home_theta(home_team=home_team,
away_team=away_team,
home=home,
atts=atts,
defs=defs,
intercept=intercept):
return np.exp(intercept +
home +
atts[home_team] +
defs[away_team])
@pymc.deterministic
def away_theta(home_team=home_team,
away_team=away_team,
home=home,
atts=atts,
defs=defs,
intercept=intercept):
return np.exp(intercept +
atts[away_team] +
defs[home_team])
In [228]:
home_points = pymc.Poisson('home_points',
mu=home_theta,
value=observed_home_goals,
observed=True)
away_points = pymc.Poisson('away_points',
mu=away_theta,
value=observed_away_goals,
observed=True)
mcmc = pymc.MCMC([home, intercept, tau_att, tau_def,
home_theta, away_theta,
atts_star, defs_star, atts, defs,
home_points, away_points])
map_ = pymc.MAP( mcmc )
map_.fit()
mcmc.sample(200000, 40000, 20)
Let's see if/how the model converged. The home parameter looks good, and indicates that home field advantage amounts to goals per game at the intercept. We can see that it converges just like the model for the Premier League in the other tutorial. I wonder and this is left as a question if all field sports have models of this form that converge.
In [229]:
pymc.Matplot.plot(home)
In [230]:
pymc.Matplot.plot(atts)
# We can plot all of the parameters, just to see.
In [231]:
observed_season = DATA_DIR + 'table_2014.csv'
df_observed = pd.read_csv(observed_season)
df_observed.loc[df_observed.QR.isnull(), 'QR'] = ''
df_avg = pd.DataFrame({'avg_att': atts.stats()['mean'],
'avg_def': defs.stats()['mean']},
index=teams.team.values)
In [232]:
df_new = df_avg.merge(df_observed, left_index=True, right_on='team', how='left')
#Bring in the new csv file
df_results = pd.read_csv('df_avg_data_edited.csv', sep=';')
In [233]:
df_results.tail()
Out[233]:
In [254]:
def fig3():
fig, ax = plt.subplots(figsize=(12,12))
for outcome in ['winners', 'wooden_spoon', 'triple_crown', 'NaN']:
ax.plot(df_results.avg_att[df_results.QR == outcome],
df_results.avg_def[df_results.QR == outcome], 'o', label=outcome)
for label, x, y in zip(df_results.team.values, df_results.avg_att.values, df_results.avg_def.values):
ax.annotate(label, xy=(x,y), xytext = (-5,5), textcoords = 'offset points')
ax.set_title('Attack vs Defense avg effect: 13-14 Six Nations')
ax.set_xlabel('Avg attack effect')
ax.set_ylabel('Avg defense effect')
ax.legend()
In [255]:
fig3()
In [235]:
def simulate_season():
"""
Simulate a season once, using one random draw from the mcmc chain.
"""
num_samples = atts.trace().shape[0]
draw = np.random.randint(0, num_samples)
atts_draw = pd.DataFrame({'att': atts.trace()[draw, :],})
defs_draw = pd.DataFrame({'def': defs.trace()[draw, :],})
home_draw = home.trace()[draw]
intercept_draw = intercept.trace()[draw]
season = df.copy()
season = pd.merge(season, atts_draw, left_on='i_home', right_index=True)
season = pd.merge(season, defs_draw, left_on='i_home', right_index=True)
season = season.rename(columns = {'att': 'att_home', 'def': 'def_home'})
season = pd.merge(season, atts_draw, left_on='i_away', right_index=True)
season = pd.merge(season, defs_draw, left_on='i_away', right_index=True)
season = season.rename(columns = {'att': 'att_away', 'def': 'def_away'})
season['home'] = home_draw
season['intercept'] = intercept_draw
season['home_theta'] = season.apply(lambda x: math.exp(x['intercept'] +
x['home'] +
x['att_home'] +
x['def_away']), axis=1)
season['away_theta'] = season.apply(lambda x: math.exp(x['intercept'] +
x['att_away'] +
x['def_home']), axis=1)
season['home_goals'] = season.apply(lambda x: np.random.poisson(x['home_theta']), axis=1)
season['away_goals'] = season.apply(lambda x: np.random.poisson(x['away_theta']), axis=1)
season['home_outcome'] = season.apply(lambda x: 'win' if x['home_goals'] > x['away_goals'] else
'loss' if x['home_goals'] < x['away_goals'] else 'draw', axis=1)
season['away_outcome'] = season.apply(lambda x: 'win' if x['home_goals'] < x['away_goals'] else
'loss' if x['home_goals'] > x['away_goals'] else 'draw', axis=1)
season = season.join(pd.get_dummies(season.home_outcome, prefix='home'))
season = season.join(pd.get_dummies(season.away_outcome, prefix='away'))
return season
def create_season_table(season):
"""
Using a season dataframe output by simulate_season(), create a summary dataframe with wins, losses, goals for, etc.
"""
g = season.groupby('i_home')
home = pd.DataFrame({'home_goals': g.home_goals.sum(),
'home_goals_against': g.away_goals.sum(),
'home_wins': g.home_win.sum(),
'home_losses': g.home_loss.sum()
})
g = season.groupby('i_away')
away = pd.DataFrame({'away_goals': g.away_goals.sum(),
'away_goals_against': g.home_goals.sum(),
'away_wins': g.away_win.sum(),
'away_losses': g.away_loss.sum()
})
df = home.join(away)
df['wins'] = df.home_wins + df.away_wins
df['losses'] = df.home_losses + df.away_losses
df['points'] = df.wins * 2
df['gf'] = df.home_goals + df.away_goals
df['ga'] = df.home_goals_against + df.away_goals_against
df['gd'] = df.gf - df.ga
df = pd.merge(teams, df, left_on='i', right_index=True)
df = df.sort_index(by='points', ascending=False)
df = df.reset_index()
df['position'] = df.index + 1
df['champion'] = (df.position == 1).astype(int)
df['relegated'] = (df.position > 5).astype(int)
return df
def simulate_seasons(n=100):
dfs = []
for i in range(n):
s = simulate_season()
t = create_season_table(s)
t['iteration'] = i
dfs.append(t)
return pd.concat(dfs, ignore_index=True)
In [236]:
simuls = simulate_seasons(1000)
In [237]:
def fig1():
ax = simuls.points[simuls.team == 'Ireland'].hist()
median = simuls.points[simuls.team == 'Ireland'].median()
ax.set_title('Ireland: 2015 points, 1000 simulations')
ax.plot([median, median], ax.get_ylim())
plt.annotate('Median: %s' % median, xy=(median + 1, ax.get_ylim()[1]-10))
In [238]:
fig1()
In [239]:
def fig2():
ax = simuls.gf[simuls.team == 'Ireland'].hist(figsize=(7,5))
median = simuls.gf[simuls.team == 'Ireland'].median()
ax.set_title('Ireland: 2015 scores for, 1000 simulations')
ax.plot([median, median], ax.get_ylim())
plt.annotate('Median: %s' % median, xy=(median + 1, ax.get_ylim()[1]-10))
In [240]:
fig2()
In [241]:
g = simuls.groupby('team')
df_champs = pd.DataFrame({'percent_champs': g.champion.mean()})
df_champs = df_champs.sort_index(by='percent_champs')
df_champs = df_champs[df_champs.percent_champs > .05]
df_champs = df_champs.reset_index()
In [242]:
fig, ax = plt.subplots(figsize=(8,6))
ax.barh(df_champs.index.values, df_champs.percent_champs.values)
for i, row in df_champs.iterrows():
label = "{0:.1f}%".format(100 * row['percent_champs'])
ax.annotate(label, xy=(row['percent_champs'], i), xytext = (3, 10), textcoords = 'offset points')
ax.set_ylabel('Team')
ax.set_title('% of Simulated Seasons In Which Team Finished Winners')
_= ax.set_yticks(df_champs.index + .5)
_= ax.set_yticklabels(df_champs['team'].values)
Unfortunately it seems that in most of the Universes England come top of the Six Nations. And as an Irish man this is firm proof that I put Mathematical rigour before patriotism :) This is a reasonable result, and I hope it proved a nice example of Bayesian models in Rugby Analytics.
In [243]:
pd.read_csv('data/results_2015.csv', sep='\t', header=0,
names =['Rank','Team','Games','Wins','Draws',
'Losses','Points_For','Points_Against','Points'])
Out[243]:
In [244]:
from IPython.display import Image
Image(filename='onedoesnotsimply.jpg')
Out[244]:
Probabilistic Programming for Hackers -- IPython Notebook book on Bayesian stats using PyMC2
Doing Bayesian Data Analysis -- Great book by Kruschke.
In [245]:
from IPython.display import Image
Image(filename='the-most-interesting-man-in-the-world-meme-generator-i-don-t-give-many-talks-but-when-i-do-i-do-q-a-e38222 copy.jpg')
Out[245]: