In this ipython notebook will take a bayesian approach to analyzing just how good Tim Howard's record was during the world cup compared to other goalies. I take an empirical bayes approach.
The actual Bayesian analysis is pretty straightforward but getting the data is not. As a result most of this notebook deals with scraping and preparing the data. Feel free to skip ahead to the actual analysis if this doesn't interest you.
Unless you happen to be interested in how to scrape data, you should probably skip this section.
In [1]:
%pylab inline
In [216]:
% ls data
In [3]:
# grab data! First focus on world cup data from FIFA. Tabular form
import pandas as pd
goals = pd.read_csv('data/goals.csv')
goals.head()
Out[3]:
In [217]:
goals.ix[:, 0][0:5]
Out[217]:
In [5]:
goals.describe()
Out[5]:
Unfortunately this data isn't usable because it combines the stats for both teams! How lame. It looks like we'll have to scrap the data from all ~55 individual games. We can do this by:
In [218]:
# from bs4 import BeautifulSoup
with open('data/2014 FIFA World Cup Brazil™ - Statistics - Matches - Top attacks - FIFA.com.html', 'r') as f:
raw_page = f.read()
# big_page_soup = BeautifulSoup(raw_page)
In [219]:
import re
example_url = 'http://www.fifa.com/worldcup/matches/round=255931/match=300186456/index.html#nosticky'
url_re = r'http://www.fifa.com/worldcup/matches/round=(\d{6})/match=(\d{9})/index.html'
game_urls = re.findall(url_re, raw_page)
len(game_urls)
Out[219]:
In [220]:
game_urls[:5]
Out[220]:
In [46]:
# Scrape statistics from each game url
import requests
from bs4 import BeautifulSoup
test_url = 'http://www.fifa.com/worldcup/matches/round=255931/match=300186456/statistics.html'
def extract_stats(url):
"""function to extract all statistics"""
r = requests.get(url).text
soup = BeautifulSoup(r)
# print(soup)
# find teams
header = soup.find('div', 'navbar-matchheader')
home = header.find('div', 't home').find('span', 't-nTri').text # or use 't-nText' for full name
away = header.find('div', 't away').find('span', 't-nTri').text
# find stats
stats_tables = soup.find('div', 'general-stats').find_all('div', 'teamstats-wholematch')
data = {}
# extract data
for stats in stats_tables:
# find rows
rows = None
if stats:
rows = stats.find_all('tr')
# find stats in each row
for row in rows:
try:
name = row.find('td', 'stats-name').text
home_stats = row.find('td', attrs={'data-statref': 'home'}).text
away_stats = row.find('td', attrs={'data-statref': 'away'}).text
# append data
data[name] = {home:home_stats, away:away_stats}
except:
print('error on ' + row)
return data
test_soup = extract_stats(test_url)
test_soup
Out[46]:
In [240]:
# Loop through all games and extract stats
import time
all_stats = {}
for round_id, match_id in game_urls:
url = ('http://www.fifa.com/worldcup/matches/round=' +
str(round_id) + '/match=' + str(match_id) + '/statistics.html')
stats_dict = extract_stats(url)
all_stats['{}:{}'.format(round_id, match_id)] = stats_dict
time.sleep(0.1)
# all_stats
In [48]:
# save stats for later use
import json
with open('data/all_stats.json', 'w') as f:
json.dump(all_stats, f)
In [222]:
# Reformat data of interest!
# We care about a goalies chance of saving an on target goal
# So attempts that are on target, ex blocks
# The other thing is that because we care about goalies,
# we need to switch the team that we are tracking the statistics for
def get_goalie_stats(game, t1, t2):
"""Extracts savable attacks and saves for the goalie on t1"""
saves = int(game['Saves'][t2])
savable_attacks = int(game['On-Target'][t2]) - int(game['Blocked'][t2])
return savable_attacks, saves
goalie_stats = {}
for game_no, (ids, game) in enumerate(all_stats.items()):
if 'Goals' in game:
teams = game['Goals'].keys()
t1 = teams[0]
t2 = teams[1]
goalie_stats[t1] = (goalie_stats.get(t1, (0, 0))[0] + get_goalie_stats(game, t1, t2)[0],
goalie_stats.get(t1, (0, 0))[1] + get_goalie_stats(game, t1, t2)[1] )
goalie_stats[t2] = (goalie_stats.get(t2, (0, 0))[0] + get_goalie_stats(game, t2, t1)[0],
goalie_stats.get(t2, (0, 0))[1] + get_goalie_stats(game, t2, t1)[1])
goalie_stats
Out[222]:
In [223]:
# Simple save percentages (no prior)
{k:float(saves)/savables for k, (savables, saves) in goalie_stats.items()}
Out[223]:
In [225]:
# number of non-saves
# {k:savables - saves for k, (savables, saves) in goalie_stats.items()}
Here I will model a goalies probability of making a save as a bernoulli random variable with a prior given by the beta distribution.
First, I am going to set the prior by choosing the maximum likelihood estimates of the paramters for the beta distribution using all of the data. Looking at many binary outcomes and using a beta distribution as a conjugate prior means I am using the beta-binomial distribution.
Then, I will look at each Goalie's performance by using their results to calculate a posterior probability that they will save an on-target, on-blocked attempt.
In [65]:
total_savables = float(sum([a for a, b in goalie_stats.values()]))
total_savables
Out[65]:
In [66]:
total_saves = float(sum([b for a, b in goalie_stats.values()]))
total_saves
Out[66]:
In [226]:
# I expect the prior to have an expected mean close to the fraction of
# total saves divided by total savables
total_frac = total_saves/total_savables
total_frac
Out[226]:
In [179]:
# First, I will form the log-likelihood for the beta-binomial distribution.
# Luckily it has a nice closed form solution
from scipy.special import betaln, gammaln, comb
def nllf(a, b):
result = 0
for savables, saves in goalie_stats.values():
# result += (log(comb(savables, saves)) + betaln(saves + a, savables - saves + b) - betaln(a, b))
# think maybe using gammaln would be more precise
result += (gammaln(savables + 1) - gammaln(saves + 1) - gammaln(savables - saves + 1) +
betaln(saves + a, savables - saves + b) - betaln(a, b))
return -result
# test
nllf(total_frac, 1.0-total_frac)
Out[179]:
In [180]:
nllf(total_saves, total_savables)
Out[180]:
In [181]:
nllf(0.5, 0.5)
Out[181]:
In [190]:
# Now we'll estimate the values of a and b via maximizing the log-likelihood
from scipy.optimize import minimize
def nllf_log_args(arg_array):
return nllf(exp(arg_array[0]), exp(arg_array[1]))
result = minimize(nllf_log_args, [log(0.5), log(0.5)], tol=10**-3)
result
Out[190]:
In [227]:
# Calculate the prior parameters in more usable forms
a, b = exp(result.x)
a
Out[227]:
In [228]:
b
Out[228]:
In [229]:
# Calculated the expected value of the prior
prior_mean = a/(a+b)
prior_mean
Out[229]:
In [204]:
# Plot the pdf of the prior distribution
# Make a function for repeat use
from scipy.stats import beta
def bayes_goalie_chart(posterior_1_country, posterior_2_country=None, chart_title=None):
x = arange(0,1,0.01)
if posterior_1_country:
plot(x, beta(*posteriors[posterior_1_country]).pdf(x),
label="{} Posterior".format(posterior_1_country))
if posterior_2_country is not None:
plot(x, beta(*posteriors[posterior_2_country]).pdf(x),
label="{} Posterior".format(posterior_2_country))
else:
plot(x, beta(a, b).pdf(x), label="Prior")
# Other nice formatting stuff
title(chart_title)
legend(loc="best")
xlabel("Probability")
ylabel("Density")
show()
bayes_goalie_chart(None, chart_title="Probability of a Save")
In [122]:
# Calculate posterior distributions
posteriors = {k:(a + saves, b + savables - saves) for k, (savables, saves) in goalie_stats.items()}
posteriors
Out[122]:
In [124]:
posterior_means = {k:saves/(saves + misses) for k, (saves, misses) in posteriors.items()}
posterior_means
Out[124]:
In [215]:
# Calculate the 'rank' of how good a goalie is
ranked_posterior_means = [-x for x in sort([-x for x in posterior_means.values()])]
posterior_rank = {k:ranked_posterior_means.index(v) + 1 for k, v in posterior_means.items()}
posterior_rank
Out[215]:
Aww, do it turns out that according to this analysis USA is not #1. Tim Howard was good, but it's hard to come out on top of Costa Rica which only allowed two goals.
Let's look at a handful of countries to get a feel for the data. Why not start with Brazil, who seemed to have the worst performance in the World Cup?
In [207]:
bayes_goalie_chart('BRA', chart_title="Brazil's Posterior Probability of a Save")
By comparison, how did the USA compare to the Prior?
In [237]:
bayes_goalie_chart('USA', chart_title="USA's Posterior Probability of a Save")
And what does it look like when we compare the two?
In [238]:
bayes_goalie_chart('USA', posterior_2_country='BRA', chart_title="USA vs BRA")
How does the USA posterior compare to Costa Rica's?
In [210]:
bayes_goalie_chart('USA', posterior_2_country='CRC', chart_title="USA vs CRC")
Pretty similar.
Since I did this analysis in the Netherlands, I thought I'd compare it to the Netherland's performance.
In [211]:
bayes_goalie_chart('USA', posterior_2_country='NED', chart_title="USA vs NED")
And how different are all of the goalies from each other?
In [235]:
for a, b in posteriors.values():
x = arange(0, 1, 0.01)
plot(x, beta(a, b).pdf(x), color='b', alpha=0.25)
title('All Posteriors')
show()
In [ ]: