Reflecting on 2017, I decided to return to my most popular blog topic (at least by the number of emails I get). Last time, I built a crude statistical model to predict the result of football matches. I even presented a webinar on the subject here (it's free to sign up). During the presentation, I described a coefficient in the model that accounts for the fact that the home team tends to score more goals than the away team. This is called the home advantage or home field advantage and can probably be explained by a combination of physcological (e.g. familiarity with surroundings) and physical factors (e.g. travel). It occurs in various sports, including American football, baseball, basketball and soccer. Sticking to soccer/football, I mentioned in my talk how it would be interesting to see how this effect varies around the world. In which countries do the home teams enjoy the greatest advantage?

We're going to use the same statistcal model as last time, so there won't be any new statistical features developed in this post. Instead, it will focus on retrieving the appropriate goals data for even the most obscure leagues in the world (yes, even the Irish Premier Division) and then interactively visualising the results with D3. The full code can be found in the accompanying Jupyter notebook.

Calculating Home Field Advantage

The first consideration should probably be how to calculate home advantage. The traditional approach is to look at team matchups and check whether teams achieved better, equal or worse results at home than away. For example, let's imagine Chlesea beat Arsenal 2-0 at home and drew 1-1 away. That would be recored as a better home result (+2 goals versus 0). This process is repeated for every opponent and so you can actually construct a trinomial distribution and test whether there was a statistically significant home field effect. This works for balanced leagues, where team play each other an equal number of times home and away. While this holds for Europe's most famous leagues (e.g. EPL, La Liga), there are various leagues where teams play each other threes times (e.g. Ireland, Montenegro, Tajikistan aka The Big Leagues) or even just once (e.g Libya and to a lesser extent MLS (balanced for teams within the same conference)). There's also issues with postponements and abandonments rendering some leagues slightly unbalanced (e.g. Sri Lanka). For those reasons, we'll opt for a different (though not necessarily better) approach.

In the previous post, we built a model for the EPL 2016/17 season, using the number of goals scored in the past to predict future results. Looking at the model coefficients again, you see the home coefficient has a value of approximately 0.3. By taking the exponent of this value ($exp^{0.3}=1.35$), it tells us that the home team are generally 1.35 times more likely to score than the away team. In case you don't recall, the model accounts for team strength/weakness by including coefficients for each team (e.g 0.07890 and -0.96194 for Chelsea and Sunderland, respectively).

Let's see how this value compares with the lower divisions in England over the past 10 years. We'll pull the data from football-data.co.uk, which can loaded in directly using the url link for each csv file. First, we'll design a function that will take a dataframe of match results as an input and return the home field advantage (plus confidence interval limits) for that league.


In [11]:
# importing the tools required for the Poisson regression model
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn

def get_home_team_advantage(goals_df, pval=0.05):
    
    # extract relevant columns
    model_goals_df = goals_df[['HomeTeam','AwayTeam','FTHG','FTAG']]
    # rename goal columns
    model_goals_df = model_goals_df.rename(columns={'FTHG': 'HomeGoals', 'FTAG': 'AwayGoals'})

    # reformat dataframe for the model
    goal_model_data = pd.concat([model_goals_df[['HomeTeam','AwayTeam','HomeGoals']].assign(home=1).rename(
                columns={'HomeTeam':'team', 'AwayTeam':'opponent','HomeGoals':'goals'}),
               model_goals_df[['AwayTeam','HomeTeam','AwayGoals']].assign(home=0).rename(
                columns={'AwayTeam':'team', 'HomeTeam':'opponent','AwayGoals':'goals'})])

    # build poisson model
    poisson_model = smf.glm(formula="goals ~ home + team + opponent", data=goal_model_data, 
                            family=sm.families.Poisson()).fit()
    # output model parameters
    poisson_model.summary()
    
    return np.concatenate((np.array([poisson_model.params['home']]), 
                    poisson_model.conf_int(alpha=pval).values[-1]))


/home/david/anaconda2/lib/python2.7/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

I've essentially combined various parts of the previous post into one convenient function. If it looks a little strange, then I suggest you consult the original post. Okay, we're ready to start calculating some home advantage scores.


In [12]:
# home field advantage for EPL 2016/17 season
get_home_team_advantage(pd.read_csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv"))


Out[12]:
array([ 0.2838454,  0.16246  ,  0.4052308])

It's as easy as that. Feed a url from football-data.co.uk into the function and it'll quickly tell you the statistical advantage enjoyed by home teams in that league. Note that the latter two values repesent the left and right limit of the 95% confidence interval around the mean value. The first value in the array is actually just the log of the number of goals scored by the home team divided by the total number of away goals.


In [13]:
temp_goals_df = pd.read_csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv")
[np.exp(get_home_team_advantage(temp_goals_df)[0]),
 np.sum(temp_goals_df['FTHG'])/float(np.sum(temp_goals_df['FTAG']))]


Out[13]:
[1.3282275711159723, 1.3282275711159737]

The goals ratio calculation is obviously much simpler and definitely more intuitive. But it doesn't allow me to reference my previous post as much (link link link) and it fails to provide any uncertainty around the headline figure. Let's plot the home advantage figure for the top 5 divisions of the English league pyramid for since 2005. You can remove those hugely informative confidence interval bars by unticking the checkbox.


In [14]:
division_results = []
for division in range(5):
    year_results = []
    for year in range(2005,2017):
        if division==4:
            division_string = 'C'
        else:
            division_string = str(division)
        url = "http://www.football-data.co.uk/mmz4281/"+str(year)[-2:]+str(year+1)[-2:]+"/E"+division_string+".csv"
        print(url)
        year_results.append(np.concatenate((np.array([year, division]), get_home_team_advantage(pd.read_csv(url)))))
    division_results.append(np.vstack(year_results))


http://www.football-data.co.uk/mmz4281/0506/E0.csv
http://www.football-data.co.uk/mmz4281/0607/E0.csv
http://www.football-data.co.uk/mmz4281/0708/E0.csv
http://www.football-data.co.uk/mmz4281/0809/E0.csv
http://www.football-data.co.uk/mmz4281/0910/E0.csv
http://www.football-data.co.uk/mmz4281/1011/E0.csv
http://www.football-data.co.uk/mmz4281/1112/E0.csv
http://www.football-data.co.uk/mmz4281/1213/E0.csv
http://www.football-data.co.uk/mmz4281/1314/E0.csv
http://www.football-data.co.uk/mmz4281/1415/E0.csv
http://www.football-data.co.uk/mmz4281/1516/E0.csv
http://www.football-data.co.uk/mmz4281/1617/E0.csv
http://www.football-data.co.uk/mmz4281/0506/E1.csv
http://www.football-data.co.uk/mmz4281/0607/E1.csv
http://www.football-data.co.uk/mmz4281/0708/E1.csv
http://www.football-data.co.uk/mmz4281/0809/E1.csv
http://www.football-data.co.uk/mmz4281/0910/E1.csv
http://www.football-data.co.uk/mmz4281/1011/E1.csv
http://www.football-data.co.uk/mmz4281/1112/E1.csv
http://www.football-data.co.uk/mmz4281/1213/E1.csv
http://www.football-data.co.uk/mmz4281/1314/E1.csv
http://www.football-data.co.uk/mmz4281/1415/E1.csv
http://www.football-data.co.uk/mmz4281/1516/E1.csv
http://www.football-data.co.uk/mmz4281/1617/E1.csv
http://www.football-data.co.uk/mmz4281/0506/E2.csv
http://www.football-data.co.uk/mmz4281/0607/E2.csv
http://www.football-data.co.uk/mmz4281/0708/E2.csv
http://www.football-data.co.uk/mmz4281/0809/E2.csv
http://www.football-data.co.uk/mmz4281/0910/E2.csv
http://www.football-data.co.uk/mmz4281/1011/E2.csv
http://www.football-data.co.uk/mmz4281/1112/E2.csv
http://www.football-data.co.uk/mmz4281/1213/E2.csv
http://www.football-data.co.uk/mmz4281/1314/E2.csv
http://www.football-data.co.uk/mmz4281/1415/E2.csv
http://www.football-data.co.uk/mmz4281/1516/E2.csv
http://www.football-data.co.uk/mmz4281/1617/E2.csv
http://www.football-data.co.uk/mmz4281/0506/E3.csv
http://www.football-data.co.uk/mmz4281/0607/E3.csv
http://www.football-data.co.uk/mmz4281/0708/E3.csv
http://www.football-data.co.uk/mmz4281/0809/E3.csv
http://www.football-data.co.uk/mmz4281/0910/E3.csv
http://www.football-data.co.uk/mmz4281/1011/E3.csv
http://www.football-data.co.uk/mmz4281/1112/E3.csv
http://www.football-data.co.uk/mmz4281/1213/E3.csv
http://www.football-data.co.uk/mmz4281/1314/E3.csv
http://www.football-data.co.uk/mmz4281/1415/E3.csv
http://www.football-data.co.uk/mmz4281/1516/E3.csv
http://www.football-data.co.uk/mmz4281/1617/E3.csv
http://www.football-data.co.uk/mmz4281/0506/EC.csv
http://www.football-data.co.uk/mmz4281/0607/EC.csv
http://www.football-data.co.uk/mmz4281/0708/EC.csv
http://www.football-data.co.uk/mmz4281/0809/EC.csv
http://www.football-data.co.uk/mmz4281/0910/EC.csv
http://www.football-data.co.uk/mmz4281/1011/EC.csv
http://www.football-data.co.uk/mmz4281/1112/EC.csv
http://www.football-data.co.uk/mmz4281/1213/EC.csv
http://www.football-data.co.uk/mmz4281/1314/EC.csv
http://www.football-data.co.uk/mmz4281/1415/EC.csv
http://www.football-data.co.uk/mmz4281/1516/EC.csv
http://www.football-data.co.uk/mmz4281/1617/EC.csv

In [15]:
from ipywidgets import interact, Checkbox

def plot_func(freq):
    fig, ax1 = plt.subplots(1, 1,figsize=(7, 5))
    for div, (div_name, div_col) in enumerate(zip(['EPL', 'Championship', 'League 1', 'League 2', 'Conference'],
                               ["#9b5369", "#7c8163", "#c1a381", "#d9bcc0", "#F67280"])):
        if freq:
            ax1.errorbar(division_results[div][:,0], division_results[div][:,2],
                         yerr= (division_results[div][:,4] - division_results[div][:,2]),
                          linestyle='-', marker='o',label=div_name, color = div_col)
        else:
            ax1.plot(division_results[div][:,0], division_results[div][:,2],
                          linestyle='-', marker='o',label=div_name, color = div_col)
    #[str(int(item.get_text())+1)[-2:] for item in ax1.get_xticklabels()]
    ax1.set_xticks([2005, 2007, 2009, 2011, 2013, 2015])
    ax1.set_xlabel('Season', fontsize=12)
    ax1.set_ylabel('Home Advantage Score', fontsize=12)
    ax1.set_ylim([-0.05, 0.6])
    ax1.set_xlim([2004.5, 2016.5])
    ax1.set_xticklabels(['05/06', '07/08', '09/10', '11/12', '13/14', '15/16'])
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.show()
    
interact(plot_func, freq = Checkbox(
    value=True,
    description='Error Bars',
    disabled=False))


Out[15]:
<function __main__.plot_func>

It's probably more apparent without those hugely informative confidence interval bars, but it seems that the home advantage score decreases slightly as you move down the pyramid (analysis by Sky Sports produced something similar). This might make sense for two reasons. Firstly, bigger teams generally have larger stadiums and more supporters, which could strengthen the home field advantage. Secondly, as you go down the leagues, I suspect the quality gap between teams narrows. Taking it to an extreme, when I used to play Sunday league football, it didn't really matter where we played... we still lost. In that sense, one must be careful comparing the home advantage between leagues, as it will be affected by the relative team strengths within those leagues. For example, a league with a very dominant team (or teams) will record a lower home advantage score, as that dominant team will score goals home and away with little difference (Man Utd would probably beat Cork City 6-0 at Old Trafford and Turners Cross!).

Having warned about the dangers of comparing different leagues with this approach, let's now compare the top five leagues in Europe over the same time period as before.


In [16]:
country_results = []
for country in ['E0', 'SP1', 'I1', 'D1', 'F1']:
    year_results = []
    for year in range(2005,2017):
        url = "http://www.football-data.co.uk/mmz4281/"+str(year)[-2:]+str(year+1)[-2:]+ "/" + country +".csv"
        print(url)
        year_results.append(np.concatenate((np.array([year, division]), 
                                            get_home_team_advantage(pd.read_csv(url)))))
    country_results.append(np.vstack(year_results))


http://www.football-data.co.uk/mmz4281/0506/E0.csv
http://www.football-data.co.uk/mmz4281/0607/E0.csv
http://www.football-data.co.uk/mmz4281/0708/E0.csv
http://www.football-data.co.uk/mmz4281/0809/E0.csv
http://www.football-data.co.uk/mmz4281/0910/E0.csv
http://www.football-data.co.uk/mmz4281/1011/E0.csv
http://www.football-data.co.uk/mmz4281/1112/E0.csv
http://www.football-data.co.uk/mmz4281/1213/E0.csv
http://www.football-data.co.uk/mmz4281/1314/E0.csv
http://www.football-data.co.uk/mmz4281/1415/E0.csv
http://www.football-data.co.uk/mmz4281/1516/E0.csv
http://www.football-data.co.uk/mmz4281/1617/E0.csv
http://www.football-data.co.uk/mmz4281/0506/SP1.csv
http://www.football-data.co.uk/mmz4281/0607/SP1.csv
http://www.football-data.co.uk/mmz4281/0708/SP1.csv
http://www.football-data.co.uk/mmz4281/0809/SP1.csv
http://www.football-data.co.uk/mmz4281/0910/SP1.csv
http://www.football-data.co.uk/mmz4281/1011/SP1.csv
http://www.football-data.co.uk/mmz4281/1112/SP1.csv
http://www.football-data.co.uk/mmz4281/1213/SP1.csv
http://www.football-data.co.uk/mmz4281/1314/SP1.csv
http://www.football-data.co.uk/mmz4281/1415/SP1.csv
http://www.football-data.co.uk/mmz4281/1516/SP1.csv
http://www.football-data.co.uk/mmz4281/1617/SP1.csv
http://www.football-data.co.uk/mmz4281/0506/I1.csv
http://www.football-data.co.uk/mmz4281/0607/I1.csv
http://www.football-data.co.uk/mmz4281/0708/I1.csv
http://www.football-data.co.uk/mmz4281/0809/I1.csv
http://www.football-data.co.uk/mmz4281/0910/I1.csv
http://www.football-data.co.uk/mmz4281/1011/I1.csv
http://www.football-data.co.uk/mmz4281/1112/I1.csv
http://www.football-data.co.uk/mmz4281/1213/I1.csv
http://www.football-data.co.uk/mmz4281/1314/I1.csv
http://www.football-data.co.uk/mmz4281/1415/I1.csv
http://www.football-data.co.uk/mmz4281/1516/I1.csv
http://www.football-data.co.uk/mmz4281/1617/I1.csv
http://www.football-data.co.uk/mmz4281/0506/D1.csv
http://www.football-data.co.uk/mmz4281/0607/D1.csv
http://www.football-data.co.uk/mmz4281/0708/D1.csv
http://www.football-data.co.uk/mmz4281/0809/D1.csv
http://www.football-data.co.uk/mmz4281/0910/D1.csv
http://www.football-data.co.uk/mmz4281/1011/D1.csv
http://www.football-data.co.uk/mmz4281/1112/D1.csv
http://www.football-data.co.uk/mmz4281/1213/D1.csv
http://www.football-data.co.uk/mmz4281/1314/D1.csv
http://www.football-data.co.uk/mmz4281/1415/D1.csv
http://www.football-data.co.uk/mmz4281/1516/D1.csv
http://www.football-data.co.uk/mmz4281/1617/D1.csv
http://www.football-data.co.uk/mmz4281/0506/F1.csv
http://www.football-data.co.uk/mmz4281/0607/F1.csv
http://www.football-data.co.uk/mmz4281/0708/F1.csv
http://www.football-data.co.uk/mmz4281/0809/F1.csv
http://www.football-data.co.uk/mmz4281/0910/F1.csv
http://www.football-data.co.uk/mmz4281/1011/F1.csv
http://www.football-data.co.uk/mmz4281/1112/F1.csv
http://www.football-data.co.uk/mmz4281/1213/F1.csv
http://www.football-data.co.uk/mmz4281/1314/F1.csv
http://www.football-data.co.uk/mmz4281/1415/F1.csv
http://www.football-data.co.uk/mmz4281/1516/F1.csv
http://www.football-data.co.uk/mmz4281/1617/F1.csv

In [17]:
def plot_func(freq):
    fig, ax1 = plt.subplots(1, 1,figsize=(7, 5))
    for div, (div_name, div_col) in enumerate(zip(['EPL', 'La Liga', 'Serie A', 'Bundesliga 1', 'Ligue 1'],
                                   ["#9b5369", "#A1D9FF", "#CA82F8", "#ED93CB", "#78B7BB"])):
        if freq:
            ax1.errorbar(country_results[div][:,0], country_results[div][:,2],
                         yerr= (country_results[div][:,4] - country_results[div][:,2]),
                          linestyle='-', marker='o',label=div_name, color = div_col)
        else:
            ax1.plot(country_results[div][:,0], country_results[div][:,2],
                          linestyle='-', marker='o',label=div_name, color = div_col)

    ax1.set_xticks([2005, 2007, 2009, 2011, 2013, 2015])
    ax1.set_ylim([-0.05, 0.6])
    ax1.set_xlim([2004.5, 2016.5])
    ax1.set_xlabel('Season', fontsize=12)
    ax1.set_ylabel('Home Advantage Score', fontsize=12)
    ax1.set_xticklabels(['05/06', '07/08', '09/10', '11/12', '13/14', '15/16'])
    plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
    plt.show()
    
interact(plot_func, freq = Checkbox(
    value=True,
    description='Error Bars',
    disabled=False))


Honestly, there's not much going on there. With the poissble exception of the Spanish La Liga since 2010, the home field advantage enjoyed by the teams in each league is broadly similar (and that's before we bring in the idea of confidence intervals and hypothesis testing).

Home Advantage Around the World

To find more interesting contrasts, we must venture to crappier and more corrupt leagues. My hunch is that home advantage would be negligible in countries where the overall quality (team, infastructure, etc.) is very low. And by low, I mean leagues worse than the Irish Premier Division (yes, they exist). Unfortunately, the historical results for such leagues are not available on football-data.co.uk. Instead, we'll scrape the data off betexplorer. I'm extremely impressed by the breadth of this site. You can even retrieve past results for the French overseas department of Réunion. Fun fact: Dimtri Payet spent the 2004 season at AS Excelsior of the Réunion Premier League.

We'll use Scrapy to pull the appropriate information off the website. If you've never used Scrapy before, then you should check out this post. I won't spend too long on this part, but you can find the full code on here.


In [1]:
import scrapy
import re # for text parsing
import logging
import pandas as pd

class FootballSpider(scrapy.Spider):
    name = 'footballSpider'
    # page to scrape
    start_urls = pd.read_csv(
        'https://raw.githubusercontent.com/dashee87/blogScripts/master/files/league_links.csv')['link'].tolist() + \
    pd.read_csv("league_links_.csv")['link2'].dropna(axis=0, how='all').tolist()
    # if you want to impose a delay between sucessive scrapes
#   download_delay = 1.0 

    def parse(self, response):
        self.logger.info('Scraping page: %s', response.url)
        country_league = response.css('.list-breadcrumb__item__in::text').extract()
        
        for num, (hometeam, awayteam, match_result, date) in \
            enumerate(zip(response.css('.in-match span:nth-child(1)'), 
                                 response.css('.in-match span:nth-child(2)'),
                                 response.css('td.h-text-center'),
                                 response.css('.h-text-no-wrap::text').extract())):
            yield {'country':country_league[2], 'league': country_league[3],
                   'HomeTeam': hometeam.css('::text').extract_first(), 
                   'AwayTeam':awayteam.css('::text').extract_first(), 
                   'FTHG':  re.sub(':.*', '', match_result.css('::text').extract_first()), 
                   'FTAG':  re.sub('.*:', '', match_result.css('::text').extract_first()),
                   'awarded': ' ' in match_result.css('::text').extract_first() or
                   'AWA.' in match_result.css('::text').extract_first(),
                   'date':date}

In [2]:
from scrapy.crawler import CrawlerProcess

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'all_league_goals_.json'
})

# minimising the information presented on the scrapy log
logging.getLogger('scrapy').setLevel(logging.WARNING)
process.crawl(FootballSpider)
process.start()


2018-01-09 15:26:24 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2018-01-09 15:26:24 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'all_league_goals_.json', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
2018-01-09 15:26:25 [footballSpider] INFO: Scraping page: http://www.betexplorer.com/soccer/albania/super-league-2016-2017/results/
2018-01-09 15:26:25 [footballSpider] INFO: Scraping page: http://www.betexplorer.com/soccer/andorra/primera-divisio-2016-2017/results/?stage=dp9brpEB

You don't actually need to run your own spider, as I've shared the output to my GitHub account. We can import the json file in directly using pandas.


In [22]:
all_league_goals = pd.read_json(
    "https://raw.githubusercontent.com/dashee87/blogScripts/master/files/all_league_goals.json")
# reorder the columns to it a bit more logical
all_league_goals = all_league_goals[['country', 'league', 'date', 'HomeTeam', 
                                     'AwayTeam', 'FTHG', 'FTAG', 'awarded']]
all_league_goals.head()


Out[22]:
country league date HomeTeam AwayTeam FTHG FTAG awarded
0 Albania Super League 2016/2017 2017-05-27 Korabi Peshkopi Flamurtari 0 3 False
1 Albania Super League 2016/2017 2017-05-27 Laci Teuta 2 1 False
2 Albania Super League 2016/2017 2017-05-27 Luftetari Gjirokastra Kukesi 1 0 False
3 Albania Super League 2016/2017 2017-05-27 Skenderbeu Partizani 2 2 False
4 Albania Super League 2016/2017 2017-05-27 Vllaznia KF Tirana 0 0 False

Hopefully, that's all relatively clear. You'll notice that it's very similar to the format used by football-data, which means that we can feed this dataframe into the get_home_team_advantage function. Sometimes, matches are awarded due to one team fielding an ineligible player or crowd trouble. We should probably exclude such matches from the home field advantage calculations.


In [25]:
# little bit of data cleansing to remove fixtures that were abandoned/awarded/postponed
all_league_goals = all_league_goals[~all_league_goals['awarded']]
all_league_goals = all_league_goals[all_league_goals['FTAG']!='POSTP.']
all_league_goals = all_league_goals[all_league_goals['FTAG']!='CAN.']
all_league_goals[['FTAG', 'FTHG']] = all_league_goals[['FTAG', 'FTHG']].astype(int)

We're ready to put it all together. I'll omit the code (though it can be found here), but we'll loop through each country and league combination (just in case you decide to include multiple leagues from the same country) and calculate the home advantage score, plus its confidence limits as well as some other information for each league (number of teams, average number of goals in each match). I've converted the pandas output to a datatables table that you can interactively filter and sort.


In [27]:
home_advantage_country = pd.DataFrame(all_league_goals.assign(match_goals = all_league_goals['FTHG'] +  
                                      all_league_goals['FTHG']).groupby(['country','league']).agg(
        {'HomeTeam':['size','nunique'], 'match_goals':'mean'}).to_records())
home_advantage_country.columns = ['country', 'league', 'num_games', 'num_teams', 'avg_goals']
temp_set = []
for i in range(home_advantage_country.shape[0]):
    temp_set.append(get_home_team_advantage(all_league_goals[(
                    all_league_goals['country']==home_advantage_country['country'][i]) & (
                all_league_goals['league']==home_advantage_country['league'][i])]))
temp_set = pd.DataFrame(temp_set,columns= ['home_advantage_score', 'left_tail', 'right_tail'])
home_advantage_country = pd.concat([home_advantage_country, temp_set], axis=1).sort_values('home_advantage_score', 
                                            ascending=False).reset_index(drop=True)
home_advantage_country.index = home_advantage_country.index + 1
# if you want display more/less rows than the default option
pd.options.display.max_rows = 40
home_advantage_country.assign(avg_goals= pd.Series.round(home_advantage_country['avg_goals'], 3),
                             home_advantage_score= pd.Series.round(home_advantage_country['home_advantage_score'], 3),
                             left_tail= pd.Series.round(home_advantage_country['left_tail'], 3),
                             right_tail= pd.Series.round(home_advantage_country['right_tail'], 3))


Out[27]:
country league num_games num_teams avg_goals home_advantage_score left_tail right_tail
1 Nigeria Premier League 2017 379 20 3.087 1.195 1.027 1.363
2 Haiti Championnat National 2017 237 16 2.329 0.741 0.533 0.949
3 Algeria Ligue 1 2016/2017 238 16 2.790 0.698 0.512 0.884
4 Ghana Premier League 2017 238 16 2.924 0.676 0.494 0.857
5 Bolivia Liga de Futbol Prof 2016/2017 132 12 4.470 0.624 0.431 0.818
6 Guatemala Liga Nacional 2016/2017 264 12 2.803 0.620 0.448 0.792
7 Benin Championnat National 2017 162 19 2.272 0.571 0.330 0.811
8 USA MLS 2017 374 22 3.749 0.538 0.416 0.660
9 Peru Primera Division 2017 238 16 3.361 0.520 0.359 0.680
10 Indonesia Liga 1 2017 304 18 3.618 0.515 0.378 0.651
11 Togo Championnat National 2016/2017 181 14 2.409 0.510 0.293 0.726
12 Uzbekistan Professional Football League 2017 233 16 3.193 0.503 0.338 0.668
13 Mozambique Mocambola 2017 240 16 2.325 0.501 0.310 0.692
14 Angola Girabola 2017 239 16 2.678 0.499 0.321 0.678
15 Greece Super League 2016/2017 240 16 2.883 0.499 0.328 0.671
16 Tunisia Ligue Professionnelle 1 2016/2017 112 16 2.607 0.495 0.231 0.759
17 Albania Super League 2016/2017 180 10 2.333 0.488 0.269 0.707
18 Sudan Premier League 2017 306 18 2.791 0.486 0.332 0.639
19 Tanzania Ligi Kuu Bara 2016/2017 239 16 2.435 0.480 0.294 0.665
20 Colombia Liga Aguila 2017 400 20 2.635 0.465 0.328 0.603
... ... ... ... ... ... ... ... ...
141 Djibouti Division 1 2016/2017 90 10 4.067 0.045 -0.163 0.252
142 Uruguay Primera Division 2017 240 16 2.775 0.034 -0.120 0.187
143 Gambia GFA League 2016/2017 131 12 1.969 0.032 -0.218 0.281
144 Canada CSL 2017 56 8 4.357 0.025 -0.228 0.277
145 Armenia Premier League 2016/2017 90 6 2.222 0.020 -0.258 0.299
146 Panama LPF 2016/2017 180 10 2.089 0.005 -0.197 0.208
147 Kuwait Premier League 2016/2017 210 15 3.048 -0.000 -0.155 0.155
148 Mauritius Mauritian League 2016/2017 179 10 2.916 -0.001 -0.173 0.170
149 Andorra Primera Divisió 2016/2017 83 8 3.205 -0.005 -0.245 0.236
150 Latvia SynotTip Virslīga 2017 96 8 2.417 -0.006 -0.264 0.251
151 Libya Premier League 2017 83 28 2.337 -0.015 -0.324 0.293
152 Dominican Republic LDF 2017 90 10 2.489 -0.021 -0.280 0.239
153 Cambodia C-League 2017 132 12 3.803 -0.031 -0.205 0.142
154 Paraguay Primera Division 2017 264 12 2.492 -0.033 -0.184 0.119
155 Jordan Premier League 2016/2017 132 12 2.197 -0.034 -0.262 0.194
156 Bahrain Premier League 2016/2017 90 10 2.467 -0.048 -0.310 0.213
157 Pakistan Premier League 2014/2015 132 12 2.258 -0.065 -0.288 0.159
158 Liberia LFA First Division 2016/2017 125 12 2.160 -0.089 -0.323 0.145
159 Somalia Nation Link Telecom Championship 2016/2017 90 10 2.756 -0.153 -0.396 0.090
160 Maldives Dhivehi Premier League 2017 56 8 3.964 -0.370 -0.782 0.042

160 rows × 8 columns

Focusing on the home_advantage_score column, teams in Nigeria by far enjoy the greatest benefit from playing at home (score = 1.195). In other words, home teams scored 3.3 (= $e^{1.195}$) times more goals than their opponents. This isn't new information and can be attributed to a combination of corruption (e.g. bribing referees) and violent fans. In fact, my motivation for this post was to identify more football corruption hotspots. Alas, when it comes to home turf invincibility, it seems Nigeria are the World Cup winners.

Fifteen leagues have a negative home_advantage_score, meaning that visiting teams actually scored more goals than their hosts- though none was statistically significant. By some distance, the Maldives records the most negative score. Luckily, I've twice researched this beautiful archipelago and I'm aware that all matches in the Dhiveli Premier League are played at the national stadium in Malé (much like the Gibraltar Premier League). So it would make sense that there's no particular advantage gained by the home team. Libya is another interesting example. Owing to security issues, all matches in the Libyan Premier League are played in neutral venues with no spectators present. Quite fittingly, it returned a home advantage score just off zero. Generally speaking, the leagues with near zero home advantage come from small countries (minimal inconvenience for travelling teams) with a small number of teams and they tend to share stadiums.

If you sort the avg_goals column, you'll see that Bolivia is the place to be for goals (average = 4.47). But rather than sifting through that table or explaining the results with words, the most intuitive way to illustrate this type of data is with a map of world. This might also help to clarify whether there's any geographical influence on the home advantage effect. Again, I won't go into the details (an appendix can be found in the Jupyter notebook), but I built a map using the JavaScript library, D3. And by built I mean I adapted the code from this post and this post. Though a little outdated now, I found this post quite useful too. Finally, I think this post shows off quite well what you can do with maps using D3.

And here it is! The country colour represents its home_advantage_score. You can zoom in and out and hover over a country to reveal a nice informative overlay; use the radio buttons to switch between home advantage and goals scored. I recommend viewing it on desktop (mobile's a bit jumpy) and on Chrome (sometimes have security issues with Firefox).

It's not scientifically rigorous (not in academia any more, baby!), but there's evidence for some geographical trends. For example, it appears that home advantage is stronger in Africa and South America compared to Western and Central Europe, with the unstable warzones of Libya, Somalia and Paraguay (?) being notable exceptions. As for average goals, Europe boasts stonger colours compared to Africa, though South East Asia seems to be the global hotspot for goals. North America is also quite dark, but you can debate whether Canada should be coloured grey, as the best Canadian teams belong to the American soccer system.

Conclusion

Using a previously described model and some JavaScript, this post explored the so called home advantage in football leagues all over the world (including Réunion). I don't think it uncovered anything particularly amazing: different leagues have different properties and don't bet on the away team in the Nigerian league. You can play around with the Python code here. Thanks for reading!

Appendix

This section is intended more than anything else as a reminder to myself if I ever want to build a map with D3 again.

We need to write the country league data to a csv that will then be loaded into the d3 map. To improve readability, the values are rounded to 3 decimal places. The country outlines in the map (see below) will be coloured according to their average goals or home advantage score. Rather than matching on country name (which can be fickle- is it Democratic Republic of Congo or DR Congo or even Congo-Kinshasa?), we'll append a column for the country ISO 3166-1 alpha-3 code. I'd like to say I scraped some page here, but it was mostly a manual job. After creating some new columns for the column ranking, the file is written to the local directory (alternatively, you can view it here).


In [227]:
home_advantage_country.assign(avg_goals= pd.Series.round(home_advantage_country['avg_goals'], 3),
                             home_advantage_score= pd.Series.round(home_advantage_country['home_advantage_score'], 3),
                             left_tail= pd.Series.round(home_advantage_country['left_tail'], 3),
                             right_tail= pd.Series.round(home_advantage_country['right_tail'], 3)).merge(
    pd.read_csv("https://raw.githubusercontent.com/dashee87/blogScripts/master/files/league_links.csv", 
                usecols = ['country', 'countryCode'], encoding='latin-1'), 
    on="country", how="inner").reset_index().rename(columns={"index": "home_adv_rank"}).sort_values(
    'avg_goals', ascending=False).reset_index(drop=True).reset_index().rename(columns={"index": "avg_goals_rank"}).to_csv(
    "home_advantage.csv", encoding='utf-8', index=False)

Shapefiles and TopoJSON

Generating the world map required a little bit of command line, python and a whole lot of JavaScript (specifically D3). The command line was used to convert the shapefiles into geojson files (see ogr2ogr and finally into the topojson format. The main reason for the last step is that it drastically reduces the file size, which should improve its onsite loading (though it could also affect the quality of the map).

My particular map was complicated by the fact that some sovereign states are composed of several countries that organise their own national competitions. If that sounds weird, think of the United Kingdom. It's a member of the UN and a sovereign state in its own right (despite what Brexiteers may say). But there's no UK (or British) Premier League; there's the English/Welsh/Scottish/Northern Irish Premier League/Premiership. Similarly, Reunion is part of France but has its own football league. Then again, the Basque country is recognised as a nation within Spain, but has no internationally recognised national league. In summary, it's complicated.

Political realities aside, we need to get the geojson file for all of the countries in the world (see all.geojson available here). We must remove the United Kingdom, France and a few others. To reduce the file size, I also removed some country information that wasn't relevant for my purposes (population, GDP, etc.). The geojson files for England, Scotland, Reunion, etc. were a little harder to track down. The shapefile containing those country subdivisions can be downloaded here, which can be converted into geojson files with ogr2ogr. Unfortunately, that file contains various subdivisions that don't correspond to actual football leagues (e.g. Belgium is split into the Flemish and Walloon regions). That means we need append the subdivisions we do want to higher level geojson file, which I did my manipulating the two json files in Python.


In [229]:
import json

all_nations = json.load(open('all.geojson'))
# command line: ogr2ogr -f GEOJson -where "ADM0_A3 IN ('GBR', 'FRA','NLD', 'USA')" \
# football_subdivisions.json ne_10m_admin_0_map_units.shp
subdivisions = json.load(open('football_subdivisions.json'))

In [210]:
all_nations_tidy = {}
all_nations_tidy['type']= all_nations['type']
all_nations_tidy['crs'] = all_nations['crs']
all_nations_tidy['features'] = []

for country_features in all_nations['features']:
    # skip UK, France, US, Netherlands and Antartica
    if (country_features['properties']['ADM0_A3'] in ['GBR', 'FRA', 'USA', 'NLD', 'ATA'] or 
    # skip minor islands and territories with populations less than 1000
    country_features['properties']['POP_EST']<1000) and \
    # don't want to exclude Western Sahara
    country_features['properties']['NAME_LONG'] != 'Western Sahara':
        continue
    if True:
        all_nations_tidy['features'].append({'properties': {'country': country_features['properties']['NAME_LONG'],
                                   'countryCode': country_features['properties']['ADM0_A3']},
                                   'geometry': country_features['geometry']})
    
for subdiv_features in subdivisions['features']:
    all_nations_tidy['features'].append({'properties': {'country': subdiv_features['properties']['NAME_LONG'],
                               'countryCode': subdiv_features['properties']['BRK_A3']},
                               'geometry': subdiv_features['geometry']})

with open('countries.json', 'w') as outfile:
    json.dump(all_nations_tidy, outfile)

We now have a json file containing all the subdivisions and country outlines we want, but it's quite large (>20MB), which you may not want to load on the page. The good news is that we can convert the file to an alternative cartographical format called topojson. It should preserve most of the information (i.e the borders) and reduces the file size significantly by creating efficiencies and removing redundancies (e.g. shared borders). If you've installed topojson, then it's as simple as running this command geo2topo -q 1e4 countries.json>world_topo.json.