NTDS - Elite Football Transfers Analysis

This Notebook presents the main code used to perform our analysis of the players transfer in elite Football. We have scraped data about the elite football leagues in the world, their clubs and players from Transfermarkt.com in this Scraper Notebook.

N. Quinteiro, F. Mourad, G. Abi-Younes


In [1]:
from urllib.request import *
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import scipy.stats as stats
import pylab as pl
import requests
from geopy.geocoders import Nominatim
import json
from pygsp import graphs, filters, plotting
import os
import time
from tempfile import TemporaryFile
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
from matplotlib.ticker import FuncFormatter
from collections import Counter
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
import networkx as nx
from pylab import rcParams
plotting.BACKEND = 'matplotlib'
%matplotlib inline


2018-01-22 18:46:50,836:[ERROR](pygsp.plotting.<module>): Cannot import packages for pyqtgraph: Traceback (most recent call last):
  File "/home/nathan/anaconda3/lib/python3.5/site-packages/pygsp/plotting.py", line 43, in <module>
    import pyqtgraph as qtg
  File "/home/nathan/anaconda3/lib/python3.5/site-packages/pyqtgraph/__init__.py", line 13, in <module>
    from .Qt import QtGui
  File "/home/nathan/anaconda3/lib/python3.5/site-packages/pyqtgraph/Qt.py", line 133, in <module>
    from PyQt4 import QtGui, QtCore, uic
ImportError: cannot import name 'QtGui'


In [2]:
matplotlib.rcParams.update({'font.size': 22})
matplotlib.rcParams.update({'font.family': 'DejaVu Sans'})
matplotlib.rcParams.update({'font.weight': 'normal'})

Visualizing graphs

If you want to visualize the graphs without executing the Notebook, use this nvViewer LINK

Folium

pip install folium

In [3]:
import folium
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap

GMAPS API

Quick summary to install GMAPS API if you need to run the map.

jupyter nbextension enable --py --sys-prefix widgetsnbextension
pip install gmaps
jupyter nbextension enable --py --sys-prefix gmaps

You might need to restart jupyter after installing.

See the installation instructions

DO NOT SHARE THE FILE gmapsKey in the folder ../scraper/data/gmapsKey


In [4]:
with open("scraper/data/gmapsKey", "r") as infile:
    GMAPKEY = json.load(infile)["key"]
    
import gmaps
import gmaps.datasets
gmaps.configure(api_key=GMAPKEY)

Data Scraping

All the data scraping process is done in the Scraper Notebook. The data was parsed and processed to generate the json files in the scrape data folder (scrape/data). It contains every infos required to run this Notebook.

The entire scraped data is heavy (>2GB) and is therefore not present in this repo. If you need hit, either execute the Scraper Notebook, or contact Nathan Quinteiro (nathan.quinteiro@epfl.ch)

Displaying clubs

All the methods used to obtains the clubs geographic location are commented in order to avoid redoing multiple requests. The locations obtained using requests were saved in a json file and are loaded in this Notebook.

Loading clubs data


In [5]:
'''
with open("scraper/data/clubs.json", "r") as in_file:
    clubs = json.load(in_file)
print("There are " + str(len(clubs)) + " clubs")
clubs[0]["href"]
'''


Out[5]:
'\nwith open("scraper/data/clubs.json", "r") as in_file:\n    clubs = json.load(in_file)\nprint("There are " + str(len(clubs)) + " clubs")\nclubs[0]["href"]\n'

In [6]:
# Create a club dictionnary to faster access. Save it as json to reload it faster.
'''
clubDict = {}
for club in clubs:
    club_id = club["href"].split("/")[4]
    if club_id not in clubDict:
        clubDict[club_id] = club
        
with open("scraper/data/dictClubs.json", "w") as out:
    json.dump(clubDict, out)
'''

with open("scraper/data/dictClubs.json", "r") as in_file:
    clubDict = json.load(in_file)

Retrieve clubs coordinates using stadium and GoogleMap requests


In [7]:
'''GOOGLE_MAPS_API_URL = 'http://maps.googleapis.com/maps/api/geocode/json'

# For each club, we use its name and the stadium name to make a GoogleMap query and retrieve the stadium's coordinates
for club_id in clubDict.keys():
    club = clubDict[club_id]
    if "position" not in club:
        
        params = {
            'address': club["stadium"] + " " + club["name"]
        }
        # Do the request and get the response data
        req = requests.get(GOOGLE_MAPS_API_URL, params=params)
        res = req.json()
        # Use the first result
        if len(res['results']) > 0:
            result = res['results'][0]
            club["position"] = (result['geometry']['location']['lat'], result['geometry']['location']['lng'])
'''


Out[7]:
'GOOGLE_MAPS_API_URL = \'http://maps.googleapis.com/maps/api/geocode/json\'\n\n# For each club, we use its name and the stadium name to make a GoogleMap query and retrieve the stadium\'s coordinates\nfor club_id in clubDict.keys():\n    club = clubDict[club_id]\n    if "position" not in club:\n        \n        params = {\n            \'address\': club["stadium"] + " " + club["name"]\n        }\n        # Do the request and get the response data\n        req = requests.get(GOOGLE_MAPS_API_URL, params=params)\n        res = req.json()\n        # Use the first result\n        if len(res[\'results\']) > 0:\n            result = res[\'results\'][0]\n            club["position"] = (result[\'geometry\'][\'location\'][\'lat\'], result[\'geometry\'][\'location\'][\'lng\'])\n'

In [8]:
'''
# For each club, we use its name and the stadium name to make a GoogleMap query and retrieve the stadium's coordinates
for club_id in clubDict.keys():
    club = clubDict[club_id]
    if "position" not in club:
        
        params = {
            'address': club["stadium"]
        }
        # Do the request and get the response data
        req = requests.get(GOOGLE_MAPS_API_URL, params=params)
        res = req.json()
        # Use the first result
        if len(res['results']) > 0:
            result = res['results'][0]
            club["position"] = (result['geometry']['location']['lat'], result['geometry']['location']['lng'])
'''


Out[8]:
'\n# For each club, we use its name and the stadium name to make a GoogleMap query and retrieve the stadium\'s coordinates\nfor club_id in clubDict.keys():\n    club = clubDict[club_id]\n    if "position" not in club:\n        \n        params = {\n            \'address\': club["stadium"]\n        }\n        # Do the request and get the response data\n        req = requests.get(GOOGLE_MAPS_API_URL, params=params)\n        res = req.json()\n        # Use the first result\n        if len(res[\'results\']) > 0:\n            result = res[\'results\'][0]\n            club["position"] = (result[\'geometry\'][\'location\'][\'lat\'], result[\'geometry\'][\'location\'][\'lng\'])\n'

Manual clean up

Check the clubs for which the coordinates couldn't be retrieved


In [9]:
'''
for club_id in clubDict.keys():
    club = clubDict[club_id]
    if "position" not in club:
        print(str(club_id) + ": " + club["name"])
'''


Out[9]:
'\nfor club_id in clubDict.keys():\n    club = clubDict[club_id]\n    if "position" not in club:\n        print(str(club_id) + ": " + club["name"])\n'

Change manually the stadiums that couldn't be found with GoogleMaps. We then rerun the google map coordinate retrieval


In [10]:
'''
clubs[93]["stadium"] = 'S.B.V. Excelsior'
clubs[170]["stadium"] = 'Başakşehir Fatih Terim Stadium'
clubs[208]["stadium"] = 'Lokeren Daknamstadion'
clubs[209]["stadium"] = 'KAS Eupen'
clubs[221]["stadium"] = 'Obolon Arena'
clubs[235]["stadium"] = 'FC Vysočina Jihlava'
clubs[20]["stadium"] = "Juventus Stadium"
clubs[261]["stadium"] = "Estadio León"
clubs[358]["stadium"] = "Estadio León"
'''


Out[10]:
'\nclubs[93]["stadium"] = \'S.B.V. Excelsior\'\nclubs[170]["stadium"] = \'Başakşehir Fatih Terim Stadium\'\nclubs[208]["stadium"] = \'Lokeren Daknamstadion\'\nclubs[209]["stadium"] = \'KAS Eupen\'\nclubs[221]["stadium"] = \'Obolon Arena\'\nclubs[235]["stadium"] = \'FC Vysočina Jihlava\'\nclubs[20]["stadium"] = "Juventus Stadium"\nclubs[261]["stadium"] = "Estadio León"\nclubs[358]["stadium"] = "Estadio León"\n'

In [11]:
'''
# Example to realod a clubs location with modified stadium name
params = {
    'address': clubs[358]["stadium"] + " " + clubs[358]["name"]
}
# Do the request and get the response data
req = requests.get(GOOGLE_MAPS_API_URL, params=params)
res = req.json()
# Use the first result
if len(res['results']) > 0:
    result = res['results'][0]
    clubs[358]["position"] = (result['geometry']['location']['lat'], result['geometry']['location']['lng'])
'''


Out[11]:
'\n# Example to realod a clubs location with modified stadium name\nparams = {\n    \'address\': clubs[358]["stadium"] + " " + clubs[358]["name"]\n}\n# Do the request and get the response data\nreq = requests.get(GOOGLE_MAPS_API_URL, params=params)\nres = req.json()\n# Use the first result\nif len(res[\'results\']) > 0:\n    result = res[\'results\'][0]\n    clubs[358]["position"] = (result[\'geometry\'][\'location\'][\'lat\'], result[\'geometry\'][\'location\'][\'lng\'])\n'

Save/Load Clubs with location data

Save the clubs with data in order to reload them without making requests.


In [12]:
'''
with open("scraper/data/clubs_with_location.json", "w") as out:
    json.dump(clubs, out)
'''
with open("scraper/data/clubs_with_location.json", "r") as in_file:
    clubs = json.load(in_file)

Display clubs on map

Display a map using Folium and add a marker for each of the clubs.

This gives us a quick overview of all the clubs we have info about, and let us check wether the coordinates have been correctly obtained.


In [13]:
m = folium.Map(location=[52, 15], zoom_start=4)
marker_cluster = MarkerCluster([]).add_to(m)
for club in clubs:
    folium.Marker(
        location=club["position"],
        popup=club["name"],
        icon=None,
    ).add_to(marker_cluster)
m


Out[13]:

Loading Players data

We now load the players data scraped, containin all infos about the players (name, nationality, age, current team) and all their transfers (from which club to which clubs, and the amount).


In [14]:
with open("scraper/data/players_ref.json", "r") as in_file:
    players_ref_list = json.load(in_file)

In [15]:
print ("We have {} players".format(len(players_ref_list)))


We have 12075 players

Loading all players data and transfers

All players infos and transfers are loaded from the data scraped, and saved as a panda DataFrame. The DataFrame is saved as a json file to be loaded faster when restarting the Notebook.


In [16]:
playersDF = pd.read_json("scraper/data/players.json")
playersDF.head()


Out[16]:
birthdate current_club href name nationality number player_id position transfers
0 May 11, 1992 Chelsea FC /thibaut-courtois/profil/spieler/108390 Thibaut Courtois Belgium #13 108390 Goalkeeper [{'to': '631', 'player': '108390', 'from': '13...
1 Sep 28, 1981 Chelsea FC /willy-caballero/profil/spieler/19948 Willy Caballero Argentina #1 19948 Goalkeeper [{'to': '631', 'player': '19948', 'from': '281...
10 Feb 8, 1996 Chelsea FC /kenedy/profil/spieler/281404 Kenedy Brazil #16 281404 Defence - Left-Back [{'to': '631', 'player': '281404', 'from': '10...
100 Feb 21, 1993 Everton FC /davy-klaassen/profil/spieler/182932 Davy Klaassen Netherlands #20 182932 Midfield - Attacking Midfield [{'to': '29', 'player': '182932', 'from': '610...
1000 Oct 8, 1986 Chievo Verona /fabrizio-cacciatore/profil/spieler/85605 Fabrizio Cacciatore Italy #29 85605 Defence - Right-Back [{'to': '862', 'player': '85605', 'from': '103...

Data exploration and preprocessing functions

From the example above, one concern one can raise is the fact that the transfer fee is represented in a way that combines digits and characters. We want to cast the fee of transfer from string to int. All currencies are expressed in euros. We have found the below relations:
Mill. is used for millions
Th. is used for thousands
- or ? is used to mark missing entry
Free Transfer is used to mark 0 value
Loan fee:amount is used to mark the value under a loan


In [17]:
missing_fees = 0
# This function will transform the fees of a player from a string to a number so that it can be used as a weight for the edges
def fix_fee(player_fee):
    # missing_fees is to count the number of 
    global missing_fees
    try:
        player_fee = player_fee.replace(":",": ").replace(",",".")
    except:
        pass
    if re.search(r'\d+.\d+', player_fee) is not None:
        fee = float(re.search(r'\d+.\d+', player_fee).group())
    else:
        fee = 0
    
    if fee > 0:
        if "Mill." in player_fee:
            processed_fee = int(fee*1000000)
        elif "Th." in player_fee:
            processed_fee = int(fee*1000)
        else:
            processed_fee = int(fee)
    else:
        missing_fees += 1
        processed_fee = 0
    
    return processed_fee

Build graph of transfers


In [18]:
# We need to reduce the number of nodes to actual number of clubs. However, this will make us lose the indexing
width = len(clubs)
print ("The number of clubs that are valued above 200 Million Euros are {}".format(len(clubs)))


The number of clubs that are valued above 200 Million Euros are 418

Build the graph ranked by value of club

We would like to build a graph that displays the interaction among clubs. An interaction is decided by if a player goes from one club to another (or vice versa). We have reasons to believe that clubs that have higher market values may tend to have more transfers among them, and clubs with smaller market values tend to interact more among each other. For this reason, we will build an adjacency matrix to build a graph sorted by the club's value.


In [19]:
# Our adjacency matrix should be 418x418
W = np.zeros((width,width))

In [20]:
# Create a list of tuple consisting of club id and market_value
club_pairs = []
for club in clubs:
    club_id = int(club['href'].split("/")[-3])
    club_market_value = int(fix_fee(club["market_value"]))
    club_pairs.append(tuple((club_market_value,club_id)))

# Sort list based on market value
club_pairs = sorted(club_pairs, key=lambda tup: tup[0])

club_ids = []

# Create a new list with the sorted club_ids based on their market value
for club in club_pairs:
    club_ids.append(club[1])

In [21]:
# Create a dictionary linking club ids to clubs data
clubs_ids_dict = {}
for club in clubs:
    club_id = int(club['href'].split("/")[-3])
    club["spending"] = 0
    club["earning"] = 0
    clubs_ids_dict[club_id] = club

In [22]:
# Check the correctness of the dictionnaries
print(playersDF.loc[1]["current_club"])

transfer = playersDF.loc[1]["transfers"][0]

from_club = int(transfer["from"])
to_club = int(transfer["to"])

print(clubs_ids_dict[from_club]["name"])
print(clubs_ids_dict[to_club]["name"])


Chelsea FC
Manchester City
Chelsea FC

There are some players that are being transfered from one club to another, with the clubs being small (smaller than our criteria). The total number of clubs found was equal to 7200, however, we are only interested in the big clubs that we found were only 418.
We will add a constraint when adding the node.


In [23]:
# Fill the Adjacency matrix with one (W[i][j] == 1) when transfer from one club to another
# Where the two clubs match our 200 million euros valuation criteria
for idx, player in playersDF.iterrows():
    try:
        if player["transfers"] is not None:
            for transfer in player["transfers"]:
                from_club = int(transfer["from"])
                to_club = int(transfer["to"])
                
                if(from_club in club_ids and to_club in club_ids):
                    # Add link between the two clubs 
                    W[club_ids.index(from_club)][club_ids.index(to_club)] = 1
                    W[club_ids.index(to_club)][club_ids.index(from_club)] = 1
                    
                    #Add amount to the to and from clubs spending and earning respectivelly
                    clubs_ids_dict[to_club]["spending"] += transfer["amount"]
                    clubs_ids_dict[from_club]["earning"] += transfer["amount"]
    except:
        print("problem with " + player["name"] + " " + str(idx))

In [24]:
# Some transfers are within the same club. When a player contract came to an end.
number_of_inner_transfers = 0
for i in range(W.shape[0]):
    if(W[i][i] == 1):
        number_of_inner_transfers += 1
        
print ("There have been {} innertransfers".format(number_of_inner_transfers))


There have been 2 innertransfers

In [25]:
# Set all the transfers within the same club to zero -> no self-loops
# Ignore innertransfers
np.fill_diagonal(W, 0)

In [26]:
# Create a temporary matrix so we wouldn't need to recompute W if we later need to go back to our result
np.save('adj_matrix.npy', W)

Load directly filled matrix


In [27]:
# Load W matrix from pre-saved model
W = np.load('adj_matrix.npy')

In [28]:
# Build graph from G
G = graphs.Graph(W)
print('{} nodes, {} edges'.format(G.N, G.Ne))


418 nodes, 6714 edges

In [29]:
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(8, 8)
plt.hist(G.d)
plt.title('Degree distribution of my graph');



In [30]:
G.compute_laplacian('combinatorial')

fig = matplotlib.pyplot.gcf()
fig.set_size_inches(12, 12)
plt.spy(G.L, markersize=1)

plt.title("Transfers interactions - clubs ordered by wealth")


Out[30]:
Text(0.5,1.05,'Transfers interactions - clubs ordered by wealth')

We can see from the graph above that clubs do interact with each other. However there seems to be a formation of a cluster where interactions are more dense, that is the lowest 100 clubs (from a market value perspective) tend to interact more with each other. As the market value of the club is higher, we notice a lower interaction between clubs. Contrary to how media only covers the story of big football players going from a big club to another big one, we can clearly see that the majority of transfers are done in the smaller ones. This reason might be due to the high valuation of the player that is already in the big clubs. In fact, this finding explains why media might want to cover the transfer of big players, since we can see that these transfers are actually rare events that may lead to higher click through rate.


In [31]:
'''
G.set_coordinates('spring')
plotting.plot_graph(G)
'''


Out[31]:
"\nG.set_coordinates('spring')\nplotting.plot_graph(G)\n"

Geographic analysis of transfers amount

We use Heatmaps to show where the money is spent and earned during the big clubs transfers.

We create three heatmaps, one that represents the spendings, one the earning and one the benefits (earning - spending) of the clubs, and use the clubs location on the map to create our heatmaps.


In [32]:
locData = []
for club in clubs:
    locData.append([club["position"][0], club["position"][1]])
    
spendingData = []
earningData = []
benefitData = []
for club in clubs:
    spendingData.append(club["spending"])
    earningData.append(club["earning"])
    benefit = club["earning"]-club["spending"]
    benefitData.append(max(0, benefit))

In [33]:
maxSpent = max(list(map(lambda x: x, spendingData)))

In [34]:
maxEarned = max(list(map(lambda x: x, earningData)))

In [35]:
maxBenefit = max(list(map(lambda x: x, benefitData)))

Spending Heatmap

We now display a Heatmap showing where are the clubs that spent the most money in transfers overall.


In [36]:
data = np.concatenate((locData, np.expand_dims(spendingData, axis=1)), axis=1)
m = folium.Map(tiles='stamentoner', zoom_start=12)
HeatMap(data.tolist(), radius=10, min_opacity=0, max_val=maxSpent/10000, blur=10).add_to(m)
m


Out[36]:

In [37]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locData, weights=spendingData, max_intensity=maxSpent, point_radius=15))
fig


Earning Heatmap

We now display a Heatmap showing where are the clubs that earned the most money from the transfers overall.


In [38]:
data = np.concatenate((locData, np.expand_dims(earningData, axis=1)), axis=1)
m = folium.Map(tiles='stamentoner', zoom_start=12)
HeatMap(data.tolist(), radius=10, min_opacity=0, max_val=maxEarned/10000, blur=10).add_to(m)
m


Out[38]:

In [39]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locData, weights=earningData, max_intensity=maxEarned, point_radius=15))
fig


Benefit Heatmap

We now display a Heatmap showing where are the clubs that made the biggest benefits on the transfers overall.


In [40]:
data = np.concatenate((locData, np.expand_dims(benefitData, axis=1)), axis=1)
m = folium.Map(tiles='stamentoner', zoom_start=12)
HeatMap(data.tolist(), radius=10, min_opacity=0, max_val=maxBenefit/10000, blur=10).add_to(m)
m


Out[40]:

In [41]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locData, weights=benefitData, max_intensity=maxBenefit, point_radius=15))
fig


Compute eigenvalues and eigenvectors


In [42]:
# Build graph using networkX and adjacency matrix W
G=nx.from_numpy_matrix(W)
laplacian = nx.normalized_laplacian_matrix(G)

In [43]:
eigenvalues, eigenvectors = scipy.sparse.linalg.eigs(laplacian, k = 10, which = 'SM')
print (eigenvalues)
print (eigenvectors.shape)


[  5.50319772e-17+0.j   0.00000000e+00+0.j   2.26177619e-01+0.j
   2.54249921e-01+0.j   2.82996063e-01+0.j   2.30452225e-16+0.j
  -1.34781882e-18+0.j   2.82249995e-17+0.j  -1.15552658e-16+0.j
   0.00000000e+00+0.j]
(418, 10)

In [44]:
plt.plot(eigenvalues, '.-', markersize=15);


/home/nathan/anaconda3/lib/python3.5/site-packages/numpy/core/numeric.py:531: ComplexWarning: Casting complex values to real discards the imaginary part
  return array(a, dtype, copy=False, order=order)

Build graph based on geographic locations

We have reasons to believe that players in a certain country, will tend to transfer to another club in the same country. For this reason, we would like to build our adjacency matrix with the clubs in the same country be next to each other in the indexing.


In [45]:
# Our adjacency matrix should be 418x418
W = np.zeros((width,width))
W_weighted = np.zeros((width,width))

In [46]:
club_ids = []
for club in clubs:
    club_ids.append(int(club['href'].split("/")[-3]))
    
club_ids = sorted(club_ids)

In [47]:
with open("scraper/data/clubs_leagues.json", "r") as infile:
    clubLeagues = json.load(infile)

In [48]:
club_pairs = []
count_exception = 0
i = 0
for club in clubs:
    club_id = int(club['href'].split("/")[-3])
    club_pairs.append(tuple((clubLeagues[str(club_id)],club_id)))

In [49]:
# Sort club pairs based on the league name
club_pairs = sorted(club_pairs, key=lambda tup: tup[0])

club_ids = []
clubs_position = []

# Dictionary of clubs position
clubsPositionDict = {}
for club in clubs:
    club_id = int(club['href'].split("/")[-3])
    clubsPositionDict[club_id] = [club["position"][1], club["position"][0]]


# club_ids are now sorted based on the country
for club in club_pairs:
    club_ids.append(club[1])
    clubs_position.append(clubsPositionDict[club[1]])

In [50]:
# Fill the Adjacency matrix with one when transfer from one club to another
# Where the two clubs match our 200 million euros valuation criteria
for idx, player in playersDF.iterrows():
    try:
        for transfer in player["transfers"]:
            from_club = int(transfer["from"])
            to_club = int(transfer["to"])
            if(from_club in club_ids and to_club in club_ids and to_club != from_club):
                W[club_ids.index(from_club)][club_ids.index(to_club)] = 1
                W[club_ids.index(to_club)][club_ids.index(from_club)] = 1
                W_weighted[club_ids.index(from_club)][club_ids.index(to_club)] += transfer["amount"]
                W_weighted[club_ids.index(to_club)][club_ids.index(from_club)] += transfer["amount"]
    except:
        pass

In [51]:
# Set all the transfers within the same club to zero -> no self-loops
np.fill_diagonal(W, 0)

In [52]:
# Create a temporary matrix so we wouldn't need to recompute W if we later need to go back to our result
np.save('adj_matrix_sorted_country.npy', W)

In [53]:
# Load W matrix from pre-saved model
W = np.load('adj_matrix_sorted_country.npy')

In [54]:
# Build graph from G
G = graphs.Graph(W)
print('{} nodes, {} edges'.format(G.N, G.Ne))


418 nodes, 6714 edges

In [55]:
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(12, 12)

G.compute_laplacian('combinatorial')

plt.spy(G.L, markersize=1)
plt.title("Transfers interactions - clubs ordered by leagues")


Out[55]:
Text(0.5,1.05,'Transfers interactions - clubs ordered by leagues')

We notice a lot of transfer between clubs in ~25-50 and ~360-280. After checking, we realize these are clubs of first and seconde Brazilian division, which makes sense.


In [56]:
print("Lot of transfers between : " + club_pairs[30][0] + " and " + club_pairs[370][0])


Lot of transfers between : Campeonato Brasileiro Série A and Serie B

Seeing transfers amount on the map

We now display the graph using transfers amount as edges color and clubs location for the nodes. The darker edges indicates huge transfer between clubs.


In [57]:
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(14, 8)

g1 = nx.from_numpy_matrix(W_weighted)

edges,weights = zip(*nx.get_edge_attributes(g1,'weight').items())
maxWeight = max(weights)
weights = list(map(lambda x: x/(maxWeight+1), weights))

nx.draw(g1, clubs_position, node_color='#5566FF', node_size=8, edge_color=weights, width=0.1, edge_cmap=plt.cm.plasma)
plt.title("Weighted and located world transfers")
plt.show()


We observe that the transfers between the South American (Bottom left) and European (Top center) clubs represents the biggest part of the intercontinental transfer market

Data exploration


In [58]:
# The average transfers of a player within these clubs
# Compute the number of interactions (Number of edges/2) over the number of players
print ("The average transfer of a player within these clubs is {}".format(G.Ne/(2*len(players_ref_list))))


The average transfer of a player within these clubs is 0.2780124223602484

In [59]:
player_nationalities = []
birthdays_month = []
counter = 0

for idx, player in playersDF.iterrows():
    try:
        player_nationalities.append(player['nationality'])
        bday = player["birthdate"].replace(",","").strip()
        datetime_object = datetime.strptime(bday, '%b %d %Y')
        birthdays_month.append(datetime_object.month)
    except:
        player_nationalities.append("N/A")
        # Mark not available birthday month by 13
        birthdays_month.append("13")

Players nationalities


In [60]:
# Sort nationalities by most common
nationalities_counter = Counter(player_nationalities).most_common(20)
index = []
iterator = 1
nationalities = []
nationalities_count = []
for nationality in nationalities_counter:
    index.append(iterator)
    iterator += 1
    nationalities.append(nationality[0])
    nationalities_count.append(nationality[1])

In [61]:
rcParams['figure.figsize'] = 30, 10
plt.bar(index, nationalities_count, align='center')
plt.xticks(index, nationalities,rotation=90)
plt.show()


Clubs per leagues


In [62]:
# Visualize how many clubs each leagues has
clubs_country = []
for club in club_pairs:
    clubs_country.append(club[0])

In [63]:
clubs_country = Counter(clubs_country).most_common()

index = []
iterator = 1
countries = []
country_count = []

for country in clubs_country:
    index.append(iterator)
    iterator += 1
    countries.append(country[0])
    country_count.append(country[1])

In [64]:
rcParams['figure.figsize'] = 30, 10
plt.bar(index, country_count, align='center')
plt.xticks(index, countries,rotation=90)
plt.show()


From the two graphs before: The one showing the distribution of nationalities of football players and the one showing the number of clubs per country. We notice that the distribution is similar, i.e. Brazilian leagues (Campeonato Brasileiro Serie A and Serie B) that has the highest number of football players also has the highest numbers of clubs. This implication reinforce our assumption that the transfers might be dependent on geographical locations that has been shown in previous graph.

Player birthday distribution


In [65]:
birthdays_month = list(map(int, birthdays_month))
player_birthdays_month = Counter(birthdays_month).most_common()

player_birthdays_month = sorted(player_birthdays_month)

index = []
iterator = 1
month_count = []

for country in player_birthdays_month:
    index.append(iterator)
    iterator += 1
    month_count.append(country[1])

In [66]:
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "N/A"]

In [67]:
rcParams['figure.figsize'] = 30, 10
plt.bar(index, month_count, align='center')
plt.xticks(index, months,rotation=90)
plt.show()


If we compare it to a standard birth distribution in the month (EU 27 nations birth distribution by month 2000 to 2010), we observe that the footballer birthdate distribution is clearly skewed.

According to the book "Outliers" by Malcolm Gladwell, the distribution of professional athletes should be skewed. It is linked with the fact that when children are less than 10 years old, a month's difference in age makes a different from a performance point of view. The players who were born in the early phase of the year had a physical advantage over the other kids and had higher chances of being selected into clubs (usually the player selection is at a young age, less than or equal to 10 years old). This selection would cause a player to train more than the rest and hence higher his chances of becoming a professional atheles (hence professional football player). The arguments given by Malcolm Gladwell are shown to hold, specially since the distribution of birth is equally distributed with respect to month of birth.

Transfers amount over time

We now take a look at how much money those transfers represents over the year in order to get an idea of how much the spending increased over the last 5 years in the elite leagues.

This will give us an idea of the transfers increase but since we have only the data of the currently active players, we might miss some transfers of players not currently active in those leagues.


In [68]:
years = np.linspace(2013, 2017, 5)
amounts = np.zeros(5)

for idx, player in playersDF.iterrows():
    if player["transfers"] is not None:
        for transfer in player["transfers"]:
            year = int(transfer["date"].split(" ")[-1])
            if year in years:
                amounts[year - 2013] += transfer["amount"]

In [69]:
def billions(x, pos):    
    return '%1.1fB€' % (x*1e-9)

formatter = FuncFormatter(billions)

fig, ax = plt.subplots()
ax.yaxis.set_major_formatter(formatter)
plt.bar(years, amounts)
plt.show()


We observe as expected a constant and significant increase of the transfer total value over the last 5 years.

Conclusion

We have examined clubs that are valued above 200 million euros. Three important findings were present:

  1. Clubs with lower market value tend to interact more with each other. However, Clubs with higher market value have less interaction/transfers between each other. This may be due to the fact that a player has a high value that wouldn't explain the economics of buying him out.
  2. European top clubs tends to spend more money and not make benefit on the transfer market.
  3. Interactions are clustered by geographics. That is, clubs in the same country tend to interact more with each other.
  4. The month distribution of players's birthdays is actually skewed and there are more players that are born at the beginning of the year than at the end. This is explained by Malcolm Gladwell in his book "Outliers", where he states that individuals born at the beginning of the year have an innate advantage during the player selection at a young age (10 years old)
  5. According to our data, and as expected, the transfers amount in the elite football has strongly increased over the last 5 years, from 5.5 Billions euros in 2013 to 10.5 Billions euros in 2017