This Notebook presents the main code used to perform our analysis of the players transfer in elite Football. We have scraped data about the elite football leagues in the world, their clubs and players from Transfermarkt.com in this Scraper Notebook.
N. Quinteiro, F. Mourad, G. Abi-Younes
In [1]:
from urllib.request import *
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import scipy.stats as stats
import pylab as pl
import requests
from geopy.geocoders import Nominatim
import json
from pygsp import graphs, filters, plotting
import os
import time
from tempfile import TemporaryFile
from datetime import datetime
import matplotlib
import matplotlib.pyplot as plt
import networkx as nx
from matplotlib.ticker import FuncFormatter
from collections import Counter
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
import networkx as nx
from pylab import rcParams
plotting.BACKEND = 'matplotlib'
%matplotlib inline
In [2]:
matplotlib.rcParams.update({'font.size': 22})
matplotlib.rcParams.update({'font.family': 'DejaVu Sans'})
matplotlib.rcParams.update({'font.weight': 'normal'})
If you want to visualize the graphs without executing the Notebook, use this nvViewer LINK
In [3]:
import folium
from folium.plugins import MarkerCluster
from folium.plugins import HeatMap
Quick summary to install GMAPS API if you need to run the map.
jupyter nbextension enable --py --sys-prefix widgetsnbextension
pip install gmaps
jupyter nbextension enable --py --sys-prefix gmaps
You might need to restart jupyter after installing.
See the installation instructions
DO NOT SHARE THE FILE gmapsKey in the folder ../scraper/data/gmapsKey
In [4]:
with open("scraper/data/gmapsKey", "r") as infile:
GMAPKEY = json.load(infile)["key"]
import gmaps
import gmaps.datasets
gmaps.configure(api_key=GMAPKEY)
All the data scraping process is done in the Scraper Notebook. The data was parsed and processed to generate the json files in the scrape data folder (scrape/data). It contains every infos required to run this Notebook.
The entire scraped data is heavy (>2GB) and is therefore not present in this repo. If you need hit, either execute the Scraper Notebook, or contact Nathan Quinteiro (nathan.quinteiro@epfl.ch)
In [5]:
'''
with open("scraper/data/clubs.json", "r") as in_file:
clubs = json.load(in_file)
print("There are " + str(len(clubs)) + " clubs")
clubs[0]["href"]
'''
Out[5]:
In [6]:
# Create a club dictionnary to faster access. Save it as json to reload it faster.
'''
clubDict = {}
for club in clubs:
club_id = club["href"].split("/")[4]
if club_id not in clubDict:
clubDict[club_id] = club
with open("scraper/data/dictClubs.json", "w") as out:
json.dump(clubDict, out)
'''
with open("scraper/data/dictClubs.json", "r") as in_file:
clubDict = json.load(in_file)
In [7]:
'''GOOGLE_MAPS_API_URL = 'http://maps.googleapis.com/maps/api/geocode/json'
# For each club, we use its name and the stadium name to make a GoogleMap query and retrieve the stadium's coordinates
for club_id in clubDict.keys():
club = clubDict[club_id]
if "position" not in club:
params = {
'address': club["stadium"] + " " + club["name"]
}
# Do the request and get the response data
req = requests.get(GOOGLE_MAPS_API_URL, params=params)
res = req.json()
# Use the first result
if len(res['results']) > 0:
result = res['results'][0]
club["position"] = (result['geometry']['location']['lat'], result['geometry']['location']['lng'])
'''
Out[7]:
In [8]:
'''
# For each club, we use its name and the stadium name to make a GoogleMap query and retrieve the stadium's coordinates
for club_id in clubDict.keys():
club = clubDict[club_id]
if "position" not in club:
params = {
'address': club["stadium"]
}
# Do the request and get the response data
req = requests.get(GOOGLE_MAPS_API_URL, params=params)
res = req.json()
# Use the first result
if len(res['results']) > 0:
result = res['results'][0]
club["position"] = (result['geometry']['location']['lat'], result['geometry']['location']['lng'])
'''
Out[8]:
In [9]:
'''
for club_id in clubDict.keys():
club = clubDict[club_id]
if "position" not in club:
print(str(club_id) + ": " + club["name"])
'''
Out[9]:
Change manually the stadiums that couldn't be found with GoogleMaps. We then rerun the google map coordinate retrieval
In [10]:
'''
clubs[93]["stadium"] = 'S.B.V. Excelsior'
clubs[170]["stadium"] = 'Başakşehir Fatih Terim Stadium'
clubs[208]["stadium"] = 'Lokeren Daknamstadion'
clubs[209]["stadium"] = 'KAS Eupen'
clubs[221]["stadium"] = 'Obolon Arena'
clubs[235]["stadium"] = 'FC Vysočina Jihlava'
clubs[20]["stadium"] = "Juventus Stadium"
clubs[261]["stadium"] = "Estadio León"
clubs[358]["stadium"] = "Estadio León"
'''
Out[10]:
In [11]:
'''
# Example to realod a clubs location with modified stadium name
params = {
'address': clubs[358]["stadium"] + " " + clubs[358]["name"]
}
# Do the request and get the response data
req = requests.get(GOOGLE_MAPS_API_URL, params=params)
res = req.json()
# Use the first result
if len(res['results']) > 0:
result = res['results'][0]
clubs[358]["position"] = (result['geometry']['location']['lat'], result['geometry']['location']['lng'])
'''
Out[11]:
In [12]:
'''
with open("scraper/data/clubs_with_location.json", "w") as out:
json.dump(clubs, out)
'''
with open("scraper/data/clubs_with_location.json", "r") as in_file:
clubs = json.load(in_file)
In [13]:
m = folium.Map(location=[52, 15], zoom_start=4)
marker_cluster = MarkerCluster([]).add_to(m)
for club in clubs:
folium.Marker(
location=club["position"],
popup=club["name"],
icon=None,
).add_to(marker_cluster)
m
Out[13]:
In [14]:
with open("scraper/data/players_ref.json", "r") as in_file:
players_ref_list = json.load(in_file)
In [15]:
print ("We have {} players".format(len(players_ref_list)))
In [16]:
playersDF = pd.read_json("scraper/data/players.json")
playersDF.head()
Out[16]:
From the example above, one concern one can raise is the fact that the transfer fee is represented in a way that combines digits and characters. We want to cast the fee of transfer from string to int. All currencies are expressed in euros. We have found the below relations:
Mill. is used for millions
Th. is used for thousands
- or ? is used to mark missing entry
Free Transfer is used to mark 0 value
Loan fee:amount is used to mark the value under a loan
In [17]:
missing_fees = 0
# This function will transform the fees of a player from a string to a number so that it can be used as a weight for the edges
def fix_fee(player_fee):
# missing_fees is to count the number of
global missing_fees
try:
player_fee = player_fee.replace(":",": ").replace(",",".")
except:
pass
if re.search(r'\d+.\d+', player_fee) is not None:
fee = float(re.search(r'\d+.\d+', player_fee).group())
else:
fee = 0
if fee > 0:
if "Mill." in player_fee:
processed_fee = int(fee*1000000)
elif "Th." in player_fee:
processed_fee = int(fee*1000)
else:
processed_fee = int(fee)
else:
missing_fees += 1
processed_fee = 0
return processed_fee
In [18]:
# We need to reduce the number of nodes to actual number of clubs. However, this will make us lose the indexing
width = len(clubs)
print ("The number of clubs that are valued above 200 Million Euros are {}".format(len(clubs)))
We would like to build a graph that displays the interaction among clubs. An interaction is decided by if a player goes from one club to another (or vice versa). We have reasons to believe that clubs that have higher market values may tend to have more transfers among them, and clubs with smaller market values tend to interact more among each other. For this reason, we will build an adjacency matrix to build a graph sorted by the club's value.
In [19]:
# Our adjacency matrix should be 418x418
W = np.zeros((width,width))
In [20]:
# Create a list of tuple consisting of club id and market_value
club_pairs = []
for club in clubs:
club_id = int(club['href'].split("/")[-3])
club_market_value = int(fix_fee(club["market_value"]))
club_pairs.append(tuple((club_market_value,club_id)))
# Sort list based on market value
club_pairs = sorted(club_pairs, key=lambda tup: tup[0])
club_ids = []
# Create a new list with the sorted club_ids based on their market value
for club in club_pairs:
club_ids.append(club[1])
In [21]:
# Create a dictionary linking club ids to clubs data
clubs_ids_dict = {}
for club in clubs:
club_id = int(club['href'].split("/")[-3])
club["spending"] = 0
club["earning"] = 0
clubs_ids_dict[club_id] = club
In [22]:
# Check the correctness of the dictionnaries
print(playersDF.loc[1]["current_club"])
transfer = playersDF.loc[1]["transfers"][0]
from_club = int(transfer["from"])
to_club = int(transfer["to"])
print(clubs_ids_dict[from_club]["name"])
print(clubs_ids_dict[to_club]["name"])
There are some players that are being transfered from one club to another, with the clubs being small (smaller than our criteria). The total number of clubs found was equal to 7200, however, we are only interested in the big clubs that we found were only 418.
We will add a constraint when adding the node.
In [23]:
# Fill the Adjacency matrix with one (W[i][j] == 1) when transfer from one club to another
# Where the two clubs match our 200 million euros valuation criteria
for idx, player in playersDF.iterrows():
try:
if player["transfers"] is not None:
for transfer in player["transfers"]:
from_club = int(transfer["from"])
to_club = int(transfer["to"])
if(from_club in club_ids and to_club in club_ids):
# Add link between the two clubs
W[club_ids.index(from_club)][club_ids.index(to_club)] = 1
W[club_ids.index(to_club)][club_ids.index(from_club)] = 1
#Add amount to the to and from clubs spending and earning respectivelly
clubs_ids_dict[to_club]["spending"] += transfer["amount"]
clubs_ids_dict[from_club]["earning"] += transfer["amount"]
except:
print("problem with " + player["name"] + " " + str(idx))
In [24]:
# Some transfers are within the same club. When a player contract came to an end.
number_of_inner_transfers = 0
for i in range(W.shape[0]):
if(W[i][i] == 1):
number_of_inner_transfers += 1
print ("There have been {} innertransfers".format(number_of_inner_transfers))
In [25]:
# Set all the transfers within the same club to zero -> no self-loops
# Ignore innertransfers
np.fill_diagonal(W, 0)
In [26]:
# Create a temporary matrix so we wouldn't need to recompute W if we later need to go back to our result
np.save('adj_matrix.npy', W)
In [27]:
# Load W matrix from pre-saved model
W = np.load('adj_matrix.npy')
In [28]:
# Build graph from G
G = graphs.Graph(W)
print('{} nodes, {} edges'.format(G.N, G.Ne))
In [29]:
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(8, 8)
plt.hist(G.d)
plt.title('Degree distribution of my graph');
In [30]:
G.compute_laplacian('combinatorial')
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(12, 12)
plt.spy(G.L, markersize=1)
plt.title("Transfers interactions - clubs ordered by wealth")
Out[30]:
We can see from the graph above that clubs do interact with each other. However there seems to be a formation of a cluster where interactions are more dense, that is the lowest 100 clubs (from a market value perspective) tend to interact more with each other. As the market value of the club is higher, we notice a lower interaction between clubs. Contrary to how media only covers the story of big football players going from a big club to another big one, we can clearly see that the majority of transfers are done in the smaller ones. This reason might be due to the high valuation of the player that is already in the big clubs. In fact, this finding explains why media might want to cover the transfer of big players, since we can see that these transfers are actually rare events that may lead to higher click through rate.
In [31]:
'''
G.set_coordinates('spring')
plotting.plot_graph(G)
'''
Out[31]:
We use Heatmaps to show where the money is spent and earned during the big clubs transfers.
We create three heatmaps, one that represents the spendings, one the earning and one the benefits (earning - spending) of the clubs, and use the clubs location on the map to create our heatmaps.
In [32]:
locData = []
for club in clubs:
locData.append([club["position"][0], club["position"][1]])
spendingData = []
earningData = []
benefitData = []
for club in clubs:
spendingData.append(club["spending"])
earningData.append(club["earning"])
benefit = club["earning"]-club["spending"]
benefitData.append(max(0, benefit))
In [33]:
maxSpent = max(list(map(lambda x: x, spendingData)))
In [34]:
maxEarned = max(list(map(lambda x: x, earningData)))
In [35]:
maxBenefit = max(list(map(lambda x: x, benefitData)))
In [36]:
data = np.concatenate((locData, np.expand_dims(spendingData, axis=1)), axis=1)
m = folium.Map(tiles='stamentoner', zoom_start=12)
HeatMap(data.tolist(), radius=10, min_opacity=0, max_val=maxSpent/10000, blur=10).add_to(m)
m
Out[36]:
In [37]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locData, weights=spendingData, max_intensity=maxSpent, point_radius=15))
fig
In [38]:
data = np.concatenate((locData, np.expand_dims(earningData, axis=1)), axis=1)
m = folium.Map(tiles='stamentoner', zoom_start=12)
HeatMap(data.tolist(), radius=10, min_opacity=0, max_val=maxEarned/10000, blur=10).add_to(m)
m
Out[38]:
In [39]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locData, weights=earningData, max_intensity=maxEarned, point_radius=15))
fig
In [40]:
data = np.concatenate((locData, np.expand_dims(benefitData, axis=1)), axis=1)
m = folium.Map(tiles='stamentoner', zoom_start=12)
HeatMap(data.tolist(), radius=10, min_opacity=0, max_val=maxBenefit/10000, blur=10).add_to(m)
m
Out[40]:
In [41]:
fig = gmaps.figure()
fig.add_layer(gmaps.heatmap_layer(locData, weights=benefitData, max_intensity=maxBenefit, point_radius=15))
fig
In [42]:
# Build graph using networkX and adjacency matrix W
G=nx.from_numpy_matrix(W)
laplacian = nx.normalized_laplacian_matrix(G)
In [43]:
eigenvalues, eigenvectors = scipy.sparse.linalg.eigs(laplacian, k = 10, which = 'SM')
print (eigenvalues)
print (eigenvectors.shape)
In [44]:
plt.plot(eigenvalues, '.-', markersize=15);
We have reasons to believe that players in a certain country, will tend to transfer to another club in the same country. For this reason, we would like to build our adjacency matrix with the clubs in the same country be next to each other in the indexing.
In [45]:
# Our adjacency matrix should be 418x418
W = np.zeros((width,width))
W_weighted = np.zeros((width,width))
In [46]:
club_ids = []
for club in clubs:
club_ids.append(int(club['href'].split("/")[-3]))
club_ids = sorted(club_ids)
In [47]:
with open("scraper/data/clubs_leagues.json", "r") as infile:
clubLeagues = json.load(infile)
In [48]:
club_pairs = []
count_exception = 0
i = 0
for club in clubs:
club_id = int(club['href'].split("/")[-3])
club_pairs.append(tuple((clubLeagues[str(club_id)],club_id)))
In [49]:
# Sort club pairs based on the league name
club_pairs = sorted(club_pairs, key=lambda tup: tup[0])
club_ids = []
clubs_position = []
# Dictionary of clubs position
clubsPositionDict = {}
for club in clubs:
club_id = int(club['href'].split("/")[-3])
clubsPositionDict[club_id] = [club["position"][1], club["position"][0]]
# club_ids are now sorted based on the country
for club in club_pairs:
club_ids.append(club[1])
clubs_position.append(clubsPositionDict[club[1]])
In [50]:
# Fill the Adjacency matrix with one when transfer from one club to another
# Where the two clubs match our 200 million euros valuation criteria
for idx, player in playersDF.iterrows():
try:
for transfer in player["transfers"]:
from_club = int(transfer["from"])
to_club = int(transfer["to"])
if(from_club in club_ids and to_club in club_ids and to_club != from_club):
W[club_ids.index(from_club)][club_ids.index(to_club)] = 1
W[club_ids.index(to_club)][club_ids.index(from_club)] = 1
W_weighted[club_ids.index(from_club)][club_ids.index(to_club)] += transfer["amount"]
W_weighted[club_ids.index(to_club)][club_ids.index(from_club)] += transfer["amount"]
except:
pass
In [51]:
# Set all the transfers within the same club to zero -> no self-loops
np.fill_diagonal(W, 0)
In [52]:
# Create a temporary matrix so we wouldn't need to recompute W if we later need to go back to our result
np.save('adj_matrix_sorted_country.npy', W)
In [53]:
# Load W matrix from pre-saved model
W = np.load('adj_matrix_sorted_country.npy')
In [54]:
# Build graph from G
G = graphs.Graph(W)
print('{} nodes, {} edges'.format(G.N, G.Ne))
In [55]:
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(12, 12)
G.compute_laplacian('combinatorial')
plt.spy(G.L, markersize=1)
plt.title("Transfers interactions - clubs ordered by leagues")
Out[55]:
We notice a lot of transfer between clubs in ~25-50 and ~360-280. After checking, we realize these are clubs of first and seconde Brazilian division, which makes sense.
In [56]:
print("Lot of transfers between : " + club_pairs[30][0] + " and " + club_pairs[370][0])
In [57]:
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(14, 8)
g1 = nx.from_numpy_matrix(W_weighted)
edges,weights = zip(*nx.get_edge_attributes(g1,'weight').items())
maxWeight = max(weights)
weights = list(map(lambda x: x/(maxWeight+1), weights))
nx.draw(g1, clubs_position, node_color='#5566FF', node_size=8, edge_color=weights, width=0.1, edge_cmap=plt.cm.plasma)
plt.title("Weighted and located world transfers")
plt.show()
We observe that the transfers between the South American (Bottom left) and European (Top center) clubs represents the biggest part of the intercontinental transfer market
In [58]:
# The average transfers of a player within these clubs
# Compute the number of interactions (Number of edges/2) over the number of players
print ("The average transfer of a player within these clubs is {}".format(G.Ne/(2*len(players_ref_list))))
In [59]:
player_nationalities = []
birthdays_month = []
counter = 0
for idx, player in playersDF.iterrows():
try:
player_nationalities.append(player['nationality'])
bday = player["birthdate"].replace(",","").strip()
datetime_object = datetime.strptime(bday, '%b %d %Y')
birthdays_month.append(datetime_object.month)
except:
player_nationalities.append("N/A")
# Mark not available birthday month by 13
birthdays_month.append("13")
In [60]:
# Sort nationalities by most common
nationalities_counter = Counter(player_nationalities).most_common(20)
index = []
iterator = 1
nationalities = []
nationalities_count = []
for nationality in nationalities_counter:
index.append(iterator)
iterator += 1
nationalities.append(nationality[0])
nationalities_count.append(nationality[1])
In [61]:
rcParams['figure.figsize'] = 30, 10
plt.bar(index, nationalities_count, align='center')
plt.xticks(index, nationalities,rotation=90)
plt.show()
In [62]:
# Visualize how many clubs each leagues has
clubs_country = []
for club in club_pairs:
clubs_country.append(club[0])
In [63]:
clubs_country = Counter(clubs_country).most_common()
index = []
iterator = 1
countries = []
country_count = []
for country in clubs_country:
index.append(iterator)
iterator += 1
countries.append(country[0])
country_count.append(country[1])
In [64]:
rcParams['figure.figsize'] = 30, 10
plt.bar(index, country_count, align='center')
plt.xticks(index, countries,rotation=90)
plt.show()
From the two graphs before: The one showing the distribution of nationalities of football players and the one showing the number of clubs per country. We notice that the distribution is similar, i.e. Brazilian leagues (Campeonato Brasileiro Serie A and Serie B) that has the highest number of football players also has the highest numbers of clubs. This implication reinforce our assumption that the transfers might be dependent on geographical locations that has been shown in previous graph.
In [65]:
birthdays_month = list(map(int, birthdays_month))
player_birthdays_month = Counter(birthdays_month).most_common()
player_birthdays_month = sorted(player_birthdays_month)
index = []
iterator = 1
month_count = []
for country in player_birthdays_month:
index.append(iterator)
iterator += 1
month_count.append(country[1])
In [66]:
months = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December", "N/A"]
In [67]:
rcParams['figure.figsize'] = 30, 10
plt.bar(index, month_count, align='center')
plt.xticks(index, months,rotation=90)
plt.show()
If we compare it to a standard birth distribution in the month (EU 27 nations birth distribution by month 2000 to 2010), we observe that the footballer birthdate distribution is clearly skewed.
According to the book "Outliers" by Malcolm Gladwell, the distribution of professional athletes should be skewed. It is linked with the fact that when children are less than 10 years old, a month's difference in age makes a different from a performance point of view. The players who were born in the early phase of the year had a physical advantage over the other kids and had higher chances of being selected into clubs (usually the player selection is at a young age, less than or equal to 10 years old). This selection would cause a player to train more than the rest and hence higher his chances of becoming a professional atheles (hence professional football player). The arguments given by Malcolm Gladwell are shown to hold, specially since the distribution of birth is equally distributed with respect to month of birth.
We now take a look at how much money those transfers represents over the year in order to get an idea of how much the spending increased over the last 5 years in the elite leagues.
This will give us an idea of the transfers increase but since we have only the data of the currently active players, we might miss some transfers of players not currently active in those leagues.
In [68]:
years = np.linspace(2013, 2017, 5)
amounts = np.zeros(5)
for idx, player in playersDF.iterrows():
if player["transfers"] is not None:
for transfer in player["transfers"]:
year = int(transfer["date"].split(" ")[-1])
if year in years:
amounts[year - 2013] += transfer["amount"]
In [69]:
def billions(x, pos):
return '%1.1fB€' % (x*1e-9)
formatter = FuncFormatter(billions)
fig, ax = plt.subplots()
ax.yaxis.set_major_formatter(formatter)
plt.bar(years, amounts)
plt.show()
We observe as expected a constant and significant increase of the transfer total value over the last 5 years.
We have examined clubs that are valued above 200 million euros. Three important findings were present: