This Jupyter Notebook file has been created by George Claireaux for the Data Bootcamp course within NYU Stern.
This file is provided 100% open-source and anyone is welcome to use and edit it as such.
The rest of the document will be an instructional collection, formatting and analysis of various data relating to the "Python Premiership Project" ©.
By collecting a vast amount of data on football fixtures in England since the leagues began in 1880, this project aims to shed new light onto the workings of sports economics.
In [1]:
import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
import matplotlib.pyplot as plt # pyplot module
import datetime as dt # date and time module
import numpy as np # NumPy module
import seaborn as sns # seaborn module
import time
from dateutil.parser import parse
from pandas_datareader import data as web
# Scientific libraries
from numpy import arange,array,ones
from scipy import stats
# plotly imports
from plotly.offline import iplot, iplot_mpl # plotting functions
from plotly import tools
import plotly.graph_objs as go # ditto
import plotly.plotly as py
import plotly # just to print version and init notebook
import cufflinks as cf # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)
# these lines make our graphics show up in the notebook
%matplotlib inline
plotly.offline.init_notebook_mode(connected=True)
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Seaborn version: ', sns.__version__)
print('Plotly version: ', plotly.__version__)
print('Today: ', dt.date.today())
This is where we'll pull in match data from the English leagues dating back to the 1880's.
We'll read these into seperate dataframes, each one containing all the results from one league and year, e.g - League 2, 1997.
These will all be contained within one massive dictionary and accessed via the appropriate key, e.g - [1997 div2].
All original data is available here and here.
In [2]:
# Creating a list of the all the years in the correct format (e.g - 1997-98)
finalYear = 2016 # adjustable for future update of project
seasonList = [] # create empty list
for x in range(1888,finalYear):
seasonList.append(x) # append all years up to finalYear to the list
seasonList = [str(i) for i in seasonList] # convert all years in list to strings
for x in range(len(seasonList)): # converting strings to correct format
if seasonList[x][2:4] != '99':
if seasonList[x][-2] == '0' and seasonList[x][-1] != '9':
seasonList[x] = (seasonList[x]+'-0'+str(int(seasonList[x][2:4])+1)) # catching error prone 0X-0X years
else:
seasonList[x] = (seasonList[x]+'-'+str(int(seasonList[x][2:4])+1)) # majority of seasons
elif seasonList[x][2:4] == '99':
seasonList[x] = (seasonList[x]+'-'+'00') # catching error prone 99-00 years
else:
print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
# remove any unwanted entries (such as wartime years where leagues didn't occur)
seasonList.remove('1915-16')
seasonList.remove('1916-17')
seasonList.remove('1917-18')
seasonList.remove('1918-19')
seasonList.remove('1940-41')
seasonList.remove('1941-42')
seasonList.remove('1942-43')
seasonList.remove('1943-44')
seasonList.remove('1944-45')
seasonList.remove('1945-46')
# creating an empty dictionary to store all season dataframes in
fixtureDict = {}
In [3]:
# Defining some functions for use within the next cell
# creating a function that will finalise the url, read the csv to a DataFrame and add it to the Dict
def finaliseCSV (url,urlEnd,division,year):
finalURL = url + urlEnd
if year > 1999:
df = pd.read_csv(finalURL, usecols=['Date','HomeTeam','AwayTeam','FTHG','FTAG','HTHG','HTAG'], index_col='Date')
else:
df = pd.read_csv(finalURL, index_col='Date')
fixtureDict[str(year)+' '+division] = df
# running into problems with overloading urllib when trying to run this all at once so making it a controllable function
def readDivisions(year,ur):
if year < 1992: # before 1992, division 1 was always the top league
finaliseCSV(ur,'1-division1.csv','div1',year)
if year > 1891: # in 1892, division 2 was introduced
finaliseCSV(ur,'2-division2.csv','div2',year)
if (year > 1919 and year < 1921) or year > 1957: # making sure we only look for division 3 in the right years
finaliseCSV(ur,'3-division3.csv','div3',year)
if year > 1957:
finaliseCSV(ur,'4-division4.csv','div4',year) # division 4 was introduced in 1958
elif year > 1920 and year < 1958: # between these years there were two divisions 3s, north and south
finaliseCSV(ur,'3a-division3n.csv','div3 north',year)
finaliseCSV(ur,'3b-division3s.csv','div3 south',year)
elif year < 2000: # for the rest of the years up to 2000, introduction of the premier league
finaliseCSV(ur,'1-premierleague.csv','div1',year)
finaliseCSV(ur,'2-division1.csv','div2',year)
finaliseCSV(ur,'3-division2.csv','div3',year)
if year > 1992: # 1992 had no fourth division but the rest do
finaliseCSV(ur,'4-division3.csv','div4',year)
else: # for years 2000+ (getting data from different website)
finaliseCSV(ur,'E0.csv','div1',year)
finaliseCSV(ur,'E1.csv','div2',year)
finaliseCSV(ur,'E2.csv','div3',year)
finaliseCSV(ur,'E3.csv','div4',year)
if year > 2004: # introduction of the conference (lowest professional league currently)
finaliseCSV(ur,'EC.csv','div5',year)
In [4]:
# Number of leagues in existence and naming conventions of said leagues has changed over the years.
# So we'll nest our data mining loops within 'if' statements to make sure we get everything correctly (function readDivisions)
# PLEASE NOTE: This cell will take a fair amount of time to run (dependent on internet connection) so be warned!
# ALSO BE ADVISED: You may get this error when running this cell ---> "<urlopen error [Errno 11002] getaddrinfo failed>"
# If this happens, run the code in the TEN cells below instead (one by one, not at the same time)
# main 'for' loop that calls the appropriate function above for each season in the seasonList
for x in range(len(seasonList)):
if seasonList[x][0:2] == '18' or seasonList[x][0:2] == '19': # getting season data up to 2000 from one site
u = 'https://raw.githubusercontent.com/footballcsv/eng-england/master/'
u = u + seasonList[x][0:3] + '0s/' + seasonList[x] + '/' #building up the start of the url
readDivisions(int(seasonList[x][0:4]),u) # running the function to make sure we choose the right leagues for the year
elif seasonList[x][0:2] == '20': # getting more recent data from different site
u = 'http://www.football-data.co.uk/mmz4281/'
u = u + seasonList[x][2:4] + seasonList[x][5:7] + '/'
readDivisions(int(seasonList[x][0:4]),u)
else:
print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
#print('All Done') # debug line to certify code complete
In [5]:
# Now that we've read all the fixtures into DataFrames, we just need to tidy them up
# Here we're going to run a large 'for' loop that will edit all DataFrames into a single format
def is_date(string): # creating a quick function that checks if a string is a date
try:
parse(string)
return True
except ValueError:
return False
for x in (fixtureDict):
# This block of code is going to convert all dates into the same YYYY-MM-DD format
if type(fixtureDict[x].index[0]) == str and is_date(fixtureDict[x].index[0]):
date = fixtureDict[x].index
date = list(date) # creating a new variable based off the original index and converting it to a list
for y in range(len(fixtureDict[x].index)):
date[y] = pd.to_datetime(fixtureDict[x].index[y]) # assigning each date within the index to an element in date
fixtureDict[x]['Date'] = date
fixtureDict[x] = fixtureDict[x].set_index('Date') # assigning our date list as the new index for each DataFrame
# This block of code is going to convert every DataFrame's columns into the same format
# This format will be (index)Date , Team 1 , Team 2 , FT T1 , FT T2 , - , HT T1 , HT T2
# Team 1 (T1) is home team, Team 2 (T2) is away team, FT = Full-Time Score, HT = Half-Time Score
if fixtureDict[x].columns[0] == 'Round':
del fixtureDict[x]['Round'] # deleting the redundant 'Round' column
# renaming the headers to new format
fixtureDict[x].rename(columns={'HomeTeam': 'Team 1', 'AwayTeam': 'Team 2', 'FTHG': 'FT T1', 'FTAG': 'FT T2'}, inplace=True)
fixtureDict[x].rename(columns={'HTHG': 'HT T1', 'HTAG': 'HT T2'}, inplace=True)
if fixtureDict[x].columns[2] == 'FT': # specifically the old data
fixtureDict[x]['FT T1'] = fixtureDict[x]['FT']
fixtureDict[x]['FT T2'] = fixtureDict[x]['FT']
fixtureDict[x]['HT T1'] = fixtureDict[x]['HT']
fixtureDict[x]['HT T2'] = fixtureDict[x]['HT'] # creating new columns to split up the score line based on team
del fixtureDict[x]['FT']
del fixtureDict[x]['HT'] # deleting redundant columns
fixtureDict[x] = fixtureDict[x][['Team 1','Team 2','FT T1','FT T2','HT T1','HT T2']] # setting correct order
# Here we're reformatting the scorelines to be seperated by team (e.g '2' and '0' rather than '2-0')
if int(x[0:4]) < 2000: # seperating the data so we know which DataFrame has split goals by team
FTT1 = list(fixtureDict[x]['FT T1'])
FTT2 = list(fixtureDict[x]['FT T2'])
HTT1 = list(fixtureDict[x]['HT T1'])
HTT2 = list(fixtureDict[x]['HT T2']) # creating new flexible lists to use from original data
for y in range(len(fixtureDict[x].index)): # 'for' loop to cycle through each element in column
FTT1[y] = int(FTT1[y][0])
FTT2[y] = int(FTT2[y][-1]) # splitting up the scores into seperate lists for each team
if str(HTT1[y]) != 'nan' and len(HTT1[y]) > 1:
HTT1[y] = int(HTT1[y][0])
HTT2[y] = int(HTT2[y][-1]) # splitting up the halftime scores (if they exist)
elif str(HTT1[y]) == 'nan':
HTT1[y] = '-'
HTT2[y] = '-' # changing any NaN values to '-' for concurrent data
fixtureDict[x]['FT T1'] = FTT1
fixtureDict[x]['FT T2'] = FTT2
fixtureDict[x]['HT T1'] = HTT1
fixtureDict[x]['HT T2'] = HTT2 # assigning the new lists into the actual columns
# Deleting index name ('Date')
del fixtureDict[x].index.name
# Fixing dataframes with no dates by resetting the index
if fixtureDict[x].index[0] == '-':
fixtureDict[x].reset_index(inplace=True)
if fixtureDict[x].columns[0] == 'index':
fixtureDict[x].drop('index', axis=1, inplace=True)
#print('All Done') # debug line to certify code complete
In [6]:
# removing erroneous NaN entries in fixture tables
# this is being a pain so put it in a seperate cell
def fixNaN():
for y in fixtureDict[x].index:
if str(y) == 'NaT':
#print(str(y))
num = 0 - len(fixtureDict[x].ix[y])
#print(num)
fixtureDict[x] = fixtureDict[x].ix[:num]
for x in (fixtureDict):
try:
fixNaN()
except:
pass
#print('All Done') # debug line to certify code complete
Over the next few cells, we will be importing data sets to be used in comparison with the football results.
Examples of this data include transfer market spendings, as well as average player age.
The more data of this sort, the better the final project will be, as it will allow more analysis of what drives football success or failure.
In [7]:
# Reading in Premiership (div1) data from 1992 - 2015 that includes:
# No. of players in squad
# Average player age
# No. of foreign players in squad
# Total market value (million € euros)
# Average market value (million € euros)
premDataDict = {}
# Reading in the data from my site where I'm hosting the (relatively) clean csv files
for x in range(0,16): # 2000-2015
if x < 10:
url = 'http://claireaux.co.uk/project/premData0'
url = url + str(x) + '.csv'
df = pd.read_csv(url, decimal=',', index_col='Team')
del df.index.name # deleting index name ('Team')
premDataDict['200'+str(x)] = df
elif x > 9:
url = 'http://claireaux.co.uk/project/premData'
url = url + str(x) + '.csv'
df = pd.read_csv(url, decimal=',', index_col='Team')
del df.index.name # deleting index name ('Team')
premDataDict['20'+str(x)] = df
for x in range(92,100): # 1992-1999
url = 'http://claireaux.co.uk/project/premData'
url = url + str(x) + '.csv'
df = pd.read_csv(url, decimal=',', index_col='Team')
del df.index.name # deleting index name ('Team')
premDataDict['19'+str(x)] = df
# Here we're reformatting the final two columns to give a float rather than a value like "12,15 Mill. ?"
def listerTwister(aList): # function to make this easier
for x in range(len(premDataDict[y].index)): # 'for' loop to cycle through each element in column
if 'Th' in str(aList[x]):
aList[x] = '0.' + str(aList[x][:4])
aList[x] = float(aList[x])
else:
aList[x] = str(aList[x][:-8])
aList[x] = aList[x].replace(',','.')
if len(aList[x]) > 1:
aList[x] = float(aList[x]) # formatting 'Total Market Value' column list
return aList
for y in premDataDict:
myList1 = list(premDataDict[y]['Total market value (mil. euros)']) # creating new flexible lists from original data
myList2 = list(premDataDict[y]['Average Market value (mil. euros)']) # creating new flexible lists from original data
premDataDict[y]['Total market value (mil. euros)'] = listerTwister(myList1) # assigning new lists into the actual columns
premDataDict[y]['Average Market value (mil. euros)'] = listerTwister(myList2) # assigning new lists into the actual columns
#print('All Done') # debug line to certify code complete
In [8]:
# Reading in transfer data from 1980 - 2015 that includes:
# Competition given club played in
# Transfer expenditure (million € euros)
# No. of transfer arrivals
# Transfer income (million € euros)
# No. of transfer departures
# Balance i.e profit/loss calculated as [income - expenditure]
transferDataDict = {}
# Defining a function to make reading in the data easier
def readTransfer(year):
url = 'http://claireaux.co.uk/project/transferData'
if year < 10:
url = url + '0' + str(year) + '.csv'
else:
url = url + str(year) + '.csv' # slight alterations based on year
df = pd.read_csv(url, index_col='Club(s)') # reading csv making the club the index
del df.index.name # deleting index name ('Club(s)')
df.drop('#',axis=1,inplace=True) # dropping the column that was previously an index-like counter
df.rename(columns={'Expenditures': 'Expenditures (mil. euros)', 'Income': 'Income (mil. euros)',
'Balance': 'Balance (mil. euros)'}, inplace=True)
if year < 10:
transferDataDict['200'+str(year)] = df # adding dataframe into Dictionary
elif year < 20:
transferDataDict['20'+str(year)] = df # adding dataframe into Dictionary
else:
transferDataDict['19'+str(year)] = df # adding dataframe into Dictionary
# Reading in the data from my site where I'm hosting the (relatively) clean csv files
for x in range(0,16): # 2000-2015
readTransfer(x)
for x in range(80,100): # 1980-1999
readTransfer(x)
# Here we're reformatting the monetary columns to give a float rather than a value like "12,15 Mill. ?"
def listerTwisterTwo(aList): # function to make this easier
for x in range(len(transferDataDict[y].index)): # 'for' loop to cycle through each element in column
if len(aList[x]) < 2:
aList[x] = float(0)
elif 'Th' in str(aList[x]):
aList[x] = str(aList[x][:-6])
aList[x] = aList[x].replace(',','.')
if aList[x][0] == '-':
aList[x] = '-0.' + aList[x][1:]
else:
aList[x] = '0.' + aList[x]
aList[x] = float(aList[x])
else:
aList[x] = str(aList[x][:-8])
aList[x] = aList[x].replace(',','.')
if len(aList[x]) > 1:
aList[x] = float(aList[x]) # formatting 'Total Market Value' column list
return aList # return the new formatted column in our list variable
for y in transferDataDict:
a = list(transferDataDict[y]['Expenditures (mil. euros)']) # creating new flexible lists from original data
b = list(transferDataDict[y]['Income (mil. euros)']) # creating new flexible lists from original data
c = list(transferDataDict[y]['Balance (mil. euros)']) # creating new flexible lists from original data
transferDataDict[y]['Expenditures (mil. euros)'] = listerTwisterTwo(a) # assigning new lists into the actual columns
transferDataDict[y]['Income (mil. euros)'] = listerTwisterTwo(b) # assigning new lists into the actual columns
transferDataDict[y]['Balance (mil. euros)'] = listerTwisterTwo(c) # assigning new lists into the actual columns
#print('All Done') # debug line to certify code complete
Now that our main portion of data is collected, stored and concurrently formatted; it's time to use it!
From here, we can employ these football results to discover a host of insights into the workings of the sport.
As it stands however, our fixture DataFrames aren't incredibly useful just on their own.
In this next cell, we're going to create final league standings for each season using all of our current data.
In [9]:
# All of the fixture data has been stored by Year/Division in identically formatted DataFrames, within a HUGE dictionary
# Here, we are going to define a function that takes a given Year/Division and works out the final league standings
# The final league table will then be returned as a new DataFrame
# TODO: add the ability to produce the full table IF full = True
# sea = season ('1969-70') , div = division ('div3') , full = boolean (True or False)
# the 'full' bool controls whether extra columns are added that split goals and points into home and away performance
def createFinalTable (sea, div, full):
teamList = [] # creating an empty list that will hold one copy of each team within the league
zeroList = [] # creating empty list that will contain as many zeroes as there are teams
ref = sea[0:4] + ' ' + div # creating a reference based on function inputs
for y in range(len(fixtureDict[ref].index)): # for loop to cycle through each row and build Team List
if fixtureDict[ref]['Team 1'][y] not in teamList: # checking if team is already in list
teamList.append(fixtureDict[ref]['Team 1'][y]) # if not, add it
zeroList.append(0)
# We're going to create 8 Dicts, one for games played, one for number of wins, one for number of draws...
# one for number of losses, one for goals for, one for goals against, one for goal difference...
# and one for Points
# The keys will be the team names, and the values will start at zero
playedDict = dict(zip(teamList, zeroList))
winDict = dict(zip(teamList, zeroList))
drawDict = dict(zip(teamList, zeroList))
lossDict = dict(zip(teamList, zeroList))
forGoalsDict = dict(zip(teamList, zeroList))
badGoalsDict = dict(zip(teamList, zeroList))
diffGoalsDict = dict(zip(teamList, zeroList))
pointsDict = dict(zip(teamList, zeroList))
for y in range(len(fixtureDict[ref].index)): # for loop to cycle through each row and build up the dictionaries
team1 = fixtureDict[ref]['Team 1'][y] # making it easier to reference current Team 1
team2 = fixtureDict[ref]['Team 2'][y] # making it easier to reference current Team 2
goals1 = fixtureDict[ref]['FT T1'][y] # making it easier to reference Team 1 score
goals2 = fixtureDict[ref]['FT T2'][y] # making it easier to reference Team 2 score
playedDict[team1] += 1 # notch up another game played for both teams
playedDict[team2] += 1 # notch up another game played for both teams
forGoalsDict[team1] += goals1 # add goals scored by Team 1 to Team 1's 'goals for'
badGoalsDict[team2] += goals1 # add goals scored by Team 1 to Team 2's 'goals against'
forGoalsDict[team2] += goals2 # add goals scored by Team 2 to Team 2's 'goals for'
badGoalsDict[team1] += goals2 # add goals scored by Team 2 to Team 1's 'goals against'
diffGoalsDict[team1] = forGoalsDict[team1] - badGoalsDict[team1] # defining goal difference each time (for - against)
diffGoalsDict[team2] = forGoalsDict[team2] - badGoalsDict[team2] # defining goal difference each time (for - against)
if goals1 > goals2 : # if Team 1 has more goals than Team 2...
# then that's a win for Team 1, and a loss for Team 2
winDict[team1] += 1 # increasing Team 1's win count
lossDict[team2] += 1 # increasing Team 2's loss count
pointsDict[team1] += 3 # give Team 1 three points for the win
elif goals1 == goals2 : # if Team 1 has equal goals as Team 2...
# then that's a draw for Team 1, and a draw for Team 2
drawDict[team1] += 1 # increasing Team 1's draw count
drawDict[team2] += 1 # increasing Team 2's draw count
pointsDict[team1] += 1 # give Team 1 one point for the draw
pointsDict[team2] += 1 # give Team 2 one point for the draw
elif goals1 < goals2 : # if Team 1 has less goals than Team 2...
# then that's a loss for Team 1, and a win for Team 2
lossDict[team1] += 1 # increasing Team 1's loss count
winDict[team2] += 1 # increasing Team 2's win count
pointsDict[team2] += 3 # give Team 2 three points for the win
elif str(goals1) == 'nan':
print(ref)
print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
# Now that we have dictionaries for every row necessary in the final league table, let's construct the DataFrame
columnList = ('Games Played','Wins','Draws','Losses','Goals For','Goals Against','Goal Difference','Points')
tableList =(playedDict,winDict,drawDict,lossDict,forGoalsDict,badGoalsDict,diffGoalsDict,pointsDict)
tableDict = dict(zip(columnList, tableList))
leagueTable = pd.DataFrame(tableDict)
leagueTable['Team'] = leagueTable.index # change the Team from being index to its own column
leagueTable = leagueTable[['Team','Games Played','Wins','Draws','Losses','Goals For',
'Goals Against','Goal Difference','Points']] # arrange columns in correct order
leagueTable.sort_values(by='Points', ascending=False, inplace=True) # sort by points
newIndex = []
for z in range(len(leagueTable.index)): # only after sorting do we assign numbered index to show place in league table
newIndex.append(z+1) # +1 so we start at 1 instead of 0
leagueTable[''] = newIndex
leagueTable.set_index('', inplace=True) # setting new index in place
del leagueTable.index.name # delete the name
return leagueTable # return our completed dataFrame
In [10]:
# In this cell, we'll run through every fixture list in fixtureDict and create a league table for each
# These will all be stored in a Dictionary and accessed by the same pattern of key (e.g - 1954 div1)
seasonDict = {} # creating the empty dictionary that will hold every single league table
#seasonFullDict = {} # for the full versions of the tables
# 'for' loop to cycle through each table of results and run it through the function in the cell above
for x in (fixtureDict):
season = x[0:4]
division = x[5:]
seasonDict[season + ' ' + division] = createFinalTable(season, division, False)
#print('All Done') # debug line to certify code complete
In [11]:
seasonDict['1936 div2']
Out[11]:
In [12]:
seasonDict['1994 div4']
Out[12]:
Here we'll graphically represent, in the form of pie charts, the 'best' teams since the leagues began.
We'll do this with a simple Formula 1 (think Mario Kart) inspired scoring system.
1st: 25 points, 2nd: 18, 3rd: 15, 4th: 12, 5th: 10, 6th: 8, 7th: 6, 8th: 4, 9th: 2, 10th: 1, 11th+: 0
In [13]:
# This cell is going to run through all of the leagues and build dictionaries containing tallies for every team involved.
# These can then be utilised to build charts displaying how teams have performed since 1880.
scoringDict = {1:25,2:18,3:15,4:12,5:10,6:8,7:6,8:4,9:2,10:1} # creating a dictionary for the scoring
div1TallyDict = {}
div2TallyDict = {}
div3TallyDict = {}
div4TallyDict = {} # empty dictionaries for team:score
for x in seasonDict: # cycle through league tables
if x[-4:] == 'div1': # dealing with Premiership
for y in seasonDict[x].index:
if y < 11: # only go 1st to 10th, 11th and higher yields no score
try: # Use scoringDict to get new score value and add it to current overall score for the team
div1TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
except KeyError: # Initially div1TallyDict has no value, so it throws an error
div1TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
if x[-4:] == 'div2': # dealing with Championship
for y in seasonDict[x].index:
if y < 11: # only go 1st to 10th, 11th and higher yields no score
try: # Use scoringDict to get new score value and add it to current overall score for the team
div2TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
except KeyError: # Initially div2TallyDict has no value, so it throws an error
div2TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
if x[-4:] == 'div3': # dealing with 3rd Division
for y in seasonDict[x].index:
if y < 11: #only go 1st to 10th, 11th and higher yields no score
try: # Use scoringDict to get new score value and add it to current overall score for the team
div3TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
except KeyError: # Initially div3TallyDict has no value, so it throws an error
div3TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
if x[-4:] == 'div4': # dealing with 4th Division
for y in seasonDict[x].index:
if y < 11: #only go 1st to 10th, 11th and higher yields no score
try: # Use scoringDict to get new score value and add it to current overall score for the team
div4TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
except KeyError: # Initially div4TallyDict has no value, so it throws an error
div4TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
# defining a function that can run through a tallyDict and fix for the different team names used
# e.g "Man United" and "Manchester United", "Birmingham" and "Birmingham City"
def fixTallyDict(tallyDict):
lista = [] # empty list to store team names
for x in sorted(tallyDict):
lista.append(x) # build up list with team names in alphabetical order
for x in range(len(lista)):
y=x+1 # using y to distinguish entry AFTER x
if y < len(lista):
if lista[x] == "Leeds": # SPECIAL CASE: Leeds
if lista[y] == 'Leeds United':
tallyDict['Leeds'] += tallyDict['Leeds United']
del tallyDict['Leeds United']
elif lista[y] == 'Leeds City':
tallyDict['Leeds'] += tallyDict['Leeds City']
tallyDict['Leeds'] += tallyDict['Leeds United']
del tallyDict['Leeds City']
del tallyDict['Leeds United']
elif lista[x] in lista[y]: # Main generic checker
tallyDict[lista[x]] += tallyDict[lista[y]]
del tallyDict[lista[y]]
elif lista[x] == "Man City": # SPECIAL CASE: Man City
if lista[y] == 'Manchester City':
tallyDict['Man City'] += tallyDict['Manchester City']
del tallyDict['Manchester City']
elif lista[x] == "Man United": # SPECIAL CASE: Man Utd
tallyDict['Man United'] += tallyDict['Manchester United']
del tallyDict['Manchester United']
elif lista[x] == "Nott'm Forest": # SPECIAL CASE: Nottingham Forest
if lista[y] == 'Nottingham Forest':
tallyDict["Nott'm Forest"] += tallyDict['Nottingham Forest']
del tallyDict['Nottingham Forest']
elif lista[x] == "QPR": # SPECIAL CASE: QPR
tallyDict["QPR"] += tallyDict['Queens Park Rangers']
del tallyDict['Queens Park Rangers']
elif lista[x] == "Sheffield Wednesday": # SPECIAL CASE: Sheff Weds
tallyDict["Sheffield Wednesday"] += tallyDict['Sheffield Weds']
del tallyDict['Sheffield Weds']
elif lista[x] == "Wolverhampton Wanderers": # SPECIAL CASE: Wolves
if lista[y] == 'Wolves':
tallyDict["Wolverhampton Wanderers"] += tallyDict['Wolves']
del tallyDict['Wolves']
return(tallyDict) # return fixed Dictionary
div1TallyDict = fixTallyDict(div1TallyDict)
div2TallyDict = fixTallyDict(div2TallyDict)
div3TallyDict = fixTallyDict(div3TallyDict)
div4TallyDict = fixTallyDict(div4TallyDict) # updating tallyDicts to fix them all with new function
#print('All Done') # debug line to certify code complete
In [14]:
sns.set()
# Using the TallyDicts to construct 'cropped' DataFrames
minScore = 300 # Specify minScore (this is normalised for Div2 and adjusted for the others)
# Create DataFrames using keys as 'Team' and values as 'Greatness Score'
div1TallyFrame = pd.DataFrame({"Team": list(div1TallyDict.keys()), "'Greatness' Score": list(div1TallyDict.values())})
div1TallyFrame = div1TallyFrame[div1TallyFrame["'Greatness' Score"] > minScore*1.5] # Keep only the best teams
div2TallyFrame = pd.DataFrame({"Team": list(div2TallyDict.keys()), "'Greatness' Score": list(div2TallyDict.values())})
div2TallyFrame = div2TallyFrame[div2TallyFrame["'Greatness' Score"] > minScore] # Keep only the best teams
div3TallyFrame = pd.DataFrame({"Team": list(div3TallyDict.keys()), "'Greatness' Score": list(div3TallyDict.values())})
div3TallyFrame = div3TallyFrame[div3TallyFrame["'Greatness' Score"] > minScore*0.46] # Keep only the best teams
div4TallyFrame = pd.DataFrame({"Team": list(div4TallyDict.keys()), "'Greatness' Score": list(div4TallyDict.values())})
div4TallyFrame = div4TallyFrame[div4TallyFrame["'Greatness' Score"] > minScore*0.45] # Keep only the best teams
# Creating the figure to display Pie Charts of the all-time greatest teams
fig = {
'data': [
{'labels': div1TallyFrame['Team'],'values': div1TallyFrame["'Greatness' Score"], 'type': 'pie','name': 'Div 1',
'domain': {'x': [.1, .33], # Domain is a fiddly variable
'y': [1, .34]},"hole": .35, 'hoverinfo':'label+value','textinfo':'label+value',},
{'labels': div2TallyFrame['Team'],'values': div2TallyFrame["'Greatness' Score"],'type': 'pie','name': 'Div 2',
'domain': {'x': [.9, .67], # Domain is a fiddly variable
'y': [1, .34]}, "hole": .35,'hoverinfo':'label+value', 'textinfo':'label+value',},
{'labels': div3TallyFrame['Team'],'values': div3TallyFrame["'Greatness' Score"],'type': 'pie', 'name': 'Div 3',
'domain': {'x': [.3, .15], # Domain is a fiddly variable
'y': [0, .1]},"hole": .45, 'hoverinfo':'label+value','textinfo':'label', },
{'labels': div4TallyFrame['Team'], 'values': div4TallyFrame["'Greatness' Score"],'type': 'pie','name': 'Div 4',
'domain': {'x': [.8, .67], # Domain is a fiddly variable
'y': [0, .1]},"hole": .5, 'hoverinfo':'label+value','textinfo':'label', }
],
# setting annotations to denote which league each chart is
'layout': {'title': "All-Time Team 'Greatness' Ranking by League",'showlegend': False, "annotations": [
{"font": {"size": 20},"showarrow": False,"text": "1st","x": 0.195,"y": 0.72 },
{"font": {"size": 20}, "showarrow": False,"text": "2nd","x": 0.81,"y": 0.72},
{"font": { "size": 20},"showarrow": False,"text": "3rd","x": 0.2,"y": 0.01},
{ "font": {"size": 20}, "showarrow": False,"text": "4th","x": 0.755,"y": 0.01}
]
}
}
iplot(fig) # plot the figure
# TO DO: EDIT THE COLOURS OF THE PIE CHART TO CORRELATE WITH TEAM COLOURS (e.g Liverpool red, Chelsea blue)
Analysing these pie charts above, we can see that Liverpool hold the all-time top spot for the highest division (currently the premiership) with Manchester United in a very close 2nd. The, perhaps surprising, teams to make it are Aston Villa and Sunderland, who have not been hugely successful recently but used to enjoy a great deal of success.
Moving over to Division 2 (or the secondary league, currently the Championship), the scores are a lot lower. This is most likely due to the fact that once you finish near the top (or bottom) of this league you move up or down to another league, and thus it's impossible to finish high in the 2nd Division season after season. With that said, it's clear that Birmingham and Leicester have seen the most success in the second tier, Leicester interestingly enough having won the 2nd division very recently and then shockingly won the Premier league just last season.
Below these, I've also included some smaller pies to show league 3 and 4.
Using our organised data, we can now start to explore relationships and answer some specular questions.
Here, the relationship between transfer spending and league performance will be explored.
Hopefully, these answers will provide insights as yet untold into the economics of football.
In [15]:
CPI = pd.read_csv('http://claireaux.co.uk/project/cpiVALUES.csv') # reading CPI csv to get inflation values
# Function that allows adjusting old euro values for inflation
def adjustInflation(value,yearFrom,yearTo):
indexTo = float(CPI[yearTo])
indexFrom = float(CPI[yearFrom])
change = (((indexTo - indexFrom) / indexFrom) + 1)
newValue = value*change
return(newValue)
# Function that builds a dataframe containing teams as index, league position for the given season...
# ...and various transfer related data
# This also includes some code that accounts for the different naming conventions between data sources
def posExpBuilder(season) :
df = seasonDict[season]
df2 = transferDataDict[season[0:4]] # creating dataframes for specified year to use
leaguePositions = {} # creating empty dict for league position: team
expenditures = {} # creating empty dict for expenditure: team
incomes = {} # creating empty dict for income: team
averageNewPlayerCosts = {} # creating empty dict for average cost of each new player: team
averageSoldPlayerPrices = {} # creating empty dict for average price of each sold player: team
balances = {} # creating empty dict for income - expenditures: team
for x in range(len(df.index)): # for loop to cycle through first dataframe
y=x+1
leaguePositions[df.iloc[x]['Team']] = y # setting league positions dictionary values (e.g Arsenal: 2)
for x in range(len(df2.index)): # cycling through second Dataframe
# creating adjusted variables for inflation
adjustedExpense = adjustInflation((df2.iloc[x]['Expenditures (mil. euros)']),season[0:4],'2016')
adjustedIncome = adjustInflation((df2.iloc[x]['Income (mil. euros)']),season[0:4],'2016')
playerCost = (df2.iloc[x]['Expenditures (mil. euros)']) / (df2.iloc[x]['Arrivals'])
adjustedPlayerCost = adjustInflation(playerCost,season[0:4],'2016')
playerPrice = (df2.iloc[x]['Income (mil. euros)']) / (df2.iloc[x]['Departures'])
adjustedPlayerPrice = adjustInflation(playerPrice,season[0:4],'2016')
adjustedBalance = adjustInflation((df2.iloc[x]['Balance (mil. euros)']),season[0:4],'2016')
# appending dictionaries for calculated adjusted values
expenditures[df2.index[x]] = adjustedExpense
incomes[df2.index[x]] = adjustedIncome
averageNewPlayerCosts[df2.index[x]] = adjustedPlayerCost
averageSoldPlayerPrices[df2.index[x]] = adjustedPlayerPrice
balances[df2.index[x]] = adjustedBalance
# making a list of the new dictionaries
working = (leaguePositions,expenditures,incomes,averageNewPlayerCosts,averageSoldPlayerPrices,balances)
returnFrame = pd.DataFrame.from_records(working) # creating the final frame from this new list
returnFrame = returnFrame.transpose() # transpose to look nicer
names = [] # empty names list to be used below
# for loop to cycle through our new final dataframe accounting for the different conventions from combining different data
for x in range(len(returnFrame.index)):
y = x+1
if y < len(returnFrame.index):
if returnFrame.index[x] in returnFrame.index[y]:
returnFrame[1][x] = returnFrame[1][y] # grabbing the column[1] value for ('TEAM' FC) and putting it into ('TEAM')
returnFrame[2][x] = returnFrame[2][y] # grabbing the column[2] value for ('TEAM' FC) and putting it into ('TEAM')
returnFrame[3][x] = returnFrame[3][y] # grabbing the column[3] value for ('TEAM' FC) and putting it into ('TEAM')
returnFrame[4][x] = returnFrame[4][y] # grabbing the column[4] value for ('TEAM' FC) and putting it into ('TEAM')
returnFrame[5][x] = returnFrame[5][y] # grabbing the column[5] value for ('TEAM' FC) and putting it into ('TEAM')
if str(returnFrame[0][x]) == 'nan':
names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
elif str(returnFrame[1][x]) == 'nan':
names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
elif str(returnFrame[2][x]) == 'nan':
names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
elif str(returnFrame[3][x]) == 'nan':
names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
elif str(returnFrame[4][x]) == 'nan':
names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
elif str(returnFrame[5][x]) == 'nan':
names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
returnFrame.drop(names,inplace=True) # deleting all those teams that were in the list
returnFrame[0] = returnFrame[0].astype(int,raise_on_error=False) # change league standing column dataType to int
returnFrame.rename(columns={0: 'Position', 1: 'Expenditure', 2: 'Income',
3: 'Average Cost of New Player', 4: 'Average Price of Sold Player',
5: 'Balance'}, inplace=True) # renaming columns appropriately
return(returnFrame) # return the dataFrame
In [16]:
# Some general functions for graphing etc.
def roundTo3(x): # small function to round numbers to 3 decimal places
return(float("{0:.3f}".format(x)))
# function that takes given x and y values and calculates pearsonr and a significance value for p
def textMaker(xi, y):
bond = (stats.pearsonr(xi,y))
p = bond[1]
if p < 0.01:
p = '< 1%'
elif p < 0.05:
p = '< 5%'
elif p < 0.1:
p = '< 10%'
else:
p = 'not significant'
return("Pearson's r: " + str(roundTo3(bond[0])) + ', p value:' + p)
# Generate linear line fit using regression
def fitLine(y):
slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)
return(slope*xi+intercept)
positionTransferDict1 = {}
positionTransferDict2 = {} # Creating empty dicts for use
# Top league (Premiership [div 1])
for x in range(1980,2016):
season = str(x) + ' div1'
positionTransferDict1[str(x)] = posExpBuilder(season) # running all data through function in previous cell
# Secondary league (Championship/league 1 [div 2])
for x in range(1980,2016):
season = str(x) + ' div2'
positionTransferDict2[str(x)] = posExpBuilder(season) # running all data through function in previous cell
positionTransferList1 = []
positionTransferList2 = []
resultList = [] # creating empty lists
for x in positionTransferDict1:
positionTransferList1.append(positionTransferDict1[x])
for x in positionTransferDict2:
positionTransferList2.append(positionTransferDict2[x]) # populating lists with dataframes
result = pd.concat(positionTransferList1)
result2 = pd.concat(positionTransferList2) # concatenating all dataframes in each list to make one big dataframe
result['div'] = 'div1'
result2['div'] = 'div2'
resultList = [result,result2]
result3 = pd.concat(resultList) # as yet unused 3rd dataframe that combines both div1 and div2
In [17]:
xi = result['Position'] # using league position as x values
A = array([ xi, ones(22)]) # array for use in stats calcs
y1=result['Expenditure']
y2=result['Income']
# Creating the dataset, and generating the plot
trace1 = go.Scatter(x=xi, y=y1, mode='markers', marker=go.Marker(color='turquoise', opacity=0.5), name='Spending Data')
trace2 = go.Scatter(x=xi, y=fitLine(y1), mode='lines', marker=go.Marker(color='blue', size=100), name='Spending Fit')
trace3 = go.Scatter(x=xi, y=y2, mode='markers', marker=go.Marker(color='violet', opacity=0.5), name='Income Data')
trace4 = go.Scatter(x=xi, y=fitLine(y2), mode='lines', marker=go.Marker(color='red', size=100), name='Income Fit')
# Annotations containg important statistical stuff
pearson1 = go.Annotation(x=.3, y=.8, text = textMaker(xi, y1), showarrow=False, font=go.Font(size=12))
pearson2 = go.Annotation(x=.85, y=.8, text = textMaker(xi, y2), showarrow=False, font=go.Font(size=12))
annotationList = [pearson1,pearson2]
# Create fig with two subplots side by side
fig = tools.make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(
'League performance v.s transfer spending', 'League performance v.s transfer income'))
# Update layout for annotations and titles
fig['layout'].update(annotations=annotationList,title='League performance v.s transfer spending' +
' League performance v.s transfer income .',)
# Add all the data traces to the plots
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)
# Update axis layouts
fig['layout']['xaxis1'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
fig['layout']['yaxis1'].update(zerolinecolor='rgb(210,210,210)', gridcolor='rgb(210,210,210)',
title="Million EUR (inflation adjusted)")
fig['layout']['xaxis2'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
# Plot figure
iplot(fig)
Now that we've visualised in pie charts the greatest teams, it's time to explore how a team can achieve such 'greatness'.
These two graphs explore the correlation between League Performance and Transfer Spending/Income.
For both cases, we see a highly significant (to the 1% level) negative correlation (i.e the less you spend/earn, the worse your final league standing).
Admittedly, the correlation is somewhat weak, especially in the transfer income case, but the regression lines show it definitely exists.
This result may seem like something obvious from pure common sense, but this data spanning 35 years proves it empirically.
It also proves that the correlation may not be quite as strong as one would imagine without the data.
In [18]:
xi = result['Position'] # using league position as x values
A = array([ xi, ones(22)]) # array for use in stats calcs
y1=result['Average Cost of New Player']
y2=result['Average Price of Sold Player']
# Creating the dataset, and generating the plot
trace1 = go.Scatter(x=xi, y=y1, mode='markers', marker=go.Marker(color='turquoise', opacity=0.5), name='Average Cost')
trace2 = go.Scatter(x=xi, y=fitLine(y1), mode='lines', marker=go.Marker(color='blue', size=100), name='Inbound Fit')
trace3 = go.Scatter(x=xi, y=y2, mode='markers', marker=go.Marker(color='violet', opacity=0.5), name='Average Price')
trace4 = go.Scatter(x=xi, y=fitLine(y2), mode='lines', marker=go.Marker(color='red', size=100), name='Outbound Fit')
# Annotations containg important statistical stuff
pearson1 = go.Annotation(x=.3, y=.8, text = textMaker(xi, y1), showarrow=False, font=go.Font(size=12))
pearson2 = go.Annotation(x=.85, y=.8, text = textMaker(xi, y2), showarrow=False, font=go.Font(size=12))
annotationList = [pearson1,pearson2]
# Create fig with two subplots side by side
fig = tools.make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(
'League performance v.s transfer spending', 'League performance v.s transfer income'))
# Update layout for annotations and titles
fig['layout'].update(annotations=annotationList,title='Performance v.s Average Spent per New Player' +
' Performance v.s Average Price per Sold Player .',)
# Add all the data traces to the plots
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)
# Update axis layouts
fig['layout']['xaxis1'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
fig['layout']['yaxis1'].update(zerolinecolor='rgb(210,210,210)', gridcolor='rgb(210,210,210)',
title="Million EUR (inflation adjusted)")
fig['layout']['xaxis2'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
# Plot figure
iplot(fig)
These two graphs are similar to the ones above, except we're comparing on an individual basis.
In each case, the y value depicts the average spent/earned from every incoming/outgoing player.
This is calculated simply by dividing the total expenditure by number of arrivals (or total income by number of departures).
Again, correlation is significant to the 1% level, and we see negative correlation across both.
At a value of -0.31, it is clear that purchasing high value players seems to give the highest payoff seen so far in terms of performance.
The average price received from sold players has only a very small negative correlation and thus I would not say it matters too much.
Based on the data studied, in order for teams to improve their performance, spending large amounts in concentration has been shown to provide the best outcome (i.e purchasing a few high value players > many low value players > spending less).
Ideally, I would have built around 50 figures for comparison of all the values stored within DataFrames, but tackling seaborn and plotly turned out to be more time-consuming than I had imagined and thus I don't believe I have done my level of data collection justice.
I plan to continue to work on this notebook in my spare time going forward, but for now this is my submitted and final project.