The Python Premiership Project

Details

This Jupyter Notebook file has been created by George Claireaux for the Data Bootcamp course within NYU Stern.
This file is provided 100% open-source and anyone is welcome to use and edit it as such.
The rest of the document will be an instructional collection, formatting and analysis of various data relating to the "Python Premiership Project" ©.
By collecting a vast amount of data on football fixtures in England since the leagues began in 1880, this project aims to shed new light onto the workings of sports economics.

Import Packages


In [1]:
import sys                             # system module
import pandas as pd                    # data package
import matplotlib as mpl               # graphics package
import matplotlib.pyplot as plt        # pyplot module
import datetime as dt                  # date and time module
import numpy as np                     # NumPy module
import seaborn as sns                  # seaborn module
import time
from dateutil.parser import parse
from pandas_datareader import data as web

# Scientific libraries
from numpy import arange,array,ones
from scipy import stats

# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
from plotly import tools
import plotly.graph_objs as go               # ditto
import plotly.plotly as py
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

# these lines make our graphics show up in the notebook
%matplotlib inline             
plotly.offline.init_notebook_mode(connected=True)

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Seaborn version: ', sns.__version__)
print('Plotly version: ', plotly.__version__)
print('Today: ', dt.date.today())


/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:878: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
Python version: 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.19.0
Matplotlib version:  1.5.3
Seaborn version:  0.6.0
Plotly version:  1.12.11
Today:  2016-12-22

Reading in the Bulk of our Data

This is where we'll pull in match data from the English leagues dating back to the 1880's.
We'll read these into seperate dataframes, each one containing all the results from one league and year, e.g - League 2, 1997.
These will all be contained within one massive dictionary and accessed via the appropriate key, e.g - [1997 div2].
All original data is available here and here.

Initial List Creation


In [2]:
# Creating a list of the all the years in the correct format (e.g - 1997-98)

finalYear = 2016 # adjustable for future update of project

seasonList = [] # create empty list
for x in range(1888,finalYear):
    seasonList.append(x) # append all years up to finalYear to the list
seasonList = [str(i) for i in seasonList] # convert all years in list to strings

for x in range(len(seasonList)): # converting strings to correct format
    if seasonList[x][2:4] != '99':
        if seasonList[x][-2] == '0' and seasonList[x][-1] != '9':
            seasonList[x] = (seasonList[x]+'-0'+str(int(seasonList[x][2:4])+1)) # catching error prone 0X-0X years
        else:
            seasonList[x] = (seasonList[x]+'-'+str(int(seasonList[x][2:4])+1)) # majority of seasons
    elif seasonList[x][2:4] == '99':
        seasonList[x] = (seasonList[x]+'-'+'00') # catching error prone 99-00 years
    else: 
        print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
        
# remove any unwanted entries (such as wartime years where leagues didn't occur)
seasonList.remove('1915-16')
seasonList.remove('1916-17')
seasonList.remove('1917-18')
seasonList.remove('1918-19')
seasonList.remove('1940-41')
seasonList.remove('1941-42')
seasonList.remove('1942-43')
seasonList.remove('1943-44')
seasonList.remove('1944-45')
seasonList.remove('1945-46')

# creating an empty dictionary to store all season dataframes in
fixtureDict = {}

Function Definitions


In [3]:
# Defining some functions for use within the next cell

# creating a function that will finalise the url, read the csv to a DataFrame and add it to the Dict
def finaliseCSV (url,urlEnd,division,year):
    finalURL = url + urlEnd
    if year > 1999:
        df = pd.read_csv(finalURL, usecols=['Date','HomeTeam','AwayTeam','FTHG','FTAG','HTHG','HTAG'], index_col='Date')
    else:
        df = pd.read_csv(finalURL, index_col='Date')
    fixtureDict[str(year)+' '+division] = df
    
# running into problems with overloading urllib when trying to run this all at once so making it a controllable function
def readDivisions(year,ur):
    if year < 1992: # before 1992, division 1 was always the top league
        finaliseCSV(ur,'1-division1.csv','div1',year)
        if year > 1891: # in 1892, division 2 was introduced
            finaliseCSV(ur,'2-division2.csv','div2',year)
            if (year > 1919 and year < 1921) or year > 1957: # making sure we only look for division 3 in the right years
                finaliseCSV(ur,'3-division3.csv','div3',year)
                if year > 1957:
                    finaliseCSV(ur,'4-division4.csv','div4',year) # division 4 was introduced in 1958
            elif year > 1920 and year < 1958: # between these years there were two divisions 3s, north and south
                finaliseCSV(ur,'3a-division3n.csv','div3 north',year)
                finaliseCSV(ur,'3b-division3s.csv','div3 south',year)
    elif year < 2000: # for the rest of the years up to 2000, introduction of the premier league
        finaliseCSV(ur,'1-premierleague.csv','div1',year)
        finaliseCSV(ur,'2-division1.csv','div2',year)
        finaliseCSV(ur,'3-division2.csv','div3',year)
        if year > 1992: # 1992 had no fourth division but the rest do
            finaliseCSV(ur,'4-division3.csv','div4',year)
    else: # for years 2000+ (getting data from different website)
        finaliseCSV(ur,'E0.csv','div1',year)
        finaliseCSV(ur,'E1.csv','div2',year)
        finaliseCSV(ur,'E2.csv','div3',year)
        finaliseCSV(ur,'E3.csv','div4',year)
        if year > 2004: # introduction of the conference (lowest professional league currently)
            finaliseCSV(ur,'EC.csv','div5',year)

Major Data Collection


In [4]:
# Number of leagues in existence and naming conventions of said leagues has changed over the years.
# So we'll nest our data mining loops within 'if' statements to make sure we get everything correctly (function readDivisions)
# PLEASE NOTE: This cell will take a fair amount of time to run (dependent on internet connection) so be warned!
# ALSO BE ADVISED: You may get this error when running this cell --->  "<urlopen error [Errno 11002] getaddrinfo failed>"
# If this happens, run the code in the TEN cells below instead (one by one, not at the same time)

# main 'for' loop that calls the appropriate function above for each season in the seasonList
for x in range(len(seasonList)):
    
    if seasonList[x][0:2] == '18' or seasonList[x][0:2] == '19': # getting season data up to 2000 from one site
        u = 'https://raw.githubusercontent.com/footballcsv/eng-england/master/'
        u = u + seasonList[x][0:3] + '0s/' + seasonList[x] + '/' #building up the start of the url
        readDivisions(int(seasonList[x][0:4]),u) # running the function to make sure we choose the right leagues for the year
                    
    elif seasonList[x][0:2] == '20': # getting more recent data from different site
        u = 'http://www.football-data.co.uk/mmz4281/'
        u = u + seasonList[x][2:4] + seasonList[x][5:7] + '/'
        readDivisions(int(seasonList[x][0:4]),u)
        
    else:
        print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
        
#print('All Done') # debug line to certify code complete

DataFrame Formatting


In [5]:
# Now that we've read all the fixtures into DataFrames, we just need to tidy them up
# Here we're going to run a large 'for' loop that will edit all DataFrames into a single format

def is_date(string): # creating a quick function that checks if a string is a date
    try: 
        parse(string)
        return True
    except ValueError:
        return False

for x in (fixtureDict):
    
    # This block of code is going to convert all dates into the same YYYY-MM-DD format
    if type(fixtureDict[x].index[0]) == str and is_date(fixtureDict[x].index[0]):
        date = fixtureDict[x].index 
        date = list(date) # creating a new variable based off the original index and converting it to a list
        for y in range(len(fixtureDict[x].index)):
            date[y] = pd.to_datetime(fixtureDict[x].index[y]) # assigning each date within the index to an element in date
        fixtureDict[x]['Date'] = date
        fixtureDict[x] = fixtureDict[x].set_index('Date') # assigning our date list as the new index for each DataFrame
        
    # This block of code is going to convert every DataFrame's columns into the same format
    # This format will be (index)Date , Team 1 , Team 2 , FT T1 , FT T2 , - , HT T1 , HT T2
    # Team 1 (T1) is home team, Team 2 (T2) is away team, FT = Full-Time Score, HT = Half-Time Score
    if fixtureDict[x].columns[0] == 'Round':
        del fixtureDict[x]['Round'] # deleting the redundant 'Round' column
    # renaming the headers to new format
    fixtureDict[x].rename(columns={'HomeTeam': 'Team 1', 'AwayTeam': 'Team 2', 'FTHG': 'FT T1', 'FTAG': 'FT T2'}, inplace=True)
    fixtureDict[x].rename(columns={'HTHG': 'HT T1', 'HTAG': 'HT T2'}, inplace=True)
    if fixtureDict[x].columns[2] == 'FT': # specifically the old data
        fixtureDict[x]['FT T1'] = fixtureDict[x]['FT']
        fixtureDict[x]['FT T2'] = fixtureDict[x]['FT']
        fixtureDict[x]['HT T1'] = fixtureDict[x]['HT']
        fixtureDict[x]['HT T2'] = fixtureDict[x]['HT'] # creating new columns to split up the score line based on team
        del fixtureDict[x]['FT']
        del fixtureDict[x]['HT'] # deleting redundant columns
        fixtureDict[x] = fixtureDict[x][['Team 1','Team 2','FT T1','FT T2','HT T1','HT T2']] # setting correct order
        
    # Here we're reformatting the scorelines to be seperated by team (e.g '2' and '0' rather than '2-0')
    if int(x[0:4]) < 2000: # seperating the data so we know which DataFrame has split goals by team
        FTT1 = list(fixtureDict[x]['FT T1'])
        FTT2 = list(fixtureDict[x]['FT T2'])
        HTT1 = list(fixtureDict[x]['HT T1'])
        HTT2 = list(fixtureDict[x]['HT T2']) # creating new flexible lists to use from original data
        for y in range(len(fixtureDict[x].index)): # 'for' loop to cycle through each element in column
            FTT1[y] = int(FTT1[y][0])
            FTT2[y] = int(FTT2[y][-1]) # splitting up the scores into seperate lists for each team
            if str(HTT1[y]) != 'nan' and len(HTT1[y]) > 1:
                HTT1[y] = int(HTT1[y][0])
                HTT2[y] = int(HTT2[y][-1]) # splitting up the halftime scores (if they exist)
            elif str(HTT1[y]) == 'nan':
                HTT1[y] = '-'
                HTT2[y] = '-' # changing any NaN values to '-' for concurrent data
        fixtureDict[x]['FT T1'] = FTT1
        fixtureDict[x]['FT T2'] = FTT2
        fixtureDict[x]['HT T1'] = HTT1
        fixtureDict[x]['HT T2'] = HTT2 # assigning the new lists into the actual columns
        
    # Deleting index name ('Date')
    del fixtureDict[x].index.name
        
    # Fixing dataframes with no dates by resetting the index
    if fixtureDict[x].index[0] == '-':
        fixtureDict[x].reset_index(inplace=True)
    if fixtureDict[x].columns[0] == 'index':
        fixtureDict[x].drop('index', axis=1, inplace=True)
        
#print('All Done') # debug line to certify code complete

In [6]:
# removing erroneous NaN entries in fixture tables
# this is being a pain so put it in a seperate cell

def fixNaN():
    for y in fixtureDict[x].index:
         if str(y) == 'NaT':
            #print(str(y))
            num = 0 - len(fixtureDict[x].ix[y])
            #print(num)
            fixtureDict[x] = fixtureDict[x].ix[:num]
            
for x in (fixtureDict):    
    try:
        fixNaN()
    except: 
        pass

#print('All Done') # debug line to certify code complete

Importing Comparative Data

Over the next few cells, we will be importing data sets to be used in comparison with the football results.
Examples of this data include transfer market spendings, as well as average player age.
The more data of this sort, the better the final project will be, as it will allow more analysis of what drives football success or failure.


In [7]:
# Reading in Premiership (div1) data from 1992 - 2015 that includes:
#    No. of players in squad
#    Average player age
#    No. of foreign players in squad
#    Total market value (million € euros)
#    Average market value (million € euros)

premDataDict = {}

# Reading in the data from my site where I'm hosting the (relatively) clean csv files
for x in range(0,16): # 2000-2015
    if x < 10:
        url = 'http://claireaux.co.uk/project/premData0'
        url = url + str(x) + '.csv'
        df = pd.read_csv(url, decimal=',', index_col='Team')
        del df.index.name # deleting index name ('Team')
        premDataDict['200'+str(x)] = df
    elif x > 9:
        url = 'http://claireaux.co.uk/project/premData'
        url = url + str(x) + '.csv'
        df = pd.read_csv(url, decimal=',', index_col='Team')
        del df.index.name # deleting index name ('Team')
        premDataDict['20'+str(x)] = df
        
for x in range(92,100): # 1992-1999
    url = 'http://claireaux.co.uk/project/premData'
    url = url + str(x) + '.csv'
    df = pd.read_csv(url, decimal=',', index_col='Team')
    del df.index.name # deleting index name ('Team')
    premDataDict['19'+str(x)] = df
        
# Here we're reformatting the final two columns to give a float rather than a value like "12,15 Mill. ?"
def listerTwister(aList): # function to make this easier
    for x in range(len(premDataDict[y].index)): # 'for' loop to cycle through each element in column
        if 'Th' in str(aList[x]):
            aList[x] = '0.' + str(aList[x][:4])
            aList[x] = float(aList[x])
        else:
            aList[x] = str(aList[x][:-8])
            aList[x] = aList[x].replace(',','.')
            if len(aList[x]) > 1:
                aList[x] = float(aList[x]) # formatting 'Total Market Value' column list
    return aList

for y in premDataDict:
    myList1 = list(premDataDict[y]['Total market value (mil. euros)']) # creating new flexible lists from original data
    myList2 = list(premDataDict[y]['Average Market value (mil. euros)']) # creating new flexible lists from original data
    premDataDict[y]['Total market value (mil. euros)'] = listerTwister(myList1) # assigning new lists into the actual columns
    premDataDict[y]['Average Market value (mil. euros)'] = listerTwister(myList2) # assigning new lists into the actual columns
    
#print('All Done') # debug line to certify code complete

In [8]:
# Reading in transfer data from 1980 - 2015 that includes:
#    Competition given club played in
#    Transfer expenditure (million € euros)
#    No. of transfer arrivals
#    Transfer income (million € euros)
#    No. of transfer departures
#    Balance i.e profit/loss calculated as [income - expenditure]

transferDataDict = {}

# Defining a function to make reading in the data easier
def readTransfer(year):
    url = 'http://claireaux.co.uk/project/transferData'
    if year < 10:
        url = url + '0' + str(year) + '.csv'
    else:
        url = url + str(year) + '.csv' # slight alterations based on year
    df = pd.read_csv(url, index_col='Club(s)') # reading csv making the club the index
    del df.index.name # deleting index name ('Club(s)')
    df.drop('#',axis=1,inplace=True) # dropping the column that was previously an index-like counter
    df.rename(columns={'Expenditures': 'Expenditures (mil. euros)', 'Income': 'Income (mil. euros)',
                       'Balance': 'Balance (mil. euros)'}, inplace=True)
    if year < 10:
        transferDataDict['200'+str(year)] = df # adding dataframe into Dictionary
    elif year < 20:
        transferDataDict['20'+str(year)] = df # adding dataframe into Dictionary
    else:
        transferDataDict['19'+str(year)] = df # adding dataframe into Dictionary

# Reading in the data from my site where I'm hosting the (relatively) clean csv files
for x in range(0,16): # 2000-2015
    readTransfer(x)
for x in range(80,100): # 1980-1999
    readTransfer(x)
    
# Here we're reformatting the monetary columns to give a float rather than a value like "12,15 Mill. ?"
def listerTwisterTwo(aList): # function to make this easier
    for x in range(len(transferDataDict[y].index)): # 'for' loop to cycle through each element in column
        
        if len(aList[x]) < 2:
            aList[x] = float(0)
            
        elif 'Th' in str(aList[x]):
            aList[x] = str(aList[x][:-6])
            aList[x] = aList[x].replace(',','.')
            if aList[x][0] == '-':
                aList[x] = '-0.' + aList[x][1:]
            else:
                aList[x] = '0.' + aList[x]
            aList[x] = float(aList[x])
            
        else:
            aList[x] = str(aList[x][:-8])
            aList[x] = aList[x].replace(',','.')
            if len(aList[x]) > 1:
                aList[x] = float(aList[x]) # formatting 'Total Market Value' column list
                
    return aList # return the new formatted column in our list variable

for y in transferDataDict:
    a = list(transferDataDict[y]['Expenditures (mil. euros)']) # creating new flexible lists from original data
    b = list(transferDataDict[y]['Income (mil. euros)']) # creating new flexible lists from original data
    c = list(transferDataDict[y]['Balance (mil. euros)']) # creating new flexible lists from original data
    transferDataDict[y]['Expenditures (mil. euros)'] = listerTwisterTwo(a) # assigning new lists into the actual columns
    transferDataDict[y]['Income (mil. euros)'] = listerTwisterTwo(b) # assigning new lists into the actual columns
    transferDataDict[y]['Balance (mil. euros)'] = listerTwisterTwo(c) # assigning new lists into the actual columns
    
#print('All Done') # debug line to certify code complete

Utilising the Data

Now that our main portion of data is collected, stored and concurrently formatted; it's time to use it!
From here, we can employ these football results to discover a host of insights into the workings of the sport.
As it stands however, our fixture DataFrames aren't incredibly useful just on their own.
In this next cell, we're going to create final league standings for each season using all of our current data.


In [9]:
# All of the fixture data has been stored by Year/Division in identically formatted DataFrames, within a HUGE dictionary
# Here, we are going to define a function that takes a given Year/Division and works out the final league standings
# The final league table will then be returned as a new DataFrame

# TODO: add the ability to produce the full table IF full = True

# sea = season ('1969-70') , div = division ('div3') , full = boolean (True or False)
# the 'full' bool controls whether extra columns are added that split goals and points into home and away performance
def createFinalTable (sea, div, full):
    
    teamList = [] # creating an empty list that will hold one copy of each team within the league
    zeroList = [] # creating empty list that will contain as many zeroes as there are teams
    ref = sea[0:4] + ' ' + div # creating a reference based on function inputs
        
    for y in range(len(fixtureDict[ref].index)): # for loop to cycle through each row and build Team List
        if fixtureDict[ref]['Team 1'][y] not in teamList: # checking if team is already in list
            teamList.append(fixtureDict[ref]['Team 1'][y]) # if not, add it
            zeroList.append(0)
            
    # We're going to create 8 Dicts, one for games played, one for number of wins, one for number of draws... 
    # one for number of losses, one for goals for, one for goals against, one for goal difference...
    # and one for Points
    # The keys will be the team names, and the values will start at zero
    playedDict = dict(zip(teamList, zeroList))
    winDict = dict(zip(teamList, zeroList))
    drawDict = dict(zip(teamList, zeroList))
    lossDict = dict(zip(teamList, zeroList))
    forGoalsDict = dict(zip(teamList, zeroList))
    badGoalsDict = dict(zip(teamList, zeroList))
    diffGoalsDict = dict(zip(teamList, zeroList))
    pointsDict = dict(zip(teamList, zeroList))
    
    for y in range(len(fixtureDict[ref].index)): # for loop to cycle through each row and build up the dictionaries
        
        team1 = fixtureDict[ref]['Team 1'][y] # making it easier to reference current Team 1
        team2 = fixtureDict[ref]['Team 2'][y] # making it easier to reference current Team 2
        goals1 = fixtureDict[ref]['FT T1'][y] # making it easier to reference Team 1 score
        goals2 = fixtureDict[ref]['FT T2'][y] # making it easier to reference Team 2 score
        
        playedDict[team1] += 1 # notch up another game played for both teams
        playedDict[team2] += 1 # notch up another game played for both teams
        forGoalsDict[team1] += goals1 # add goals scored by Team 1 to Team 1's 'goals for'
        badGoalsDict[team2] += goals1 # add goals scored by Team 1 to Team 2's 'goals against'
        forGoalsDict[team2] += goals2 # add goals scored by Team 2 to Team 2's 'goals for'
        badGoalsDict[team1] += goals2 # add goals scored by Team 2 to Team 1's 'goals against'
        diffGoalsDict[team1] =  forGoalsDict[team1] - badGoalsDict[team1] # defining goal difference each time (for - against)
        diffGoalsDict[team2] =  forGoalsDict[team2] - badGoalsDict[team2] # defining goal difference each time (for - against)
        
        if goals1 > goals2 : # if Team 1 has more goals than Team 2...            
            # then that's a win for Team 1, and a loss for Team 2
            winDict[team1] += 1 # increasing Team 1's win count
            lossDict[team2] += 1 # increasing Team 2's loss count
            pointsDict[team1] += 3 # give Team 1 three points for the win
        elif goals1 == goals2 : # if Team 1 has equal goals as Team 2...
            # then that's a draw for Team 1, and a draw for Team 2
            drawDict[team1] += 1 # increasing Team 1's draw count
            drawDict[team2] += 1 # increasing Team 2's draw count
            pointsDict[team1] += 1 # give Team 1 one point for the draw
            pointsDict[team2] += 1 # give Team 2 one point for the draw
        elif goals1 < goals2 : # if Team 1 has less goals than Team 2...
            # then that's a loss for Team 1, and a win for Team 2
            lossDict[team1] += 1 # increasing Team 1's loss count
            winDict[team2] += 1 # increasing Team 2's win count
            pointsDict[team2] += 3 # give Team 2 three points for the win
        elif str(goals1) == 'nan':
            print(ref)
            print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
    
    # Now that we have dictionaries for every row necessary in the final league table, let's construct the DataFrame
    columnList = ('Games Played','Wins','Draws','Losses','Goals For','Goals Against','Goal Difference','Points')
    tableList =(playedDict,winDict,drawDict,lossDict,forGoalsDict,badGoalsDict,diffGoalsDict,pointsDict)
    tableDict = dict(zip(columnList, tableList))
    leagueTable = pd.DataFrame(tableDict)
    leagueTable['Team'] = leagueTable.index # change the Team from being index to its own column
    leagueTable = leagueTable[['Team','Games Played','Wins','Draws','Losses','Goals For',
                               'Goals Against','Goal Difference','Points']] # arrange columns in correct order
    leagueTable.sort_values(by='Points', ascending=False, inplace=True) # sort by points
    newIndex = []
    for z in range(len(leagueTable.index)): # only after sorting do we assign numbered index to show place in league table
        newIndex.append(z+1) # +1 so we start at 1 instead of 0
    leagueTable[''] = newIndex
    leagueTable.set_index('', inplace=True) # setting new index in place
    del leagueTable.index.name # delete the name
    return leagueTable # return our completed dataFrame

In [10]:
# In this cell, we'll run through every fixture list in fixtureDict and create a league table for each
# These will all be stored in a Dictionary and accessed by the same pattern of key (e.g - 1954 div1)

seasonDict = {} # creating the empty dictionary that will hold every single league table
#seasonFullDict = {} # for the full versions of the tables

# 'for' loop to cycle through each table of results and run it through the function in the cell above
for x in (fixtureDict):
    season = x[0:4]
    division = x[5:]
    seasonDict[season + ' ' + division] = createFinalTable(season, division, False)
    
#print('All Done') # debug line to certify code complete

Example league table, 1936 Division 2


In [11]:
seasonDict['1936 div2']


Out[11]:
Team Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points
1 Leicester City 42 24 8 10 89 57 32 80
2 Blackpool 42 24 7 11 88 53 35 79
3 Bury 42 22 8 12 74 55 19 74
4 Newcastle United 42 22 5 15 80 56 24 71
5 West Ham United 42 19 11 12 73 55 18 68
6 Plymouth Argyle 42 18 13 11 71 53 18 67
7 Sheffield United 42 18 10 14 66 54 12 64
8 Coventry City 42 17 11 14 66 54 12 62
9 Tottenham Hotspur 42 17 9 16 88 66 22 60
10 Aston Villa 42 16 12 14 82 70 12 60
11 Burnley 42 16 10 16 57 61 -4 58
12 Blackburn Rovers 42 16 10 16 70 62 8 58
13 Fulham 42 15 13 14 71 61 10 58
14 Barnsley 42 16 9 17 50 64 -14 57
15 Chesterfield 42 16 8 18 84 89 -5 56
16 Swansea City 42 15 7 20 50 65 -15 52
17 Norwich City 42 14 8 20 63 71 -8 50
18 Nottingham Forest 42 12 10 20 68 90 -22 46
19 Bradford Park Avenue 42 12 9 21 52 88 -36 45
20 Southampton 42 11 12 19 53 77 -24 45
21 Bradford City 42 9 12 21 54 94 -40 39
22 Doncaster Rovers 42 7 10 25 30 84 -54 31

Example league table, 1994 Division 4


In [12]:
seasonDict['1994 div4']


Out[12]:
Team Games Played Wins Draws Losses Goals For Goals Against Goal Difference Points
1 Carlisle 42 27 10 5 68 31 37 91
2 Walsall 42 25 10 7 77 40 37 85
3 Chesterfield 42 23 12 7 62 37 25 81
4 Bury 42 23 10 9 73 38 35 79
5 Preston 42 19 10 13 58 41 17 67
6 Mansfield 42 18 11 13 84 59 25 65
7 Fulham 42 16 14 12 60 54 6 62
8 Scunthorpe 42 18 8 16 68 63 5 62
9 Doncaster 42 17 10 15 58 43 15 61
10 Colchester 42 16 10 16 56 64 -8 58
11 Barnet 42 15 11 16 56 63 -7 56
12 Lincoln 42 15 11 16 54 56 -2 56
13 Torquay 42 14 13 15 54 57 -3 55
14 Wigan 42 14 10 18 53 60 -7 52
15 Rochdale 42 12 14 16 44 67 -23 50
16 Hereford 42 12 13 17 45 62 -17 49
17 Northampton 42 10 14 18 45 67 -22 44
18 Hartlepool 42 11 10 21 43 69 -26 43
19 Gillingham 42 10 11 21 46 64 -18 41
20 Darlington 42 11 8 23 43 57 -14 41
21 Scarborough 42 8 10 24 49 70 -21 34
22 Exeter 42 8 10 24 36 70 -34 34

The All-Time Greats

Here we'll graphically represent, in the form of pie charts, the 'best' teams since the leagues began.
We'll do this with a simple Formula 1 (think Mario Kart) inspired scoring system.
1st: 25 points, 2nd: 18, 3rd: 15, 4th: 12, 5th: 10, 6th: 8, 7th: 6, 8th: 4, 9th: 2, 10th: 1, 11th+: 0


In [13]:
# This cell is going to run through all of the leagues and build dictionaries containing tallies for every team involved.
# These can then be utilised to build charts displaying how teams have performed since 1880.

scoringDict = {1:25,2:18,3:15,4:12,5:10,6:8,7:6,8:4,9:2,10:1} # creating a dictionary for the scoring
div1TallyDict = {}
div2TallyDict = {}
div3TallyDict = {}
div4TallyDict = {} # empty dictionaries for team:score

for x in seasonDict: # cycle through league tables
    if x[-4:] == 'div1': # dealing with Premiership
        for y in seasonDict[x].index:
            if y < 11: # only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div1TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div1TallyDict has no value, so it throws an error
                        div1TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
    if x[-4:] == 'div2': # dealing with Championship
        for y in seasonDict[x].index:
            if y < 11: # only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div2TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div2TallyDict has no value, so it throws an error
                        div2TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
    if x[-4:] == 'div3': # dealing with 3rd Division
        for y in seasonDict[x].index:
            if y < 11: #only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div3TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div3TallyDict has no value, so it throws an error
                        div3TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
    if x[-4:] == 'div4': # dealing with 4th Division
        for y in seasonDict[x].index:
            if y < 11: #only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div4TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div4TallyDict has no value, so it throws an error
                        div4TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score

# defining a function that can run through a tallyDict and fix for the different team names used
# e.g "Man United" and "Manchester United", "Birmingham" and "Birmingham City"
def fixTallyDict(tallyDict):
    lista = [] # empty list to store team names
    for x in sorted(tallyDict):
        lista.append(x) # build up list with team names in alphabetical order
    for x in range(len(lista)):
        y=x+1 # using y to distinguish entry AFTER x
        if y < len(lista):
            if lista[x] == "Leeds": # SPECIAL CASE: Leeds
                if lista[y] == 'Leeds United':
                    tallyDict['Leeds'] += tallyDict['Leeds United']
                    del tallyDict['Leeds United']
                elif lista[y] == 'Leeds City':
                    tallyDict['Leeds'] += tallyDict['Leeds City']
                    tallyDict['Leeds'] += tallyDict['Leeds United']
                    del tallyDict['Leeds City']
                    del tallyDict['Leeds United']
            elif lista[x] in lista[y]: # Main generic checker
                tallyDict[lista[x]] += tallyDict[lista[y]] 
                del tallyDict[lista[y]]
            elif lista[x] == "Man City": # SPECIAL CASE: Man City
                if lista[y] == 'Manchester City':
                    tallyDict['Man City'] += tallyDict['Manchester City']
                    del tallyDict['Manchester City']
            elif lista[x] == "Man United": # SPECIAL CASE: Man Utd
                tallyDict['Man United'] += tallyDict['Manchester United']
                del tallyDict['Manchester United']
            elif lista[x] == "Nott'm Forest": # SPECIAL CASE: Nottingham Forest
                if lista[y] == 'Nottingham Forest':
                    tallyDict["Nott'm Forest"] += tallyDict['Nottingham Forest']
                    del tallyDict['Nottingham Forest']
            elif lista[x] == "QPR": # SPECIAL CASE: QPR
                tallyDict["QPR"] += tallyDict['Queens Park Rangers']
                del tallyDict['Queens Park Rangers']
            elif lista[x] == "Sheffield Wednesday": # SPECIAL CASE: Sheff Weds
                tallyDict["Sheffield Wednesday"] += tallyDict['Sheffield Weds']
                del tallyDict['Sheffield Weds']
            elif lista[x] == "Wolverhampton Wanderers": # SPECIAL CASE: Wolves
                if lista[y] == 'Wolves':
                    tallyDict["Wolverhampton Wanderers"] += tallyDict['Wolves']
                    del tallyDict['Wolves']
    return(tallyDict) # return fixed Dictionary

div1TallyDict = fixTallyDict(div1TallyDict)
div2TallyDict = fixTallyDict(div2TallyDict)
div3TallyDict = fixTallyDict(div3TallyDict)
div4TallyDict = fixTallyDict(div4TallyDict) # updating tallyDicts to fix them all with new function

#print('All Done') # debug line to certify code complete

In [14]:
sns.set()

# Using the TallyDicts to construct 'cropped' DataFrames
minScore = 300 # Specify minScore (this is normalised for Div2 and adjusted for the others)
# Create DataFrames using keys as 'Team' and values as 'Greatness Score'
div1TallyFrame = pd.DataFrame({"Team": list(div1TallyDict.keys()), "'Greatness' Score": list(div1TallyDict.values())})
div1TallyFrame = div1TallyFrame[div1TallyFrame["'Greatness' Score"] > minScore*1.5] # Keep only the best teams
div2TallyFrame = pd.DataFrame({"Team": list(div2TallyDict.keys()), "'Greatness' Score": list(div2TallyDict.values())})
div2TallyFrame = div2TallyFrame[div2TallyFrame["'Greatness' Score"] > minScore] # Keep only the best teams
div3TallyFrame = pd.DataFrame({"Team": list(div3TallyDict.keys()), "'Greatness' Score": list(div3TallyDict.values())})
div3TallyFrame = div3TallyFrame[div3TallyFrame["'Greatness' Score"] > minScore*0.46] # Keep only the best teams
div4TallyFrame = pd.DataFrame({"Team": list(div4TallyDict.keys()), "'Greatness' Score": list(div4TallyDict.values())})
div4TallyFrame = div4TallyFrame[div4TallyFrame["'Greatness' Score"] > minScore*0.45] # Keep only the best teams

# Creating the figure to display Pie Charts of the all-time greatest teams
fig = {
    'data': [
        {'labels': div1TallyFrame['Team'],'values': div1TallyFrame["'Greatness' Score"], 'type': 'pie','name': 'Div 1',
            'domain': {'x': [.1, .33], # Domain is a fiddly variable
                       'y': [1, .34]},"hole": .35, 'hoverinfo':'label+value','textinfo':'label+value',},
        {'labels': div2TallyFrame['Team'],'values': div2TallyFrame["'Greatness' Score"],'type': 'pie','name': 'Div 2',
            'domain': {'x': [.9, .67], # Domain is a fiddly variable
                       'y': [1, .34]}, "hole": .35,'hoverinfo':'label+value', 'textinfo':'label+value',},
        {'labels': div3TallyFrame['Team'],'values': div3TallyFrame["'Greatness' Score"],'type': 'pie', 'name': 'Div 3',
            'domain': {'x': [.3, .15], # Domain is a fiddly variable
                       'y': [0, .1]},"hole": .45, 'hoverinfo':'label+value','textinfo':'label', },
        {'labels': div4TallyFrame['Team'], 'values': div4TallyFrame["'Greatness' Score"],'type': 'pie','name': 'Div 4',
            'domain': {'x': [.8, .67], # Domain is a fiddly variable
                       'y': [0, .1]},"hole": .5, 'hoverinfo':'label+value','textinfo':'label', }
    ],
    # setting annotations to denote which league each chart is
    'layout': {'title': "All-Time Team 'Greatness' Ranking by League",'showlegend': False, "annotations": [
            {"font": {"size": 20},"showarrow": False,"text": "1st","x": 0.195,"y": 0.72 },
            {"font": {"size": 20}, "showarrow": False,"text": "2nd","x": 0.81,"y": 0.72},
            {"font": { "size": 20},"showarrow": False,"text": "3rd","x": 0.2,"y": 0.01},
            { "font": {"size": 20}, "showarrow": False,"text": "4th","x": 0.755,"y": 0.01} 
        ]
    }
}
iplot(fig) # plot the figure

# TO DO: EDIT THE COLOURS OF THE PIE CHART TO CORRELATE WITH TEAM COLOURS (e.g Liverpool red, Chelsea blue)


/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:878: UserWarning:

axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.

Analysing these pie charts above, we can see that Liverpool hold the all-time top spot for the highest division (currently the premiership) with Manchester United in a very close 2nd. The, perhaps surprising, teams to make it are Aston Villa and Sunderland, who have not been hugely successful recently but used to enjoy a great deal of success.
Moving over to Division 2 (or the secondary league, currently the Championship), the scores are a lot lower. This is most likely due to the fact that once you finish near the top (or bottom) of this league you move up or down to another league, and thus it's impossible to finish high in the 2nd Division season after season. With that said, it's clear that Birmingham and Leicester have seen the most success in the second tier, Leicester interestingly enough having won the 2nd division very recently and then shockingly won the Premier league just last season.
Below these, I've also included some smaller pies to show league 3 and 4.

Does transfer spending correlate with performance?

Using our organised data, we can now start to explore relationships and answer some specular questions.
Here, the relationship between transfer spending and league performance will be explored.
Hopefully, these answers will provide insights as yet untold into the economics of football.


In [15]:
CPI = pd.read_csv('http://claireaux.co.uk/project/cpiVALUES.csv') # reading CPI csv to get inflation values

# Function that allows adjusting old euro values for inflation
def adjustInflation(value,yearFrom,yearTo):
    indexTo = float(CPI[yearTo])
    indexFrom = float(CPI[yearFrom])
    change = (((indexTo - indexFrom) / indexFrom) + 1)
    newValue = value*change
    return(newValue)

# Function that builds a dataframe containing teams as index, league position for the given season...
# ...and various transfer related data
# This also includes some code that accounts for the different naming conventions between data sources
def posExpBuilder(season) :
    df = seasonDict[season]
    df2 = transferDataDict[season[0:4]] # creating dataframes for specified year to use
    leaguePositions = {} # creating empty dict for league position: team
    expenditures = {} # creating empty dict for expenditure: team
    incomes = {} # creating empty dict for income: team
    averageNewPlayerCosts = {} # creating empty dict for average cost of each new player: team
    averageSoldPlayerPrices = {} # creating empty dict for average price of each sold player: team
    balances = {} # creating empty dict for income - expenditures: team
    for x in range(len(df.index)): # for loop to cycle through first dataframe
        y=x+1
        leaguePositions[df.iloc[x]['Team']] = y # setting league positions dictionary values (e.g Arsenal: 2)
    for x in range(len(df2.index)): # cycling through second Dataframe
        # creating adjusted variables for inflation
        adjustedExpense = adjustInflation((df2.iloc[x]['Expenditures (mil. euros)']),season[0:4],'2016')
        adjustedIncome = adjustInflation((df2.iloc[x]['Income (mil. euros)']),season[0:4],'2016')
        playerCost = (df2.iloc[x]['Expenditures (mil. euros)']) / (df2.iloc[x]['Arrivals'])
        adjustedPlayerCost = adjustInflation(playerCost,season[0:4],'2016')
        playerPrice = (df2.iloc[x]['Income (mil. euros)']) / (df2.iloc[x]['Departures'])
        adjustedPlayerPrice = adjustInflation(playerPrice,season[0:4],'2016')
        adjustedBalance = adjustInflation((df2.iloc[x]['Balance (mil. euros)']),season[0:4],'2016')
        # appending dictionaries for calculated adjusted values
        expenditures[df2.index[x]] = adjustedExpense
        incomes[df2.index[x]] = adjustedIncome
        averageNewPlayerCosts[df2.index[x]] = adjustedPlayerCost
        averageSoldPlayerPrices[df2.index[x]] = adjustedPlayerPrice
        balances[df2.index[x]] = adjustedBalance
    
    # making a list of the new dictionaries
    working = (leaguePositions,expenditures,incomes,averageNewPlayerCosts,averageSoldPlayerPrices,balances)
    returnFrame = pd.DataFrame.from_records(working) # creating the final frame from this new list
    returnFrame = returnFrame.transpose() # transpose to look nicer
    
    names = [] # empty names list to be used below
    # for loop to cycle through our new final dataframe accounting for the different conventions from combining different data
    for x in range(len(returnFrame.index)): 
        y = x+1
        if y < len(returnFrame.index):
            if returnFrame.index[x] in returnFrame.index[y]:
                returnFrame[1][x] = returnFrame[1][y] # grabbing the column[1] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[2][x] = returnFrame[2][y] # grabbing the column[2] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[3][x] = returnFrame[3][y] # grabbing the column[3] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[4][x] = returnFrame[4][y] # grabbing the column[4] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[5][x] = returnFrame[5][y] # grabbing the column[5] value for ('TEAM' FC) and putting it into ('TEAM')
        if str(returnFrame[0][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[1][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[2][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[3][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[4][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[5][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
    
    returnFrame.drop(names,inplace=True) # deleting all those teams that were in the list
    returnFrame[0] = returnFrame[0].astype(int,raise_on_error=False) # change league standing column dataType to int
    returnFrame.rename(columns={0: 'Position', 1: 'Expenditure', 2: 'Income',
                               3: 'Average Cost of New Player', 4: 'Average Price of Sold Player',
                               5: 'Balance'}, inplace=True) # renaming columns appropriately
    return(returnFrame) # return the dataFrame

In [16]:
# Some general functions for graphing etc.
def roundTo3(x): # small function to round numbers to 3 decimal places
    return(float("{0:.3f}".format(x)))

# function that takes given x and y values and calculates pearsonr and a significance value for p
def textMaker(xi, y): 
    bond = (stats.pearsonr(xi,y))
    p = bond[1]
    if p < 0.01:
        p = '< 1%'
    elif p < 0.05:
        p = '< 5%'
    elif p < 0.1:
        p = '< 10%'
    else:
        p = 'not significant'
    return("Pearson's r: " + str(roundTo3(bond[0])) + ', p value:' + p)

# Generate linear line fit using regression
def fitLine(y):
    slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)
    return(slope*xi+intercept)

positionTransferDict1 = {}
positionTransferDict2 = {} # Creating empty dicts for use

# Top league (Premiership [div 1])
for x in range(1980,2016):
    season = str(x) + ' div1'
    positionTransferDict1[str(x)] = posExpBuilder(season) # running all data through function in previous cell
    
# Secondary league (Championship/league 1 [div 2])
for x in range(1980,2016):
    season = str(x) + ' div2'
    positionTransferDict2[str(x)] = posExpBuilder(season) # running all data through function in previous cell
    
positionTransferList1 = []
positionTransferList2 = []
resultList = [] # creating empty lists

for x in positionTransferDict1:
    positionTransferList1.append(positionTransferDict1[x])
for x in positionTransferDict2:
    positionTransferList2.append(positionTransferDict2[x]) # populating lists with dataframes

result = pd.concat(positionTransferList1)
result2 = pd.concat(positionTransferList2) # concatenating all dataframes in each list to make one big dataframe

result['div'] = 'div1'
result2['div'] = 'div2'
resultList = [result,result2]
result3 = pd.concat(resultList) # as yet unused 3rd dataframe that combines both div1 and div2


/Users/sglyon/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:32: RuntimeWarning:

invalid value encountered in double_scalars


In [17]:
xi = result['Position'] # using league position as x values
A = array([ xi, ones(22)]) # array for use in stats calcs
y1=result['Expenditure']
y2=result['Income']

# Creating the dataset, and generating the plot
trace1 = go.Scatter(x=xi, y=y1,  mode='markers',  marker=go.Marker(color='turquoise', opacity=0.5), name='Spending Data')
trace2 = go.Scatter(x=xi, y=fitLine(y1), mode='lines', marker=go.Marker(color='blue', size=100), name='Spending Fit')
trace3 = go.Scatter(x=xi, y=y2, mode='markers', marker=go.Marker(color='violet', opacity=0.5), name='Income Data')
trace4 = go.Scatter(x=xi, y=fitLine(y2), mode='lines', marker=go.Marker(color='red', size=100), name='Income Fit')

# Annotations containg important statistical stuff
pearson1 = go.Annotation(x=.3, y=.8, text = textMaker(xi, y1), showarrow=False, font=go.Font(size=12))
pearson2 = go.Annotation(x=.85, y=.8, text = textMaker(xi, y2), showarrow=False, font=go.Font(size=12))
annotationList = [pearson1,pearson2]

# Create fig with two subplots side by side
fig = tools.make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(
        'League performance v.s transfer spending', 'League performance v.s transfer income'))
# Update layout for annotations and titles
fig['layout'].update(annotations=annotationList,title='League performance v.s transfer spending' +
                     '          League performance v.s transfer income             .',)

# Add all the data traces to the plots
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

# Update axis layouts
fig['layout']['xaxis1'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
fig['layout']['yaxis1'].update(zerolinecolor='rgb(210,210,210)', gridcolor='rgb(210,210,210)', 
                               title="Million EUR (inflation adjusted)")
fig['layout']['xaxis2'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")

# Plot figure
iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y1 ]

Now that we've visualised in pie charts the greatest teams, it's time to explore how a team can achieve such 'greatness'.
These two graphs explore the correlation between League Performance and Transfer Spending/Income.
For both cases, we see a highly significant (to the 1% level) negative correlation (i.e the less you spend/earn, the worse your final league standing).
Admittedly, the correlation is somewhat weak, especially in the transfer income case, but the regression lines show it definitely exists.
This result may seem like something obvious from pure common sense, but this data spanning 35 years proves it empirically.
It also proves that the correlation may not be quite as strong as one would imagine without the data.


In [18]:
xi = result['Position'] # using league position as x values
A = array([ xi, ones(22)]) # array for use in stats calcs
y1=result['Average Cost of New Player']
y2=result['Average Price of Sold Player']

# Creating the dataset, and generating the plot
trace1 = go.Scatter(x=xi, y=y1,  mode='markers',  marker=go.Marker(color='turquoise', opacity=0.5), name='Average Cost')
trace2 = go.Scatter(x=xi, y=fitLine(y1), mode='lines', marker=go.Marker(color='blue', size=100), name='Inbound Fit')
trace3 = go.Scatter(x=xi, y=y2, mode='markers', marker=go.Marker(color='violet', opacity=0.5), name='Average Price')
trace4 = go.Scatter(x=xi, y=fitLine(y2), mode='lines', marker=go.Marker(color='red', size=100), name='Outbound Fit')

# Annotations containg important statistical stuff
pearson1 = go.Annotation(x=.3, y=.8, text = textMaker(xi, y1), showarrow=False, font=go.Font(size=12))
pearson2 = go.Annotation(x=.85, y=.8, text = textMaker(xi, y2), showarrow=False, font=go.Font(size=12))
annotationList = [pearson1,pearson2]

# Create fig with two subplots side by side
fig = tools.make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(
        'League performance v.s transfer spending', 'League performance v.s transfer income'))
# Update layout for annotations and titles
fig['layout'].update(annotations=annotationList,title='Performance v.s Average Spent per New Player' +
                     '      Performance v.s Average Price per Sold Player        .',)

# Add all the data traces to the plots
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

# Update axis layouts
fig['layout']['xaxis1'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
fig['layout']['yaxis1'].update(zerolinecolor='rgb(210,210,210)', gridcolor='rgb(210,210,210)', 
                               title="Million EUR (inflation adjusted)")
fig['layout']['xaxis2'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")

# Plot figure
iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y1 ]

These two graphs are similar to the ones above, except we're comparing on an individual basis.
In each case, the y value depicts the average spent/earned from every incoming/outgoing player.
This is calculated simply by dividing the total expenditure by number of arrivals (or total income by number of departures).
Again, correlation is significant to the 1% level, and we see negative correlation across both.
At a value of -0.31, it is clear that purchasing high value players seems to give the highest payoff seen so far in terms of performance.
The average price received from sold players has only a very small negative correlation and thus I would not say it matters too much.

Conclusions

Based on the data studied, in order for teams to improve their performance, spending large amounts in concentration has been shown to provide the best outcome (i.e purchasing a few high value players > many low value players > spending less).
Ideally, I would have built around 50 figures for comparison of all the values stored within DataFrames, but tackling seaborn and plotly turned out to be more time-consuming than I had imagined and thus I don't believe I have done my level of data collection justice.
I plan to continue to work on this notebook in my spare time going forward, but for now this is my submitted and final project.