The Python Premiership Project

Details

This Jupyter Notebook file has been created by George Claireaux for the Data Bootcamp course within NYU Stern.
This file is provided 100% open-source and anyone is welcome to use and edit it as such.
The rest of the document will be an instructional collection, formatting and analysis of various data relating to the "Python Premiership Project" ©.
By collecting a vast amount of data on football fixtures in England since the leagues began in 1880, this project aims to shed new light onto the workings of sports economics.

Import Packages



In [1]:

    
import sys                             # system module
import pandas as pd                    # data package
import matplotlib as mpl               # graphics package
import matplotlib.pyplot as plt        # pyplot module
import datetime as dt                  # date and time module
import numpy as np                     # NumPy module
import seaborn as sns                  # seaborn module
import time
from dateutil.parser import parse
from pandas_datareader import data as web

# Scientific libraries
from numpy import arange,array,ones
from scipy import stats

# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
from plotly import tools
import plotly.graph_objs as go               # ditto
import plotly.plotly as py
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

# these lines make our graphics show up in the notebook
%matplotlib inline             
plotly.offline.init_notebook_mode(connected=True)

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Seaborn version: ', sns.__version__)
print('Plotly version: ', plotly.__version__)
print('Today: ', dt.date.today())









    



/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:878: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))






    











    











    











    



Python version: 3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.19.0
Matplotlib version:  1.5.3
Seaborn version:  0.6.0
Plotly version:  1.12.11
Today:  2016-12-22

Reading in the Bulk of our Data

This is where we'll pull in match data from the English leagues dating back to the 1880's.
We'll read these into seperate dataframes, each one containing all the results from one league and year, e.g - League 2, 1997.
These will all be contained within one massive dictionary and accessed via the appropriate key, e.g - [1997 div2].
All original data is available here and here.

Initial List Creation



In [2]:

    
# Creating a list of the all the years in the correct format (e.g - 1997-98)

finalYear = 2016 # adjustable for future update of project

seasonList = [] # create empty list
for x in range(1888,finalYear):
    seasonList.append(x) # append all years up to finalYear to the list
seasonList = [str(i) for i in seasonList] # convert all years in list to strings

for x in range(len(seasonList)): # converting strings to correct format
    if seasonList[x][2:4] != '99':
        if seasonList[x][-2] == '0' and seasonList[x][-1] != '9':
            seasonList[x] = (seasonList[x]+'-0'+str(int(seasonList[x][2:4])+1)) # catching error prone 0X-0X years
        else:
            seasonList[x] = (seasonList[x]+'-'+str(int(seasonList[x][2:4])+1)) # majority of seasons
    elif seasonList[x][2:4] == '99':
        seasonList[x] = (seasonList[x]+'-'+'00') # catching error prone 99-00 years
    else: 
        print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
        
# remove any unwanted entries (such as wartime years where leagues didn't occur)
seasonList.remove('1915-16')
seasonList.remove('1916-17')
seasonList.remove('1917-18')
seasonList.remove('1918-19')
seasonList.remove('1940-41')
seasonList.remove('1941-42')
seasonList.remove('1942-43')
seasonList.remove('1943-44')
seasonList.remove('1944-45')
seasonList.remove('1945-46')

# creating an empty dictionary to store all season dataframes in
fixtureDict = {}

Function Definitions



In [3]:

    
# Defining some functions for use within the next cell

# creating a function that will finalise the url, read the csv to a DataFrame and add it to the Dict
def finaliseCSV (url,urlEnd,division,year):
    finalURL = url + urlEnd
    if year > 1999:
        df = pd.read_csv(finalURL, usecols=['Date','HomeTeam','AwayTeam','FTHG','FTAG','HTHG','HTAG'], index_col='Date')
    else:
        df = pd.read_csv(finalURL, index_col='Date')
    fixtureDict[str(year)+' '+division] = df
    
# running into problems with overloading urllib when trying to run this all at once so making it a controllable function
def readDivisions(year,ur):
    if year < 1992: # before 1992, division 1 was always the top league
        finaliseCSV(ur,'1-division1.csv','div1',year)
        if year > 1891: # in 1892, division 2 was introduced
            finaliseCSV(ur,'2-division2.csv','div2',year)
            if (year > 1919 and year < 1921) or year > 1957: # making sure we only look for division 3 in the right years
                finaliseCSV(ur,'3-division3.csv','div3',year)
                if year > 1957:
                    finaliseCSV(ur,'4-division4.csv','div4',year) # division 4 was introduced in 1958
            elif year > 1920 and year < 1958: # between these years there were two divisions 3s, north and south
                finaliseCSV(ur,'3a-division3n.csv','div3 north',year)
                finaliseCSV(ur,'3b-division3s.csv','div3 south',year)
    elif year < 2000: # for the rest of the years up to 2000, introduction of the premier league
        finaliseCSV(ur,'1-premierleague.csv','div1',year)
        finaliseCSV(ur,'2-division1.csv','div2',year)
        finaliseCSV(ur,'3-division2.csv','div3',year)
        if year > 1992: # 1992 had no fourth division but the rest do
            finaliseCSV(ur,'4-division3.csv','div4',year)
    else: # for years 2000+ (getting data from different website)
        finaliseCSV(ur,'E0.csv','div1',year)
        finaliseCSV(ur,'E1.csv','div2',year)
        finaliseCSV(ur,'E2.csv','div3',year)
        finaliseCSV(ur,'E3.csv','div4',year)
        if year > 2004: # introduction of the conference (lowest professional league currently)
            finaliseCSV(ur,'EC.csv','div5',year)

Major Data Collection



In [4]:

    
# Number of leagues in existence and naming conventions of said leagues has changed over the years.
# So we'll nest our data mining loops within 'if' statements to make sure we get everything correctly (function readDivisions)
# PLEASE NOTE: This cell will take a fair amount of time to run (dependent on internet connection) so be warned!
# ALSO BE ADVISED: You may get this error when running this cell --->  "<urlopen error [Errno 11002] getaddrinfo failed>"
# If this happens, run the code in the TEN cells below instead (one by one, not at the same time)

# main 'for' loop that calls the appropriate function above for each season in the seasonList
for x in range(len(seasonList)):
    
    if seasonList[x][0:2] == '18' or seasonList[x][0:2] == '19': # getting season data up to 2000 from one site
        u = 'https://raw.githubusercontent.com/footballcsv/eng-england/master/'
        u = u + seasonList[x][0:3] + '0s/' + seasonList[x] + '/' #building up the start of the url
        readDivisions(int(seasonList[x][0:4]),u) # running the function to make sure we choose the right leagues for the year
                    
    elif seasonList[x][0:2] == '20': # getting more recent data from different site
        u = 'http://www.football-data.co.uk/mmz4281/'
        u = u + seasonList[x][2:4] + seasonList[x][5:7] + '/'
        readDivisions(int(seasonList[x][0:4]),u)
        
    else:
        print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
        
#print('All Done') # debug line to certify code complete

DataFrame Formatting



In [5]:

    
# Now that we've read all the fixtures into DataFrames, we just need to tidy them up
# Here we're going to run a large 'for' loop that will edit all DataFrames into a single format

def is_date(string): # creating a quick function that checks if a string is a date
    try: 
        parse(string)
        return True
    except ValueError:
        return False

for x in (fixtureDict):
    
    # This block of code is going to convert all dates into the same YYYY-MM-DD format
    if type(fixtureDict[x].index[0]) == str and is_date(fixtureDict[x].index[0]):
        date = fixtureDict[x].index 
        date = list(date) # creating a new variable based off the original index and converting it to a list
        for y in range(len(fixtureDict[x].index)):
            date[y] = pd.to_datetime(fixtureDict[x].index[y]) # assigning each date within the index to an element in date
        fixtureDict[x]['Date'] = date
        fixtureDict[x] = fixtureDict[x].set_index('Date') # assigning our date list as the new index for each DataFrame
        
    # This block of code is going to convert every DataFrame's columns into the same format
    # This format will be (index)Date , Team 1 , Team 2 , FT T1 , FT T2 , - , HT T1 , HT T2
    # Team 1 (T1) is home team, Team 2 (T2) is away team, FT = Full-Time Score, HT = Half-Time Score
    if fixtureDict[x].columns[0] == 'Round':
        del fixtureDict[x]['Round'] # deleting the redundant 'Round' column
    # renaming the headers to new format
    fixtureDict[x].rename(columns={'HomeTeam': 'Team 1', 'AwayTeam': 'Team 2', 'FTHG': 'FT T1', 'FTAG': 'FT T2'}, inplace=True)
    fixtureDict[x].rename(columns={'HTHG': 'HT T1', 'HTAG': 'HT T2'}, inplace=True)
    if fixtureDict[x].columns[2] == 'FT': # specifically the old data
        fixtureDict[x]['FT T1'] = fixtureDict[x]['FT']
        fixtureDict[x]['FT T2'] = fixtureDict[x]['FT']
        fixtureDict[x]['HT T1'] = fixtureDict[x]['HT']
        fixtureDict[x]['HT T2'] = fixtureDict[x]['HT'] # creating new columns to split up the score line based on team
        del fixtureDict[x]['FT']
        del fixtureDict[x]['HT'] # deleting redundant columns
        fixtureDict[x] = fixtureDict[x][['Team 1','Team 2','FT T1','FT T2','HT T1','HT T2']] # setting correct order
        
    # Here we're reformatting the scorelines to be seperated by team (e.g '2' and '0' rather than '2-0')
    if int(x[0:4]) < 2000: # seperating the data so we know which DataFrame has split goals by team
        FTT1 = list(fixtureDict[x]['FT T1'])
        FTT2 = list(fixtureDict[x]['FT T2'])
        HTT1 = list(fixtureDict[x]['HT T1'])
        HTT2 = list(fixtureDict[x]['HT T2']) # creating new flexible lists to use from original data
        for y in range(len(fixtureDict[x].index)): # 'for' loop to cycle through each element in column
            FTT1[y] = int(FTT1[y][0])
            FTT2[y] = int(FTT2[y][-1]) # splitting up the scores into seperate lists for each team
            if str(HTT1[y]) != 'nan' and len(HTT1[y]) > 1:
                HTT1[y] = int(HTT1[y][0])
                HTT2[y] = int(HTT2[y][-1]) # splitting up the halftime scores (if they exist)
            elif str(HTT1[y]) == 'nan':
                HTT1[y] = '-'
                HTT2[y] = '-' # changing any NaN values to '-' for concurrent data
        fixtureDict[x]['FT T1'] = FTT1
        fixtureDict[x]['FT T2'] = FTT2
        fixtureDict[x]['HT T1'] = HTT1
        fixtureDict[x]['HT T2'] = HTT2 # assigning the new lists into the actual columns
        
    # Deleting index name ('Date')
    del fixtureDict[x].index.name
        
    # Fixing dataframes with no dates by resetting the index
    if fixtureDict[x].index[0] == '-':
        fixtureDict[x].reset_index(inplace=True)
    if fixtureDict[x].columns[0] == 'index':
        fixtureDict[x].drop('index', axis=1, inplace=True)
        
#print('All Done') # debug line to certify code complete



In [6]:

    
# removing erroneous NaN entries in fixture tables
# this is being a pain so put it in a seperate cell

def fixNaN():
    for y in fixtureDict[x].index:
         if str(y) == 'NaT':
            #print(str(y))
            num = 0 - len(fixtureDict[x].ix[y])
            #print(num)
            fixtureDict[x] = fixtureDict[x].ix[:num]
            
for x in (fixtureDict):    
    try:
        fixNaN()
    except: 
        pass

#print('All Done') # debug line to certify code complete

Importing Comparative Data

Over the next few cells, we will be importing data sets to be used in comparison with the football results.
Examples of this data include transfer market spendings, as well as average player age.
The more data of this sort, the better the final project will be, as it will allow more analysis of what drives football success or failure.



In [7]:

    
# Reading in Premiership (div1) data from 1992 - 2015 that includes:
#    No. of players in squad
#    Average player age
#    No. of foreign players in squad
#    Total market value (million € euros)
#    Average market value (million € euros)

premDataDict = {}

# Reading in the data from my site where I'm hosting the (relatively) clean csv files
for x in range(0,16): # 2000-2015
    if x < 10:
        url = 'http://claireaux.co.uk/project/premData0'
        url = url + str(x) + '.csv'
        df = pd.read_csv(url, decimal=',', index_col='Team')
        del df.index.name # deleting index name ('Team')
        premDataDict['200'+str(x)] = df
    elif x > 9:
        url = 'http://claireaux.co.uk/project/premData'
        url = url + str(x) + '.csv'
        df = pd.read_csv(url, decimal=',', index_col='Team')
        del df.index.name # deleting index name ('Team')
        premDataDict['20'+str(x)] = df
        
for x in range(92,100): # 1992-1999
    url = 'http://claireaux.co.uk/project/premData'
    url = url + str(x) + '.csv'
    df = pd.read_csv(url, decimal=',', index_col='Team')
    del df.index.name # deleting index name ('Team')
    premDataDict['19'+str(x)] = df
        
# Here we're reformatting the final two columns to give a float rather than a value like "12,15 Mill. ?"
def listerTwister(aList): # function to make this easier
    for x in range(len(premDataDict[y].index)): # 'for' loop to cycle through each element in column
        if 'Th' in str(aList[x]):
            aList[x] = '0.' + str(aList[x][:4])
            aList[x] = float(aList[x])
        else:
            aList[x] = str(aList[x][:-8])
            aList[x] = aList[x].replace(',','.')
            if len(aList[x]) > 1:
                aList[x] = float(aList[x]) # formatting 'Total Market Value' column list
    return aList

for y in premDataDict:
    myList1 = list(premDataDict[y]['Total market value (mil. euros)']) # creating new flexible lists from original data
    myList2 = list(premDataDict[y]['Average Market value (mil. euros)']) # creating new flexible lists from original data
    premDataDict[y]['Total market value (mil. euros)'] = listerTwister(myList1) # assigning new lists into the actual columns
    premDataDict[y]['Average Market value (mil. euros)'] = listerTwister(myList2) # assigning new lists into the actual columns
    
#print('All Done') # debug line to certify code complete



In [8]:

    
# Reading in transfer data from 1980 - 2015 that includes:
#    Competition given club played in
#    Transfer expenditure (million € euros)
#    No. of transfer arrivals
#    Transfer income (million € euros)
#    No. of transfer departures
#    Balance i.e profit/loss calculated as [income - expenditure]

transferDataDict = {}

# Defining a function to make reading in the data easier
def readTransfer(year):
    url = 'http://claireaux.co.uk/project/transferData'
    if year < 10:
        url = url + '0' + str(year) + '.csv'
    else:
        url = url + str(year) + '.csv' # slight alterations based on year
    df = pd.read_csv(url, index_col='Club(s)') # reading csv making the club the index
    del df.index.name # deleting index name ('Club(s)')
    df.drop('#',axis=1,inplace=True) # dropping the column that was previously an index-like counter
    df.rename(columns={'Expenditures': 'Expenditures (mil. euros)', 'Income': 'Income (mil. euros)',
                       'Balance': 'Balance (mil. euros)'}, inplace=True)
    if year < 10:
        transferDataDict['200'+str(year)] = df # adding dataframe into Dictionary
    elif year < 20:
        transferDataDict['20'+str(year)] = df # adding dataframe into Dictionary
    else:
        transferDataDict['19'+str(year)] = df # adding dataframe into Dictionary

# Reading in the data from my site where I'm hosting the (relatively) clean csv files
for x in range(0,16): # 2000-2015
    readTransfer(x)
for x in range(80,100): # 1980-1999
    readTransfer(x)
    
# Here we're reformatting the monetary columns to give a float rather than a value like "12,15 Mill. ?"
def listerTwisterTwo(aList): # function to make this easier
    for x in range(len(transferDataDict[y].index)): # 'for' loop to cycle through each element in column
        
        if len(aList[x]) < 2:
            aList[x] = float(0)
            
        elif 'Th' in str(aList[x]):
            aList[x] = str(aList[x][:-6])
            aList[x] = aList[x].replace(',','.')
            if aList[x][0] == '-':
                aList[x] = '-0.' + aList[x][1:]
            else:
                aList[x] = '0.' + aList[x]
            aList[x] = float(aList[x])
            
        else:
            aList[x] = str(aList[x][:-8])
            aList[x] = aList[x].replace(',','.')
            if len(aList[x]) > 1:
                aList[x] = float(aList[x]) # formatting 'Total Market Value' column list
                
    return aList # return the new formatted column in our list variable

for y in transferDataDict:
    a = list(transferDataDict[y]['Expenditures (mil. euros)']) # creating new flexible lists from original data
    b = list(transferDataDict[y]['Income (mil. euros)']) # creating new flexible lists from original data
    c = list(transferDataDict[y]['Balance (mil. euros)']) # creating new flexible lists from original data
    transferDataDict[y]['Expenditures (mil. euros)'] = listerTwisterTwo(a) # assigning new lists into the actual columns
    transferDataDict[y]['Income (mil. euros)'] = listerTwisterTwo(b) # assigning new lists into the actual columns
    transferDataDict[y]['Balance (mil. euros)'] = listerTwisterTwo(c) # assigning new lists into the actual columns
    
#print('All Done') # debug line to certify code complete

Utilising the Data

Now that our main portion of data is collected, stored and concurrently formatted; it's time to use it!
From here, we can employ these football results to discover a host of insights into the workings of the sport.
As it stands however, our fixture DataFrames aren't incredibly useful just on their own.
In this next cell, we're going to create final league standings for each season using all of our current data.



In [9]:

    
# All of the fixture data has been stored by Year/Division in identically formatted DataFrames, within a HUGE dictionary
# Here, we are going to define a function that takes a given Year/Division and works out the final league standings
# The final league table will then be returned as a new DataFrame

# TODO: add the ability to produce the full table IF full = True

# sea = season ('1969-70') , div = division ('div3') , full = boolean (True or False)
# the 'full' bool controls whether extra columns are added that split goals and points into home and away performance
def createFinalTable (sea, div, full):
    
    teamList = [] # creating an empty list that will hold one copy of each team within the league
    zeroList = [] # creating empty list that will contain as many zeroes as there are teams
    ref = sea[0:4] + ' ' + div # creating a reference based on function inputs
        
    for y in range(len(fixtureDict[ref].index)): # for loop to cycle through each row and build Team List
        if fixtureDict[ref]['Team 1'][y] not in teamList: # checking if team is already in list
            teamList.append(fixtureDict[ref]['Team 1'][y]) # if not, add it
            zeroList.append(0)
            
    # We're going to create 8 Dicts, one for games played, one for number of wins, one for number of draws... 
    # one for number of losses, one for goals for, one for goals against, one for goal difference...
    # and one for Points
    # The keys will be the team names, and the values will start at zero
    playedDict = dict(zip(teamList, zeroList))
    winDict = dict(zip(teamList, zeroList))
    drawDict = dict(zip(teamList, zeroList))
    lossDict = dict(zip(teamList, zeroList))
    forGoalsDict = dict(zip(teamList, zeroList))
    badGoalsDict = dict(zip(teamList, zeroList))
    diffGoalsDict = dict(zip(teamList, zeroList))
    pointsDict = dict(zip(teamList, zeroList))
    
    for y in range(len(fixtureDict[ref].index)): # for loop to cycle through each row and build up the dictionaries
        
        team1 = fixtureDict[ref]['Team 1'][y] # making it easier to reference current Team 1
        team2 = fixtureDict[ref]['Team 2'][y] # making it easier to reference current Team 2
        goals1 = fixtureDict[ref]['FT T1'][y] # making it easier to reference Team 1 score
        goals2 = fixtureDict[ref]['FT T2'][y] # making it easier to reference Team 2 score
        
        playedDict[team1] += 1 # notch up another game played for both teams
        playedDict[team2] += 1 # notch up another game played for both teams
        forGoalsDict[team1] += goals1 # add goals scored by Team 1 to Team 1's 'goals for'
        badGoalsDict[team2] += goals1 # add goals scored by Team 1 to Team 2's 'goals against'
        forGoalsDict[team2] += goals2 # add goals scored by Team 2 to Team 2's 'goals for'
        badGoalsDict[team1] += goals2 # add goals scored by Team 2 to Team 1's 'goals against'
        diffGoalsDict[team1] =  forGoalsDict[team1] - badGoalsDict[team1] # defining goal difference each time (for - against)
        diffGoalsDict[team2] =  forGoalsDict[team2] - badGoalsDict[team2] # defining goal difference each time (for - against)
        
        if goals1 > goals2 : # if Team 1 has more goals than Team 2...            
            # then that's a win for Team 1, and a loss for Team 2
            winDict[team1] += 1 # increasing Team 1's win count
            lossDict[team2] += 1 # increasing Team 2's loss count
            pointsDict[team1] += 3 # give Team 1 three points for the win
        elif goals1 == goals2 : # if Team 1 has equal goals as Team 2...
            # then that's a draw for Team 1, and a draw for Team 2
            drawDict[team1] += 1 # increasing Team 1's draw count
            drawDict[team2] += 1 # increasing Team 2's draw count
            pointsDict[team1] += 1 # give Team 1 one point for the draw
            pointsDict[team2] += 1 # give Team 2 one point for the draw
        elif goals1 < goals2 : # if Team 1 has less goals than Team 2...
            # then that's a loss for Team 1, and a win for Team 2
            lossDict[team1] += 1 # increasing Team 1's loss count
            winDict[team2] += 1 # increasing Team 2's win count
            pointsDict[team2] += 3 # give Team 2 three points for the win
        elif str(goals1) == 'nan':
            print(ref)
            print('Oops! Something messed up...') # this shouldn't happen but it's an error catcher
    
    # Now that we have dictionaries for every row necessary in the final league table, let's construct the DataFrame
    columnList = ('Games Played','Wins','Draws','Losses','Goals For','Goals Against','Goal Difference','Points')
    tableList =(playedDict,winDict,drawDict,lossDict,forGoalsDict,badGoalsDict,diffGoalsDict,pointsDict)
    tableDict = dict(zip(columnList, tableList))
    leagueTable = pd.DataFrame(tableDict)
    leagueTable['Team'] = leagueTable.index # change the Team from being index to its own column
    leagueTable = leagueTable[['Team','Games Played','Wins','Draws','Losses','Goals For',
                               'Goals Against','Goal Difference','Points']] # arrange columns in correct order
    leagueTable.sort_values(by='Points', ascending=False, inplace=True) # sort by points
    newIndex = []
    for z in range(len(leagueTable.index)): # only after sorting do we assign numbered index to show place in league table
        newIndex.append(z+1) # +1 so we start at 1 instead of 0
    leagueTable[''] = newIndex
    leagueTable.set_index('', inplace=True) # setting new index in place
    del leagueTable.index.name # delete the name
    return leagueTable # return our completed dataFrame



In [10]:

    
# In this cell, we'll run through every fixture list in fixtureDict and create a league table for each
# These will all be stored in a Dictionary and accessed by the same pattern of key (e.g - 1954 div1)

seasonDict = {} # creating the empty dictionary that will hold every single league table
#seasonFullDict = {} # for the full versions of the tables

# 'for' loop to cycle through each table of results and run it through the function in the cell above
for x in (fixtureDict):
    season = x[0:4]
    division = x[5:]
    seasonDict[season + ' ' + division] = createFinalTable(season, division, False)
    
#print('All Done') # debug line to certify code complete

Example league table, 1936 Division 2



In [11]:

    
seasonDict['1936 div2']









    Out[11]:






  
    
      
      Team
      Games Played
      Wins
      Draws
      Losses
      Goals For
      Goals Against
      Goal Difference
      Points
    
  
  
    
      1
      Leicester City
      42
      24
      8
      10
      89
      57
      32
      80
    
    
      2
      Blackpool
      42
      24
      7
      11
      88
      53
      35
      79
    
    
      3
      Bury
      42
      22
      8
      12
      74
      55
      19
      74
    
    
      4
      Newcastle United
      42
      22
      5
      15
      80
      56
      24
      71
    
    
      5
      West Ham United
      42
      19
      11
      12
      73
      55
      18
      68
    
    
      6
      Plymouth Argyle
      42
      18
      13
      11
      71
      53
      18
      67
    
    
      7
      Sheffield United
      42
      18
      10
      14
      66
      54
      12
      64
    
    
      8
      Coventry City
      42
      17
      11
      14
      66
      54
      12
      62
    
    
      9
      Tottenham Hotspur
      42
      17
      9
      16
      88
      66
      22
      60
    
    
      10
      Aston Villa
      42
      16
      12
      14
      82
      70
      12
      60
    
    
      11
      Burnley
      42
      16
      10
      16
      57
      61
      -4
      58
    
    
      12
      Blackburn Rovers
      42
      16
      10
      16
      70
      62
      8
      58
    
    
      13
      Fulham
      42
      15
      13
      14
      71
      61
      10
      58
    
    
      14
      Barnsley
      42
      16
      9
      17
      50
      64
      -14
      57
    
    
      15
      Chesterfield
      42
      16
      8
      18
      84
      89
      -5
      56
    
    
      16
      Swansea City
      42
      15
      7
      20
      50
      65
      -15
      52
    
    
      17
      Norwich City
      42
      14
      8
      20
      63
      71
      -8
      50
    
    
      18
      Nottingham Forest
      42
      12
      10
      20
      68
      90
      -22
      46
    
    
      19
      Bradford Park Avenue
      42
      12
      9
      21
      52
      88
      -36
      45
    
    
      20
      Southampton
      42
      11
      12
      19
      53
      77
      -24
      45
    
    
      21
      Bradford City
      42
      9
      12
      21
      54
      94
      -40
      39
    
    
      22
      Doncaster Rovers
      42
      7
      10
      25
      30
      84
      -54
      31

Example league table, 1994 Division 4



In [12]:

    
seasonDict['1994 div4']









    Out[12]:






  
    
      
      Team
      Games Played
      Wins
      Draws
      Losses
      Goals For
      Goals Against
      Goal Difference
      Points
    
  
  
    
      1
      Carlisle
      42
      27
      10
      5
      68
      31
      37
      91
    
    
      2
      Walsall
      42
      25
      10
      7
      77
      40
      37
      85
    
    
      3
      Chesterfield
      42
      23
      12
      7
      62
      37
      25
      81
    
    
      4
      Bury
      42
      23
      10
      9
      73
      38
      35
      79
    
    
      5
      Preston
      42
      19
      10
      13
      58
      41
      17
      67
    
    
      6
      Mansfield
      42
      18
      11
      13
      84
      59
      25
      65
    
    
      7
      Fulham
      42
      16
      14
      12
      60
      54
      6
      62
    
    
      8
      Scunthorpe
      42
      18
      8
      16
      68
      63
      5
      62
    
    
      9
      Doncaster
      42
      17
      10
      15
      58
      43
      15
      61
    
    
      10
      Colchester
      42
      16
      10
      16
      56
      64
      -8
      58
    
    
      11
      Barnet
      42
      15
      11
      16
      56
      63
      -7
      56
    
    
      12
      Lincoln
      42
      15
      11
      16
      54
      56
      -2
      56
    
    
      13
      Torquay
      42
      14
      13
      15
      54
      57
      -3
      55
    
    
      14
      Wigan
      42
      14
      10
      18
      53
      60
      -7
      52
    
    
      15
      Rochdale
      42
      12
      14
      16
      44
      67
      -23
      50
    
    
      16
      Hereford
      42
      12
      13
      17
      45
      62
      -17
      49
    
    
      17
      Northampton
      42
      10
      14
      18
      45
      67
      -22
      44
    
    
      18
      Hartlepool
      42
      11
      10
      21
      43
      69
      -26
      43
    
    
      19
      Gillingham
      42
      10
      11
      21
      46
      64
      -18
      41
    
    
      20
      Darlington
      42
      11
      8
      23
      43
      57
      -14
      41
    
    
      21
      Scarborough
      42
      8
      10
      24
      49
      70
      -21
      34
    
    
      22
      Exeter
      42
      8
      10
      24
      36
      70
      -34
      34

The All-Time Greats

Here we'll graphically represent, in the form of pie charts, the 'best' teams since the leagues began.
We'll do this with a simple Formula 1 (think Mario Kart) inspired scoring system.
1st: 25 points, 2nd: 18, 3rd: 15, 4th: 12, 5th: 10, 6th: 8, 7th: 6, 8th: 4, 9th: 2, 10th: 1, 11th+: 0



In [13]:

    
# This cell is going to run through all of the leagues and build dictionaries containing tallies for every team involved.
# These can then be utilised to build charts displaying how teams have performed since 1880.

scoringDict = {1:25,2:18,3:15,4:12,5:10,6:8,7:6,8:4,9:2,10:1} # creating a dictionary for the scoring
div1TallyDict = {}
div2TallyDict = {}
div3TallyDict = {}
div4TallyDict = {} # empty dictionaries for team:score

for x in seasonDict: # cycle through league tables
    if x[-4:] == 'div1': # dealing with Premiership
        for y in seasonDict[x].index:
            if y < 11: # only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div1TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div1TallyDict has no value, so it throws an error
                        div1TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
    if x[-4:] == 'div2': # dealing with Championship
        for y in seasonDict[x].index:
            if y < 11: # only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div2TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div2TallyDict has no value, so it throws an error
                        div2TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
    if x[-4:] == 'div3': # dealing with 3rd Division
        for y in seasonDict[x].index:
            if y < 11: #only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div3TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div3TallyDict has no value, so it throws an error
                        div3TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score
    if x[-4:] == 'div4': # dealing with 4th Division
        for y in seasonDict[x].index:
            if y < 11: #only go 1st to 10th, 11th and higher yields no score
                    try: # Use scoringDict to get new score value and add it to current overall score for the team
                        div4TallyDict[seasonDict[x]['Team'][y]] += scoringDict[y]
                    except KeyError: # Initially div4TallyDict has no value, so it throws an error
                        div4TallyDict[seasonDict[x]['Team'][y]] = scoringDict[y] # In that case, set equal to the first score

# defining a function that can run through a tallyDict and fix for the different team names used
# e.g "Man United" and "Manchester United", "Birmingham" and "Birmingham City"
def fixTallyDict(tallyDict):
    lista = [] # empty list to store team names
    for x in sorted(tallyDict):
        lista.append(x) # build up list with team names in alphabetical order
    for x in range(len(lista)):
        y=x+1 # using y to distinguish entry AFTER x
        if y < len(lista):
            if lista[x] == "Leeds": # SPECIAL CASE: Leeds
                if lista[y] == 'Leeds United':
                    tallyDict['Leeds'] += tallyDict['Leeds United']
                    del tallyDict['Leeds United']
                elif lista[y] == 'Leeds City':
                    tallyDict['Leeds'] += tallyDict['Leeds City']
                    tallyDict['Leeds'] += tallyDict['Leeds United']
                    del tallyDict['Leeds City']
                    del tallyDict['Leeds United']
            elif lista[x] in lista[y]: # Main generic checker
                tallyDict[lista[x]] += tallyDict[lista[y]] 
                del tallyDict[lista[y]]
            elif lista[x] == "Man City": # SPECIAL CASE: Man City
                if lista[y] == 'Manchester City':
                    tallyDict['Man City'] += tallyDict['Manchester City']
                    del tallyDict['Manchester City']
            elif lista[x] == "Man United": # SPECIAL CASE: Man Utd
                tallyDict['Man United'] += tallyDict['Manchester United']
                del tallyDict['Manchester United']
            elif lista[x] == "Nott'm Forest": # SPECIAL CASE: Nottingham Forest
                if lista[y] == 'Nottingham Forest':
                    tallyDict["Nott'm Forest"] += tallyDict['Nottingham Forest']
                    del tallyDict['Nottingham Forest']
            elif lista[x] == "QPR": # SPECIAL CASE: QPR
                tallyDict["QPR"] += tallyDict['Queens Park Rangers']
                del tallyDict['Queens Park Rangers']
            elif lista[x] == "Sheffield Wednesday": # SPECIAL CASE: Sheff Weds
                tallyDict["Sheffield Wednesday"] += tallyDict['Sheffield Weds']
                del tallyDict['Sheffield Weds']
            elif lista[x] == "Wolverhampton Wanderers": # SPECIAL CASE: Wolves
                if lista[y] == 'Wolves':
                    tallyDict["Wolverhampton Wanderers"] += tallyDict['Wolves']
                    del tallyDict['Wolves']
    return(tallyDict) # return fixed Dictionary

div1TallyDict = fixTallyDict(div1TallyDict)
div2TallyDict = fixTallyDict(div2TallyDict)
div3TallyDict = fixTallyDict(div3TallyDict)
div4TallyDict = fixTallyDict(div4TallyDict) # updating tallyDicts to fix them all with new function

#print('All Done') # debug line to certify code complete



In [14]:

    
sns.set()

# Using the TallyDicts to construct 'cropped' DataFrames
minScore = 300 # Specify minScore (this is normalised for Div2 and adjusted for the others)
# Create DataFrames using keys as 'Team' and values as 'Greatness Score'
div1TallyFrame = pd.DataFrame({"Team": list(div1TallyDict.keys()), "'Greatness' Score": list(div1TallyDict.values())})
div1TallyFrame = div1TallyFrame[div1TallyFrame["'Greatness' Score"] > minScore*1.5] # Keep only the best teams
div2TallyFrame = pd.DataFrame({"Team": list(div2TallyDict.keys()), "'Greatness' Score": list(div2TallyDict.values())})
div2TallyFrame = div2TallyFrame[div2TallyFrame["'Greatness' Score"] > minScore] # Keep only the best teams
div3TallyFrame = pd.DataFrame({"Team": list(div3TallyDict.keys()), "'Greatness' Score": list(div3TallyDict.values())})
div3TallyFrame = div3TallyFrame[div3TallyFrame["'Greatness' Score"] > minScore*0.46] # Keep only the best teams
div4TallyFrame = pd.DataFrame({"Team": list(div4TallyDict.keys()), "'Greatness' Score": list(div4TallyDict.values())})
div4TallyFrame = div4TallyFrame[div4TallyFrame["'Greatness' Score"] > minScore*0.45] # Keep only the best teams

# Creating the figure to display Pie Charts of the all-time greatest teams
fig = {
    'data': [
        {'labels': div1TallyFrame['Team'],'values': div1TallyFrame["'Greatness' Score"], 'type': 'pie','name': 'Div 1',
            'domain': {'x': [.1, .33], # Domain is a fiddly variable
                       'y': [1, .34]},"hole": .35, 'hoverinfo':'label+value','textinfo':'label+value',},
        {'labels': div2TallyFrame['Team'],'values': div2TallyFrame["'Greatness' Score"],'type': 'pie','name': 'Div 2',
            'domain': {'x': [.9, .67], # Domain is a fiddly variable
                       'y': [1, .34]}, "hole": .35,'hoverinfo':'label+value', 'textinfo':'label+value',},
        {'labels': div3TallyFrame['Team'],'values': div3TallyFrame["'Greatness' Score"],'type': 'pie', 'name': 'Div 3',
            'domain': {'x': [.3, .15], # Domain is a fiddly variable
                       'y': [0, .1]},"hole": .45, 'hoverinfo':'label+value','textinfo':'label', },
        {'labels': div4TallyFrame['Team'], 'values': div4TallyFrame["'Greatness' Score"],'type': 'pie','name': 'Div 4',
            'domain': {'x': [.8, .67], # Domain is a fiddly variable
                       'y': [0, .1]},"hole": .5, 'hoverinfo':'label+value','textinfo':'label', }
    ],
    # setting annotations to denote which league each chart is
    'layout': {'title': "All-Time Team 'Greatness' Ranking by League",'showlegend': False, "annotations": [
            {"font": {"size": 20},"showarrow": False,"text": "1st","x": 0.195,"y": 0.72 },
            {"font": {"size": 20}, "showarrow": False,"text": "2nd","x": 0.81,"y": 0.72},
            {"font": { "size": 20},"showarrow": False,"text": "3rd","x": 0.2,"y": 0.01},
            { "font": {"size": 20}, "showarrow": False,"text": "4th","x": 0.755,"y": 0.01} 
        ]
    }
}
iplot(fig) # plot the figure

# TO DO: EDIT THE COLOURS OF THE PIE CHART TO CORRELATE WITH TEAM COLOURS (e.g Liverpool red, Chelsea blue)









    



/Users/sglyon/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:878: UserWarning:

axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.

Analysing these pie charts above, we can see that Liverpool hold the all-time top spot for the highest division (currently the premiership) with Manchester United in a very close 2nd. The, perhaps surprising, teams to make it are Aston Villa and Sunderland, who have not been hugely successful recently but used to enjoy a great deal of success.
Moving over to Division 2 (or the secondary league, currently the Championship), the scores are a lot lower. This is most likely due to the fact that once you finish near the top (or bottom) of this league you move up or down to another league, and thus it's impossible to finish high in the 2nd Division season after season. With that said, it's clear that Birmingham and Leicester have seen the most success in the second tier, Leicester interestingly enough having won the 2nd division very recently and then shockingly won the Premier league just last season.
Below these, I've also included some smaller pies to show league 3 and 4.

Does transfer spending correlate with performance?

Using our organised data, we can now start to explore relationships and answer some specular questions.
Here, the relationship between transfer spending and league performance will be explored.
Hopefully, these answers will provide insights as yet untold into the economics of football.



In [15]:

    
CPI = pd.read_csv('http://claireaux.co.uk/project/cpiVALUES.csv') # reading CPI csv to get inflation values

# Function that allows adjusting old euro values for inflation
def adjustInflation(value,yearFrom,yearTo):
    indexTo = float(CPI[yearTo])
    indexFrom = float(CPI[yearFrom])
    change = (((indexTo - indexFrom) / indexFrom) + 1)
    newValue = value*change
    return(newValue)

# Function that builds a dataframe containing teams as index, league position for the given season...
# ...and various transfer related data
# This also includes some code that accounts for the different naming conventions between data sources
def posExpBuilder(season) :
    df = seasonDict[season]
    df2 = transferDataDict[season[0:4]] # creating dataframes for specified year to use
    leaguePositions = {} # creating empty dict for league position: team
    expenditures = {} # creating empty dict for expenditure: team
    incomes = {} # creating empty dict for income: team
    averageNewPlayerCosts = {} # creating empty dict for average cost of each new player: team
    averageSoldPlayerPrices = {} # creating empty dict for average price of each sold player: team
    balances = {} # creating empty dict for income - expenditures: team
    for x in range(len(df.index)): # for loop to cycle through first dataframe
        y=x+1
        leaguePositions[df.iloc[x]['Team']] = y # setting league positions dictionary values (e.g Arsenal: 2)
    for x in range(len(df2.index)): # cycling through second Dataframe
        # creating adjusted variables for inflation
        adjustedExpense = adjustInflation((df2.iloc[x]['Expenditures (mil. euros)']),season[0:4],'2016')
        adjustedIncome = adjustInflation((df2.iloc[x]['Income (mil. euros)']),season[0:4],'2016')
        playerCost = (df2.iloc[x]['Expenditures (mil. euros)']) / (df2.iloc[x]['Arrivals'])
        adjustedPlayerCost = adjustInflation(playerCost,season[0:4],'2016')
        playerPrice = (df2.iloc[x]['Income (mil. euros)']) / (df2.iloc[x]['Departures'])
        adjustedPlayerPrice = adjustInflation(playerPrice,season[0:4],'2016')
        adjustedBalance = adjustInflation((df2.iloc[x]['Balance (mil. euros)']),season[0:4],'2016')
        # appending dictionaries for calculated adjusted values
        expenditures[df2.index[x]] = adjustedExpense
        incomes[df2.index[x]] = adjustedIncome
        averageNewPlayerCosts[df2.index[x]] = adjustedPlayerCost
        averageSoldPlayerPrices[df2.index[x]] = adjustedPlayerPrice
        balances[df2.index[x]] = adjustedBalance
    
    # making a list of the new dictionaries
    working = (leaguePositions,expenditures,incomes,averageNewPlayerCosts,averageSoldPlayerPrices,balances)
    returnFrame = pd.DataFrame.from_records(working) # creating the final frame from this new list
    returnFrame = returnFrame.transpose() # transpose to look nicer
    
    names = [] # empty names list to be used below
    # for loop to cycle through our new final dataframe accounting for the different conventions from combining different data
    for x in range(len(returnFrame.index)): 
        y = x+1
        if y < len(returnFrame.index):
            if returnFrame.index[x] in returnFrame.index[y]:
                returnFrame[1][x] = returnFrame[1][y] # grabbing the column[1] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[2][x] = returnFrame[2][y] # grabbing the column[2] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[3][x] = returnFrame[3][y] # grabbing the column[3] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[4][x] = returnFrame[4][y] # grabbing the column[4] value for ('TEAM' FC) and putting it into ('TEAM')
                returnFrame[5][x] = returnFrame[5][y] # grabbing the column[5] value for ('TEAM' FC) and putting it into ('TEAM')
        if str(returnFrame[0][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[1][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[2][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[3][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[4][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
        elif str(returnFrame[5][x]) == 'nan':
            names.append(str(returnFrame.index[x])) # adding teams that now contain a NaN value to list for deletion
    
    returnFrame.drop(names,inplace=True) # deleting all those teams that were in the list
    returnFrame[0] = returnFrame[0].astype(int,raise_on_error=False) # change league standing column dataType to int
    returnFrame.rename(columns={0: 'Position', 1: 'Expenditure', 2: 'Income',
                               3: 'Average Cost of New Player', 4: 'Average Price of Sold Player',
                               5: 'Balance'}, inplace=True) # renaming columns appropriately
    return(returnFrame) # return the dataFrame



In [16]:

    
# Some general functions for graphing etc.
def roundTo3(x): # small function to round numbers to 3 decimal places
    return(float("{0:.3f}".format(x)))

# function that takes given x and y values and calculates pearsonr and a significance value for p
def textMaker(xi, y): 
    bond = (stats.pearsonr(xi,y))
    p = bond[1]
    if p < 0.01:
        p = '< 1%'
    elif p < 0.05:
        p = '< 5%'
    elif p < 0.1:
        p = '< 10%'
    else:
        p = 'not significant'
    return("Pearson's r: " + str(roundTo3(bond[0])) + ', p value:' + p)

# Generate linear line fit using regression
def fitLine(y):
    slope, intercept, r_value, p_value, std_err = stats.linregress(xi,y)
    return(slope*xi+intercept)

positionTransferDict1 = {}
positionTransferDict2 = {} # Creating empty dicts for use

# Top league (Premiership [div 1])
for x in range(1980,2016):
    season = str(x) + ' div1'
    positionTransferDict1[str(x)] = posExpBuilder(season) # running all data through function in previous cell
    
# Secondary league (Championship/league 1 [div 2])
for x in range(1980,2016):
    season = str(x) + ' div2'
    positionTransferDict2[str(x)] = posExpBuilder(season) # running all data through function in previous cell
    
positionTransferList1 = []
positionTransferList2 = []
resultList = [] # creating empty lists

for x in positionTransferDict1:
    positionTransferList1.append(positionTransferDict1[x])
for x in positionTransferDict2:
    positionTransferList2.append(positionTransferDict2[x]) # populating lists with dataframes

result = pd.concat(positionTransferList1)
result2 = pd.concat(positionTransferList2) # concatenating all dataframes in each list to make one big dataframe

result['div'] = 'div1'
result2['div'] = 'div2'
resultList = [result,result2]
result3 = pd.concat(resultList) # as yet unused 3rd dataframe that combines both div1 and div2









    



/Users/sglyon/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:32: RuntimeWarning:

invalid value encountered in double_scalars



In [17]:

    
xi = result['Position'] # using league position as x values
A = array([ xi, ones(22)]) # array for use in stats calcs
y1=result['Expenditure']
y2=result['Income']

# Creating the dataset, and generating the plot
trace1 = go.Scatter(x=xi, y=y1,  mode='markers',  marker=go.Marker(color='turquoise', opacity=0.5), name='Spending Data')
trace2 = go.Scatter(x=xi, y=fitLine(y1), mode='lines', marker=go.Marker(color='blue', size=100), name='Spending Fit')
trace3 = go.Scatter(x=xi, y=y2, mode='markers', marker=go.Marker(color='violet', opacity=0.5), name='Income Data')
trace4 = go.Scatter(x=xi, y=fitLine(y2), mode='lines', marker=go.Marker(color='red', size=100), name='Income Fit')

# Annotations containg important statistical stuff
pearson1 = go.Annotation(x=.3, y=.8, text = textMaker(xi, y1), showarrow=False, font=go.Font(size=12))
pearson2 = go.Annotation(x=.85, y=.8, text = textMaker(xi, y2), showarrow=False, font=go.Font(size=12))
annotationList = [pearson1,pearson2]

# Create fig with two subplots side by side
fig = tools.make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(
        'League performance v.s transfer spending', 'League performance v.s transfer income'))
# Update layout for annotations and titles
fig['layout'].update(annotations=annotationList,title='League performance v.s transfer spending' +
                     '          League performance v.s transfer income             .',)

# Add all the data traces to the plots
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

# Update axis layouts
fig['layout']['xaxis1'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
fig['layout']['yaxis1'].update(zerolinecolor='rgb(210,210,210)', gridcolor='rgb(210,210,210)', 
                               title="Million EUR (inflation adjusted)")
fig['layout']['xaxis2'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")

# Plot figure
iplot(fig)









    



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y1 ]

Now that we've visualised in pie charts the greatest teams, it's time to explore how a team can achieve such 'greatness'.
These two graphs explore the correlation between League Performance and Transfer Spending/Income.
For both cases, we see a highly significant (to the 1% level) negative correlation (i.e the less you spend/earn, the worse your final league standing).
Admittedly, the correlation is somewhat weak, especially in the transfer income case, but the regression lines show it definitely exists.
This result may seem like something obvious from pure common sense, but this data spanning 35 years proves it empirically.
It also proves that the correlation may not be quite as strong as one would imagine without the data.



In [18]:

    
xi = result['Position'] # using league position as x values
A = array([ xi, ones(22)]) # array for use in stats calcs
y1=result['Average Cost of New Player']
y2=result['Average Price of Sold Player']

# Creating the dataset, and generating the plot
trace1 = go.Scatter(x=xi, y=y1,  mode='markers',  marker=go.Marker(color='turquoise', opacity=0.5), name='Average Cost')
trace2 = go.Scatter(x=xi, y=fitLine(y1), mode='lines', marker=go.Marker(color='blue', size=100), name='Inbound Fit')
trace3 = go.Scatter(x=xi, y=y2, mode='markers', marker=go.Marker(color='violet', opacity=0.5), name='Average Price')
trace4 = go.Scatter(x=xi, y=fitLine(y2), mode='lines', marker=go.Marker(color='red', size=100), name='Outbound Fit')

# Annotations containg important statistical stuff
pearson1 = go.Annotation(x=.3, y=.8, text = textMaker(xi, y1), showarrow=False, font=go.Font(size=12))
pearson2 = go.Annotation(x=.85, y=.8, text = textMaker(xi, y2), showarrow=False, font=go.Font(size=12))
annotationList = [pearson1,pearson2]

# Create fig with two subplots side by side
fig = tools.make_subplots(rows=1, cols=2, shared_yaxes=True, subplot_titles=(
        'League performance v.s transfer spending', 'League performance v.s transfer income'))
# Update layout for annotations and titles
fig['layout'].update(annotations=annotationList,title='Performance v.s Average Spent per New Player' +
                     '      Performance v.s Average Price per Sold Player        .',)

# Add all the data traces to the plots
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

# Update axis layouts
fig['layout']['xaxis1'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")
fig['layout']['yaxis1'].update(zerolinecolor='rgb(210,210,210)', gridcolor='rgb(210,210,210)', 
                               title="Million EUR (inflation adjusted)")
fig['layout']['xaxis2'].update(zerolinecolor='rgb(255,255,255)', gridcolor='rgb(255,255,255)', title="Final League Position")

# Plot figure
iplot(fig)









    



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y1 ]

These two graphs are similar to the ones above, except we're comparing on an individual basis.
In each case, the y value depicts the average spent/earned from every incoming/outgoing player.
This is calculated simply by dividing the total expenditure by number of arrivals (or total income by number of departures).
Again, correlation is significant to the 1% level, and we see negative correlation across both.
At a value of -0.31, it is clear that purchasing high value players seems to give the highest payoff seen so far in terms of performance.
The average price received from sold players has only a very small negative correlation and thus I would not say it matters too much.

Conclusions

Based on the data studied, in order for teams to improve their performance, spending large amounts in concentration has been shown to provide the best outcome (i.e purchasing a few high value players > many low value players > spending less).
Ideally, I would have built around 50 figures for comparison of all the values stored within DataFrames, but tackling seaborn and plotly turned out to be more time-consuming than I had imagined and thus I don't believe I have done my level of data collection justice.
I plan to continue to work on this notebook in my spare time going forward, but for now this is my submitted and final project.

	Team	Games Played	Wins	Draws	Losses	Goals For	Goals Against	Goal Difference	Points
1	Leicester City	42	24	8	10	89	57	32	80
2	Blackpool	42	24	7	11	88	53	35	79
3	Bury	42	22	8	12	74	55	19	74
4	Newcastle United	42	22	5	15	80	56	24	71
5	West Ham United	42	19	11	12	73	55	18	68
6	Plymouth Argyle	42	18	13	11	71	53	18	67
7	Sheffield United	42	18	10	14	66	54	12	64
8	Coventry City	42	17	11	14	66	54	12	62
9	Tottenham Hotspur	42	17	9	16	88	66	22	60
10	Aston Villa	42	16	12	14	82	70	12	60
11	Burnley	42	16	10	16	57	61	-4	58
12	Blackburn Rovers	42	16	10	16	70	62	8	58
13	Fulham	42	15	13	14	71	61	10	58
14	Barnsley	42	16	9	17	50	64	-14	57
15	Chesterfield	42	16	8	18	84	89	-5	56
16	Swansea City	42	15	7	20	50	65	-15	52
17	Norwich City	42	14	8	20	63	71	-8	50
18	Nottingham Forest	42	12	10	20	68	90	-22	46
19	Bradford Park Avenue	42	12	9	21	52	88	-36	45
20	Southampton	42	11	12	19	53	77	-24	45
21	Bradford City	42	9	12	21	54	94	-40	39
22	Doncaster Rovers	42	7	10	25	30	84	-54	31

	Team	Games Played	Wins	Draws	Losses	Goals For	Goals Against	Goal Difference	Points
1	Carlisle	42	27	10	5	68	31	37	91
2	Walsall	42	25	10	7	77	40	37	85
3	Chesterfield	42	23	12	7	62	37	25	81
4	Bury	42	23	10	9	73	38	35	79
5	Preston	42	19	10	13	58	41	17	67
6	Mansfield	42	18	11	13	84	59	25	65
7	Fulham	42	16	14	12	60	54	6	62
8	Scunthorpe	42	18	8	16	68	63	5	62
9	Doncaster	42	17	10	15	58	43	15	61
10	Colchester	42	16	10	16	56	64	-8	58
11	Barnet	42	15	11	16	56	63	-7	56
12	Lincoln	42	15	11	16	54	56	-2	56
13	Torquay	42	14	13	15	54	57	-3	55
14	Wigan	42	14	10	18	53	60	-7	52
15	Rochdale	42	12	14	16	44	67	-23	50
16	Hereford	42	12	13	17	45	62	-17	49
17	Northampton	42	10	14	18	45	67	-22	44
18	Hartlepool	42	11	10	21	43	69	-26	43
19	Gillingham	42	10	11	21	46	64	-18	41
20	Darlington	42	11	8	23	43	57	-14	41
21	Scarborough	42	8	10	24	49	70	-21	34
22	Exeter	42	8	10	24	36	70	-34	34