LFC Data Analysis: The LFC Goal Machine

This notebook analyses Liverpool FC's goal scorers, in particular exploring a scatter plot of player's age against top level league goals scored. The plot is available as an interactive web app called the LFC Goal Machine here.

The notebook was used originally to explore and merge the input data, generate the required additional data and prototype the application. The notebook contains the key algorithms, some interesting plots and describes how the lfcgm app was built and deployed.

The project uses IPython Notebook, python, pandas, matplotlib, numpy, ggplot, spyre, heroku and heroku scipy buildpack.

Notebook Structure

Prepare to load the data by defining the location of the key input csv files and the location of the output csv files
Load the csv files and munge the data e.g. add player age at season midpoint; generate a merged dataframe with the key data
Analyse the data and define key algorithms e.g. to produce the lfcgm graph
Describe how the Spyre web application is built
Generate key additional data e.g. list of players for the application dropdown list
Describe how the app is deployed and run

Notebook Change Log



In [383]:

    
%%html
<! left align the change log table in next cell >
<style>
table {float:left}
</style>

Date	Change Description
21st February 2016	Initial baseline
30th October 2016	Added season 2015-16
12th October 2017	Added season 2016-17

Set-up

Import the python modules needed for the analysis.



In [384]:

    
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import sys
import collections
import os
from ggplot import *
from ggplot import __version__ as ggplot_version
from datetime import datetime
from __future__ import division

# enable inline plotting
%matplotlib inline

Print version numbers.



In [289]:

    
print 'python version: {}'.format(sys.version)
print 'pandas version: {}'.format(pd.__version__)
print 'matplotlib version: {}'.format(mpl.__version__)
print 'numpy version: {}'.format(np.__version__)
print 'ggplot version: {}'.format(ggplot_version)









    



python version: 2.7.11 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]
pandas version: 0.18.0
matplotlib version: 1.5.1
numpy version: 1.11.1
ggplot version: 0.6.8

Prepare To Load The Data

Define name and location of csv files



In [290]:

    
# define key seasons
SEASON_END = '2016-2017' # most recent season
SEASON_START = '1892-1893' # first season in existance
LG_SEASON_START = '1893-1894' # first league season (2nd Division)
PLAYERS_CSV_MONTH = 'October' # month of players csv extract
PLAYERS_CSV_YEAR = '2017' # year of players csv extract

# define input csv files

# define scorers CSV file name (and check exists)
SCORERS_PREFIX = 'lfc_scorers'
SCORERS_CSV_FILE = '{}_{}_{}.csv'.format(SCORERS_PREFIX, SEASON_START, SEASON_END)
LFC_SCORERS_CSV_FILE = os.path.relpath('data/{}'.format(SCORERS_CSV_FILE))
assert os.path.isfile(LFC_SCORERS_CSV_FILE) 
print 'LFC scorers csv file is: {}'.format(LFC_SCORERS_CSV_FILE) 

# define squads CSV file name (and check exists)
SQUADS_PREFIX = 'lfc_squads'
SQUADS_CSV_FILE = '{}_{}_{}.csv'.format(SQUADS_PREFIX, SEASON_START, SEASON_END)
LFC_SQUADS_CSV_FILE = os.path.relpath('data/{}'.format(SQUADS_CSV_FILE))
assert os.path.isfile(LFC_SQUADS_CSV_FILE) 
print 'LFC squads csv file is: {}'.format(LFC_SQUADS_CSV_FILE)

# define player appearances CSV file name (and check exists)
APPS_PREFIX = 'lfc_apps'
APPS_CSV_FILE = '{}_{}_{}.csv'.format(APPS_PREFIX, SEASON_START, SEASON_END)
LFC_APPS_CSV_FILE = os.path.relpath('data/{}'.format(APPS_CSV_FILE))
assert os.path.isfile(LFC_APPS_CSV_FILE) 
print 'LFC appearances csv file is: {}'.format(LFC_APPS_CSV_FILE)

# define league CSV file name (and check exists)
LEAGUE_PREFIX = 'lfc_league'
LEAGUE_CSV_FILE = '{}_{}_{}.csv'.format(LEAGUE_PREFIX, LG_SEASON_START, SEASON_END)
LFC_LEAGUE_CSV_FILE = os.path.relpath('data/{}'.format(LEAGUE_CSV_FILE))
assert os.path.isfile(LFC_LEAGUE_CSV_FILE) 
print 'LFC league csv file is: {}'.format(LFC_LEAGUE_CSV_FILE)
                                          
# define players CSV file name (and check exists)
PLAYERS_PREFIX = 'lfc_players'
PLAYERS_CSV_FILE_UPDATED = '{}_{}{}_upd.csv'.format(PLAYERS_PREFIX, PLAYERS_CSV_MONTH, PLAYERS_CSV_YEAR)
LFC_PLAYERS_CSV_FILE_UPDATED = os.path.relpath('data/{}'.format(PLAYERS_CSV_FILE_UPDATED))
assert os.path.isfile(LFC_PLAYERS_CSV_FILE_UPDATED) 
print 'LFC league csv file is: {}'.format(LFC_PLAYERS_CSV_FILE_UPDATED)

# define generated csv files (this is the data used by the lfcgm app)

# define scorers in top league with position and age CSV file name
SCORERS_TL_POS_AGE_CSV_FILE = 'lfc_scorers_tl_pos_age.csv'
LFC_SCORERS_TL_POS_AGE_CSV_FILE = os.path.relpath('data/lfc_scorers_tl_pos_age.csv')
print 'LFC scorers in top league with position and age is: {}'.format(LFC_SCORERS_TL_POS_AGE_CSV_FILE)

# define dropdown CSV file name
LFCGM_DROPDOWN = os.path.relpath('data\lfcgm_app_dropdown.csv')
print 'LFC goal machine dropdown is: {}'.format(LFCGM_DROPDOWN)









    



LFC scorers csv file is: data\lfc_scorers_1892-1893_2016-2017.csv
LFC squads csv file is: data\lfc_squads_1892-1893_2016-2017.csv
LFC appearances csv file is: data\lfc_apps_1892-1893_2016-2017.csv
LFC league csv file is: data\lfc_league_1893-1894_2016-2017.csv
LFC league csv file is: data\lfc_players_October2017_upd.csv
LFC scorers in top league with position and age is: data\lfc_scorers_tl_pos_age.csv
LFC goal machine dropdown is: data\lfcgm_app_dropdown.csv



In [ ]:

Load the LFC data into dataframes and munge

Create a dataframe of scorers in top level league seasons

Data source: lfchistory.net



In [291]:

    
print 'Loading LFC scorers csv from {}'.format(LFC_SCORERS_CSV_FILE)
dflfc_scorers = pd.read_csv(LFC_SCORERS_CSV_FILE)

# sort by season, then league goals
dflfc_scorers = dflfc_scorers.sort_values(['season', 'league'], ascending=([False, False]))
dflfc_scorers.shape









    



Loading LFC scorers csv from data\lfc_scorers_1892-1893_2016-2017.csv






    Out[291]:





(1429, 3)



In [292]:

    
dflfc_scorers.head()









    Out[292]:






  
    
      
      season
      player
      league
    
  
  
    
      1416
      2016-2017
      Philippe Coutinho
      13
    
    
      1417
      2016-2017
      Sadio Mane
      13
    
    
      1418
      2016-2017
      Roberto Firmino
      11
    
    
      1419
      2016-2017
      Adam Lallana
      8
    
    
      1420
      2016-2017
      Divock Origi
      7



In [293]:

    
dflfc_scorers.tail()









    Out[293]:






  
    
      
      season
      player
      league
    
  
  
    
      5
      1892-1893
      Jonathan Cameron
      4
    
    
      6
      1892-1893
      Jim McBride
      4
    
    
      7
      1892-1893
      Hugh McQueen
      3
    
    
      8
      1892-1893
      Joe McQue
      2
    
    
      9
      1892-1893
      Own goals
      1



In [294]:

    
# note that scorers includes own goals
dflfc_scorers[dflfc_scorers.player == 'Own goals'].head()

Filter out seasons that LFC weren't in top level



In [295]:

    
# note: war years already excluded in input files
LANCS_YRS = ['1892-1893']
SECOND_DIV_YRS = ['1893-1894', '1895-1896', '1904-1905', '1961-1962', 
                  '1954-1955', '1955-1956', '1956-1957', '1957-1958', 
                  '1958-1959', '1959-1960', '1960-1961']

NOT_TOP_LEVEL_YRS = LANCS_YRS + SECOND_DIV_YRS
dflfc_scorers_tl = dflfc_scorers[~dflfc_scorers.season.isin(NOT_TOP_LEVEL_YRS)].copy()
dflfc_scorers_tl.shape









    Out[295]:





(1281, 3)



In [299]:

    
## check number of top level seasons aligns with http://www.lfchistory.net/Stats/LeagueOverall
## expect 102 total for top level seasons from 1894-95 to 2016-17
print 'the number of seasons is {}'.format(len(dflfc_scorers_tl.season.unique()))
assert len(dflfc_scorers_tl.season.unique()) == 102









    



the number of seasons is 102



In [300]:

    
# show most league goals in a season in top level
# cross-check with http://en.wikipedia.org/wiki/List_of_Liverpool_F.C._records_and_statistics#Goalscorers
# expect 101 in 2013-14
assert dflfc_scorers_tl[['season', 'league']].groupby(['season'])\
            .sum().sort_values('league', ascending=False).head(1).reset_index().values.tolist()[0] == ['2013-2014', 101]
dflfc_scorers_tl[['season', 'league']].groupby(['season']).sum().sort_values('league', ascending=False).head(1)



In [301]:

    
# remove OG
dflfc_scorers_tl = dflfc_scorers_tl[dflfc_scorers_tl.player != 'Own goals']
dflfc_scorers_tl.shape









    Out[301]:





(1210, 3)



In [302]:

    
# check 2016-17
dflfc_scorers_tl[dflfc_scorers_tl.season == '2016-2017'].head(10)









    Out[302]:






  
    
      
      season
      player
      league
    
  
  
    
      1416
      2016-2017
      Philippe Coutinho
      13
    
    
      1417
      2016-2017
      Sadio Mane
      13
    
    
      1418
      2016-2017
      Roberto Firmino
      11
    
    
      1419
      2016-2017
      Adam Lallana
      8
    
    
      1420
      2016-2017
      Divock Origi
      7
    
    
      1421
      2016-2017
      James Milner
      7
    
    
      1422
      2016-2017
      Georginio Wijnaldum
      6
    
    
      1423
      2016-2017
      Emre Can
      5
    
    
      1424
      2016-2017
      Daniel Sturridge
      3
    
    
      1425
      2016-2017
      Dejan Lovren
      2

Create dataframe of squads giving the position of each player

Data source: lfchistory.net



In [303]:

    
print 'Loading LFC scorers csv from {}'.format(LFC_SQUADS_CSV_FILE)
dflfc_squads = pd.read_csv(LFC_SQUADS_CSV_FILE)
dflfc_squads.shape









    



Loading LFC scorers csv from data\lfc_squads_1892-1893_2016-2017.csv






    Out[303]:





(2977, 3)



In [304]:

    
dflfc_squads.head()









    Out[304]:






  
    
      
      season
      player
      position
    
  
  
    
      0
      1892-1893
      Sydney Ross
      Goalkeeper
    
    
      1
      1892-1893
      Billy McOwen
      Goalkeeper
    
    
      2
      1892-1893
      Jim McBride
      Defender
    
    
      3
      1892-1893
      John McCartney
      Defender
    
    
      4
      1892-1893
      Andrew Hannah
      Defender



In [305]:

    
dflfc_squads.tail()









    Out[305]:






  
    
      
      season
      player
      position
    
  
  
    
      2972
      2016-2017
      Ben Woodburn
      Striker
    
    
      2973
      2016-2017
      Divock Origi
      Striker
    
    
      2974
      2016-2017
      Sadio Mane
      Striker
    
    
      2975
      2016-2017
      Daniel Sturridge
      Striker
    
    
      2976
      2016-2017
      Rhian Brewster
      Striker

Create dataframe of LFC's league position

Data source: lfchistory.net



In [306]:

    
print 'Loading LFC scorers csv from {}'.format(LFC_LEAGUE_CSV_FILE)
dflfc_league = pd.read_csv(LFC_LEAGUE_CSV_FILE)
dflfc_league.shape









    



Loading LFC scorers csv from data\lfc_league_1893-1894_2016-2017.csv






    Out[306]:





(113, 18)



In [307]:

    
dflfc_league.head()









    Out[307]:






  
    
      
      Season
      League
      Pos
      PLD
      HW
      HD
      HL
      HF
      HA
      AW
      AD
      AL
      AF
      AA
      PTS
      GF
      GA
      GD
    
  
  
    
      0
      1893-1894
      2nd Division
      1
      28
      14
      0
      0
      46
      6
      8
      6
      0
      31
      12
      50
      77
      18
      59
    
    
      1
      1894-1895
      1st Division
      15
      30
      6
      4
      5
      38
      28
      1
      4
      10
      13
      42
      22
      51
      70
      -19
    
    
      2
      1895-1896
      2nd Division
      1
      30
      14
      1
      0
      65
      11
      8
      1
      6
      41
      21
      46
      106
      32
      74
    
    
      3
      1896-1897
      1st Division
      5
      30
      7
      6
      2
      25
      10
      5
      3
      7
      21
      28
      33
      46
      38
      8
    
    
      4
      1897-1898
      1st Division
      9
      30
      7
      4
      4
      27
      16
      4
      2
      9
      21
      29
      28
      48
      45
      3



In [308]:

    
dflfc_league.tail()









    Out[308]:






  
    
      
      Season
      League
      Pos
      PLD
      HW
      HD
      HL
      HF
      HA
      AW
      AD
      AL
      AF
      AA
      PTS
      GF
      GA
      GD
    
  
  
    
      108
      2012-2013
      Premier League
      7
      38
      9
      6
      4
      33
      16
      7
      7
      5
      38
      27
      61
      71
      43
      28
    
    
      109
      2013-2014
      Premier League
      2
      38
      16
      1
      2
      53
      18
      10
      5
      4
      48
      32
      84
      101
      50
      51
    
    
      110
      2014-2015
      Premier League
      6
      38
      10
      5
      4
      30
      20
      8
      3
      8
      22
      28
      62
      52
      48
      4
    
    
      111
      2015-2016
      Premier League
      8
      38
      8
      8
      3
      33
      22
      8
      4
      7
      30
      28
      60
      63
      50
      13
    
    
      112
      2016-2017
      Premier League
      4
      38
      12
      5
      2
      45
      18
      10
      5
      4
      33
      24
      76
      78
      42
      36

Create merged dataframe, combining scorers in top league level season with squad position



In [309]:

    
dflfc_scorers_tl_pos = pd.DataFrame.merge(dflfc_scorers_tl, dflfc_squads)
dflfc_scorers_tl_pos.shape









    Out[309]:





(1210, 4)



In [310]:

    
dflfc_scorers_tl_pos.head()









    Out[310]:






  
    
      
      season
      player
      league
      position
    
  
  
    
      0
      2016-2017
      Philippe Coutinho
      13
      Midfielder
    
    
      1
      2016-2017
      Sadio Mane
      13
      Striker
    
    
      2
      2016-2017
      Roberto Firmino
      11
      Striker
    
    
      3
      2016-2017
      Adam Lallana
      8
      Midfielder
    
    
      4
      2016-2017
      Divock Origi
      7
      Striker



In [311]:

    
dflfc_scorers_tl_pos.tail()









    Out[311]:






  
    
      
      season
      player
      league
      position
    
  
  
    
      1205
      1894-1895
      Frank Becton
      4
      Striker
    
    
      1206
      1894-1895
      Neil Kerr
      3
      Midfielder
    
    
      1207
      1894-1895
      Hugh McQueen
      2
      Midfielder
    
    
      1208
      1894-1895
      Joe McQue
      1
      Defender
    
    
      1209
      1894-1895
      Patrick Gordon
      1
      Midfielder

Create a dataframe of players with birthdate and country of birth



In [312]:

    
print 'Loading LFC scorers csv from {}'.format(LFC_PLAYERS_CSV_FILE_UPDATED)
dflfc_players = pd.read_csv(LFC_PLAYERS_CSV_FILE_UPDATED, parse_dates=['birthdate'])
assert dflfc_players.birthdate.dtypes == 'datetime64[ns]'
dflfc_players.shape









    



Loading LFC scorers csv from data\lfc_players_October2017_upd.csv






    Out[312]:





(786, 3)



In [313]:

    
dflfc_players.head()









    Out[313]:






  
    
      
      player
      birthdate
      country
    
  
  
    
      0
      Gary Ablett
      1965-11-19
      England
    
    
      1
      Alan A'Court
      1934-09-30
      England
    
    
      2
      Charlie Adam
      1985-12-10
      Scotland
    
    
      3
      Daniel Agger
      1984-12-12
      Denmark
    
    
      4
      Andrew Aitken
      1909-08-25
      England



In [314]:

    
dflfc_players.tail()









    Out[314]:






  
    
      
      player
      birthdate
      country
    
  
  
    
      781
      Ron Yeats
      1937-11-15
      Scotland
    
    
      782
      Samed Yesil
      1994-05-25
      Germany
    
    
      783
      Tommy Younger
      1930-04-10
      Scotland
    
    
      784
      Bolo Zenden
      1976-08-15
      Netherlands
    
    
      785
      Christian Ziege
      1972-02-01
      Germany

Create merged dataframe of players, combining scorers in top league level season with squad position and age

Add players age to the dflfc_scorers_tl_pos dataframe



In [315]:

    
def age_at_season(row):
    """Return player's age at mid-point of season, assumed to be 1st Jan.
    
        row.player -> player's name
        row.season -> season
        
        uses dflfc_players to look-up birthdate, keyed on player
         - return average age if player is missing from dflfc_players
    """
    
    AVERAGE_AGE = 26.5
    
    mid_point = '01 January {}'.format(row.season[-4:])
    try:
        dob = dflfc_players[dflfc_players.player == row.player].birthdate.values[0]
    except:
        # use average age if player's birthdate not available
        print 'error: age not found for player {} in season {}, using average age {}'.format(row.player, 
                                                                                             row.season, 
                                                                                             AVERAGE_AGE)
        return AVERAGE_AGE
    return round((pd.Timestamp(mid_point) - dob).days/365.0, 1)



In [316]:

    
# add age column
# no errors expected
dflfc_scorers_tl_pos['age'] = dflfc_scorers_tl_pos.apply(lambda row: age_at_season(row), axis=1)



In [317]:

    
dflfc_scorers_tl_pos_age = dflfc_scorers_tl_pos.copy()



In [318]:

    
dflfc_scorers_tl_pos_age.head()









    Out[318]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      0
      2016-2017
      Philippe Coutinho
      13
      Midfielder
      24.6
    
    
      1
      2016-2017
      Sadio Mane
      13
      Striker
      24.7
    
    
      2
      2016-2017
      Roberto Firmino
      11
      Striker
      25.3
    
    
      3
      2016-2017
      Adam Lallana
      8
      Midfielder
      28.7
    
    
      4
      2016-2017
      Divock Origi
      7
      Striker
      21.7

Save the new dataframe

This is the key dataframe used in the plot of age vs league goals.



In [319]:

    
dflfc_scorers_tl_pos_age.to_csv(LFC_SCORERS_TL_POS_AGE_CSV_FILE, header=True, sep=',')
assert os.path.isfile(LFC_SCORERS_TL_POS_AGE_CSV_FILE)

Create dataframe of player's league appearances



In [320]:

    
# read the appearance file
print 'Loading LFC appearances csv from {}'.format(LFC_APPS_CSV_FILE)
dflfc_lgapps = pd.read_csv(LFC_APPS_CSV_FILE)
print dflfc_lgapps.shape
dflfc_lgapps.head()









    



Loading LFC appearances csv from data\lfc_apps_1892-1893_2016-2017.csv
(2608, 3)






    Out[320]:






  
    
      
      season
      player
      lgapp
    
  
  
    
      0
      1892-1893
      Andrew Hannah
      22
    
    
      1
      1892-1893
      Duncan McLean
      22
    
    
      2
      1892-1893
      Tom Wyllie
      22
    
    
      3
      1892-1893
      Malcolm McVean
      21
    
    
      4
      1892-1893
      John Miller
      21

Create merged dataframe of players, combining scorers in top league level season with squad position, age and appearances



In [321]:

    
dflfc_scorers_tl_pos_age_apps = dflfc_scorers_tl_pos_age.merge(dflfc_lgapps)
print dflfc_scorers_tl_pos_age_apps.shape
print dflfc_scorers_tl_pos.shape









    



(1210, 6)
(1210, 5)



In [322]:

    
dflfc_scorers_tl_pos_age_apps.head()









    Out[322]:






  
    
      
      season
      player
      league
      position
      age
      lgapp
    
  
  
    
      0
      2016-2017
      Philippe Coutinho
      13
      Midfielder
      24.6
      31
    
    
      1
      2016-2017
      Sadio Mane
      13
      Striker
      24.7
      27
    
    
      2
      2016-2017
      Roberto Firmino
      11
      Striker
      25.3
      35
    
    
      3
      2016-2017
      Adam Lallana
      8
      Midfielder
      28.7
      31
    
    
      4
      2016-2017
      Divock Origi
      7
      Striker
      21.7
      34



In [323]:

    
dflfc_scorers_tl_pos_age_apps.tail()









    Out[323]:






  
    
      
      season
      player
      league
      position
      age
      lgapp
    
  
  
    
      1205
      1894-1895
      Frank Becton
      4
      Striker
      21.2
      5
    
    
      1206
      1894-1895
      Neil Kerr
      3
      Midfielder
      23.7
      12
    
    
      1207
      1894-1895
      Hugh McQueen
      2
      Midfielder
      27.3
      12
    
    
      1208
      1894-1895
      Joe McQue
      1
      Defender
      21.8
      29
    
    
      1209
      1894-1895
      Patrick Gordon
      1
      Midfielder
      24.9
      5



In [ ]:

Analyse the data

Ask a question and find the answer!

Create a function to plot player's age vs top level league goals



In [324]:

    
def ggplot_age_vs_lgoals(df, players):
    """Return ggplot of Age vs League Goals for given list of players in dataframe.

       Given the low number of points, ggplot's geom_smooth uses
       the loess method with default span."""
    TITLE = 'LFCGM Age vs League Goals'
    XLABEL = 'Age at Midpoint of Season'
    YLABEL = 'League Goals per Season'
    EXEMPLAR_PLAYERS = ['Ian Rush', 'Kenny Dalglish', 'Roger Hunt', 'David Johnson',
                        'Harry Chambers', 'John Toshack', 'John Barnes', 'Kevin Keegan']
    EXEMPLAR_TITLE = 'LFCGM Example Plot, The Champions: Age vs League Goals'

    # if players list is empty then set the default exemplar options
    if not players:
        players = EXEMPLAR_PLAYERS
        TITLE = EXEMPLAR_TITLE

    # fiter dataframe for given players and plot
    this_df = df[df.player.isin(players)]
    this_plot = ggplot(this_df, aes(x='age', y='league', color='player', shape='player')) + \
                    geom_point() + \
                    geom_smooth(se=False) + \
                    xlab(XLABEL) + \
                    ylab(YLABEL) + \
                    scale_y_discrete(limits=(0, this_df.league.max() + 1)) + \
                    ggtitle(TITLE)
    return this_plot



In [325]:

    
# show the default plot, showing the champions (see below for more info on 'The Champions')
players = []
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)









    












    Out[325]:





<ggplot: (15438073)>



In [326]:

    
# show type of gglplot.draw() object
# Note that Spyre (v0.2) can only handle matplotlib object or pyplot figure
players = []
ggp = ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
type(ggp.draw())









    Out[326]:





matplotlib.figure.Figure

Show some interesting plots

Early Riser



In [327]:

    
# show all players scoring more than 20 goals when under 20 years old
dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.league >= 20) & 
                          (dflfc_scorers_tl_pos_age.age < 20)]









    Out[327]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      285
      1994-1995
      Robbie Fowler
      25
      Striker
      19.7



In [328]:

    
# produce plot for player known as 'god'
players = ['Robbie Fowler']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)









    












    Out[328]:





<ggplot: (15442324)>

Late Flourish



In [329]:

    
# show all players scoring more than 20 goals when over 30 years old
df_late = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.league >= 20) & 
                                   (dflfc_scorers_tl_pos_age.age > 30)]
df_late









    Out[329]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      360
      1988-1989
      John Aldridge
      21
      Striker
      30.3
    
    
      766
      1946-1947
      Jack Balmer
      24
      Striker
      30.9
    
    
      814
      1934-1935
      Gordon Hodgson
      27
      Striker
      30.7
    
    
      922
      1925-1926
      Dick Forshaw
      27
      Striker
      30.4
    
    
      1067
      1908-1909
      Ronald Orr
      20
      Striker
      32.4



In [330]:

    
players = df_late.player.values
print sorted(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Dick Forshaw', 'Gordon Hodgson', 'Jack Balmer', 'John Aldridge', 'Ronald Orr']






    












    Out[330]:





<ggplot: (15265870)>

All time career top scorers



In [331]:

    
# show players scoring most league goals over their career
df_top = dflfc_scorers_tl_pos_age[['player', 'league']].groupby('player').sum()
df_top = df_top.sort_values('league', ascending=False).head(12)
df_top









    Out[331]:






  
    
      
      league
    
    
      player
      
    
  
  
    
      Gordon Hodgson
      233
    
    
      Ian Rush
      229
    
    
      Roger Hunt
      167
    
    
      Harry Chambers
      135
    
    
      Robbie Fowler
      128
    
    
      Steven Gerrard
      120
    
    
      Kenny Dalglish
      118
    
    
      Michael Owen
      118
    
    
      Dick Forshaw
      116
    
    
      Jack Parkinson
      103
    
    
      Sam Raybould
      101
    
    
      Jack Balmer
      98



In [332]:

    
players = df_top[df_top.league >= 120].index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Gordon Hodgson' 'Ian Rush' 'Roger Hunt' 'Harry Chambers' 'Robbie Fowler'
 'Steven Gerrard']






    












    Out[332]:





<ggplot: (11773251)>

Elite 30



In [333]:

    
# show players scoring >=30 league goals in a season
df_elite = dflfc_scorers_tl_pos_age[['season', 'player', 'league']].sort_values('league', ascending=False)
df_elite.head(10)









    Out[333]:






  
    
      
      season
      player
      league
    
  
  
    
      861
      1930-1931
      Gordon Hodgson
      36
    
    
      417
      1983-1984
      Ian Rush
      32
    
    
      1128
      1902-1903
      Sam Raybould
      31
    
    
      666
      1963-1964
      Roger Hunt
      31
    
    
      44
      2013-2014
      Luis Suarez
      31
    
    
      1057
      1909-1910
      Jack Parkinson
      30
    
    
      380
      1986-1987
      Ian Rush
      30
    
    
      887
      1928-1929
      Gordon Hodgson
      30
    
    
      639
      1965-1966
      Roger Hunt
      29
    
    
      275
      1995-1996
      Robbie Fowler
      28



In [334]:

    
players = df_elite[df_elite.league > 30].player.unique()
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Gordon Hodgson' 'Ian Rush' 'Sam Raybould' 'Roger Hunt' 'Luis Suarez']






    












    Out[334]:





<ggplot: (11779030)>

A Striking Trio

For a discusson of Liverpool's best ever trio see Terry's blog A Striking Trio



In [335]:

    
# show best total for a striking trio in the league
df_trio = dflfc_scorers_tl_pos_age[['season', 'league']].groupby('season').head(3).groupby('season').sum()
df_trio.sort_values('league', ascending=False).head(10)



In [336]:

    
TOP_TRIO = ['1963-1964']
df_trio_players = dflfc_scorers_tl_pos_age[['season', 'player', 'league']]\
                                            [dflfc_scorers_tl_pos_age.season.isin(TOP_TRIO)].groupby('season').head(3)
df_trio_players









    Out[336]:






  
    
      
      season
      player
      league
    
  
  
    
      666
      1963-1964
      Roger Hunt
      31
    
    
      667
      1963-1964
      Ian St John
      21
    
    
      668
      1963-1964
      Alf Arrowsmith
      15



In [337]:

    
players = df_trio_players.player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    












    Out[337]:





<ggplot: (17379058)>

A Striking Duo



In [338]:

    
# show best total for a striking trio in the league
df_duo = dflfc_scorers_tl_pos_age[['season', 'league']].groupby('season').head(2).groupby('season').sum()
df_duo.sort_values('league', ascending=False).head(10)



In [339]:

    
TOP_DUO = ['1963-1964', '2013-2014']
df_duo_players = dflfc_scorers_tl_pos_age[['season', 'player', 'league']]\
                                            [dflfc_scorers_tl_pos_age.season.isin(TOP_DUO)].groupby('season').head(2)
df_duo_players









    Out[339]:






  
    
      
      season
      player
      league
    
  
  
    
      44
      2013-2014
      Luis Suarez
      31
    
    
      45
      2013-2014
      Daniel Sturridge
      21
    
    
      666
      1963-1964
      Roger Hunt
      31
    
    
      667
      1963-1964
      Ian St John
      21



In [340]:

    
# plot first of TOP_DUO seasons
players = df_duo_players[df_duo_players.season == TOP_DUO[0]].player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    












    Out[340]:





<ggplot: (14873920)>



In [341]:

    
# plot second of TOP_DUO seasons
players = df_duo_players[df_duo_players.season == TOP_DUO[1]].player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    












    Out[341]:





<ggplot: (11781759)>

Performance of Liverpool players who went on to be managers



In [342]:

    
# produce list of managers - ref: http://www.lfchistory.net/Managers/
MANAGERS = ['William Barclay', 'Tom Watson', 'David Ashworth', 'Matt McQueen', 'George Patterson',\
    'George Kay', 'Don Welsh', 'Phil Taylor', 'Bill Shankly', 'Bob Paisley', 'Joe Fagan',\
    'Kenny Dalglish', 'Graeme Souness', 'Roy Evans', 'Gerard Houllier',\
    'Rafael Benitez', 'Roy Hodgson', 'Kenny Dalglish', 'Brendan Rodgers', 'Jurgen Klopp']
# excludes Ronnie Moran who was temporary manager in 1991



In [343]:

    
# produce list of players (who scored in more than 1 season at top level) who were managers
df_mgrs = dflfc_scorers_tl_pos_age[['player', 'league']][dflfc_scorers_tl_pos_age.player.isin(MANAGERS)]\
                                        .groupby('player').sum().sort_values('league', ascending=False)
df_mgrs









    Out[343]:






  
    
      
      league
    
    
      player
      
    
  
  
    
      Kenny Dalglish
      118
    
    
      Graeme Souness
      38
    
    
      Phil Taylor
      32
    
    
      Bob Paisley
      10



In [344]:

    
players = df_mgrs.index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Kenny Dalglish' 'Graeme Souness' 'Phil Taylor' 'Bob Paisley']






    












    Out[344]:





<ggplot: (15308681)>

Top midfielders



In [345]:

    
# show midfielders who have scored more than 15 goals
df_mids = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Midfielder') &
                                    (dflfc_scorers_tl_pos_age.league > 15)].sort_values('league', ascending=False)
df_mids









    Out[345]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      712
      1951-1952
      Billy Liddell
      19
      Midfielder
      30.0
    
    
      405
      1984-1985
      John Wark
      18
      Midfielder
      27.4
    
    
      430
      1982-1983
      Kenny Dalglish
      18
      Midfielder
      31.9
    
    
      735
      1949-1950
      Billy Liddell
      17
      Midfielder
      28.0
    
    
      851
      1931-1932
      Gordon Gunson
      17
      Midfielder
      27.5
    
    
      108
      2008-2009
      Steven Gerrard
      16
      Midfielder
      28.6
    
    
      339
      1990-1991
      John Barnes
      16
      Midfielder
      27.2
    
    
      888
      1928-1929
      Dick Edmed
      16
      Midfielder
      24.9



In [346]:

    
players = df_mids.sort_values('league', ascending=False).player.unique()
print len(players), players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



7 ['Billy Liddell' 'John Wark' 'Kenny Dalglish' 'Gordon Gunson'
 'Steven Gerrard' 'John Barnes' 'Dick Edmed']






    












    Out[346]:





<ggplot: (17754705)>

Top Defenders



In [347]:

    
# show defenders who have scored more than 6 goals
df_defs = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Defender') &
                                    (dflfc_scorers_tl_pos_age.league > 6)].sort_values('league', ascending=False)
df_defs









    Out[347]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      590
      1969-1970
      Chris Lawler
      10
      Defender
      26.2
    
    
      432
      1982-1983
      Phil Neal
      8
      Defender
      31.9
    
    
      5
      2016-2017
      James Milner
      7
      Defender
      31.0
    
    
      48
      2013-2014
      Martin Skrtel
      7
      Defender
      29.1
    
    
      200
      2001-2002
      John Arne Riise
      7
      Defender
      21.3
    
    
      504
      1976-1977
      Phil Neal
      7
      Defender
      25.9



In [348]:

    
players = df_defs.sort_values('league', ascending=False).player.unique()
print len(players), players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



5 ['Chris Lawler' 'Phil Neal' 'James Milner' 'Martin Skrtel'
 'John Arne Riise']






    












    Out[348]:





<ggplot: (15410717)>

Peak Performance



In [349]:

    
# show player with top score in a season, Gordon Hodgson
top_player = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.league == max(dflfc_scorers_tl_pos_age.league)]
top_player









    Out[349]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      861
      1930-1931
      Gordon Hodgson
      36
      Striker
      26.7



In [350]:

    
players = top_player.player.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Gordon Hodgson']






    












    Out[350]:





<ggplot: (11765147)>

Rocket Men

Show the players scored over 50 goals in 3 or more consecutive seasons, with a rising number of goals each season.



In [351]:

    
# create dataframe ordered by player and season
df = dflfc_scorers_tl_pos_age.groupby(['player', 'season']).sum()
df.head(12)









    Out[351]:






  
    
      
      
      league
      age
    
    
      player
      season
      
      
    
  
  
    
      Abel Xavier
      2001-2002
      1
      29.1
    
    
      Abraham Hartley
      1897-1898
      1
      25.9
    
    
      Adam Lallana
      2014-2015
      5
      26.7
    
    
      2015-2016
      4
      27.7
    
    
      2016-2017
      8
      28.7
    
    
      Alan A'Court
      1952-1953
      2
      18.3
    
    
      1953-1954
      3
      19.3
    
    
      1962-1963
      2
      28.3
    
    
      Alan Arnell
      1953-1954
      1
      20.1
    
    
      Alan Hansen
      1978-1979
      1
      23.6
    
    
      1979-1980
      4
      24.6
    
    
      1980-1981
      1
      25.6



In [352]:

    
def linefit(x, y):
    """"Return gradient and intercept of straight line of best fit for given x and y arrays."""
    gradient, intercept = np.polyfit(x, y, 1)
    return gradient, intercept



In [353]:

    
# test linefit()
# using y = 2x^2 + 6
x=np.array([-1, 0, 1, 2])
print x
y=2*x*x + 6
print y
print plt.plot(x, y)
gradient, intercept = linefit(x, y)
print np.round(gradient, 1), np.round(intercept, 1)
print plt.plot(x, gradient*x + intercept)









    



[-1  0  1  2]
[ 8  6  8 14]
[<matplotlib.lines.Line2D object at 0x000000000E2D37B8>]
2.0 8.0
[<matplotlib.lines.Line2D object at 0x000000000B5B0DA0>]



In [354]:

    
# Show the players scored over 50 goals in 3 or more consecutive seasons, with a rising number of goals each season.
MIN_SEASONS = 3
MIN_TOTAL_GOALS = 50
p_prev = None # previous player
l_prev = None # previous league goals
Lg = [] # List of consecutive goals
La = [] # List of consecutive ages
Ls = [] # List of consecutive seasons

# iterate through dataframe 
# for each row of (player, season) (league goals, age)
for (p, s), (l, a) in df.iterrows():
    if p != p_prev:
        # new player, so check previous
        if len(Lg) >= MIN_SEASONS and sum(Lg) >= MIN_TOTAL_GOALS:
            grad, intercept = linefit(np.array(range(len(Lg))), np.array(Lg))
            print 'Rocket man: {}, goals={}, start_season={}, start_age={}, goals={}, grad={}'\
                                .format(p_prev, Lg, Ls[0], La[0], sum(Lg), np.round(grad, 2))
            
        #print 'new p', p
        l_prev = None
        Lg = []
        La = []
        Ls = []
        
    # print p, s, l, a #player, season, league, age
    #print l, l_prev, Lg
    if l >= l_prev:
        #print '\t', l, 'greater than', l_prev, Lg
        Lg.append(l)
        La.append(a)
        Ls.append(s)
    else:
        if len(Lg) >= MIN_SEASONS and sum(Lg) >= MIN_TOTAL_GOALS:
            grad, intercept = linefit(np.array(range(len(Lg))), np.array(Lg))
            print 'Rocket man: {}, goals={}, start_season={}, start_age={}, goals={}, grad={}'\
                                .format(p_prev, Lg, Ls[0], La[0], sum(Lg), np.round(grad, 2))
        Lg = [l]
        La = [a]
        Ls = [s]
    
    l_prev = l
    p_prev = p









    



Rocket man: Berry Nieuwenhuys, goals=[9.0, 10.0, 10.0, 13.0, 13.0, 14.0], start_season=1933-1934, start_age=22.2, goals=69.0, grad=1.06
Rocket man: Dick Forshaw, goals=[7.0, 9.0, 17.0, 19.0], start_season=1919-1920, start_age=24.4, goals=52.0, grad=4.4
Rocket man: Dick Forshaw, goals=[5.0, 19.0, 27.0], start_season=1923-1924, start_age=28.4, goals=51.0, grad=11.0
Rocket man: Gordon Hodgson, goals=[4.0, 16.0, 23.0, 30.0], start_season=1925-1926, start_age=21.7, goals=73.0, grad=8.5
Rocket man: Gordon Hodgson, goals=[24.0, 24.0, 27.0], start_season=1932-1933, start_age=28.7, goals=75.0, grad=1.5
Rocket man: Ian Rush, goals=[17.0, 24.0, 32.0], start_season=1981-1982, start_age=20.2, goals=73.0, grad=7.5
Rocket man: Ian Rush, goals=[14.0, 22.0, 30.0], start_season=1984-1985, start_age=23.2, goals=66.0, grad=8.0
Rocket man: Luis Suarez, goals=[4.0, 11.0, 23.0, 31.0], start_season=2010-2011, start_age=24.0, goals=69.0, grad=9.3
Rocket man: Michael Owen, goals=[11.0, 16.0, 19.0, 19.0], start_season=1999-2000, start_age=20.1, goals=65.0, grad=2.7
Rocket man: Robbie Fowler, goals=[12.0, 25.0, 28.0], start_season=1993-1994, start_age=18.7, goals=65.0, grad=8.0
Rocket man: Terry McDermott, goals=[1.0, 1.0, 4.0, 8.0, 11.0, 13.0, 14.0], start_season=1975-1976, start_age=24.1, goals=52.0, grad=2.5

Top 5 Rocket Men (sorted by gradient of line of best fit) are

Dick Forshaw, 11
Luis Suarez, 9.3
Gordon Hodgson, 8.5
Ian Rush, 8.0
Robbie Fowler, 8.0



In [355]:

    
# show example graph of the rocket portion of the players career e.g. Robbie Fowler
p = 'Robbie Fowler'
Lg = [12.0, 25.0, 28.0]
dfp = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.player == p) &
                               (dflfc_scorers_tl_pos.league.isin(Lg))]
print dfp
print ggplot_age_vs_lgoals(dfp, [p])









    



        season         player  league position   age
275  1995-1996  Robbie Fowler      28  Striker  20.7
285  1994-1995  Robbie Fowler      25  Striker  19.7
295  1993-1994  Robbie Fowler      12  Striker  18.7






    












    



<ggplot: (17157843)>

Striking Nostalgia



In [356]:

    
# Just a few of my early favourites
players = ['Kevin Keegan', 'Kenny Dalglish', 'Steve Heighway']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)









    












    Out[356]:





<ggplot: (15506674)>

Highest scoring midfielders over career



In [357]:

    
df = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Midfielder')]\
                            .groupby('player').sum()
print df[df.league > 50].sort_values('league', ascending=False)['league']
players = df[df.league > 50].sort_values('league', ascending=False).index.unique()
print len(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



player
Steven Gerrard       120
Billy Liddell         96
Berry Nieuwenhuys     74
Arthur Goddard        65
Jack Cox              62
John Barnes           62
Terry McDermott       54
Name: league, dtype: int64
7






    












    Out[357]:





<ggplot: (14888687)>

Highest scoring defenders over career



In [358]:

    
df = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Defender')]\
                            .groupby('player').sum()
print df[df.league > 20].sort_values('league', ascending=False)['league']
players = df[df.league > 20].sort_values('league', ascending=False).index.unique()
print len(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



player
Chris Lawler        41
Phil Neal           41
Tommy Smith         36
Donald Mackinlay    28
Steve Nicol         23
Sami Hyypia         22
John Arne Riise     21
Name: league, dtype: int64
7






    












    Out[358]:





<ggplot: (11768254)>

The Champions



In [359]:

    
# create list of seasons when LFC were champions
CHAMPS = ['1900-1901', '1905-1906', '1921-1922', '1922-1923', '1946-1947', '1963-1964',\
          '1965-1966', '1972-1973', '1975-1976', '1976-1977', '1978-1979', '1979-1980',\
          '1981-1982', '1982-1983', '1983-1984', '1985-1986', '1987-1988', '1989-1990']
print len(CHAMPS)



In [360]:

    
# show total goals over career in title winning teams
df_champs = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season.isin(CHAMPS)][['league', 'player']].groupby('player').sum()\
                            .sort_values('league', ascending=False).head(12)
df_champs









    Out[360]:






  
    
      
      league
    
    
      player
      
    
  
  
    
      Ian Rush
      113
    
    
      Kenny Dalglish
      78
    
    
      Roger Hunt
      60
    
    
      David Johnson
      44
    
    
      Harry Chambers
      41
    
    
      John Toshack
      39
    
    
      John Barnes
      37
    
    
      Kevin Keegan
      37
    
    
      Dick Forshaw
      36
    
    
      Terry McDermott
      35
    
    
      Ray Kennedy
      34
    
    
      Phil Neal
      31



In [361]:

    
# plot top 8
players = df_champs.index.values[:8]
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Ian Rush' 'Kenny Dalglish' 'Roger Hunt' 'David Johnson' 'Harry Chambers'
 'John Toshack' 'John Barnes' 'Kevin Keegan']






    












    Out[361]:





<ggplot: (15628124)>



In [362]:

    
# show highest scorers in a title winning season
dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season.isin(CHAMPS)].sort_values('league', ascending=False).head(12)









    Out[362]:






  
    
      
      season
      player
      league
      position
      age
    
  
  
    
      417
      1983-1984
      Ian Rush
      32
      Striker
      22.2
    
    
      666
      1963-1964
      Roger Hunt
      31
      Striker
      25.5
    
    
      639
      1965-1966
      Roger Hunt
      29
      Striker
      27.5
    
    
      370
      1987-1988
      John Aldridge
      26
      Striker
      29.3
    
    
      767
      1946-1947
      Albert Stubbins
      24
      Striker
      27.5
    
    
      766
      1946-1947
      Jack Balmer
      24
      Striker
      30.9
    
    
      1101
      1905-1906
      Joe Hewitt
      24
      Striker
      24.7
    
    
      429
      1982-1983
      Ian Rush
      24
      Striker
      21.2
    
    
      348
      1989-1990
      John Barnes
      22
      Striker
      26.2
    
    
      391
      1985-1986
      Ian Rush
      22
      Striker
      24.2
    
    
      952
      1922-1923
      Harry Chambers
      22
      Striker
      26.1
    
    
      667
      1963-1964
      Ian St John
      21
      Striker
      25.6

European Cup Winning Team, May 1977



In [363]:

    
players = ['Ray Clemence', 'Phil Neal', 'Joey Jones', 'Tommy Smith',
           'Ray Kennedy', 'Emlyn Hughes', 'Kevin Keegan', 'Jimmy Case',
           'Steve Heighway', 'Ian Callaghan', 'Terry McDermott']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)









    












    Out[363]:





<ggplot: (17249652)>

Best goals per game



In [364]:

    
# calculate GPG for each player season
dflfc_scorers_tl_pos_age_apps['GPG'] = (dflfc_scorers_tl_pos_age_apps.league/dflfc_scorers_tl_pos_age_apps.lgapp).round(3)



In [365]:

    
dflfc_scorers_tl_pos_age_apps.head()









    Out[365]:






  
    
      
      season
      player
      league
      position
      age
      lgapp
      GPG
    
  
  
    
      0
      2016-2017
      Philippe Coutinho
      13
      Midfielder
      24.6
      31
      0.419
    
    
      1
      2016-2017
      Sadio Mane
      13
      Striker
      24.7
      27
      0.481
    
    
      2
      2016-2017
      Roberto Firmino
      11
      Striker
      25.3
      35
      0.314
    
    
      3
      2016-2017
      Adam Lallana
      8
      Midfielder
      28.7
      31
      0.258
    
    
      4
      2016-2017
      Divock Origi
      7
      Striker
      21.7
      34
      0.206



In [366]:

    
# show best GPG per season where appearance > 10
dflfc_scorers_tl_pos_age_apps[dflfc_scorers_tl_pos_age_apps.lgapp > 10].sort_values('GPG', ascending=False).head(15)









    Out[366]:






  
    
      
      season
      player
      league
      position
      age
      lgapp
      GPG
    
  
  
    
      1057
      1909-1910
      Jack Parkinson
      30
      Striker
      26.3
      31
      0.968
    
    
      44
      2013-2014
      Luis Suarez
      31
      Striker
      27.0
      33
      0.939
    
    
      1128
      1902-1903
      Sam Raybould
      31
      Striker
      27.6
      33
      0.939
    
    
      861
      1930-1931
      Gordon Hodgson
      36
      Striker
      26.7
      40
      0.900
    
    
      922
      1925-1926
      Dick Forshaw
      27
      Striker
      30.4
      32
      0.844
    
    
      997
      1914-1915
      Fred Pagnam
      24
      Striker
      23.3
      29
      0.828
    
    
      97
      2009-2010
      Fernando Torres
      18
      Striker
      25.8
      22
      0.818
    
    
      814
      1934-1935
      Gordon Hodgson
      27
      Striker
      30.7
      34
      0.794
    
    
      887
      1928-1929
      Gordon Hodgson
      30
      Striker
      24.7
      38
      0.789
    
    
      639
      1965-1966
      Roger Hunt
      29
      Striker
      27.5
      37
      0.784
    
    
      417
      1983-1984
      Ian Rush
      32
      Striker
      22.2
      41
      0.780
    
    
      903
      1927-1928
      Willie Devlin
      14
      Striker
      28.4
      18
      0.778
    
    
      666
      1963-1964
      Roger Hunt
      31
      Striker
      25.5
      41
      0.756
    
    
      668
      1963-1964
      Alf Arrowsmith
      15
      Midfielder
      21.1
      20
      0.750
    
    
      275
      1995-1996
      Robbie Fowler
      28
      Striker
      20.7
      38
      0.737



In [367]:

    
# show best Career GPG (CGPG) per career where appearance > 50
df_gpg = dflfc_scorers_tl_pos_age_apps[['player', 'league', 'lgapp']].groupby('player').sum()
df_gpg['CGPG'] = (df_gpg.league/df_gpg.lgapp).round(3) # career goals per game
df_gpg['CMPG'] = (df_gpg.lgapp*90/df_gpg.league).round(3) # career minutes per goal (assume all apps = 90 mins)
df_gpg[df_gpg.lgapp > 50].sort_values('CGPG', ascending=False).head(12)









    Out[367]:






  
    
      
      league
      lgapp
      CGPG
      CMPG
    
    
      player
      
      
      
      
    
  
  
    
      Gordon Hodgson
      233
      358
      0.651
      138.283
    
    
      Fernando Torres
      65
      102
      0.637
      141.231
    
    
      Luis Suarez
      69
      110
      0.627
      143.478
    
    
      Jimmy Smith
      38
      61
      0.623
      144.474
    
    
      John Aldridge
      50
      83
      0.602
      149.400
    
    
      Tom Reid
      30
      51
      0.588
      153.000
    
    
      Jack Parkinson
      103
      178
      0.579
      155.534
    
    
      Roger Hunt
      167
      295
      0.566
      158.982
    
    
      Sam Raybould
      101
      179
      0.564
      159.505
    
    
      Michael Owen
      118
      216
      0.546
      164.746
    
    
      Daniel Sturridge
      46
      89
      0.517
      174.130
    
    
      Ian Rush
      229
      462
      0.496
      181.572



In [368]:

    
# plot top 6 goal scorers with best Career GPG
players = df_gpg[df_gpg.lgapp > 50].sort_values('CGPG', ascending=False).head(6).index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))









    



['Gordon Hodgson' 'Fernando Torres' 'Luis Suarez' 'Jimmy Smith'
 'John Aldridge' 'Tom Reid']






    












    Out[368]:





<ggplot: (17755680)>

Note on the variable number of games per season

Note that the number of league games has varied over the top level seasons.



In [369]:

    
# show number of different total games
print dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])].PLD.unique()









    



[30 34 38 42 40]



In [370]:

    
# show number of seasons for each total
dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])][['PLD', 'Season']].groupby('PLD').count()



In [371]:

    
# show the seasons for each total
dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])][['PLD', 'Season']]\
                        .groupby('PLD')['Season'].apply(lambda x: ','.join(x))









    Out[371]:





PLD
30                        1894-1895,1896-1897,1897-1898
34    1898-1899,1899-1900,1900-1901,1901-1902,1902-1...
38    1905-1906,1906-1907,1907-1908,1908-1909,1909-1...
40                                            1987-1988
42    1919-1920,1920-1921,1921-1922,1922-1923,1923-1...
Name: Season, dtype: object



In [372]:

    
# confirm that season 1939-1940 is not included in analysis
len(dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season == '1939-1940'])









    Out[372]:





0



In [ ]:

Building The Spyre App

Useful reference material:

How to develop a Spyre app, including tutorials - https://github.com/adamhajari/spyre.

Spyre is a web app framework for providing a simple user interface for Python data projects. In simple terms the app involves:

creating a user interface to capture the list of players.
calling the ggplot_age_vs_lgoals() function to plot the players' graph.

See lfcgm_app.py in the lfcgm github repo for the app source code.

I protyped using the checkbox for Spyre app input but decided multiple dropdowns was more elegant for my use case.

Note: ggplot does not plot single points correctly, so I decided to restrict the list of players to those who've scored in more than 1 season. For more information on the issue see this stackoverflow question.



In [373]:

    
# create dataframe of all players who have scored 1 or more goals in more than 1 season
df_pgoals = dflfc_scorers_tl_pos_age[['player', 'age']].groupby('player').count()
df_pgoals.columns = ['goal_tot']
print df_pgoals.head(10)

print '\nThere are {} goal scorers.'.format(len(df_pgoals))
print "That's {} data points.".format(df_pgoals.sum().values[0])
print 'Of these {} have scored more than once.'.format(df_pgoals[df_pgoals.goal_tot > 1].count().values[0])
print "That's {} data points.".format(df_pgoals[df_pgoals.goal_tot > 1].sum().values[0])

print '\nHere are the first of those scoring more than once...'
df_pgoals_gt1 = df_pgoals[df_pgoals.goal_tot > 1]
df_pgoals_gt1.head()









    



                 goal_tot
player                   
Abel Xavier             1
Abraham Hartley         1
Adam Lallana            3
Alan A'Court            3
Alan Arnell             1
Alan Hansen             5
Alan Kennedy            7
Alan Scott              1
Alan Waddle             1
Albert Pearson          1

There are 384 goal scorers.
That's 1210 data points.
Of these 240 have scored more than once.
That's 1066 data points.

Here are the first of those scoring more than once...






    Out[373]:






  
    
      
      goal_tot
    
    
      player
      
    
  
  
    
      Adam Lallana
      3
    
    
      Alan A'Court
      3
    
    
      Alan Hansen
      5
    
    
      Alan Kennedy
      7
    
    
      Albert Stubbins
      6



In [374]:

    
# produce dataframe of players for the Spyre dropdown and save to csv
# this csv will be read by the app
player_dd = df_pgoals_gt1.index.values
player_dd

# create dropdown dataframe (with label and value) and save to csv
df_dropdown = pd.DataFrame(player_dd, player_dd).reset_index()
df_dropdown.columns = (['label', 'value'])
print df_dropdown.head()

# save dropdown label and value to csv
#LFCGM_DROPDOWN = os.path.relpath('data\lfcgm_app_dropdown.csv')
df_dropdown[['label', 'value']].to_csv(LFCGM_DROPDOWN, index=False)
assert os.path.isfile(LFCGM_DROPDOWN)









    



             label            value
0     Adam Lallana     Adam Lallana
1     Alan A'Court     Alan A'Court
2      Alan Hansen      Alan Hansen
3     Alan Kennedy     Alan Kennedy
4  Albert Stubbins  Albert Stubbins



In [375]:

    
# show players added in latest season
LATEST_SEASON = SEASON_END
scorers_for_latest_season_set = set(dflfc_scorers_tl[dflfc_scorers_tl.season == LATEST_SEASON].player.values)
this_season_scorers_df = dflfc_scorers_tl[dflfc_scorers_tl.player.isin(scorers_for_latest_season_set)]\
                                            [['player', 'season']].groupby('player').count()

players_added = this_season_scorers_df[this_season_scorers_df.season == 2].index.values
print 'there are {} new players added in {}; these are:'.format(len(players_added), LATEST_SEASON)
', '.join(players_added)









    



there are 3 new players added in 2016-2017; these are:






    Out[375]:





'Divock Origi, James Milner, Roberto Firmino'



In [376]:

    
# read dropdown.csv and create dropdown _dict in format that the Spyre app will use
dropdown_df = pd.read_csv(LFCGM_DROPDOWN)
print dropdown_df.head(), '\n'

# create dict in format Spyre app will use
dropdown_dict = dropdown_df.to_dict(orient='records')

# check length of dict (expect 240)
# 233 
# + 4 players added in 2015-16
# + 3 players added in 2016-17
print len(dropdown_dict)

# show 4 new players
dropdown_df[dropdown_df.value.isin(players_added)]









    



             label            value
0     Adam Lallana     Adam Lallana
1     Alan A'Court     Alan A'Court
2      Alan Hansen      Alan Hansen
3     Alan Kennedy     Alan Kennedy
4  Albert Stubbins  Albert Stubbins 

240






    Out[376]:






  
    
      
      label
      value
    
  
  
    
      64
      Divock Origi
      Divock Origi
    
    
      115
      James Milner
      James Milner
    
    
      194
      Roberto Firmino
      Roberto Firmino



In [ ]:

A final bit of data analysis to help find interesting plots...



In [377]:

    
# create grouper function to iterate in chunks of n
# ref: http://stackoverflow.com/questions/8991506/iterate-an-iterator-by-chunks-of-n-in-python
import string
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)



In [378]:

    
# test grouper
tmp = string.ascii_lowercase
for g in grouper(4, tmp, None):
    print g









    



('a', 'b', 'c', 'd')
('e', 'f', 'g', 'h')
('i', 'j', 'k', 'l')
('m', 'n', 'o', 'p')
('q', 'r', 's', 't')
('u', 'v', 'w', 'x')
('y', 'z', None, None)



In [379]:

    
# ggplot all players in group of 4
for players in grouper(4, player_dd, None):
    print players
    df = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.player.isin(players)][['age', 'league', 'player']]
    #print df
    print ggplot(df, aes(x='age', y='league', color='player')) + \
        geom_point() + \
        geom_smooth(se=False) + facet_grid('player')
    
    break # remove break to see all groups









    



('Adam Lallana', "Alan A'Court", 'Alan Hansen', 'Alan Kennedy')






    












    



<ggplot: (15605140)>

Use of ggplot

I was keen to use ggplot because I intend to build an LFC Goal Machine equivalent in R Studio Shiny. Spyre is meant to make it easy to build Shiny-like apps so this should be straight-forward, right? Well, sort of.

I've used ggplot in R and it is great. Yhat's python-ggplot is very good but still work in progress. I was able to get ggplot to produce the scatter plot and line of best fit but it was frustrating that I couldn't get ggplot to cope with single points, add annotations and only plot integer values (as half goals don't much make sense!). But ggplot was good enough.

Spyre App Deployment on Heroku

Reference Material:

How to deploy a Heroku python app - https://devcenter.heroku.com/categories/python.
How to deploy a Spyre app on Heroku - [http://adamhajari.github.io/2015/04/21/deploying-a-spyre-app-on-heroku.html.

The biggest challenge was fitting the lfcgm app in Heroku's slug limit of 300 MB. lfcgm uses spyre, ggplot and pandas. These packages pull in other packages. Spyre uses numpy, pandas, cherrypy, jinja and matplotlib. Ggplot uses matplotlib, pandas, numpy, scipy, statsmodels and patsy. Pandas uses numpy, python-dateutil and pytz. This means that the size of the app and supporting packages is quite large. The biggest issue was scipy which needs to be built on Heroku.

I use the excellent anaconda for my python data analysis and development. So my first stab at building the Heroku app used Kenneth Reitz' Heroku mini-conda buildpack. Unfortunately the app slug size with this buildpack was much too big.

Plan B (after lots of trial and error) used Brandon Liu's scipy buildpack. This builds numpy and scipy. This allowed me to produce a compressed sluge size ~156 MB (phew).

The Heroku app's Procfile and requirements.txt are in the lfcgm github repo.

Running the App

The app is available at lfcgm.herokuapp.com.

It is also available at lfcgm.lfcsorted.com. See here for guidance on setting up a custom domain.

App Management

As an aside, Heroku has some great addons for managing the app e.g. papertrail for log management.

App Data

The LFC Goal Machine app uses the following data files:

data/lfcgm_app_dropdown.csv (used to build the dropdown list of players)
data/lfc_scorers_tl_pos_age.csv (used to build the pandas dataframe of LFC scorers in top level league)

The data structure of these files is described in this notebook. However the data is not in the lfcgm github repository because the data is owned by lfchistory.net.

App Startup Times

Please be patient if the lfcgm app takes several seconds to wake-up. The entry level 'heroku dyno' has some key limitations:

the dyno sleeps after a period of inactivity.
the dyno must sleep for 6 hours in each 24 hours period.

If the lfcgm app doesn't respond please try again later.

lfcgm github repo

For more information, including links to the R version of lfcgm, see the lfcgm github repo.



In [ ]:

	season	player	league
1416	2016-2017	Philippe Coutinho	13
1417	2016-2017	Sadio Mane	13
1418	2016-2017	Roberto Firmino	11
1419	2016-2017	Adam Lallana	8
1420	2016-2017	Divock Origi	7

	season	player	league
5	1892-1893	Jonathan Cameron	4
6	1892-1893	Jim McBride	4
7	1892-1893	Hugh McQueen	3
8	1892-1893	Joe McQue	2
9	1892-1893	Own goals	1

	season	player	league
1426	2016-2017	Own goals	1
1411	2015-2016	Own goals	1
1388	2014-2015	Own goals	4
1374	2013-2014	Own goals	5
1357	2012-2013	Own goals	4

	season	player	position
0	1892-1893	Sydney Ross	Goalkeeper
1	1892-1893	Billy McOwen	Goalkeeper
2	1892-1893	Jim McBride	Defender
3	1892-1893	John McCartney	Defender
4	1892-1893	Andrew Hannah	Defender

	season	player	position
2972	2016-2017	Ben Woodburn	Striker
2973	2016-2017	Divock Origi	Striker
2974	2016-2017	Sadio Mane	Striker
2975	2016-2017	Daniel Sturridge	Striker
2976	2016-2017	Rhian Brewster	Striker

	Season	League	Pos	PLD	HW	HD	HL	HF	HA	AW	AD	AL	AF	AA	PTS	GF	GA	GD
0	1893-1894	2nd Division	1	28	14	0	0	46	6	8	6	0	31	12	50	77	18	59
1	1894-1895	1st Division	15	30	6	4	5	38	28	1	4	10	13	42	22	51	70	-19
2	1895-1896	2nd Division	1	30	14	1	0	65	11	8	1	6	41	21	46	106	32	74
3	1896-1897	1st Division	5	30	7	6	2	25	10	5	3	7	21	28	33	46	38	8
4	1897-1898	1st Division	9	30	7	4	4	27	16	4	2	9	21	29	28	48	45	3

	Season	League	Pos	PLD	HW	HD	HL	HF	HA	AW	AD	AL	AF	AA	PTS	GF	GA	GD
108	2012-2013	Premier League	7	38	9	6	4	33	16	7	7	5	38	27	61	71	43	28
109	2013-2014	Premier League	2	38	16	1	2	53	18	10	5	4	48	32	84	101	50	51
110	2014-2015	Premier League	6	38	10	5	4	30	20	8	3	8	22	28	62	52	48	4
111	2015-2016	Premier League	8	38	8	8	3	33	22	8	4	7	30	28	60	63	50	13
112	2016-2017	Premier League	4	38	12	5	2	45	18	10	5	4	33	24	76	78	42	36

	season	player	league	position
0	2016-2017	Philippe Coutinho	13	Midfielder
1	2016-2017	Sadio Mane	13	Striker
2	2016-2017	Roberto Firmino	11	Striker
3	2016-2017	Adam Lallana	8	Midfielder
4	2016-2017	Divock Origi	7	Striker

	season	player	league	position
1205	1894-1895	Frank Becton	4	Striker
1206	1894-1895	Neil Kerr	3	Midfielder
1207	1894-1895	Hugh McQueen	2	Midfielder
1208	1894-1895	Joe McQue	1	Defender
1209	1894-1895	Patrick Gordon	1	Midfielder

	player	birthdate	country
0	Gary Ablett	1965-11-19	England
1	Alan A'Court	1934-09-30	England
2	Charlie Adam	1985-12-10	Scotland
3	Daniel Agger	1984-12-12	Denmark
4	Andrew Aitken	1909-08-25	England

	player	birthdate	country
781	Ron Yeats	1937-11-15	Scotland
782	Samed Yesil	1994-05-25	Germany
783	Tommy Younger	1930-04-10	Scotland
784	Bolo Zenden	1976-08-15	Netherlands
785	Christian Ziege	1972-02-01	Germany

	season	player	league	position	age
360	1988-1989	John Aldridge	21	Striker	30.3
766	1946-1947	Jack Balmer	24	Striker	30.9
814	1934-1935	Gordon Hodgson	27	Striker	30.7
922	1925-1926	Dick Forshaw	27	Striker	30.4
1067	1908-1909	Ronald Orr	20	Striker	32.4

	league
player
Gordon Hodgson	233
Ian Rush	229
Roger Hunt	167
Harry Chambers	135
Robbie Fowler	128
Steven Gerrard	120
Kenny Dalglish	118
Michael Owen	118
Dick Forshaw	116
Jack Parkinson	103
Sam Raybould	101
Jack Balmer	98

	season	player	league
861	1930-1931	Gordon Hodgson	36
417	1983-1984	Ian Rush	32
1128	1902-1903	Sam Raybould	31
666	1963-1964	Roger Hunt	31
44	2013-2014	Luis Suarez	31
1057	1909-1910	Jack Parkinson	30
380	1986-1987	Ian Rush	30
887	1928-1929	Gordon Hodgson	30
639	1965-1966	Roger Hunt	29
275	1995-1996	Robbie Fowler	28

	league
player
Kenny Dalglish	118
Graeme Souness	38
Phil Taylor	32
Bob Paisley	10

	season	player	league	position	age
712	1951-1952	Billy Liddell	19	Midfielder	30.0
405	1984-1985	John Wark	18	Midfielder	27.4
430	1982-1983	Kenny Dalglish	18	Midfielder	31.9
735	1949-1950	Billy Liddell	17	Midfielder	28.0
851	1931-1932	Gordon Gunson	17	Midfielder	27.5
108	2008-2009	Steven Gerrard	16	Midfielder	28.6
339	1990-1991	John Barnes	16	Midfielder	27.2
888	1928-1929	Dick Edmed	16	Midfielder	24.9

	season	player	league	position	age
590	1969-1970	Chris Lawler	10	Defender	26.2
432	1982-1983	Phil Neal	8	Defender	31.9
5	2016-2017	James Milner	7	Defender	31.0
48	2013-2014	Martin Skrtel	7	Defender	29.1
200	2001-2002	John Arne Riise	7	Defender	21.3
504	1976-1977	Phil Neal	7	Defender	25.9

		league	age
player	season
Abel Xavier	2001-2002	1	29.1
Abraham Hartley	1897-1898	1	25.9
Adam Lallana	2014-2015	5	26.7
	2015-2016	4	27.7
	2016-2017	8	28.7
Alan A'Court	1952-1953	2	18.3
	1953-1954	3	19.3
	1962-1963	2	28.3
Alan Arnell	1953-1954	1	20.1
Alan Hansen	1978-1979	1	23.6
	1979-1980	4	24.6
	1980-1981	1	25.6

	league	lgapp	CGPG	CMPG
player
Gordon Hodgson	233	358	0.651	138.283
Fernando Torres	65	102	0.637	141.231
Luis Suarez	69	110	0.627	143.478
Jimmy Smith	38	61	0.623	144.474
John Aldridge	50	83	0.602	149.400
Tom Reid	30	51	0.588	153.000
Jack Parkinson	103	178	0.579	155.534
Roger Hunt	167	295	0.566	158.982
Sam Raybould	101	179	0.564	159.505
Michael Owen	118	216	0.546	164.746
Daniel Sturridge	46	89	0.517	174.130
Ian Rush	229	462	0.496	181.572

LFC Data Analysis: The LFC Goal Machine

Notebook Structure

Notebook Change Log

Set-up

Prepare To Load The Data

Define name and location of csv files

Load the LFC data into dataframes and munge

Create a dataframe of scorers in top level league seasons

Create dataframe of squads giving the position of each player

Create dataframe of LFC's league position

Create merged dataframe, combining scorers in top league level season with squad position

Create a dataframe of players with birthdate and country of birth

Create merged dataframe of players, combining scorers in top league level season with squad position and age

Save the new dataframe

Create dataframe of player's league appearances

Create merged dataframe of players, combining scorers in top league level season with squad position, age and appearances

Analyse the data

Create a function to plot player's age vs top level league goals

Show some interesting plots

Note on the variable number of games per season

Building The Spyre App

Create the dropdown list of players for the Spyre app

Explore 'facet grid' of individual player graphs for those players in the dropdown

Use of ggplot

Spyre App Deployment on Heroku

Running the App

App Management

App Data

App Startup Times

lfcgm github repo