LFC Data Analysis: The LFC Goal Machine

This notebook analyses Liverpool FC's goal scorers, in particular exploring a scatter plot of player's age against top level league goals scored. The plot is available as an interactive web app called the LFC Goal Machine here.

The notebook was used originally to explore and merge the input data, generate the required additional data and prototype the application. The notebook contains the key algorithms, some interesting plots and describes how the lfcgm app was built and deployed.

The project uses IPython Notebook, python, pandas, matplotlib, numpy, ggplot, spyre, heroku and heroku scipy buildpack.

Notebook Structure

  1. Prepare to load the data by defining the location of the key input csv files and the location of the output csv files
  2. Load the csv files and munge the data e.g. add player age at season midpoint; generate a merged dataframe with the key data
  3. Analyse the data and define key algorithms e.g. to produce the lfcgm graph
  4. Describe how the Spyre web application is built
  5. Generate key additional data e.g. list of players for the application dropdown list
  6. Describe how the app is deployed and run

Notebook Change Log


In [383]:
%%html
<! left align the change log table in next cell >
<style>
table {float:left}
</style>


Date Change Description
21st February 2016 Initial baseline
30th October 2016 Added season 2015-16
12th October 2017 Added season 2016-17

Set-up

Import the python modules needed for the analysis.


In [384]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import sys
import collections
import os
from ggplot import *
from ggplot import __version__ as ggplot_version
from datetime import datetime
from __future__ import division

# enable inline plotting
%matplotlib inline

Print version numbers.


In [289]:
print 'python version: {}'.format(sys.version)
print 'pandas version: {}'.format(pd.__version__)
print 'matplotlib version: {}'.format(mpl.__version__)
print 'numpy version: {}'.format(np.__version__)
print 'ggplot version: {}'.format(ggplot_version)


python version: 2.7.11 |Anaconda custom (64-bit)| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]
pandas version: 0.18.0
matplotlib version: 1.5.1
numpy version: 1.11.1
ggplot version: 0.6.8

Prepare To Load The Data

Define name and location of csv files


In [290]:
# define key seasons
SEASON_END = '2016-2017' # most recent season
SEASON_START = '1892-1893' # first season in existance
LG_SEASON_START = '1893-1894' # first league season (2nd Division)
PLAYERS_CSV_MONTH = 'October' # month of players csv extract
PLAYERS_CSV_YEAR = '2017' # year of players csv extract

# define input csv files

# define scorers CSV file name (and check exists)
SCORERS_PREFIX = 'lfc_scorers'
SCORERS_CSV_FILE = '{}_{}_{}.csv'.format(SCORERS_PREFIX, SEASON_START, SEASON_END)
LFC_SCORERS_CSV_FILE = os.path.relpath('data/{}'.format(SCORERS_CSV_FILE))
assert os.path.isfile(LFC_SCORERS_CSV_FILE) 
print 'LFC scorers csv file is: {}'.format(LFC_SCORERS_CSV_FILE) 

# define squads CSV file name (and check exists)
SQUADS_PREFIX = 'lfc_squads'
SQUADS_CSV_FILE = '{}_{}_{}.csv'.format(SQUADS_PREFIX, SEASON_START, SEASON_END)
LFC_SQUADS_CSV_FILE = os.path.relpath('data/{}'.format(SQUADS_CSV_FILE))
assert os.path.isfile(LFC_SQUADS_CSV_FILE) 
print 'LFC squads csv file is: {}'.format(LFC_SQUADS_CSV_FILE)

# define player appearances CSV file name (and check exists)
APPS_PREFIX = 'lfc_apps'
APPS_CSV_FILE = '{}_{}_{}.csv'.format(APPS_PREFIX, SEASON_START, SEASON_END)
LFC_APPS_CSV_FILE = os.path.relpath('data/{}'.format(APPS_CSV_FILE))
assert os.path.isfile(LFC_APPS_CSV_FILE) 
print 'LFC appearances csv file is: {}'.format(LFC_APPS_CSV_FILE)

# define league CSV file name (and check exists)
LEAGUE_PREFIX = 'lfc_league'
LEAGUE_CSV_FILE = '{}_{}_{}.csv'.format(LEAGUE_PREFIX, LG_SEASON_START, SEASON_END)
LFC_LEAGUE_CSV_FILE = os.path.relpath('data/{}'.format(LEAGUE_CSV_FILE))
assert os.path.isfile(LFC_LEAGUE_CSV_FILE) 
print 'LFC league csv file is: {}'.format(LFC_LEAGUE_CSV_FILE)
                                          
# define players CSV file name (and check exists)
PLAYERS_PREFIX = 'lfc_players'
PLAYERS_CSV_FILE_UPDATED = '{}_{}{}_upd.csv'.format(PLAYERS_PREFIX, PLAYERS_CSV_MONTH, PLAYERS_CSV_YEAR)
LFC_PLAYERS_CSV_FILE_UPDATED = os.path.relpath('data/{}'.format(PLAYERS_CSV_FILE_UPDATED))
assert os.path.isfile(LFC_PLAYERS_CSV_FILE_UPDATED) 
print 'LFC league csv file is: {}'.format(LFC_PLAYERS_CSV_FILE_UPDATED)

# define generated csv files (this is the data used by the lfcgm app)

# define scorers in top league with position and age CSV file name
SCORERS_TL_POS_AGE_CSV_FILE = 'lfc_scorers_tl_pos_age.csv'
LFC_SCORERS_TL_POS_AGE_CSV_FILE = os.path.relpath('data/lfc_scorers_tl_pos_age.csv')
print 'LFC scorers in top league with position and age is: {}'.format(LFC_SCORERS_TL_POS_AGE_CSV_FILE)

# define dropdown CSV file name
LFCGM_DROPDOWN = os.path.relpath('data\lfcgm_app_dropdown.csv')
print 'LFC goal machine dropdown is: {}'.format(LFCGM_DROPDOWN)


LFC scorers csv file is: data\lfc_scorers_1892-1893_2016-2017.csv
LFC squads csv file is: data\lfc_squads_1892-1893_2016-2017.csv
LFC appearances csv file is: data\lfc_apps_1892-1893_2016-2017.csv
LFC league csv file is: data\lfc_league_1893-1894_2016-2017.csv
LFC league csv file is: data\lfc_players_October2017_upd.csv
LFC scorers in top league with position and age is: data\lfc_scorers_tl_pos_age.csv
LFC goal machine dropdown is: data\lfcgm_app_dropdown.csv

In [ ]:

Load the LFC data into dataframes and munge

Create a dataframe of scorers in top level league seasons

Data source: lfchistory.net


In [291]:
print 'Loading LFC scorers csv from {}'.format(LFC_SCORERS_CSV_FILE)
dflfc_scorers = pd.read_csv(LFC_SCORERS_CSV_FILE)

# sort by season, then league goals
dflfc_scorers = dflfc_scorers.sort_values(['season', 'league'], ascending=([False, False]))
dflfc_scorers.shape


Loading LFC scorers csv from data\lfc_scorers_1892-1893_2016-2017.csv
Out[291]:
(1429, 3)

In [292]:
dflfc_scorers.head()


Out[292]:
season player league
1416 2016-2017 Philippe Coutinho 13
1417 2016-2017 Sadio Mane 13
1418 2016-2017 Roberto Firmino 11
1419 2016-2017 Adam Lallana 8
1420 2016-2017 Divock Origi 7

In [293]:
dflfc_scorers.tail()


Out[293]:
season player league
5 1892-1893 Jonathan Cameron 4
6 1892-1893 Jim McBride 4
7 1892-1893 Hugh McQueen 3
8 1892-1893 Joe McQue 2
9 1892-1893 Own goals 1

In [294]:
# note that scorers includes own goals
dflfc_scorers[dflfc_scorers.player == 'Own goals'].head()


Out[294]:
season player league
1426 2016-2017 Own goals 1
1411 2015-2016 Own goals 1
1388 2014-2015 Own goals 4
1374 2013-2014 Own goals 5
1357 2012-2013 Own goals 4
Filter out seasons that LFC weren't in top level

In [295]:
# note: war years already excluded in input files
LANCS_YRS = ['1892-1893']
SECOND_DIV_YRS = ['1893-1894', '1895-1896', '1904-1905', '1961-1962', 
                  '1954-1955', '1955-1956', '1956-1957', '1957-1958', 
                  '1958-1959', '1959-1960', '1960-1961']

NOT_TOP_LEVEL_YRS = LANCS_YRS + SECOND_DIV_YRS
dflfc_scorers_tl = dflfc_scorers[~dflfc_scorers.season.isin(NOT_TOP_LEVEL_YRS)].copy()
dflfc_scorers_tl.shape


Out[295]:
(1281, 3)

In [299]:
## check number of top level seasons aligns with http://www.lfchistory.net/Stats/LeagueOverall
## expect 102 total for top level seasons from 1894-95 to 2016-17
print 'the number of seasons is {}'.format(len(dflfc_scorers_tl.season.unique()))
assert len(dflfc_scorers_tl.season.unique()) == 102


the number of seasons is 102

In [300]:
# show most league goals in a season in top level
# cross-check with http://en.wikipedia.org/wiki/List_of_Liverpool_F.C._records_and_statistics#Goalscorers
# expect 101 in 2013-14
assert dflfc_scorers_tl[['season', 'league']].groupby(['season'])\
            .sum().sort_values('league', ascending=False).head(1).reset_index().values.tolist()[0] == ['2013-2014', 101]
dflfc_scorers_tl[['season', 'league']].groupby(['season']).sum().sort_values('league', ascending=False).head(1)


Out[300]:
league
season
2013-2014 101

In [301]:
# remove OG
dflfc_scorers_tl = dflfc_scorers_tl[dflfc_scorers_tl.player != 'Own goals']
dflfc_scorers_tl.shape


Out[301]:
(1210, 3)

In [302]:
# check 2016-17
dflfc_scorers_tl[dflfc_scorers_tl.season == '2016-2017'].head(10)


Out[302]:
season player league
1416 2016-2017 Philippe Coutinho 13
1417 2016-2017 Sadio Mane 13
1418 2016-2017 Roberto Firmino 11
1419 2016-2017 Adam Lallana 8
1420 2016-2017 Divock Origi 7
1421 2016-2017 James Milner 7
1422 2016-2017 Georginio Wijnaldum 6
1423 2016-2017 Emre Can 5
1424 2016-2017 Daniel Sturridge 3
1425 2016-2017 Dejan Lovren 2

Create dataframe of squads giving the position of each player

Data source: lfchistory.net


In [303]:
print 'Loading LFC scorers csv from {}'.format(LFC_SQUADS_CSV_FILE)
dflfc_squads = pd.read_csv(LFC_SQUADS_CSV_FILE)
dflfc_squads.shape


Loading LFC scorers csv from data\lfc_squads_1892-1893_2016-2017.csv
Out[303]:
(2977, 3)

In [304]:
dflfc_squads.head()


Out[304]:
season player position
0 1892-1893 Sydney Ross Goalkeeper
1 1892-1893 Billy McOwen Goalkeeper
2 1892-1893 Jim McBride Defender
3 1892-1893 John McCartney Defender
4 1892-1893 Andrew Hannah Defender

In [305]:
dflfc_squads.tail()


Out[305]:
season player position
2972 2016-2017 Ben Woodburn Striker
2973 2016-2017 Divock Origi Striker
2974 2016-2017 Sadio Mane Striker
2975 2016-2017 Daniel Sturridge Striker
2976 2016-2017 Rhian Brewster Striker

Create dataframe of LFC's league position

Data source: lfchistory.net


In [306]:
print 'Loading LFC scorers csv from {}'.format(LFC_LEAGUE_CSV_FILE)
dflfc_league = pd.read_csv(LFC_LEAGUE_CSV_FILE)
dflfc_league.shape


Loading LFC scorers csv from data\lfc_league_1893-1894_2016-2017.csv
Out[306]:
(113, 18)

In [307]:
dflfc_league.head()


Out[307]:
Season League Pos PLD HW HD HL HF HA AW AD AL AF AA PTS GF GA GD
0 1893-1894 2nd Division 1 28 14 0 0 46 6 8 6 0 31 12 50 77 18 59
1 1894-1895 1st Division 15 30 6 4 5 38 28 1 4 10 13 42 22 51 70 -19
2 1895-1896 2nd Division 1 30 14 1 0 65 11 8 1 6 41 21 46 106 32 74
3 1896-1897 1st Division 5 30 7 6 2 25 10 5 3 7 21 28 33 46 38 8
4 1897-1898 1st Division 9 30 7 4 4 27 16 4 2 9 21 29 28 48 45 3

In [308]:
dflfc_league.tail()


Out[308]:
Season League Pos PLD HW HD HL HF HA AW AD AL AF AA PTS GF GA GD
108 2012-2013 Premier League 7 38 9 6 4 33 16 7 7 5 38 27 61 71 43 28
109 2013-2014 Premier League 2 38 16 1 2 53 18 10 5 4 48 32 84 101 50 51
110 2014-2015 Premier League 6 38 10 5 4 30 20 8 3 8 22 28 62 52 48 4
111 2015-2016 Premier League 8 38 8 8 3 33 22 8 4 7 30 28 60 63 50 13
112 2016-2017 Premier League 4 38 12 5 2 45 18 10 5 4 33 24 76 78 42 36

Create merged dataframe, combining scorers in top league level season with squad position


In [309]:
dflfc_scorers_tl_pos = pd.DataFrame.merge(dflfc_scorers_tl, dflfc_squads)
dflfc_scorers_tl_pos.shape


Out[309]:
(1210, 4)

In [310]:
dflfc_scorers_tl_pos.head()


Out[310]:
season player league position
0 2016-2017 Philippe Coutinho 13 Midfielder
1 2016-2017 Sadio Mane 13 Striker
2 2016-2017 Roberto Firmino 11 Striker
3 2016-2017 Adam Lallana 8 Midfielder
4 2016-2017 Divock Origi 7 Striker

In [311]:
dflfc_scorers_tl_pos.tail()


Out[311]:
season player league position
1205 1894-1895 Frank Becton 4 Striker
1206 1894-1895 Neil Kerr 3 Midfielder
1207 1894-1895 Hugh McQueen 2 Midfielder
1208 1894-1895 Joe McQue 1 Defender
1209 1894-1895 Patrick Gordon 1 Midfielder

Create a dataframe of players with birthdate and country of birth


In [312]:
print 'Loading LFC scorers csv from {}'.format(LFC_PLAYERS_CSV_FILE_UPDATED)
dflfc_players = pd.read_csv(LFC_PLAYERS_CSV_FILE_UPDATED, parse_dates=['birthdate'])
assert dflfc_players.birthdate.dtypes == 'datetime64[ns]'
dflfc_players.shape


Loading LFC scorers csv from data\lfc_players_October2017_upd.csv
Out[312]:
(786, 3)

In [313]:
dflfc_players.head()


Out[313]:
player birthdate country
0 Gary Ablett 1965-11-19 England
1 Alan A'Court 1934-09-30 England
2 Charlie Adam 1985-12-10 Scotland
3 Daniel Agger 1984-12-12 Denmark
4 Andrew Aitken 1909-08-25 England

In [314]:
dflfc_players.tail()


Out[314]:
player birthdate country
781 Ron Yeats 1937-11-15 Scotland
782 Samed Yesil 1994-05-25 Germany
783 Tommy Younger 1930-04-10 Scotland
784 Bolo Zenden 1976-08-15 Netherlands
785 Christian Ziege 1972-02-01 Germany

Create merged dataframe of players, combining scorers in top league level season with squad position and age

Add players age to the dflfc_scorers_tl_pos dataframe


In [315]:
def age_at_season(row):
    """Return player's age at mid-point of season, assumed to be 1st Jan.
    
        row.player -> player's name
        row.season -> season
        
        uses dflfc_players to look-up birthdate, keyed on player
         - return average age if player is missing from dflfc_players
    """
    
    AVERAGE_AGE = 26.5
    
    mid_point = '01 January {}'.format(row.season[-4:])
    try:
        dob = dflfc_players[dflfc_players.player == row.player].birthdate.values[0]
    except:
        # use average age if player's birthdate not available
        print 'error: age not found for player {} in season {}, using average age {}'.format(row.player, 
                                                                                             row.season, 
                                                                                             AVERAGE_AGE)
        return AVERAGE_AGE
    return round((pd.Timestamp(mid_point) - dob).days/365.0, 1)

In [316]:
# add age column
# no errors expected
dflfc_scorers_tl_pos['age'] = dflfc_scorers_tl_pos.apply(lambda row: age_at_season(row), axis=1)

In [317]:
dflfc_scorers_tl_pos_age = dflfc_scorers_tl_pos.copy()

In [318]:
dflfc_scorers_tl_pos_age.head()


Out[318]:
season player league position age
0 2016-2017 Philippe Coutinho 13 Midfielder 24.6
1 2016-2017 Sadio Mane 13 Striker 24.7
2 2016-2017 Roberto Firmino 11 Striker 25.3
3 2016-2017 Adam Lallana 8 Midfielder 28.7
4 2016-2017 Divock Origi 7 Striker 21.7
Save the new dataframe

This is the key dataframe used in the plot of age vs league goals.


In [319]:
dflfc_scorers_tl_pos_age.to_csv(LFC_SCORERS_TL_POS_AGE_CSV_FILE, header=True, sep=',')
assert os.path.isfile(LFC_SCORERS_TL_POS_AGE_CSV_FILE)

Create dataframe of player's league appearances


In [320]:
# read the appearance file
print 'Loading LFC appearances csv from {}'.format(LFC_APPS_CSV_FILE)
dflfc_lgapps = pd.read_csv(LFC_APPS_CSV_FILE)
print dflfc_lgapps.shape
dflfc_lgapps.head()


Loading LFC appearances csv from data\lfc_apps_1892-1893_2016-2017.csv
(2608, 3)
Out[320]:
season player lgapp
0 1892-1893 Andrew Hannah 22
1 1892-1893 Duncan McLean 22
2 1892-1893 Tom Wyllie 22
3 1892-1893 Malcolm McVean 21
4 1892-1893 John Miller 21

Create merged dataframe of players, combining scorers in top league level season with squad position, age and appearances


In [321]:
dflfc_scorers_tl_pos_age_apps = dflfc_scorers_tl_pos_age.merge(dflfc_lgapps)
print dflfc_scorers_tl_pos_age_apps.shape
print dflfc_scorers_tl_pos.shape


(1210, 6)
(1210, 5)

In [322]:
dflfc_scorers_tl_pos_age_apps.head()


Out[322]:
season player league position age lgapp
0 2016-2017 Philippe Coutinho 13 Midfielder 24.6 31
1 2016-2017 Sadio Mane 13 Striker 24.7 27
2 2016-2017 Roberto Firmino 11 Striker 25.3 35
3 2016-2017 Adam Lallana 8 Midfielder 28.7 31
4 2016-2017 Divock Origi 7 Striker 21.7 34

In [323]:
dflfc_scorers_tl_pos_age_apps.tail()


Out[323]:
season player league position age lgapp
1205 1894-1895 Frank Becton 4 Striker 21.2 5
1206 1894-1895 Neil Kerr 3 Midfielder 23.7 12
1207 1894-1895 Hugh McQueen 2 Midfielder 27.3 12
1208 1894-1895 Joe McQue 1 Defender 21.8 29
1209 1894-1895 Patrick Gordon 1 Midfielder 24.9 5

In [ ]:

Analyse the data

Ask a question and find the answer!

Create a function to plot player's age vs top level league goals


In [324]:
def ggplot_age_vs_lgoals(df, players):
    """Return ggplot of Age vs League Goals for given list of players in dataframe.

       Given the low number of points, ggplot's geom_smooth uses
       the loess method with default span."""
    TITLE = 'LFCGM Age vs League Goals'
    XLABEL = 'Age at Midpoint of Season'
    YLABEL = 'League Goals per Season'
    EXEMPLAR_PLAYERS = ['Ian Rush', 'Kenny Dalglish', 'Roger Hunt', 'David Johnson',
                        'Harry Chambers', 'John Toshack', 'John Barnes', 'Kevin Keegan']
    EXEMPLAR_TITLE = 'LFCGM Example Plot, The Champions: Age vs League Goals'

    # if players list is empty then set the default exemplar options
    if not players:
        players = EXEMPLAR_PLAYERS
        TITLE = EXEMPLAR_TITLE

    # fiter dataframe for given players and plot
    this_df = df[df.player.isin(players)]
    this_plot = ggplot(this_df, aes(x='age', y='league', color='player', shape='player')) + \
                    geom_point() + \
                    geom_smooth(se=False) + \
                    xlab(XLABEL) + \
                    ylab(YLABEL) + \
                    scale_y_discrete(limits=(0, this_df.league.max() + 1)) + \
                    ggtitle(TITLE)
    return this_plot

In [325]:
# show the default plot, showing the champions (see below for more info on 'The Champions')
players = []
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)


Out[325]:
<ggplot: (15438073)>

In [326]:
# show type of gglplot.draw() object
# Note that Spyre (v0.2) can only handle matplotlib object or pyplot figure
players = []
ggp = ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
type(ggp.draw())


Out[326]:
matplotlib.figure.Figure

Show some interesting plots

Early Riser


In [327]:
# show all players scoring more than 20 goals when under 20 years old
dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.league >= 20) & 
                          (dflfc_scorers_tl_pos_age.age < 20)]


Out[327]:
season player league position age
285 1994-1995 Robbie Fowler 25 Striker 19.7

In [328]:
# produce plot for player known as 'god'
players = ['Robbie Fowler']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)


Out[328]:
<ggplot: (15442324)>

Late Flourish


In [329]:
# show all players scoring more than 20 goals when over 30 years old
df_late = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.league >= 20) & 
                                   (dflfc_scorers_tl_pos_age.age > 30)]
df_late


Out[329]:
season player league position age
360 1988-1989 John Aldridge 21 Striker 30.3
766 1946-1947 Jack Balmer 24 Striker 30.9
814 1934-1935 Gordon Hodgson 27 Striker 30.7
922 1925-1926 Dick Forshaw 27 Striker 30.4
1067 1908-1909 Ronald Orr 20 Striker 32.4

In [330]:
players = df_late.player.values
print sorted(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Dick Forshaw', 'Gordon Hodgson', 'Jack Balmer', 'John Aldridge', 'Ronald Orr']
Out[330]:
<ggplot: (15265870)>

All time career top scorers


In [331]:
# show players scoring most league goals over their career
df_top = dflfc_scorers_tl_pos_age[['player', 'league']].groupby('player').sum()
df_top = df_top.sort_values('league', ascending=False).head(12)
df_top


Out[331]:
league
player
Gordon Hodgson 233
Ian Rush 229
Roger Hunt 167
Harry Chambers 135
Robbie Fowler 128
Steven Gerrard 120
Kenny Dalglish 118
Michael Owen 118
Dick Forshaw 116
Jack Parkinson 103
Sam Raybould 101
Jack Balmer 98

In [332]:
players = df_top[df_top.league >= 120].index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Gordon Hodgson' 'Ian Rush' 'Roger Hunt' 'Harry Chambers' 'Robbie Fowler'
 'Steven Gerrard']
Out[332]:
<ggplot: (11773251)>

Elite 30


In [333]:
# show players scoring >=30 league goals in a season
df_elite = dflfc_scorers_tl_pos_age[['season', 'player', 'league']].sort_values('league', ascending=False)
df_elite.head(10)


Out[333]:
season player league
861 1930-1931 Gordon Hodgson 36
417 1983-1984 Ian Rush 32
1128 1902-1903 Sam Raybould 31
666 1963-1964 Roger Hunt 31
44 2013-2014 Luis Suarez 31
1057 1909-1910 Jack Parkinson 30
380 1986-1987 Ian Rush 30
887 1928-1929 Gordon Hodgson 30
639 1965-1966 Roger Hunt 29
275 1995-1996 Robbie Fowler 28

In [334]:
players = df_elite[df_elite.league > 30].player.unique()
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Gordon Hodgson' 'Ian Rush' 'Sam Raybould' 'Roger Hunt' 'Luis Suarez']
Out[334]:
<ggplot: (11779030)>

A Striking Trio

  • For a discusson of Liverpool's best ever trio see Terry's blog A Striking Trio

In [335]:
# show best total for a striking trio in the league
df_trio = dflfc_scorers_tl_pos_age[['season', 'league']].groupby('season').head(3).groupby('season').sum()
df_trio.sort_values('league', ascending=False).head(10)


Out[335]:
league
season
1963-1964 67
2013-2014 65
1909-1910 60
1930-1931 60
1946-1947 58
1987-1988 56
1931-1932 56
1934-1935 56
1922-1923 55
1928-1929 55

In [336]:
TOP_TRIO = ['1963-1964']
df_trio_players = dflfc_scorers_tl_pos_age[['season', 'player', 'league']]\
                                            [dflfc_scorers_tl_pos_age.season.isin(TOP_TRIO)].groupby('season').head(3)
df_trio_players


Out[336]:
season player league
666 1963-1964 Roger Hunt 31
667 1963-1964 Ian St John 21
668 1963-1964 Alf Arrowsmith 15

In [337]:
players = df_trio_players.player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


Out[337]:
<ggplot: (17379058)>

A Striking Duo


In [338]:
# show best total for a striking trio in the league
df_duo = dflfc_scorers_tl_pos_age[['season', 'league']].groupby('season').head(2).groupby('season').sum()
df_duo.sort_values('league', ascending=False).head(10)


Out[338]:
league
season
2013-2014 52
1963-1964 52
1930-1931 50
1909-1910 48
1946-1947 48
1934-1935 46
1928-1929 46
1925-1926 44
1962-1963 43
1931-1932 43

In [339]:
TOP_DUO = ['1963-1964', '2013-2014']
df_duo_players = dflfc_scorers_tl_pos_age[['season', 'player', 'league']]\
                                            [dflfc_scorers_tl_pos_age.season.isin(TOP_DUO)].groupby('season').head(2)
df_duo_players


Out[339]:
season player league
44 2013-2014 Luis Suarez 31
45 2013-2014 Daniel Sturridge 21
666 1963-1964 Roger Hunt 31
667 1963-1964 Ian St John 21

In [340]:
# plot first of TOP_DUO seasons
players = df_duo_players[df_duo_players.season == TOP_DUO[0]].player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


Out[340]:
<ggplot: (14873920)>

In [341]:
# plot second of TOP_DUO seasons
players = df_duo_players[df_duo_players.season == TOP_DUO[1]].player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


Out[341]:
<ggplot: (11781759)>

Performance of Liverpool players who went on to be managers


In [342]:
# produce list of managers - ref: http://www.lfchistory.net/Managers/
MANAGERS = ['William Barclay', 'Tom Watson', 'David Ashworth', 'Matt McQueen', 'George Patterson',\
    'George Kay', 'Don Welsh', 'Phil Taylor', 'Bill Shankly', 'Bob Paisley', 'Joe Fagan',\
    'Kenny Dalglish', 'Graeme Souness', 'Roy Evans', 'Gerard Houllier',\
    'Rafael Benitez', 'Roy Hodgson', 'Kenny Dalglish', 'Brendan Rodgers', 'Jurgen Klopp']
# excludes Ronnie Moran who was temporary manager in 1991

In [343]:
# produce list of players (who scored in more than 1 season at top level) who were managers
df_mgrs = dflfc_scorers_tl_pos_age[['player', 'league']][dflfc_scorers_tl_pos_age.player.isin(MANAGERS)]\
                                        .groupby('player').sum().sort_values('league', ascending=False)
df_mgrs


Out[343]:
league
player
Kenny Dalglish 118
Graeme Souness 38
Phil Taylor 32
Bob Paisley 10

In [344]:
players = df_mgrs.index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Kenny Dalglish' 'Graeme Souness' 'Phil Taylor' 'Bob Paisley']
Out[344]:
<ggplot: (15308681)>

Top midfielders


In [345]:
# show midfielders who have scored more than 15 goals
df_mids = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Midfielder') &
                                    (dflfc_scorers_tl_pos_age.league > 15)].sort_values('league', ascending=False)
df_mids


Out[345]:
season player league position age
712 1951-1952 Billy Liddell 19 Midfielder 30.0
405 1984-1985 John Wark 18 Midfielder 27.4
430 1982-1983 Kenny Dalglish 18 Midfielder 31.9
735 1949-1950 Billy Liddell 17 Midfielder 28.0
851 1931-1932 Gordon Gunson 17 Midfielder 27.5
108 2008-2009 Steven Gerrard 16 Midfielder 28.6
339 1990-1991 John Barnes 16 Midfielder 27.2
888 1928-1929 Dick Edmed 16 Midfielder 24.9

In [346]:
players = df_mids.sort_values('league', ascending=False).player.unique()
print len(players), players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


7 ['Billy Liddell' 'John Wark' 'Kenny Dalglish' 'Gordon Gunson'
 'Steven Gerrard' 'John Barnes' 'Dick Edmed']
Out[346]:
<ggplot: (17754705)>

Top Defenders


In [347]:
# show defenders who have scored more than 6 goals
df_defs = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Defender') &
                                    (dflfc_scorers_tl_pos_age.league > 6)].sort_values('league', ascending=False)
df_defs


Out[347]:
season player league position age
590 1969-1970 Chris Lawler 10 Defender 26.2
432 1982-1983 Phil Neal 8 Defender 31.9
5 2016-2017 James Milner 7 Defender 31.0
48 2013-2014 Martin Skrtel 7 Defender 29.1
200 2001-2002 John Arne Riise 7 Defender 21.3
504 1976-1977 Phil Neal 7 Defender 25.9

In [348]:
players = df_defs.sort_values('league', ascending=False).player.unique()
print len(players), players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


5 ['Chris Lawler' 'Phil Neal' 'James Milner' 'Martin Skrtel'
 'John Arne Riise']
Out[348]:
<ggplot: (15410717)>

Peak Performance


In [349]:
# show player with top score in a season, Gordon Hodgson
top_player = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.league == max(dflfc_scorers_tl_pos_age.league)]
top_player


Out[349]:
season player league position age
861 1930-1931 Gordon Hodgson 36 Striker 26.7

In [350]:
players = top_player.player.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Gordon Hodgson']
Out[350]:
<ggplot: (11765147)>

Rocket Men

Show the players scored over 50 goals in 3 or more consecutive seasons, with a rising number of goals each season.


In [351]:
# create dataframe ordered by player and season
df = dflfc_scorers_tl_pos_age.groupby(['player', 'season']).sum()
df.head(12)


Out[351]:
league age
player season
Abel Xavier 2001-2002 1 29.1
Abraham Hartley 1897-1898 1 25.9
Adam Lallana 2014-2015 5 26.7
2015-2016 4 27.7
2016-2017 8 28.7
Alan A'Court 1952-1953 2 18.3
1953-1954 3 19.3
1962-1963 2 28.3
Alan Arnell 1953-1954 1 20.1
Alan Hansen 1978-1979 1 23.6
1979-1980 4 24.6
1980-1981 1 25.6

In [352]:
def linefit(x, y):
    """"Return gradient and intercept of straight line of best fit for given x and y arrays."""
    gradient, intercept = np.polyfit(x, y, 1)
    return gradient, intercept

In [353]:
# test linefit()
# using y = 2x^2 + 6
x=np.array([-1, 0, 1, 2])
print x
y=2*x*x + 6
print y
print plt.plot(x, y)
gradient, intercept = linefit(x, y)
print np.round(gradient, 1), np.round(intercept, 1)
print plt.plot(x, gradient*x + intercept)


[-1  0  1  2]
[ 8  6  8 14]
[<matplotlib.lines.Line2D object at 0x000000000E2D37B8>]
2.0 8.0
[<matplotlib.lines.Line2D object at 0x000000000B5B0DA0>]

In [354]:
# Show the players scored over 50 goals in 3 or more consecutive seasons, with a rising number of goals each season.
MIN_SEASONS = 3
MIN_TOTAL_GOALS = 50
p_prev = None # previous player
l_prev = None # previous league goals
Lg = [] # List of consecutive goals
La = [] # List of consecutive ages
Ls = [] # List of consecutive seasons

# iterate through dataframe 
# for each row of (player, season) (league goals, age)
for (p, s), (l, a) in df.iterrows():
    if p != p_prev:
        # new player, so check previous
        if len(Lg) >= MIN_SEASONS and sum(Lg) >= MIN_TOTAL_GOALS:
            grad, intercept = linefit(np.array(range(len(Lg))), np.array(Lg))
            print 'Rocket man: {}, goals={}, start_season={}, start_age={}, goals={}, grad={}'\
                                .format(p_prev, Lg, Ls[0], La[0], sum(Lg), np.round(grad, 2))
            
        #print 'new p', p
        l_prev = None
        Lg = []
        La = []
        Ls = []
        
    # print p, s, l, a #player, season, league, age
    #print l, l_prev, Lg
    if l >= l_prev:
        #print '\t', l, 'greater than', l_prev, Lg
        Lg.append(l)
        La.append(a)
        Ls.append(s)
    else:
        if len(Lg) >= MIN_SEASONS and sum(Lg) >= MIN_TOTAL_GOALS:
            grad, intercept = linefit(np.array(range(len(Lg))), np.array(Lg))
            print 'Rocket man: {}, goals={}, start_season={}, start_age={}, goals={}, grad={}'\
                                .format(p_prev, Lg, Ls[0], La[0], sum(Lg), np.round(grad, 2))
        Lg = [l]
        La = [a]
        Ls = [s]
    
    l_prev = l
    p_prev = p


Rocket man: Berry Nieuwenhuys, goals=[9.0, 10.0, 10.0, 13.0, 13.0, 14.0], start_season=1933-1934, start_age=22.2, goals=69.0, grad=1.06
Rocket man: Dick Forshaw, goals=[7.0, 9.0, 17.0, 19.0], start_season=1919-1920, start_age=24.4, goals=52.0, grad=4.4
Rocket man: Dick Forshaw, goals=[5.0, 19.0, 27.0], start_season=1923-1924, start_age=28.4, goals=51.0, grad=11.0
Rocket man: Gordon Hodgson, goals=[4.0, 16.0, 23.0, 30.0], start_season=1925-1926, start_age=21.7, goals=73.0, grad=8.5
Rocket man: Gordon Hodgson, goals=[24.0, 24.0, 27.0], start_season=1932-1933, start_age=28.7, goals=75.0, grad=1.5
Rocket man: Ian Rush, goals=[17.0, 24.0, 32.0], start_season=1981-1982, start_age=20.2, goals=73.0, grad=7.5
Rocket man: Ian Rush, goals=[14.0, 22.0, 30.0], start_season=1984-1985, start_age=23.2, goals=66.0, grad=8.0
Rocket man: Luis Suarez, goals=[4.0, 11.0, 23.0, 31.0], start_season=2010-2011, start_age=24.0, goals=69.0, grad=9.3
Rocket man: Michael Owen, goals=[11.0, 16.0, 19.0, 19.0], start_season=1999-2000, start_age=20.1, goals=65.0, grad=2.7
Rocket man: Robbie Fowler, goals=[12.0, 25.0, 28.0], start_season=1993-1994, start_age=18.7, goals=65.0, grad=8.0
Rocket man: Terry McDermott, goals=[1.0, 1.0, 4.0, 8.0, 11.0, 13.0, 14.0], start_season=1975-1976, start_age=24.1, goals=52.0, grad=2.5

Top 5 Rocket Men (sorted by gradient of line of best fit) are

  • Dick Forshaw, 11
  • Luis Suarez, 9.3
  • Gordon Hodgson, 8.5
  • Ian Rush, 8.0
  • Robbie Fowler, 8.0

In [355]:
# show example graph of the rocket portion of the players career e.g. Robbie Fowler
p = 'Robbie Fowler'
Lg = [12.0, 25.0, 28.0]
dfp = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.player == p) &
                               (dflfc_scorers_tl_pos.league.isin(Lg))]
print dfp
print ggplot_age_vs_lgoals(dfp, [p])


        season         player  league position   age
275  1995-1996  Robbie Fowler      28  Striker  20.7
285  1994-1995  Robbie Fowler      25  Striker  19.7
295  1993-1994  Robbie Fowler      12  Striker  18.7
<ggplot: (17157843)>
Striking Nostalgia

In [356]:
# Just a few of my early favourites
players = ['Kevin Keegan', 'Kenny Dalglish', 'Steve Heighway']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)


Out[356]:
<ggplot: (15506674)>

Highest scoring midfielders over career


In [357]:
df = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Midfielder')]\
                            .groupby('player').sum()
print df[df.league > 50].sort_values('league', ascending=False)['league']
players = df[df.league > 50].sort_values('league', ascending=False).index.unique()
print len(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


player
Steven Gerrard       120
Billy Liddell         96
Berry Nieuwenhuys     74
Arthur Goddard        65
Jack Cox              62
John Barnes           62
Terry McDermott       54
Name: league, dtype: int64
7
Out[357]:
<ggplot: (14888687)>

Highest scoring defenders over career


In [358]:
df = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Defender')]\
                            .groupby('player').sum()
print df[df.league > 20].sort_values('league', ascending=False)['league']
players = df[df.league > 20].sort_values('league', ascending=False).index.unique()
print len(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


player
Chris Lawler        41
Phil Neal           41
Tommy Smith         36
Donald Mackinlay    28
Steve Nicol         23
Sami Hyypia         22
John Arne Riise     21
Name: league, dtype: int64
7
Out[358]:
<ggplot: (11768254)>

The Champions


In [359]:
# create list of seasons when LFC were champions
CHAMPS = ['1900-1901', '1905-1906', '1921-1922', '1922-1923', '1946-1947', '1963-1964',\
          '1965-1966', '1972-1973', '1975-1976', '1976-1977', '1978-1979', '1979-1980',\
          '1981-1982', '1982-1983', '1983-1984', '1985-1986', '1987-1988', '1989-1990']
print len(CHAMPS)


18

In [360]:
# show total goals over career in title winning teams
df_champs = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season.isin(CHAMPS)][['league', 'player']].groupby('player').sum()\
                            .sort_values('league', ascending=False).head(12)
df_champs


Out[360]:
league
player
Ian Rush 113
Kenny Dalglish 78
Roger Hunt 60
David Johnson 44
Harry Chambers 41
John Toshack 39
John Barnes 37
Kevin Keegan 37
Dick Forshaw 36
Terry McDermott 35
Ray Kennedy 34
Phil Neal 31

In [361]:
# plot top 8
players = df_champs.index.values[:8]
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Ian Rush' 'Kenny Dalglish' 'Roger Hunt' 'David Johnson' 'Harry Chambers'
 'John Toshack' 'John Barnes' 'Kevin Keegan']
Out[361]:
<ggplot: (15628124)>

In [362]:
# show highest scorers in a title winning season
dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season.isin(CHAMPS)].sort_values('league', ascending=False).head(12)


Out[362]:
season player league position age
417 1983-1984 Ian Rush 32 Striker 22.2
666 1963-1964 Roger Hunt 31 Striker 25.5
639 1965-1966 Roger Hunt 29 Striker 27.5
370 1987-1988 John Aldridge 26 Striker 29.3
767 1946-1947 Albert Stubbins 24 Striker 27.5
766 1946-1947 Jack Balmer 24 Striker 30.9
1101 1905-1906 Joe Hewitt 24 Striker 24.7
429 1982-1983 Ian Rush 24 Striker 21.2
348 1989-1990 John Barnes 22 Striker 26.2
391 1985-1986 Ian Rush 22 Striker 24.2
952 1922-1923 Harry Chambers 22 Striker 26.1
667 1963-1964 Ian St John 21 Striker 25.6

European Cup Winning Team, May 1977


In [363]:
players = ['Ray Clemence', 'Phil Neal', 'Joey Jones', 'Tommy Smith',
           'Ray Kennedy', 'Emlyn Hughes', 'Kevin Keegan', 'Jimmy Case',
           'Steve Heighway', 'Ian Callaghan', 'Terry McDermott']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)


Out[363]:
<ggplot: (17249652)>

Best goals per game


In [364]:
# calculate GPG for each player season
dflfc_scorers_tl_pos_age_apps['GPG'] = (dflfc_scorers_tl_pos_age_apps.league/dflfc_scorers_tl_pos_age_apps.lgapp).round(3)

In [365]:
dflfc_scorers_tl_pos_age_apps.head()


Out[365]:
season player league position age lgapp GPG
0 2016-2017 Philippe Coutinho 13 Midfielder 24.6 31 0.419
1 2016-2017 Sadio Mane 13 Striker 24.7 27 0.481
2 2016-2017 Roberto Firmino 11 Striker 25.3 35 0.314
3 2016-2017 Adam Lallana 8 Midfielder 28.7 31 0.258
4 2016-2017 Divock Origi 7 Striker 21.7 34 0.206

In [366]:
# show best GPG per season where appearance > 10
dflfc_scorers_tl_pos_age_apps[dflfc_scorers_tl_pos_age_apps.lgapp > 10].sort_values('GPG', ascending=False).head(15)


Out[366]:
season player league position age lgapp GPG
1057 1909-1910 Jack Parkinson 30 Striker 26.3 31 0.968
44 2013-2014 Luis Suarez 31 Striker 27.0 33 0.939
1128 1902-1903 Sam Raybould 31 Striker 27.6 33 0.939
861 1930-1931 Gordon Hodgson 36 Striker 26.7 40 0.900
922 1925-1926 Dick Forshaw 27 Striker 30.4 32 0.844
997 1914-1915 Fred Pagnam 24 Striker 23.3 29 0.828
97 2009-2010 Fernando Torres 18 Striker 25.8 22 0.818
814 1934-1935 Gordon Hodgson 27 Striker 30.7 34 0.794
887 1928-1929 Gordon Hodgson 30 Striker 24.7 38 0.789
639 1965-1966 Roger Hunt 29 Striker 27.5 37 0.784
417 1983-1984 Ian Rush 32 Striker 22.2 41 0.780
903 1927-1928 Willie Devlin 14 Striker 28.4 18 0.778
666 1963-1964 Roger Hunt 31 Striker 25.5 41 0.756
668 1963-1964 Alf Arrowsmith 15 Midfielder 21.1 20 0.750
275 1995-1996 Robbie Fowler 28 Striker 20.7 38 0.737

In [367]:
# show best Career GPG (CGPG) per career where appearance > 50
df_gpg = dflfc_scorers_tl_pos_age_apps[['player', 'league', 'lgapp']].groupby('player').sum()
df_gpg['CGPG'] = (df_gpg.league/df_gpg.lgapp).round(3) # career goals per game
df_gpg['CMPG'] = (df_gpg.lgapp*90/df_gpg.league).round(3) # career minutes per goal (assume all apps = 90 mins)
df_gpg[df_gpg.lgapp > 50].sort_values('CGPG', ascending=False).head(12)


Out[367]:
league lgapp CGPG CMPG
player
Gordon Hodgson 233 358 0.651 138.283
Fernando Torres 65 102 0.637 141.231
Luis Suarez 69 110 0.627 143.478
Jimmy Smith 38 61 0.623 144.474
John Aldridge 50 83 0.602 149.400
Tom Reid 30 51 0.588 153.000
Jack Parkinson 103 178 0.579 155.534
Roger Hunt 167 295 0.566 158.982
Sam Raybould 101 179 0.564 159.505
Michael Owen 118 216 0.546 164.746
Daniel Sturridge 46 89 0.517 174.130
Ian Rush 229 462 0.496 181.572

In [368]:
# plot top 6 goal scorers with best Career GPG
players = df_gpg[df_gpg.lgapp > 50].sort_values('CGPG', ascending=False).head(6).index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))


['Gordon Hodgson' 'Fernando Torres' 'Luis Suarez' 'Jimmy Smith'
 'John Aldridge' 'Tom Reid']
Out[368]:
<ggplot: (17755680)>

Note on the variable number of games per season

Note that the number of league games has varied over the top level seasons.


In [369]:
# show number of different total games
print dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])].PLD.unique()


[30 34 38 42 40]

In [370]:
# show number of seasons for each total
dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])][['PLD', 'Season']].groupby('PLD').count()


Out[370]:
Season
PLD
30 3
34 6
38 35
40 1
42 57

In [371]:
# show the seasons for each total
dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])][['PLD', 'Season']]\
                        .groupby('PLD')['Season'].apply(lambda x: ','.join(x))


Out[371]:
PLD
30                        1894-1895,1896-1897,1897-1898
34    1898-1899,1899-1900,1900-1901,1901-1902,1902-1...
38    1905-1906,1906-1907,1907-1908,1908-1909,1909-1...
40                                            1987-1988
42    1919-1920,1920-1921,1921-1922,1922-1923,1923-1...
Name: Season, dtype: object

In [372]:
# confirm that season 1939-1940 is not included in analysis
len(dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season == '1939-1940'])


Out[372]:
0

In [ ]:

Building The Spyre App

Useful reference material:

Spyre is a web app framework for providing a simple user interface for Python data projects. In simple terms the app involves:

  1. creating a user interface to capture the list of players.
  2. calling the ggplot_age_vs_lgoals() function to plot the players' graph.

See lfcgm_app.py in the lfcgm github repo for the app source code.

Create the dropdown list of players for the Spyre app

I protyped using the checkbox for Spyre app input but decided multiple dropdowns was more elegant for my use case.

Note: ggplot does not plot single points correctly, so I decided to restrict the list of players to those who've scored in more than 1 season. For more information on the issue see this stackoverflow question.


In [373]:
# create dataframe of all players who have scored 1 or more goals in more than 1 season
df_pgoals = dflfc_scorers_tl_pos_age[['player', 'age']].groupby('player').count()
df_pgoals.columns = ['goal_tot']
print df_pgoals.head(10)

print '\nThere are {} goal scorers.'.format(len(df_pgoals))
print "That's {} data points.".format(df_pgoals.sum().values[0])
print 'Of these {} have scored more than once.'.format(df_pgoals[df_pgoals.goal_tot > 1].count().values[0])
print "That's {} data points.".format(df_pgoals[df_pgoals.goal_tot > 1].sum().values[0])

print '\nHere are the first of those scoring more than once...'
df_pgoals_gt1 = df_pgoals[df_pgoals.goal_tot > 1]
df_pgoals_gt1.head()


                 goal_tot
player                   
Abel Xavier             1
Abraham Hartley         1
Adam Lallana            3
Alan A'Court            3
Alan Arnell             1
Alan Hansen             5
Alan Kennedy            7
Alan Scott              1
Alan Waddle             1
Albert Pearson          1

There are 384 goal scorers.
That's 1210 data points.
Of these 240 have scored more than once.
That's 1066 data points.

Here are the first of those scoring more than once...
Out[373]:
goal_tot
player
Adam Lallana 3
Alan A'Court 3
Alan Hansen 5
Alan Kennedy 7
Albert Stubbins 6

In [374]:
# produce dataframe of players for the Spyre dropdown and save to csv
# this csv will be read by the app
player_dd = df_pgoals_gt1.index.values
player_dd

# create dropdown dataframe (with label and value) and save to csv
df_dropdown = pd.DataFrame(player_dd, player_dd).reset_index()
df_dropdown.columns = (['label', 'value'])
print df_dropdown.head()

# save dropdown label and value to csv
#LFCGM_DROPDOWN = os.path.relpath('data\lfcgm_app_dropdown.csv')
df_dropdown[['label', 'value']].to_csv(LFCGM_DROPDOWN, index=False)
assert os.path.isfile(LFCGM_DROPDOWN)


             label            value
0     Adam Lallana     Adam Lallana
1     Alan A'Court     Alan A'Court
2      Alan Hansen      Alan Hansen
3     Alan Kennedy     Alan Kennedy
4  Albert Stubbins  Albert Stubbins

In [375]:
# show players added in latest season
LATEST_SEASON = SEASON_END
scorers_for_latest_season_set = set(dflfc_scorers_tl[dflfc_scorers_tl.season == LATEST_SEASON].player.values)
this_season_scorers_df = dflfc_scorers_tl[dflfc_scorers_tl.player.isin(scorers_for_latest_season_set)]\
                                            [['player', 'season']].groupby('player').count()

players_added = this_season_scorers_df[this_season_scorers_df.season == 2].index.values
print 'there are {} new players added in {}; these are:'.format(len(players_added), LATEST_SEASON)
', '.join(players_added)


there are 3 new players added in 2016-2017; these are:
Out[375]:
'Divock Origi, James Milner, Roberto Firmino'

In [376]:
# read dropdown.csv and create dropdown _dict in format that the Spyre app will use
dropdown_df = pd.read_csv(LFCGM_DROPDOWN)
print dropdown_df.head(), '\n'

# create dict in format Spyre app will use
dropdown_dict = dropdown_df.to_dict(orient='records')

# check length of dict (expect 240)
# 233 
# + 4 players added in 2015-16
# + 3 players added in 2016-17
print len(dropdown_dict)

# show 4 new players
dropdown_df[dropdown_df.value.isin(players_added)]


             label            value
0     Adam Lallana     Adam Lallana
1     Alan A'Court     Alan A'Court
2      Alan Hansen      Alan Hansen
3     Alan Kennedy     Alan Kennedy
4  Albert Stubbins  Albert Stubbins 

240
Out[376]:
label value
64 Divock Origi Divock Origi
115 James Milner James Milner
194 Roberto Firmino Roberto Firmino

In [ ]:

Explore 'facet grid' of individual player graphs for those players in the dropdown

A final bit of data analysis to help find interesting plots...


In [377]:
# create grouper function to iterate in chunks of n
# ref: http://stackoverflow.com/questions/8991506/iterate-an-iterator-by-chunks-of-n-in-python
import string
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
    "grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

In [378]:
# test grouper
tmp = string.ascii_lowercase
for g in grouper(4, tmp, None):
    print g


('a', 'b', 'c', 'd')
('e', 'f', 'g', 'h')
('i', 'j', 'k', 'l')
('m', 'n', 'o', 'p')
('q', 'r', 's', 't')
('u', 'v', 'w', 'x')
('y', 'z', None, None)

In [379]:
# ggplot all players in group of 4
for players in grouper(4, player_dd, None):
    print players
    df = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.player.isin(players)][['age', 'league', 'player']]
    #print df
    print ggplot(df, aes(x='age', y='league', color='player')) + \
        geom_point() + \
        geom_smooth(se=False) + facet_grid('player')
    
    break # remove break to see all groups


('Adam Lallana', "Alan A'Court", 'Alan Hansen', 'Alan Kennedy')
<ggplot: (15605140)>

Use of ggplot

I was keen to use ggplot because I intend to build an LFC Goal Machine equivalent in R Studio Shiny. Spyre is meant to make it easy to build Shiny-like apps so this should be straight-forward, right? Well, sort of.

I've used ggplot in R and it is great. Yhat's python-ggplot is very good but still work in progress. I was able to get ggplot to produce the scatter plot and line of best fit but it was frustrating that I couldn't get ggplot to cope with single points, add annotations and only plot integer values (as half goals don't much make sense!). But ggplot was good enough.

Spyre App Deployment on Heroku

Reference Material:

The biggest challenge was fitting the lfcgm app in Heroku's slug limit of 300 MB. lfcgm uses spyre, ggplot and pandas. These packages pull in other packages. Spyre uses numpy, pandas, cherrypy, jinja and matplotlib. Ggplot uses matplotlib, pandas, numpy, scipy, statsmodels and patsy. Pandas uses numpy, python-dateutil and pytz. This means that the size of the app and supporting packages is quite large. The biggest issue was scipy which needs to be built on Heroku.

I use the excellent anaconda for my python data analysis and development. So my first stab at building the Heroku app used Kenneth Reitz' Heroku mini-conda buildpack. Unfortunately the app slug size with this buildpack was much too big.

Plan B (after lots of trial and error) used Brandon Liu's scipy buildpack. This builds numpy and scipy. This allowed me to produce a compressed sluge size ~156 MB (phew).

The Heroku app's Procfile and requirements.txt are in the lfcgm github repo.

Running the App

The app is available at lfcgm.herokuapp.com.

It is also available at lfcgm.lfcsorted.com. See here for guidance on setting up a custom domain.

App Management

As an aside, Heroku has some great addons for managing the app e.g. papertrail for log management.

App Data

The LFC Goal Machine app uses the following data files:

  • data/lfcgm_app_dropdown.csv (used to build the dropdown list of players)
  • data/lfc_scorers_tl_pos_age.csv (used to build the pandas dataframe of LFC scorers in top level league)

The data structure of these files is described in this notebook. However the data is not in the lfcgm github repository because the data is owned by lfchistory.net.

App Startup Times

Please be patient if the lfcgm app takes several seconds to wake-up. The entry level 'heroku dyno' has some key limitations:

  • the dyno sleeps after a period of inactivity.
  • the dyno must sleep for 6 hours in each 24 hours period.

If the lfcgm app doesn't respond please try again later.

lfcgm github repo

For more information, including links to the R version of lfcgm, see the lfcgm github repo.


In [ ]: