This notebook analyses Liverpool FC's goal scorers, in particular exploring a scatter plot of player's age against top level league goals scored. The plot is available as an interactive web app called the LFC Goal Machine here.
The notebook was used originally to explore and merge the input data, generate the required additional data and prototype the application. The notebook contains the key algorithms, some interesting plots and describes how the lfcgm app was built and deployed.
The project uses IPython Notebook, python, pandas, matplotlib, numpy, ggplot, spyre, heroku and heroku scipy buildpack.
In [383]:
%%html
<! left align the change log table in next cell >
<style>
table {float:left}
</style>
Date | Change Description |
---|---|
21st February 2016 | Initial baseline |
30th October 2016 | Added season 2015-16 |
12th October 2017 | Added season 2016-17 |
Import the python modules needed for the analysis.
In [384]:
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import sys
import collections
import os
from ggplot import *
from ggplot import __version__ as ggplot_version
from datetime import datetime
from __future__ import division
# enable inline plotting
%matplotlib inline
Print version numbers.
In [289]:
print 'python version: {}'.format(sys.version)
print 'pandas version: {}'.format(pd.__version__)
print 'matplotlib version: {}'.format(mpl.__version__)
print 'numpy version: {}'.format(np.__version__)
print 'ggplot version: {}'.format(ggplot_version)
In [290]:
# define key seasons
SEASON_END = '2016-2017' # most recent season
SEASON_START = '1892-1893' # first season in existance
LG_SEASON_START = '1893-1894' # first league season (2nd Division)
PLAYERS_CSV_MONTH = 'October' # month of players csv extract
PLAYERS_CSV_YEAR = '2017' # year of players csv extract
# define input csv files
# define scorers CSV file name (and check exists)
SCORERS_PREFIX = 'lfc_scorers'
SCORERS_CSV_FILE = '{}_{}_{}.csv'.format(SCORERS_PREFIX, SEASON_START, SEASON_END)
LFC_SCORERS_CSV_FILE = os.path.relpath('data/{}'.format(SCORERS_CSV_FILE))
assert os.path.isfile(LFC_SCORERS_CSV_FILE)
print 'LFC scorers csv file is: {}'.format(LFC_SCORERS_CSV_FILE)
# define squads CSV file name (and check exists)
SQUADS_PREFIX = 'lfc_squads'
SQUADS_CSV_FILE = '{}_{}_{}.csv'.format(SQUADS_PREFIX, SEASON_START, SEASON_END)
LFC_SQUADS_CSV_FILE = os.path.relpath('data/{}'.format(SQUADS_CSV_FILE))
assert os.path.isfile(LFC_SQUADS_CSV_FILE)
print 'LFC squads csv file is: {}'.format(LFC_SQUADS_CSV_FILE)
# define player appearances CSV file name (and check exists)
APPS_PREFIX = 'lfc_apps'
APPS_CSV_FILE = '{}_{}_{}.csv'.format(APPS_PREFIX, SEASON_START, SEASON_END)
LFC_APPS_CSV_FILE = os.path.relpath('data/{}'.format(APPS_CSV_FILE))
assert os.path.isfile(LFC_APPS_CSV_FILE)
print 'LFC appearances csv file is: {}'.format(LFC_APPS_CSV_FILE)
# define league CSV file name (and check exists)
LEAGUE_PREFIX = 'lfc_league'
LEAGUE_CSV_FILE = '{}_{}_{}.csv'.format(LEAGUE_PREFIX, LG_SEASON_START, SEASON_END)
LFC_LEAGUE_CSV_FILE = os.path.relpath('data/{}'.format(LEAGUE_CSV_FILE))
assert os.path.isfile(LFC_LEAGUE_CSV_FILE)
print 'LFC league csv file is: {}'.format(LFC_LEAGUE_CSV_FILE)
# define players CSV file name (and check exists)
PLAYERS_PREFIX = 'lfc_players'
PLAYERS_CSV_FILE_UPDATED = '{}_{}{}_upd.csv'.format(PLAYERS_PREFIX, PLAYERS_CSV_MONTH, PLAYERS_CSV_YEAR)
LFC_PLAYERS_CSV_FILE_UPDATED = os.path.relpath('data/{}'.format(PLAYERS_CSV_FILE_UPDATED))
assert os.path.isfile(LFC_PLAYERS_CSV_FILE_UPDATED)
print 'LFC league csv file is: {}'.format(LFC_PLAYERS_CSV_FILE_UPDATED)
# define generated csv files (this is the data used by the lfcgm app)
# define scorers in top league with position and age CSV file name
SCORERS_TL_POS_AGE_CSV_FILE = 'lfc_scorers_tl_pos_age.csv'
LFC_SCORERS_TL_POS_AGE_CSV_FILE = os.path.relpath('data/lfc_scorers_tl_pos_age.csv')
print 'LFC scorers in top league with position and age is: {}'.format(LFC_SCORERS_TL_POS_AGE_CSV_FILE)
# define dropdown CSV file name
LFCGM_DROPDOWN = os.path.relpath('data\lfcgm_app_dropdown.csv')
print 'LFC goal machine dropdown is: {}'.format(LFCGM_DROPDOWN)
In [ ]:
In [291]:
print 'Loading LFC scorers csv from {}'.format(LFC_SCORERS_CSV_FILE)
dflfc_scorers = pd.read_csv(LFC_SCORERS_CSV_FILE)
# sort by season, then league goals
dflfc_scorers = dflfc_scorers.sort_values(['season', 'league'], ascending=([False, False]))
dflfc_scorers.shape
Out[291]:
In [292]:
dflfc_scorers.head()
Out[292]:
In [293]:
dflfc_scorers.tail()
Out[293]:
In [294]:
# note that scorers includes own goals
dflfc_scorers[dflfc_scorers.player == 'Own goals'].head()
Out[294]:
In [295]:
# note: war years already excluded in input files
LANCS_YRS = ['1892-1893']
SECOND_DIV_YRS = ['1893-1894', '1895-1896', '1904-1905', '1961-1962',
'1954-1955', '1955-1956', '1956-1957', '1957-1958',
'1958-1959', '1959-1960', '1960-1961']
NOT_TOP_LEVEL_YRS = LANCS_YRS + SECOND_DIV_YRS
dflfc_scorers_tl = dflfc_scorers[~dflfc_scorers.season.isin(NOT_TOP_LEVEL_YRS)].copy()
dflfc_scorers_tl.shape
Out[295]:
In [299]:
## check number of top level seasons aligns with http://www.lfchistory.net/Stats/LeagueOverall
## expect 102 total for top level seasons from 1894-95 to 2016-17
print 'the number of seasons is {}'.format(len(dflfc_scorers_tl.season.unique()))
assert len(dflfc_scorers_tl.season.unique()) == 102
In [300]:
# show most league goals in a season in top level
# cross-check with http://en.wikipedia.org/wiki/List_of_Liverpool_F.C._records_and_statistics#Goalscorers
# expect 101 in 2013-14
assert dflfc_scorers_tl[['season', 'league']].groupby(['season'])\
.sum().sort_values('league', ascending=False).head(1).reset_index().values.tolist()[0] == ['2013-2014', 101]
dflfc_scorers_tl[['season', 'league']].groupby(['season']).sum().sort_values('league', ascending=False).head(1)
Out[300]:
In [301]:
# remove OG
dflfc_scorers_tl = dflfc_scorers_tl[dflfc_scorers_tl.player != 'Own goals']
dflfc_scorers_tl.shape
Out[301]:
In [302]:
# check 2016-17
dflfc_scorers_tl[dflfc_scorers_tl.season == '2016-2017'].head(10)
Out[302]:
In [303]:
print 'Loading LFC scorers csv from {}'.format(LFC_SQUADS_CSV_FILE)
dflfc_squads = pd.read_csv(LFC_SQUADS_CSV_FILE)
dflfc_squads.shape
Out[303]:
In [304]:
dflfc_squads.head()
Out[304]:
In [305]:
dflfc_squads.tail()
Out[305]:
In [306]:
print 'Loading LFC scorers csv from {}'.format(LFC_LEAGUE_CSV_FILE)
dflfc_league = pd.read_csv(LFC_LEAGUE_CSV_FILE)
dflfc_league.shape
Out[306]:
In [307]:
dflfc_league.head()
Out[307]:
In [308]:
dflfc_league.tail()
Out[308]:
In [309]:
dflfc_scorers_tl_pos = pd.DataFrame.merge(dflfc_scorers_tl, dflfc_squads)
dflfc_scorers_tl_pos.shape
Out[309]:
In [310]:
dflfc_scorers_tl_pos.head()
Out[310]:
In [311]:
dflfc_scorers_tl_pos.tail()
Out[311]:
In [312]:
print 'Loading LFC scorers csv from {}'.format(LFC_PLAYERS_CSV_FILE_UPDATED)
dflfc_players = pd.read_csv(LFC_PLAYERS_CSV_FILE_UPDATED, parse_dates=['birthdate'])
assert dflfc_players.birthdate.dtypes == 'datetime64[ns]'
dflfc_players.shape
Out[312]:
In [313]:
dflfc_players.head()
Out[313]:
In [314]:
dflfc_players.tail()
Out[314]:
Add players age to the dflfc_scorers_tl_pos dataframe
In [315]:
def age_at_season(row):
"""Return player's age at mid-point of season, assumed to be 1st Jan.
row.player -> player's name
row.season -> season
uses dflfc_players to look-up birthdate, keyed on player
- return average age if player is missing from dflfc_players
"""
AVERAGE_AGE = 26.5
mid_point = '01 January {}'.format(row.season[-4:])
try:
dob = dflfc_players[dflfc_players.player == row.player].birthdate.values[0]
except:
# use average age if player's birthdate not available
print 'error: age not found for player {} in season {}, using average age {}'.format(row.player,
row.season,
AVERAGE_AGE)
return AVERAGE_AGE
return round((pd.Timestamp(mid_point) - dob).days/365.0, 1)
In [316]:
# add age column
# no errors expected
dflfc_scorers_tl_pos['age'] = dflfc_scorers_tl_pos.apply(lambda row: age_at_season(row), axis=1)
In [317]:
dflfc_scorers_tl_pos_age = dflfc_scorers_tl_pos.copy()
In [318]:
dflfc_scorers_tl_pos_age.head()
Out[318]:
In [319]:
dflfc_scorers_tl_pos_age.to_csv(LFC_SCORERS_TL_POS_AGE_CSV_FILE, header=True, sep=',')
assert os.path.isfile(LFC_SCORERS_TL_POS_AGE_CSV_FILE)
In [320]:
# read the appearance file
print 'Loading LFC appearances csv from {}'.format(LFC_APPS_CSV_FILE)
dflfc_lgapps = pd.read_csv(LFC_APPS_CSV_FILE)
print dflfc_lgapps.shape
dflfc_lgapps.head()
Out[320]:
In [321]:
dflfc_scorers_tl_pos_age_apps = dflfc_scorers_tl_pos_age.merge(dflfc_lgapps)
print dflfc_scorers_tl_pos_age_apps.shape
print dflfc_scorers_tl_pos.shape
In [322]:
dflfc_scorers_tl_pos_age_apps.head()
Out[322]:
In [323]:
dflfc_scorers_tl_pos_age_apps.tail()
Out[323]:
In [ ]:
In [324]:
def ggplot_age_vs_lgoals(df, players):
"""Return ggplot of Age vs League Goals for given list of players in dataframe.
Given the low number of points, ggplot's geom_smooth uses
the loess method with default span."""
TITLE = 'LFCGM Age vs League Goals'
XLABEL = 'Age at Midpoint of Season'
YLABEL = 'League Goals per Season'
EXEMPLAR_PLAYERS = ['Ian Rush', 'Kenny Dalglish', 'Roger Hunt', 'David Johnson',
'Harry Chambers', 'John Toshack', 'John Barnes', 'Kevin Keegan']
EXEMPLAR_TITLE = 'LFCGM Example Plot, The Champions: Age vs League Goals'
# if players list is empty then set the default exemplar options
if not players:
players = EXEMPLAR_PLAYERS
TITLE = EXEMPLAR_TITLE
# fiter dataframe for given players and plot
this_df = df[df.player.isin(players)]
this_plot = ggplot(this_df, aes(x='age', y='league', color='player', shape='player')) + \
geom_point() + \
geom_smooth(se=False) + \
xlab(XLABEL) + \
ylab(YLABEL) + \
scale_y_discrete(limits=(0, this_df.league.max() + 1)) + \
ggtitle(TITLE)
return this_plot
In [325]:
# show the default plot, showing the champions (see below for more info on 'The Champions')
players = []
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
Out[325]:
In [326]:
# show type of gglplot.draw() object
# Note that Spyre (v0.2) can only handle matplotlib object or pyplot figure
players = []
ggp = ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
type(ggp.draw())
Out[326]:
Early Riser
In [327]:
# show all players scoring more than 20 goals when under 20 years old
dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.league >= 20) &
(dflfc_scorers_tl_pos_age.age < 20)]
Out[327]:
In [328]:
# produce plot for player known as 'god'
players = ['Robbie Fowler']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
Out[328]:
Late Flourish
In [329]:
# show all players scoring more than 20 goals when over 30 years old
df_late = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.league >= 20) &
(dflfc_scorers_tl_pos_age.age > 30)]
df_late
Out[329]:
In [330]:
players = df_late.player.values
print sorted(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[330]:
All time career top scorers
In [331]:
# show players scoring most league goals over their career
df_top = dflfc_scorers_tl_pos_age[['player', 'league']].groupby('player').sum()
df_top = df_top.sort_values('league', ascending=False).head(12)
df_top
Out[331]:
In [332]:
players = df_top[df_top.league >= 120].index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[332]:
Elite 30
In [333]:
# show players scoring >=30 league goals in a season
df_elite = dflfc_scorers_tl_pos_age[['season', 'player', 'league']].sort_values('league', ascending=False)
df_elite.head(10)
Out[333]:
In [334]:
players = df_elite[df_elite.league > 30].player.unique()
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[334]:
A Striking Trio
In [335]:
# show best total for a striking trio in the league
df_trio = dflfc_scorers_tl_pos_age[['season', 'league']].groupby('season').head(3).groupby('season').sum()
df_trio.sort_values('league', ascending=False).head(10)
Out[335]:
In [336]:
TOP_TRIO = ['1963-1964']
df_trio_players = dflfc_scorers_tl_pos_age[['season', 'player', 'league']]\
[dflfc_scorers_tl_pos_age.season.isin(TOP_TRIO)].groupby('season').head(3)
df_trio_players
Out[336]:
In [337]:
players = df_trio_players.player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[337]:
A Striking Duo
In [338]:
# show best total for a striking trio in the league
df_duo = dflfc_scorers_tl_pos_age[['season', 'league']].groupby('season').head(2).groupby('season').sum()
df_duo.sort_values('league', ascending=False).head(10)
Out[338]:
In [339]:
TOP_DUO = ['1963-1964', '2013-2014']
df_duo_players = dflfc_scorers_tl_pos_age[['season', 'player', 'league']]\
[dflfc_scorers_tl_pos_age.season.isin(TOP_DUO)].groupby('season').head(2)
df_duo_players
Out[339]:
In [340]:
# plot first of TOP_DUO seasons
players = df_duo_players[df_duo_players.season == TOP_DUO[0]].player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[340]:
In [341]:
# plot second of TOP_DUO seasons
players = df_duo_players[df_duo_players.season == TOP_DUO[1]].player.values
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[341]:
Performance of Liverpool players who went on to be managers
In [342]:
# produce list of managers - ref: http://www.lfchistory.net/Managers/
MANAGERS = ['William Barclay', 'Tom Watson', 'David Ashworth', 'Matt McQueen', 'George Patterson',\
'George Kay', 'Don Welsh', 'Phil Taylor', 'Bill Shankly', 'Bob Paisley', 'Joe Fagan',\
'Kenny Dalglish', 'Graeme Souness', 'Roy Evans', 'Gerard Houllier',\
'Rafael Benitez', 'Roy Hodgson', 'Kenny Dalglish', 'Brendan Rodgers', 'Jurgen Klopp']
# excludes Ronnie Moran who was temporary manager in 1991
In [343]:
# produce list of players (who scored in more than 1 season at top level) who were managers
df_mgrs = dflfc_scorers_tl_pos_age[['player', 'league']][dflfc_scorers_tl_pos_age.player.isin(MANAGERS)]\
.groupby('player').sum().sort_values('league', ascending=False)
df_mgrs
Out[343]:
In [344]:
players = df_mgrs.index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[344]:
Top midfielders
In [345]:
# show midfielders who have scored more than 15 goals
df_mids = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Midfielder') &
(dflfc_scorers_tl_pos_age.league > 15)].sort_values('league', ascending=False)
df_mids
Out[345]:
In [346]:
players = df_mids.sort_values('league', ascending=False).player.unique()
print len(players), players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[346]:
Top Defenders
In [347]:
# show defenders who have scored more than 6 goals
df_defs = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Defender') &
(dflfc_scorers_tl_pos_age.league > 6)].sort_values('league', ascending=False)
df_defs
Out[347]:
In [348]:
players = df_defs.sort_values('league', ascending=False).player.unique()
print len(players), players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[348]:
Peak Performance
In [349]:
# show player with top score in a season, Gordon Hodgson
top_player = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.league == max(dflfc_scorers_tl_pos_age.league)]
top_player
Out[349]:
In [350]:
players = top_player.player.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[350]:
Rocket Men
Show the players scored over 50 goals in 3 or more consecutive seasons, with a rising number of goals each season.
In [351]:
# create dataframe ordered by player and season
df = dflfc_scorers_tl_pos_age.groupby(['player', 'season']).sum()
df.head(12)
Out[351]:
In [352]:
def linefit(x, y):
""""Return gradient and intercept of straight line of best fit for given x and y arrays."""
gradient, intercept = np.polyfit(x, y, 1)
return gradient, intercept
In [353]:
# test linefit()
# using y = 2x^2 + 6
x=np.array([-1, 0, 1, 2])
print x
y=2*x*x + 6
print y
print plt.plot(x, y)
gradient, intercept = linefit(x, y)
print np.round(gradient, 1), np.round(intercept, 1)
print plt.plot(x, gradient*x + intercept)
In [354]:
# Show the players scored over 50 goals in 3 or more consecutive seasons, with a rising number of goals each season.
MIN_SEASONS = 3
MIN_TOTAL_GOALS = 50
p_prev = None # previous player
l_prev = None # previous league goals
Lg = [] # List of consecutive goals
La = [] # List of consecutive ages
Ls = [] # List of consecutive seasons
# iterate through dataframe
# for each row of (player, season) (league goals, age)
for (p, s), (l, a) in df.iterrows():
if p != p_prev:
# new player, so check previous
if len(Lg) >= MIN_SEASONS and sum(Lg) >= MIN_TOTAL_GOALS:
grad, intercept = linefit(np.array(range(len(Lg))), np.array(Lg))
print 'Rocket man: {}, goals={}, start_season={}, start_age={}, goals={}, grad={}'\
.format(p_prev, Lg, Ls[0], La[0], sum(Lg), np.round(grad, 2))
#print 'new p', p
l_prev = None
Lg = []
La = []
Ls = []
# print p, s, l, a #player, season, league, age
#print l, l_prev, Lg
if l >= l_prev:
#print '\t', l, 'greater than', l_prev, Lg
Lg.append(l)
La.append(a)
Ls.append(s)
else:
if len(Lg) >= MIN_SEASONS and sum(Lg) >= MIN_TOTAL_GOALS:
grad, intercept = linefit(np.array(range(len(Lg))), np.array(Lg))
print 'Rocket man: {}, goals={}, start_season={}, start_age={}, goals={}, grad={}'\
.format(p_prev, Lg, Ls[0], La[0], sum(Lg), np.round(grad, 2))
Lg = [l]
La = [a]
Ls = [s]
l_prev = l
p_prev = p
Top 5 Rocket Men (sorted by gradient of line of best fit) are
In [355]:
# show example graph of the rocket portion of the players career e.g. Robbie Fowler
p = 'Robbie Fowler'
Lg = [12.0, 25.0, 28.0]
dfp = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.player == p) &
(dflfc_scorers_tl_pos.league.isin(Lg))]
print dfp
print ggplot_age_vs_lgoals(dfp, [p])
In [356]:
# Just a few of my early favourites
players = ['Kevin Keegan', 'Kenny Dalglish', 'Steve Heighway']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
Out[356]:
Highest scoring midfielders over career
In [357]:
df = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Midfielder')]\
.groupby('player').sum()
print df[df.league > 50].sort_values('league', ascending=False)['league']
players = df[df.league > 50].sort_values('league', ascending=False).index.unique()
print len(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[357]:
Highest scoring defenders over career
In [358]:
df = dflfc_scorers_tl_pos_age[(dflfc_scorers_tl_pos_age.position == 'Defender')]\
.groupby('player').sum()
print df[df.league > 20].sort_values('league', ascending=False)['league']
players = df[df.league > 20].sort_values('league', ascending=False).index.unique()
print len(players)
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[358]:
The Champions
In [359]:
# create list of seasons when LFC were champions
CHAMPS = ['1900-1901', '1905-1906', '1921-1922', '1922-1923', '1946-1947', '1963-1964',\
'1965-1966', '1972-1973', '1975-1976', '1976-1977', '1978-1979', '1979-1980',\
'1981-1982', '1982-1983', '1983-1984', '1985-1986', '1987-1988', '1989-1990']
print len(CHAMPS)
In [360]:
# show total goals over career in title winning teams
df_champs = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season.isin(CHAMPS)][['league', 'player']].groupby('player').sum()\
.sort_values('league', ascending=False).head(12)
df_champs
Out[360]:
In [361]:
# plot top 8
players = df_champs.index.values[:8]
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[361]:
In [362]:
# show highest scorers in a title winning season
dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season.isin(CHAMPS)].sort_values('league', ascending=False).head(12)
Out[362]:
European Cup Winning Team, May 1977
In [363]:
players = ['Ray Clemence', 'Phil Neal', 'Joey Jones', 'Tommy Smith',
'Ray Kennedy', 'Emlyn Hughes', 'Kevin Keegan', 'Jimmy Case',
'Steve Heighway', 'Ian Callaghan', 'Terry McDermott']
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, players)
Out[363]:
Best goals per game
In [364]:
# calculate GPG for each player season
dflfc_scorers_tl_pos_age_apps['GPG'] = (dflfc_scorers_tl_pos_age_apps.league/dflfc_scorers_tl_pos_age_apps.lgapp).round(3)
In [365]:
dflfc_scorers_tl_pos_age_apps.head()
Out[365]:
In [366]:
# show best GPG per season where appearance > 10
dflfc_scorers_tl_pos_age_apps[dflfc_scorers_tl_pos_age_apps.lgapp > 10].sort_values('GPG', ascending=False).head(15)
Out[366]:
In [367]:
# show best Career GPG (CGPG) per career where appearance > 50
df_gpg = dflfc_scorers_tl_pos_age_apps[['player', 'league', 'lgapp']].groupby('player').sum()
df_gpg['CGPG'] = (df_gpg.league/df_gpg.lgapp).round(3) # career goals per game
df_gpg['CMPG'] = (df_gpg.lgapp*90/df_gpg.league).round(3) # career minutes per goal (assume all apps = 90 mins)
df_gpg[df_gpg.lgapp > 50].sort_values('CGPG', ascending=False).head(12)
Out[367]:
In [368]:
# plot top 6 goal scorers with best Career GPG
players = df_gpg[df_gpg.lgapp > 50].sort_values('CGPG', ascending=False).head(6).index.values
print players
ggplot_age_vs_lgoals(dflfc_scorers_tl_pos_age, list(players))
Out[368]:
Note that the number of league games has varied over the top level seasons.
In [369]:
# show number of different total games
print dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])].PLD.unique()
In [370]:
# show number of seasons for each total
dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])][['PLD', 'Season']].groupby('PLD').count()
Out[370]:
In [371]:
# show the seasons for each total
dflfc_league[dflfc_league.League.isin(['1st Division', 'Premier League'])][['PLD', 'Season']]\
.groupby('PLD')['Season'].apply(lambda x: ','.join(x))
Out[371]:
In [372]:
# confirm that season 1939-1940 is not included in analysis
len(dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.season == '1939-1940'])
Out[372]:
In [ ]:
Useful reference material:
Spyre is a web app framework for providing a simple user interface for Python data projects. In simple terms the app involves:
See lfcgm_app.py in the lfcgm github repo for the app source code.
Note: ggplot does not plot single points correctly, so I decided to restrict the list of players to those who've scored in more than 1 season. For more information on the issue see this stackoverflow question.
In [373]:
# create dataframe of all players who have scored 1 or more goals in more than 1 season
df_pgoals = dflfc_scorers_tl_pos_age[['player', 'age']].groupby('player').count()
df_pgoals.columns = ['goal_tot']
print df_pgoals.head(10)
print '\nThere are {} goal scorers.'.format(len(df_pgoals))
print "That's {} data points.".format(df_pgoals.sum().values[0])
print 'Of these {} have scored more than once.'.format(df_pgoals[df_pgoals.goal_tot > 1].count().values[0])
print "That's {} data points.".format(df_pgoals[df_pgoals.goal_tot > 1].sum().values[0])
print '\nHere are the first of those scoring more than once...'
df_pgoals_gt1 = df_pgoals[df_pgoals.goal_tot > 1]
df_pgoals_gt1.head()
Out[373]:
In [374]:
# produce dataframe of players for the Spyre dropdown and save to csv
# this csv will be read by the app
player_dd = df_pgoals_gt1.index.values
player_dd
# create dropdown dataframe (with label and value) and save to csv
df_dropdown = pd.DataFrame(player_dd, player_dd).reset_index()
df_dropdown.columns = (['label', 'value'])
print df_dropdown.head()
# save dropdown label and value to csv
#LFCGM_DROPDOWN = os.path.relpath('data\lfcgm_app_dropdown.csv')
df_dropdown[['label', 'value']].to_csv(LFCGM_DROPDOWN, index=False)
assert os.path.isfile(LFCGM_DROPDOWN)
In [375]:
# show players added in latest season
LATEST_SEASON = SEASON_END
scorers_for_latest_season_set = set(dflfc_scorers_tl[dflfc_scorers_tl.season == LATEST_SEASON].player.values)
this_season_scorers_df = dflfc_scorers_tl[dflfc_scorers_tl.player.isin(scorers_for_latest_season_set)]\
[['player', 'season']].groupby('player').count()
players_added = this_season_scorers_df[this_season_scorers_df.season == 2].index.values
print 'there are {} new players added in {}; these are:'.format(len(players_added), LATEST_SEASON)
', '.join(players_added)
Out[375]:
In [376]:
# read dropdown.csv and create dropdown _dict in format that the Spyre app will use
dropdown_df = pd.read_csv(LFCGM_DROPDOWN)
print dropdown_df.head(), '\n'
# create dict in format Spyre app will use
dropdown_dict = dropdown_df.to_dict(orient='records')
# check length of dict (expect 240)
# 233
# + 4 players added in 2015-16
# + 3 players added in 2016-17
print len(dropdown_dict)
# show 4 new players
dropdown_df[dropdown_df.value.isin(players_added)]
Out[376]:
In [ ]:
A final bit of data analysis to help find interesting plots...
In [377]:
# create grouper function to iterate in chunks of n
# ref: http://stackoverflow.com/questions/8991506/iterate-an-iterator-by-chunks-of-n-in-python
import string
from itertools import izip_longest
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
In [378]:
# test grouper
tmp = string.ascii_lowercase
for g in grouper(4, tmp, None):
print g
In [379]:
# ggplot all players in group of 4
for players in grouper(4, player_dd, None):
print players
df = dflfc_scorers_tl_pos_age[dflfc_scorers_tl_pos_age.player.isin(players)][['age', 'league', 'player']]
#print df
print ggplot(df, aes(x='age', y='league', color='player')) + \
geom_point() + \
geom_smooth(se=False) + facet_grid('player')
break # remove break to see all groups
I was keen to use ggplot because I intend to build an LFC Goal Machine equivalent in R Studio Shiny. Spyre is meant to make it easy to build Shiny-like apps so this should be straight-forward, right? Well, sort of.
I've used ggplot in R and it is great. Yhat's python-ggplot is very good but still work in progress. I was able to get ggplot to produce the scatter plot and line of best fit but it was frustrating that I couldn't get ggplot to cope with single points, add annotations and only plot integer values (as half goals don't much make sense!). But ggplot was good enough.
Reference Material:
The biggest challenge was fitting the lfcgm app in Heroku's slug limit of 300 MB. lfcgm uses spyre, ggplot and pandas. These packages pull in other packages. Spyre uses numpy, pandas, cherrypy, jinja and matplotlib. Ggplot uses matplotlib, pandas, numpy, scipy, statsmodels and patsy. Pandas uses numpy, python-dateutil and pytz. This means that the size of the app and supporting packages is quite large. The biggest issue was scipy which needs to be built on Heroku.
I use the excellent anaconda for my python data analysis and development. So my first stab at building the Heroku app used Kenneth Reitz' Heroku mini-conda buildpack. Unfortunately the app slug size with this buildpack was much too big.
Plan B (after lots of trial and error) used Brandon Liu's scipy buildpack. This builds numpy and scipy. This allowed me to produce a compressed sluge size ~156 MB (phew).
The Heroku app's Procfile and requirements.txt are in the lfcgm github repo.
The app is available at lfcgm.herokuapp.com.
It is also available at lfcgm.lfcsorted.com. See here for guidance on setting up a custom domain.
As an aside, Heroku has some great addons for managing the app e.g. papertrail for log management.
The LFC Goal Machine app uses the following data files:
The data structure of these files is described in this notebook. However the data is not in the lfcgm github repository because the data is owned by lfchistory.net.
Please be patient if the lfcgm app takes several seconds to wake-up. The entry level 'heroku dyno' has some key limitations:
If the lfcgm app doesn't respond please try again later.
For more information, including links to the R version of lfcgm, see the lfcgm github repo.
In [ ]: