LFC Data Analysis: The LFC Goal Machine (R version)

This is the R version of The LFC Goal machine. For a full description of the project see the lfcgm github repository. Note that the R version files are in the Rversion folder.

The notebook describes how the app was converted to run as an RShiny app.

The project uses Jupyter Notebook, R, R ggplot, R studio, R shiny, R dplyr.

Notebook Change Log

Last updated: 14th October 2017

Version: 2.2.0

Change Description:

  • Oct 2016, Add data files for season 2015-16 and check
  • Oct 2017, Add data files for season 2016-17 and check

Summary

The initial aim was simply to convert the web app that was developed using python, spyre and ggplot to an equivalent version using R, shiny and ggplot.

This was very straight-forward with Rshiny offering a richer UI than python spyre. The R files for the equivalent version are in the lfcgmRv1 folder. Note that R's ggplot handles a player with 2 datapoints differently from python's ggplot. THerefore a tweak was required to the R ggplot function to cope with players with only 2 data points.

The Rversion folder contains the latest R version with an enhanced (simplified) user interface. This takes advantage of Rshiny's selectizeInput with multiple inputs.


In [ ]:

Set-up

Import the R libraries needed for the analysis.


In [35]:
library(ggplot2)
library(dplyr)

Load the LFC data into R dataframes and munge

Create dataframe of players, combining scorers in top league level season with squad position and age


In [36]:
dflfcgm <- read.csv('data/lfc_scorers_tl_pos_age.csv', header=TRUE)

In [38]:
# check demension of dataframe, expect 1210 rows
dim(dflfcgm)


Out[38]:
  1. 1210
  2. 6

In [39]:
# check head includes latest season: 2016-17, top scorer is Phil Coutinho
head(dflfcgm)


Out[39]:
Xseasonplayerleaguepositionage
102016-2017Philippe Coutinho13Midfielder24.6
212016-2017Sadio Mane13Striker24.7
322016-2017Roberto Firmino11Striker25.3
432016-2017Adam Lallana8Midfielder28.7
542016-2017Divock Origi7Striker21.7
652016-2017James Milner7Defender31

In [40]:
# check column names
colnames(dflfcgm)


Out[40]:
  1. "X"
  2. "season"
  3. "player"
  4. "league"
  5. "position"
  6. "age"

In [41]:
# drop column X (the original row index from the file)
drops <- c('X')
dflfcgm <- dflfcgm[ , !(names(dflfcgm) %in% drops)]

In [42]:
# check dimensions, should be 5 columns
dim(dflfcgm)


Out[42]:
  1. 1210
  2. 5

In [43]:
head(dflfcgm)


Out[43]:
seasonplayerleaguepositionage
12016-2017Philippe Coutinho13Midfielder24.6
22016-2017Sadio Mane13Striker24.7
32016-2017Roberto Firmino11Striker25.3
42016-2017Adam Lallana8Midfielder28.7
52016-2017Divock Origi7Striker21.7
62016-2017James Milner7Defender31

In [44]:
# check tail, should show 1894-95
tail(dflfcgm)


Out[44]:
seasonplayerleaguepositionage
12051894-1895Malcolm McVean5Striker23.8
12061894-1895Frank Becton4Striker21.2
12071894-1895Neil Kerr3Midfielder23.7
12081894-1895Hugh McQueen2Midfielder27.3
12091894-1895Joe McQue1Defender21.8
12101894-1895Patrick Gordon1Midfielder24.9

In [45]:
summary(dflfcgm)
str(dflfcgm)


Out[45]:
       season                player         league                 position  
 1911-1912:  17   Steven Gerrard:  16   Min.   : 1.000   Centre-forward:  1  
 2015-2016:  17   Ian Rush      :  14   1st Qu.: 1.000   Defender      :303  
 1897-1898:  16   Ian Callaghan :  13   Median : 3.000   Half-back     :  2  
 1903-1904:  16   Ronnie Whelan :  13   Mean   : 5.423   Left-winger   :  7  
 1991-1992:  16   Tommy Smith   :  12   3rd Qu.: 7.000   Midfielder    :443  
 2008-2009:  16   Arthur Goddard:  11   Max.   :36.000   Right-winger  :  5  
 (Other)  :1112   (Other)       :1131                    Striker       :449  
      age       
 Min.   :17.10  
 1st Qu.:23.43  
 Median :25.90  
 Mean   :26.17  
 3rd Qu.:28.60  
 Max.   :36.00  
                
'data.frame':	1210 obs. of  5 variables:
 $ season  : Factor w/ 102 levels "1894-1895","1896-1897",..: 102 102 102 102 102 102 102 102 102 102 ...
 $ player  : Factor w/ 384 levels "Abel Xavier",..: 304 325 316 3 105 179 138 115 82 99 ...
 $ league  : int  13 13 11 8 7 7 6 5 3 2 ...
 $ position: Factor w/ 7 levels "Centre-forward",..: 5 7 7 5 7 2 5 5 7 2 ...
 $ age     : num  24.6 24.7 25.3 28.7 21.7 31 26.2 23 27.4 27.5 ...

Create dataframe the dropdown list of players for app


In [46]:
dflfcgm_dd <- read.csv('data/lfcgm_app_dropdown.csv', header=TRUE)

In [47]:
# check dimensions, expect 240 rows
dim(dflfcgm_dd)


Out[47]:
  1. 240
  2. 2

In [13]:
# check head, expect Adam Lallana at top (added in 2015-16)
head(dflfcgm_dd)


Out[13]:
labelvalue
1Adam LallanaAdam Lallana
2Alan A'CourtAlan A'Court
3Alan HansenAlan Hansen
4Alan KennedyAlan Kennedy
5Albert StubbinsAlbert Stubbins
6Alberto MorenoAlberto Moreno

In [62]:
# check dataframe includes new additions e.g. Divock Origi added 2016-17
dflfcgm_dd[dflfcgm_dd$value %in% c("Divock Origi", "Roberto Firmino","James Milner"),]


Out[62]:
labelvalue
65Divock OrigiDivock Origi
116James MilnerJames Milner
195Roberto FirminoRoberto Firmino

In [ ]:

Analyse the data

Ask a question and find the answer!

Create an R function to plot player's age vs top level league goals

This function will be at the heart of server.UI.


In [57]:
# start with a basic ggplot plot

# create a filter of the dataframe and produce a plot of the data points
df <- dflfcgm[dflfcgm$player %in% c('Luis Suarez', 'Steven Gerrard'),]
head(df)
ggplot(df, aes(x=age, y=league, color=player, shape=player)) + geom_point()  + ggtitle('test plot: StevieG and Suarez')


Out[57]:
seasonplayerleaguepositionage
302014-2015Steven Gerrard9Midfielder34.6
452013-2014Luis Suarez31Striker27
472013-2014Steven Gerrard13Midfielder33.6
572012-2013Luis Suarez23Striker26
592012-2013Steven Gerrard9Midfielder32.6
722011-2012Luis Suarez11Striker25

In [15]:
# add the line of best fit

# create a filter of the dataframe and produce a plot of the data points
df <- dflfcgm[dflfcgm$player %in% c('Luis Suarez', 'Steven Gerrard'),]
ggplot(df, aes(x=age, y=league, color=player, shape=player)) + 
            geom_point() +
            geom_smooth(se=FALSE) + 
            ggtitle('test plot: StevieG and Suarez')


geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
Warning message:
In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, : span too small.   fewer data values than degrees of freedom.Warning message:
In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, : pseudoinverse used at 23.985Warning message:
In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, : neighborhood radius 2.015Warning message:
In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, : reciprocal condition number  0Warning message:
In simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, : There are other near singularities as well. 4.0602

In [16]:
ggplot_age_vs_lgoals <- function(df, players) {
  # Return ggplot of Age vs League Goals for given players in dataframe.
  #
  #  Given the low number of points, ggplot's geom_smooth uses
  #  the loess method with default span.
  TITLE <- 'LFCGM Age vs League Goals'
  XLABEL <- 'Age at Midpoint of Season'
  YLABEL <- 'League Goals per Season'
  EXEMPLAR_PLAYERS <- c('Ian Rush', 'Kenny Dalglish', 'Roger Hunt', 'David Johnson', 
                        'Harry Chambers', 'John Toshack', 'John Barnes', 'Kevin Keegan')
  EXEMPLAR_TITLE <- 'LFCGM Example Plot, The Champions: Age vs League Goals'
  
  # if players vector is empty then set the default exemplar options
  if (length(players) == 0) {
    players <- EXEMPLAR_PLAYERS
    TITLE <- EXEMPLAR_TITLE
  } else {
    title <- TITLE
  }
  
  # create dataframes to plot...
  # filter those players with only 2 points and those with more than 2
  this_df <- df[df$player %in% players, ]
  this_dfeq2 <- this_df %>% group_by(player) %>% filter(n()==2)
  this_dfgt2 <- this_df %>% group_by(player) %>% filter(n()>2) 

  # produce the plot and return it
  this_plot <- ggplot(this_df, aes(x=age, y=league, color=player, shape=player)) + 
    geom_point(size=2) + 
    geom_line(data=this_dfeq2, size=0.1) +
    geom_smooth(data=this_dfgt2, se=FALSE, size=0.1) + 
    xlab(XLABEL) + 
    ylab(YLABEL) + 
    ggtitle(TITLE) + 
    scale_shape_manual(values=0:length(players)) +
    theme(legend.text=element_text(size=10))
  return (this_plot)
}

Show first few plots from original lfcgm data analysis

Ref: http://www.lfcsorted.com/2016/03/the-lfc-goal-machine-graphic-detail.html


In [17]:
# show default plot
players = c()
plt <- ggplot_age_vs_lgoals(dflfcgm, players)
suppressWarnings(print(plt))


geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

In [18]:
# produce plot for player known as 'god'
players = c('Robbie Fowler')
plt <- ggplot_age_vs_lgoals(dflfcgm, players)
suppressWarnings(print(plt))


geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

In [19]:
# check plot for a player with only 2 points
dflfcgm[dflfcgm$player == 'Andy Carroll',]

players = c('Andy Carroll')
plt <- ggplot_age_vs_lgoals(dflfcgm, players)
suppressWarnings(print(plt))


Out[19]:
seasonplayerleaguepositionage
752011-2012Andy Carroll4Striker23
932010-2011Andy Carroll2Striker22

In [20]:
# show all players scoring more than 20 goals when over 30 years old
df_late <- dflfcgm[(dflfcgm$league >= 20) & 
                   (dflfcgm$age > 30),]
df_late
players = df_late$player
players
plt <- ggplot_age_vs_lgoals(dflfcgm, players)
suppressWarnings(print(plt))


Out[20]:
seasonplayerleaguepositionage
3611988-1989John Aldridge21Striker30.3
7671946-1947Jack Balmer24Striker30.9
8151934-1935Gordon Hodgson27Striker30.7
9231925-1926Dick Forshaw27Striker30.4
10681908-1909Ronald Orr20Striker32.4
Out[20]:
  1. John Aldridge
  2. Jack Balmer
  3. Gordon Hodgson
  4. Dick Forshaw
  5. Ronald Orr
geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

In [21]:
# plot second of TOP_DUO seasons
players = c('Daniel Sturridge', 'Luis Suarez')
plt <- ggplot_age_vs_lgoals(dflfcgm, players)
suppressWarnings(print(plt))


geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

In [63]:
# check plot for a recent player
players = c('Roberto Firmino')
plt <- ggplot_age_vs_lgoals(dflfcgm, players)
suppressWarnings(print(plt))


Investigate the dropdown list of players (as used in v1.0.0)

and prototype preparing the player input data for UI.R


In [66]:
dim(dflfcgm_dd)


Out[66]:
  1. 240
  2. 2

In [67]:
str(dflfcgm_dd)


'data.frame':	240 obs. of  2 variables:
 $ label: Factor w/ 240 levels "Adam Lallana",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ value: Factor w/ 240 levels "Adam Lallana",..: 1 2 3 4 5 6 7 8 9 10 ...

In [68]:
head(dflfcgm_dd)


Out[68]:
labelvalue
1Adam LallanaAdam Lallana
2Alan A'CourtAlan A'Court
3Alan HansenAlan Hansen
4Alan KennedyAlan Kennedy
5Albert StubbinsAlbert Stubbins
6Alberto MorenoAlberto Moreno

In [69]:
tail(dflfcgm_dd)


Out[69]:
labelvalue
235Walter WadsworthWalter Wadsworth
236Willie DevlinWillie Devlin
237Willie FaganWillie Fagan
238Willie StevensonWillie Stevenson
239Xabi AlonsoXabi Alonso
240Yossi BenayounYossi Benayoun

In [70]:
length(dflfcgm_dd$value)


Out[70]:
240

In [71]:
dd_players = dflfcgm_dd$value

In [72]:
dd_players[1:5]


Out[72]:
  1. Adam Lallana
  2. Alan A'Court
  3. Alan Hansen
  4. Alan Kennedy
  5. Albert Stubbins

In [73]:
class(dd_players)


Out[73]:
"factor"

In [74]:
# create vector of strings containing the list of players for the input dropdowns
dd_p <- levels(dd_players)[1:5]

In [75]:
class(dd_p)


Out[75]:
"character"

In [76]:
dd_p


Out[76]:
  1. "Adam Lallana"
  2. "Alan A'Court"
  3. "Alan Hansen"
  4. "Alan Kennedy"
  5. "Albert Stubbins"

In [77]:
# add default 'empty' value to beginning of the player dropdown list
EMPTY = '<Select Player>'
print(EMPTY)
dd_p = c(EMPTY, dd_p)
print(dd_p)


[1] "<Select Player>"
[1] "<Select Player>" "Adam Lallana"    "Alan A'Court"    "Alan Hansen"    
[5] "Alan Kennedy"    "Albert Stubbins"

In [78]:
# investigate generating the dropdown list using lapply
p <- c()
lapply(1:8, function(i) {
  p <- c(p, paste0('dd', i))
})


Out[78]:
  1. "dd1"
  2. "dd2"
  3. "dd3"
  4. "dd4"
  5. "dd5"
  6. "dd6"
  7. "dd7"
  8. "dd8"

In [ ]: