Redcard Exploratory Data Analysis

This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.

Many analysts, one dataset: Making transparent how variations in analytical choices affect results

The data can be found here.

The Task

Do an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?

Before plotting/joining/doing something, have a question or hypothesis that you want to investigate
Draw a plot of what you want to see on paper to sketch the idea
Write it down, then make the plan on how to get there
How do you know you aren't fooling yourself
What else can I check if this is actually true?
What evidence could there be that it's wrong?



In [1]:

    
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

About the Data

The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of playerâreferee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.

-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

Variable Name:	Variable Description:
playerShort	short player ID
player	player name
club	player club
leagueCountry	country of player club (England, Germany, France, and Spain)
height	player height (in cm)
weight	player weight (in kg)
position	player position
games	number of games in the player-referee dyad
goals	number of goals in the player-referee dyad
yellowCards	number of yellow cards player received from the referee
yellowReds	number of yellow-red cards player received from the referee
redCards	number of red cards player received from the referee
photoID	ID of player photo (if available)
rater1	skin rating of photo by rater 1
rater2	skin rating of photo by rater 2
refNum	unique referee ID number (referee name removed for anonymizing purposes)
refCountry	unique referee country ID number
meanIAT	mean implicit bias score (using the race IAT) for referee country
nIAT	sample size for race IAT in that particular country
seIAT	standard error for mean estimate of race IAT
meanExp	mean explicit bias score (using a racial thermometer task) for referee country
nExp	sample size for explicit bias in that particular country
seExp	standard error for mean estimate of explicit bias measure



In [2]:

    
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")



In [3]:

    
def load_subgroup(filename, index_col=[0]):
    return pd.read_csv(filename, compression='gzip', index_col=index_col)

Tidy Dyads and Starting Joins



In [4]:

    
clean_players = load_subgroup("cleaned_players.csv.gz")
players = load_subgroup("raw_players.csv.gz", )
countries = load_subgroup("raw_countries.csv.gz")
referees = load_subgroup("raw_referees.csv.gz")
agg_dyads = pd.read_csv("raw_dyads.csv.gz", compression='gzip', index_col=[0, 1])



In [5]:

    
agg_dyads.head(10)









    Out[5]:






  
    
      
      
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      redCards
    
    
      refNum
      playerShort
      
      
      
      
      
      
      
      
    
  
  
    
      1
      lucas-wilchez
      0
      0
      0
      0
      1
      1
      0
      0
    
    
      2
      john-utaka
      1
      0
      0
      0
      1
      1
      0
      0
    
    
      3
      abdon-prats
      1
      0
      0
      1
      1
      0
      0
      0
    
    
      pablo-mari
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      ruben-pena
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      4
      aaron-hughes
      0
      0
      0
      0
      1
      1
      0
      0
    
    
      aleksandar-kolarov
      0
      0
      1
      0
      1
      0
      0
      0
    
    
      alexander-tettey
      0
      0
      0
      0
      1
      1
      0
      0
    
    
      anders-lindegaard
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      andreas-beck
      0
      0
      1
      0
      1
      0
      0
      0



In [6]:

    
# Test if the number of games is equal to the victories + ties + defeats in the dataset



In [7]:

    
all(agg_dyads['games'] == agg_dyads.victories + agg_dyads.ties + agg_dyads.defeats)









    Out[7]:





True



In [8]:

    
# Sanity check passes



In [9]:

    
len(agg_dyads.reset_index().set_index('playerShort'))









    Out[9]:





146028



In [10]:

    
agg_dyads['totalRedCards'] = agg_dyads['yellowReds'] + agg_dyads['redCards']
agg_dyads.rename(columns={'redCards': 'strictRedCards'}, inplace=True)



In [11]:

    
agg_dyads.head()









    Out[11]:






  
    
      
      
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      strictRedCards
      totalRedCards
    
    
      refNum
      playerShort
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      lucas-wilchez
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      2
      john-utaka
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      3
      abdon-prats
      1
      0
      0
      1
      1
      0
      0
      0
      0
    
    
      pablo-mari
      0
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      ruben-pena
      0
      0
      1
      0
      1
      0
      0
      0
      0

Remove records that come from players who don't have a skintone rating

There are a couple of ways to do this -- set operations and joins are two ways demonstrated below:



In [12]:

    
clean_players.head()









    Out[12]:






  
    
      
      height
      weight
      skintone
      position_agg
      weightclass
      heightclass
      skintoneclass
      age_years
    
    
      playerShort
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
    
    
      aaron-hunt
      183.0
      73.0
      0.125
      Forward
      low_weight
      mid_height
      [0, 0.125]
      26.327173
    
    
      aaron-lennon
      165.0
      63.0
      0.250
      Midfield
      vlow_weight
      vlow_height
      (0.125, 0.25]
      25.713895
    
    
      aaron-ramsey
      178.0
      76.0
      0.000
      Midfield
      mid_weight
      low_height
      [0, 0.125]
      22.017796
    
    
      abdelhamid-el-kaoutari
      180.0
      73.0
      0.250
      Defense
      low_weight
      low_height
      (0.125, 0.25]
      22.795346



In [13]:

    
agg_dyads.head()









    Out[13]:






  
    
      
      
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      strictRedCards
      totalRedCards
    
    
      refNum
      playerShort
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      lucas-wilchez
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      2
      john-utaka
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      3
      abdon-prats
      1
      0
      0
      1
      1
      0
      0
      0
      0
    
    
      pablo-mari
      0
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      ruben-pena
      0
      0
      1
      0
      1
      0
      0
      0
      0



In [14]:

    
agg_dyads.reset_index().head()









    Out[14]:






  
    
      
      refNum
      playerShort
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      strictRedCards
      totalRedCards
    
  
  
    
      0
      1
      lucas-wilchez
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      1
      2
      john-utaka
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      2
      3
      abdon-prats
      1
      0
      0
      1
      1
      0
      0
      0
      0
    
    
      3
      3
      pablo-mari
      0
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      4
      3
      ruben-pena
      0
      0
      1
      0
      1
      0
      0
      0
      0



In [15]:

    
agg_dyads.reset_index().set_index('playerShort').head()









    Out[15]:






  
    
      
      refNum
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      strictRedCards
      totalRedCards
    
    
      playerShort
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      lucas-wilchez
      1
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      john-utaka
      2
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      abdon-prats
      3
      1
      0
      0
      1
      1
      0
      0
      0
      0
    
    
      pablo-mari
      3
      0
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      ruben-pena
      3
      0
      0
      1
      0
      1
      0
      0
      0
      0



In [16]:

    
player_dyad = (clean_players.merge(agg_dyads.reset_index().set_index('playerShort'),
                                   left_index=True,
                                   right_index=True))



In [17]:

    
player_dyad.head()









    Out[17]:






  
    
      
      height
      weight
      skintone
      position_agg
      weightclass
      heightclass
      skintoneclass
      age_years
      refNum
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      strictRedCards
      totalRedCards
    
    
      playerShort
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
      4
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
      66
      0
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
      77
      0
      0
      13
      8
      26
      5
      0
      0
      0
    
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
      163
      0
      0
      1
      1
      2
      0
      0
      0
      0
    
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
      194
      2
      0
      3
      5
      16
      8
      0
      0
      0



In [18]:

    
clean_dyads = (agg_dyads.reset_index()[agg_dyads.reset_index()
                                   .playerShort
                                   .isin(set(clean_players.index))
                                  ]).set_index(['refNum', 'playerShort'])



In [19]:

    
clean_dyads.head()









    Out[19]:






  
    
      
      
      yellowCards
      yellowReds
      victories
      ties
      games
      defeats
      goals
      strictRedCards
      totalRedCards
    
    
      refNum
      playerShort
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      lucas-wilchez
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      2
      john-utaka
      1
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      4
      aaron-hughes
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      aleksandar-kolarov
      0
      0
      1
      0
      1
      0
      0
      0
      0
    
    
      alexander-tettey
      0
      0
      0
      0
      1
      1
      0
      0
      0



In [20]:

    
clean_dyads.shape, agg_dyads.shape, player_dyad.shape









    Out[20]:





((124621, 9), (146028, 9), (124621, 18))

Disaggregate

The dyads are currently an aggregated metric summarizing all times a particular referee-player pair play were matched. To properly handle the data, we have to disaggregate the data into a tidy/long format. This means that each game is a row.



In [21]:

    
# inspired by https://github.com/mathewzilla/redcard/blob/master/Crowdstorming_visualisation.ipynb
colnames = ['games', 'totalRedCards']
j = 0
out = [0 for _ in range(sum(clean_dyads['games']))]

for index, row in clean_dyads.reset_index().iterrows():
    n = row['games']
    d = row['totalRedCards']
    ref = row['refNum']
    player = row['playerShort']
    for _ in range(n):
        row['totalRedCards'] = 1 if (d-_) > 0 else 0
        rowlist=list([ref, player, row['totalRedCards']])
        out[j] = rowlist
        j += 1

tidy_dyads = pd.DataFrame(out, columns=['refNum', 'playerShort', 'redcard'],).set_index(['refNum', 'playerShort'])



In [22]:

    
# 3092
tidy_dyads.redcard.sum()









    Out[22]:





3092



In [23]:

    
# Notice this is longer than before
clean_dyads.games.sum()









    Out[23]:





373067



In [24]:

    
tidy_dyads.shape









    Out[24]:





(373067, 1)



In [25]:

    
# Ok, this is a bit crazy... tear it apart and figure out what each piece is doing if it's not clear
clean_referees = (referees.reset_index()[referees.reset_index()
                                                 .refNum.isin(tidy_dyads.reset_index().refNum
                                                                                       .unique())
                                        ]).set_index('refNum')



In [26]:

    
clean_referees.shape, referees.shape









    Out[26]:





((2978, 1), (3147, 1))



In [27]:

    
clean_countries = (countries.reset_index()[countries.reset_index()
                                           .refCountry
                                           .isin(clean_referees.refCountry
                                                 .unique())
                                          ].set_index('refCountry'))



In [28]:

    
clean_countries.shape, countries.shape









    Out[28]:





((160, 7), (161, 7))



In [29]:

    
tidy_dyads.head()









    Out[29]:






  
    
      
      
      redcard
    
    
      refNum
      playerShort
      
    
  
  
    
      1
      lucas-wilchez
      0
    
    
      2
      john-utaka
      0
    
    
      4
      aaron-hughes
      0
    
    
      aleksandar-kolarov
      0
    
    
      alexander-tettey
      0



In [30]:

    
tidy_dyads.to_csv("cleaned_dyads.csv.gz", compression='gzip')



In [31]:

    
tidy_dyads.shape









    Out[31]:





(373067, 1)



In [ ]:

		yellowCards	yellowReds	victories	ties	games	defeats	goals	redCards
refNum	playerShort
1	lucas-wilchez	0	0	0	0	1	1	0	0
2	john-utaka	1	0	0	0	1	1	0	0
3	abdon-prats	1	0	0	1	1	0	0	0
	pablo-mari	0	0	1	0	1	0	0	0
	ruben-pena	0	0	1	0	1	0	0	0
4	aaron-hughes	0	0	0	0	1	1	0	0
	aleksandar-kolarov	0	0	1	0	1	0	0	0
	alexander-tettey	0	0	0	0	1	1	0	0
	anders-lindegaard	0	0	0	1	1	0	0	0
	andreas-beck	0	0	1	0	1	0	0	0

	height	weight	skintone	position_agg	weightclass	heightclass	skintoneclass	age_years
playerShort
aaron-hughes	182.0	71.0	0.125	Defense	low_weight	mid_height	[0, 0.125]	33.149897
aaron-hunt	183.0	73.0	0.125	Forward	low_weight	mid_height	[0, 0.125]	26.327173
aaron-lennon	165.0	63.0	0.250	Midfield	vlow_weight	vlow_height	(0.125, 0.25]	25.713895
aaron-ramsey	178.0	76.0	0.000	Midfield	mid_weight	low_height	[0, 0.125]	22.017796
abdelhamid-el-kaoutari	180.0	73.0	0.250	Defense	low_weight	low_height	(0.125, 0.25]	22.795346