Exploratory Data Analysis

This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.

Many analysts, one dataset: Making transparent how variations in analytical choices affect results

The data can be found here.

The Task

Take the dataframe dfd above and do an Exploratory Data Analysis on the dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?

Please work on your own -- and submit a pull-request when you have completed it. We are interested in how the Data Scientists approach EDA and will compile the results into a larger document.

No need to be elaborate w/ explanation; bullet-points are fine.



In [1]:

    
%matplotlib inline
%config InlineBackend.figure_format='retina'
# %load_ext autoreload
# the "1" means: always reload modules marked with "%aimport"
# %autoreload 1

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
# import mplsvds
import mpld3
import numpy as np
import pandas as pd
import os, sys
# from tqdm import tqdm
import warnings

sns.set_context("notebook", font_scale=1.3)



In [2]:

    
d = pd.read_csv('../data/CrowdstormingDataJuly1st.csv')

def gen_features(df):
#     all_reds = df['yellowReds']+df['redCards']
#     all_cards = all_reds + df.yellowCards
    return df.assign(
                avg_skintone=(df['rater1']+df['rater2'])/2,
#                 all_reds=all_reds,
#                 all_cards=all_cards,
#                 frac_all_red=all_reds/all_cards,
#                 frac_red=df.redCards/all_cards
          ).rename(columns={'yellowReds': 'yellowRedCards'})

dfd = d.pipe(gen_features)



In [2]:

    
# # The code from this cell comes from: https://osf.io/w7tds/

# import pandas as pd #for dealing with csv import
# import os # for joining paths and filenames sensibly

# # print "loading datafiles"
# # nb new datase
# filename = os.path.join('../data/redcard/','CrowdstormingDataJuly1st.csv') 
# df = pd.read_csv(filename)

# #add new vars
# df['skintone']=(df['rater1']+df['rater2'])/2
# df['allreds']=df['yellowReds']+df['redCards']
# df['allredsStrict']=df['redCards']
# df['refCount']=0

# #add a column which tracks how many games each ref is involved in
# refs=pd.unique(df['refNum'].values.ravel()) #list all unique ref IDs

# #for each ref, count their dyads
# for r in refs:
#     df['refCount'][df['refNum']==r]=len(df[df['refNum']==r])    

# colnames=list(df.columns)

# j = 0
# out = [0 for _ in range(sum(df['games']))]

# for _, row in df.iterrows():
#         n = row['games']
#         c = row['allreds']
#         d = row['allredsStrict']
        
#         #row['games'] = 1        
        
#         for _ in range(n):
#                 row['allreds'] = 1 if (c-_) > 0 else 0
#                 row['allredsStrict'] = 1 if (d-_) > 0 else 0
#                 rowlist=list(row)  #convert from pandas Series to prevent overwriting previous values of out[j]
#                 out[j] = rowlist
#                 j += 1


# pd.DataFrame(out, columns=colnames).to_csv('../data/redcard/crowdstorm_disaggregated.csv', index=False)

Data Structure

The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website. -- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

Variable Name:	Variable Description:
playerShort	short player ID
player	player name
club	player club
leagueCountry	country of player club (England, Germany, France, and Spain)
height	player height (in cm)
weight	player weight (in kg)
position	player position
games	number of games in the player-referee dyad
goals	number of goals in the player-referee dyad
yellowCards	number of yellow cards player received from the referee
yellowReds	number of yellow-red cards player received from the referee
redCards	number of red cards player received from the referee
photoID	ID of player photo (if available)
rater1	skin rating of photo by rater 1
rater2	skin rating of photo by rater 1
refNum	unique referee ID number (referee name removed for anonymizing purposes)
refCountry	unique referee country ID number
meanIAT	mean implicit bias score (using the race IAT) for referee country
nIAT	sample size for race IAT in that particular country
seIAT	standard error for mean estimate of race IAT
meanExp	mean explicit bias score (using a racial thermometer task) for referee country
nExp	sample size for explicit bias in that particular country
seExp	standard error for mean estimate of explicit bias measure



In [80]:

    
dfd.shape









    Out[80]:





(146028, 30)



In [184]:

    
dfd.describe()









    Out[184]:






  
    
      
      height
      weight
      games
      victories
      ties
      defeats
      goals
      yellowCards
      yellowRedCards
      redCards
      ...
      rater2
      refNum
      refCountry
      meanIAT
      nIAT
      seIAT
      meanExp
      nExp
      seExp
      avg_skintone
    
  
  
    
      count
      145765.000000
      143785.000000
      146028.000000
      146028.000000
      146028.000000
      146028.000000
      146028.000000
      146028.000000
      146028.000000
      146028.000000
      ...
      124621.000000
      146028.000000
      146028.000000
      145865.000000
      1.458650e+05
      1.458650e+05
      145865.000000
      1.458650e+05
      145865.000000
      124621.000000
    
    
      mean
      181.935938
      76.075662
      2.921166
      1.278344
      0.708241
      0.934581
      0.338058
      0.385364
      0.011381
      0.012559
      ...
      0.302862
      1534.827444
      29.642842
      0.346276
      1.969741e+04
      6.310849e-04
      0.452026
      2.044023e+04
      0.002994
      0.283559
    
    
      std
      6.738726
      7.140906
      3.413633
      1.790725
      1.116793
      1.383059
      0.906481
      0.795333
      0.107931
      0.112889
      ...
      0.293020
      918.736625
      27.496189
      0.032246
      1.271262e+05
      4.735857e-03
      0.217469
      1.306157e+05
      0.019723
      0.288517
    
    
      min
      161.000000
      54.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      1.000000
      1.000000
      -0.047254
      2.000000e+00
      2.235373e-07
      -1.375000
      2.000000e+00
      0.000001
      0.000000
    
    
      25%
      177.000000
      71.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.000000
      641.000000
      7.000000
      0.334684
      1.785000e+03
      5.454025e-05
      0.336101
      1.897000e+03
      0.000225
      0.000000
    
    
      50%
      182.000000
      76.000000
      2.000000
      1.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      ...
      0.250000
      1604.000000
      21.000000
      0.336628
      2.882000e+03
      1.508847e-04
      0.356446
      3.011000e+03
      0.000586
      0.250000
    
    
      75%
      187.000000
      81.000000
      3.000000
      2.000000
      1.000000
      1.000000
      0.000000
      1.000000
      0.000000
      0.000000
      ...
      0.500000
      2345.000000
      44.000000
      0.369894
      7.749000e+03
      2.294896e-04
      0.588297
      7.974000e+03
      0.001002
      0.375000
    
    
      max
      203.000000
      100.000000
      47.000000
      29.000000
      14.000000
      18.000000
      23.000000
      14.000000
      3.000000
      2.000000
      ...
      1.000000
      3147.000000
      161.000000
      0.573793
      1.975803e+06
      2.862871e-01
      1.800000
      2.029548e+06
      1.060660
      1.000000
    
  

8 rows × 21 columns

EDA

Dataset

From one season (2012-2013)
1st division males
England, France, Germany, Spain

How do we operationalize the question of referees giving more red cards to dark skinned players?

Counterfactual: if the player were lighter, a ref is more likely to have given a yellow or no card for the same offense under the same conditions
Regression: accounting for confounding, darker players have positive coefficient on regression against proportion red/total card

Potential issues

How to combine rater1 and rater2? Average them? What if they disagree? Throw it out?
Is data imbalanced, i.e. red cards are very rare?
Is data biased, i.e. players have different amounts of play time?
How do I know I've accounted for all forms of confounding?

First, is there systematic discrimination across all refs?

Exploration/hypotheses:

Distribution of games played
red cards vs games played
Reds per game played vs total cards per game played by skin color
Distribution of # red, # yellow, total cards, and fraction red per game played for all players by avg skin color
How many refs did players encounter?
Do some clubs play more aggresively and get carded more? Or are more reserved and get less?
Does carding vary by leagueCountry?
Do high scorers get more slack (fewer cards) for the same position?
Are there some referees that give more red/yellow cards than others?
how consistent are raters? Check with Cohen's kappa.
how do red cards vary by position? e.g. defenders get more?
Do players with more games get more cards, and is there difference across skin color?
indication of bias depending on refCountry?

Observations so far

Mostly white players in these leagues. Makes it more difficult to see discrimination.



In [4]:

    
games = dfd.groupby('playerShort').games.sum()
sns.distplot(games)









    



/Users/christian/miniconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[4]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d818ef0>



In [69]:

    
games_red = dfd.groupby('playerShort').sum()[['games','redCards']]
sns.jointplot('games', 'redCards', data=games_red, kind='hexmap', xlim=(-5,800), ylim=(-.5,10))









    Out[69]:





<seaborn.axisgrid.JointGrid at 0x122c83940>

This is incorrect, because I'm grouping over skin tone and not first aggregating up to players, so this is averaging over player-referee dyads. There are many more of those than players, so will drive the average down...

TODO: Still, why is 0.625 much lower than the rest? ANSWER: Fewer in that bin.



In [6]:

    
red_skin = dfd.groupby('avg_skintone').redCards.agg(['mean','sem','count'])
red_skin.plot.bar(y='mean', yerr='sem')









    Out[6]:





<matplotlib.axes._subplots.AxesSubplot at 0x11007d400>



In [7]:

    
red_skin['count'].plot.bar()









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x11a5b3fd0>

Distribution of cards



In [10]:

    
# Most number of refs encountered
dfd.groupby(['playerShort','refNum']).size().argmax()









    Out[10]:





('aaron-hughes', 4)



In [11]:

    
def add_totals(df):
    total = df.filter(like='Cards').sum(1)
    all_red = df.yellowRedCards + df.redCards
    assert (all_red<=total).all()
    df = df.assign(
            total_cards=total,
            all_red=all_red,
            frac_red = df.redCards/total,
            frac_all_red = all_red/total
        )
    assert df.frac_red.dropna().between(0,1).all()
    assert (all_red>=df.redCards).all()
    return df
        
totals = (dfd.groupby('playerShort').agg({
    'yellowCards': 'sum',
    'redCards': 'sum',
    'yellowRedCards': 'sum',
    'games': 'sum',
    'avg_skintone': 'mean'
    })
    .pipe(add_totals)
)
totals.head()









    Out[11]:






  
    
      
      avg_skintone
      redCards
      yellowCards
      games
      yellowRedCards
      all_red
      frac_all_red
      frac_red
      total_cards
    
    
      playerShort
      
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      0.125
      0
      19
      654
      0
      0
      0.000000
      0.000000
      19
    
    
      aaron-hunt
      0.125
      1
      42
      336
      0
      1
      0.023256
      0.023256
      43
    
    
      aaron-lennon
      0.250
      0
      11
      412
      0
      0
      0.000000
      0.000000
      11
    
    
      aaron-ramsey
      0.000
      1
      31
      260
      0
      1
      0.031250
      0.031250
      32
    
    
      abdelhamid-el-kaoutari
      0.250
      2
      8
      124
      4
      6
      0.428571
      0.142857
      14



In [12]:

    
totals.shape









    Out[12]:





(2053, 9)

albelda got almost 200 yellow cards in one season!!



In [13]:

    
totals.describe()









    Out[13]:






  
    
      
      avg_skintone
      redCards
      yellowCards
      games
      yellowRedCards
      all_red
      frac_all_red
      frac_red
      total_cards
    
  
  
    
      count
      1585.000000
      2053.000000
      2053.000000
      2053.000000
      2053.000000
      2053.000000
      1987.000000
      1987.000000
      2053.000000
    
    
      mean
      0.289511
      0.893327
      27.410619
      207.779834
      0.809547
      1.702874
      0.060798
      0.037183
      29.113492
    
    
      std
      0.290899
      1.257267
      24.076074
      141.604438
      1.293358
      2.121637
      0.088147
      0.081542
      25.583736
    
    
      min
      0.000000
      0.000000
      0.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      0.000000
      0.000000
      10.000000
      103.000000
      0.000000
      0.000000
      0.000000
      0.000000
      10.000000
    
    
      50%
      0.250000
      0.000000
      21.000000
      180.000000
      0.000000
      1.000000
      0.045455
      0.011765
      23.000000
    
    
      75%
      0.375000
      1.000000
      39.000000
      286.000000
      1.000000
      2.000000
      0.083333
      0.047619
      41.000000
    
    
      max
      1.000000
      13.000000
      197.000000
      914.000000
      12.000000
      19.000000
      1.000000
      1.000000
      206.000000



In [14]:

    
totals.yellowCards.argmax(),totals.yellowRedCards.argmax(),totals.redCards.argmax()









    Out[14]:





('albelda', 'sergio-ramos', 'cyril-jeunechamp')



In [15]:

    
sns.distplot(totals.yellowCards)









    



/Users/christian/miniconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0x11b390cf8>



In [65]:

    
# These colors suck; upgrade to Matplotlib 2
r,y = totals[['redCards','yellowRedCards']].apply(np.histogram, bins=range(12))
pd.DataFrame({'red': r[0], 'yellowRed': y[0]}, index=r[1][:-1]).plot.bar(color=['red','yellow'])









    Out[65]:





<matplotlib.axes._subplots.AxesSubplot at 0x12ee0ba20>



In [37]:

    
sns.distplot(totals.frac_red.dropna(), kde=False)









    Out[37]:





<matplotlib.axes._subplots.AxesSubplot at 0x12d402470>



In [106]:

    
totals.hist('frac_red', by='skincolor')









    Out[106]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x129f34780>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11cdf0e80>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x12ae8b160>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12b85b668>]], dtype=object)



In [67]:

    
totals.plot.scatter('games', 'all_red')









    Out[67]:





<matplotlib.axes._subplots.AxesSubplot at 0x12ee3aac8>

Reds per game played by skin tone

Results

Identified problem with preprocessing and fixed. EDA success!
I don't see convincing discrimination at a systemic level yet. Darker skinned palyers may get tend to get more reds when they get more cards, but the relationships are noisy.

Needs iteration!



In [19]:

    
totals['red_per_game'] = totals.redCards/totals.games
sns.boxplot(x='avg_skintone', y='red_per_game', data=totals)









    



/Users/christian/miniconda3/lib/python3.5/site-packages/seaborn/categorical.py:342: DeprecationWarning: pandas.core.common.is_categorical_dtype is deprecated. import from the public API: pandas.api.types.is_categorical_dtype instead
  elif is_categorical(y):






    Out[19]:





<matplotlib.axes._subplots.AxesSubplot at 0x11d4a9eb8>



In [85]:

    
totals.hist('total_per_game', bins=np.arange(0, .4, .05), 
            by=totals.skincolor, layout=(3,1), 
            figsize=(5,8), sharex=True)









    Out[85]:





array([<matplotlib.axes._subplots.AxesSubplot object at 0x11c6d15c0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x1228c9ac8>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x12228ac18>], dtype=object)



In [20]:

    
totals['total_per_game'] = totals.total_cards / totals.games
sns.distplot(totals.total_per_game)
sns.jointplot('total_per_game', 'red_per_game', data=totals[totals.games>20])









    



/Users/christian/miniconda3/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j






    Out[20]:





<seaborn.axisgrid.JointGrid at 0x120e4ba20>



In [71]:

    
totals['skincolor'] = pd.cut(totals.avg_skintone,[0,.3,.6,1], 
                             labels=['white','milk chocolate','dark chocolate'], 
                                     include_lowest=True)
have_card = totals[totals]
(sns.FacetGrid(totals, hue='skincolor', size=8, aspect=1.5)
 .map(sns.regplot, 'total_per_game', 'red_per_game', lowess=True)
 .add_legend()
 .ax.set(title='Red vs total cards per game by skin color',
         xlim=(0,1), ylim=(-0.01,.16))
);



In [97]:

    
(sns.FacetGrid(totals, hue='skincolor', size=8, aspect=1.5)
 .map(sns.regplot, 'total_cards', 'all_red', lowess=True)
 .add_legend()
 .ax.set(title='Total red vs yellow cards by skin color',
         xlim=(0,200), ylim=(0,20))
);



In [107]:

    
(sns.FacetGrid(totals[totals.total_cards>10], hue='skincolor', size=8, aspect=1.5)
 .map(sns.regplot, 'total_per_game', 'frac_all_red', lowess=True)
 .add_legend()
 .ax.set(title='Frac red vs cards per game by skin color',
#          xlim=(0,200), ylim=(0,20)
        )
);

	height	weight	games	victories	ties	defeats	goals	yellowCards	yellowRedCards	redCards	...	rater2	refNum	refCountry	meanIAT	nIAT	seIAT	meanExp	nExp	seExp	avg_skintone
count	145765.000000	143785.000000	146028.000000	146028.000000	146028.000000	146028.000000	146028.000000	146028.000000	146028.000000	146028.000000	...	124621.000000	146028.000000	146028.000000	145865.000000	1.458650e+05	1.458650e+05	145865.000000	1.458650e+05	145865.000000	124621.000000
mean	181.935938	76.075662	2.921166	1.278344	0.708241	0.934581	0.338058	0.385364	0.011381	0.012559	...	0.302862	1534.827444	29.642842	0.346276	1.969741e+04	6.310849e-04	0.452026	2.044023e+04	0.002994	0.283559
std	6.738726	7.140906	3.413633	1.790725	1.116793	1.383059	0.906481	0.795333	0.107931	0.112889	...	0.293020	918.736625	27.496189	0.032246	1.271262e+05	4.735857e-03	0.217469	1.306157e+05	0.019723	0.288517
min	161.000000	54.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	1.000000	1.000000	-0.047254	2.000000e+00	2.235373e-07	-1.375000	2.000000e+00	0.000001	0.000000
25%	177.000000	71.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	641.000000	7.000000	0.334684	1.785000e+03	5.454025e-05	0.336101	1.897000e+03	0.000225	0.000000
50%	182.000000	76.000000	2.000000	1.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	...	0.250000	1604.000000	21.000000	0.336628	2.882000e+03	1.508847e-04	0.356446	3.011000e+03	0.000586	0.250000
75%	187.000000	81.000000	3.000000	2.000000	1.000000	1.000000	0.000000	1.000000	0.000000	0.000000	...	0.500000	2345.000000	44.000000	0.369894	7.749000e+03	2.294896e-04	0.588297	7.974000e+03	0.001002	0.375000
max	203.000000	100.000000	47.000000	29.000000	14.000000	18.000000	23.000000	14.000000	3.000000	2.000000	...	1.000000	3147.000000	161.000000	0.573793	1.975803e+06	2.862871e-01	1.800000	2.029548e+06	1.060660	1.000000

	avg_skintone	redCards	yellowCards	games	yellowRedCards	all_red	frac_all_red	frac_red	total_cards
playerShort
aaron-hughes	0.125	0	19	654	0	0	0.000000	0.000000	19
aaron-hunt	0.125	1	42	336	0	1	0.023256	0.023256	43
aaron-lennon	0.250	0	11	412	0	0	0.000000	0.000000	11
aaron-ramsey	0.000	1	31	260	0	1	0.031250	0.031250	32
abdelhamid-el-kaoutari	0.250	2	8	124	4	6	0.428571	0.142857	14

	avg_skintone	redCards	yellowCards	games	yellowRedCards	all_red	frac_all_red	frac_red	total_cards
count	1585.000000	2053.000000	2053.000000	2053.000000	2053.000000	2053.000000	1987.000000	1987.000000	2053.000000
mean	0.289511	0.893327	27.410619	207.779834	0.809547	1.702874	0.060798	0.037183	29.113492
std	0.290899	1.257267	24.076074	141.604438	1.293358	2.121637	0.088147	0.081542	25.583736
min	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	0.000000	0.000000	10.000000	103.000000	0.000000	0.000000	0.000000	0.000000	10.000000
50%	0.250000	0.000000	21.000000	180.000000	0.000000	1.000000	0.045455	0.011765	23.000000
75%	0.375000	1.000000	39.000000	286.000000	1.000000	2.000000	0.083333	0.047619	41.000000
max	1.000000	13.000000	197.000000	914.000000	12.000000	19.000000	1.000000	1.000000	206.000000