This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.
Many analysts, one dataset: Making transparent how variations in analytical choices affect results
The data can be found here.
Take the dataframe dfd
above and do an Exploratory Data Analysis on the dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark skin toned players than light skin toned players?
Please work on your own -- and submit a pull-request when you have completed it. We are interested in how the Data Scientists approach EDA and will compile the results into a larger document.
No need to be elaborate w/ explanation; bullet-points are fine.
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
# %load_ext autoreload
# the "1" means: always reload modules marked with "%aimport"
# %autoreload 1
from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
# import mplsvds
import mpld3
import numpy as np
import pandas as pd
import os, sys
# from tqdm import tqdm
import warnings
sns.set_context("notebook", font_scale=1.3)
In [2]:
d = pd.read_csv('../data/CrowdstormingDataJuly1st.csv')
def gen_features(df):
# all_reds = df['yellowReds']+df['redCards']
# all_cards = all_reds + df.yellowCards
return df.assign(
avg_skintone=(df['rater1']+df['rater2'])/2,
# all_reds=all_reds,
# all_cards=all_cards,
# frac_all_red=all_reds/all_cards,
# frac_red=df.redCards/all_cards
).rename(columns={'yellowReds': 'yellowRedCards'})
dfd = d.pipe(gen_features)
In [2]:
# # The code from this cell comes from: https://osf.io/w7tds/
# import pandas as pd #for dealing with csv import
# import os # for joining paths and filenames sensibly
# # print "loading datafiles"
# # nb new datase
# filename = os.path.join('../data/redcard/','CrowdstormingDataJuly1st.csv')
# df = pd.read_csv(filename)
# #add new vars
# df['skintone']=(df['rater1']+df['rater2'])/2
# df['allreds']=df['yellowReds']+df['redCards']
# df['allredsStrict']=df['redCards']
# df['refCount']=0
# #add a column which tracks how many games each ref is involved in
# refs=pd.unique(df['refNum'].values.ravel()) #list all unique ref IDs
# #for each ref, count their dyads
# for r in refs:
# df['refCount'][df['refNum']==r]=len(df[df['refNum']==r])
# colnames=list(df.columns)
# j = 0
# out = [0 for _ in range(sum(df['games']))]
# for _, row in df.iterrows():
# n = row['games']
# c = row['allreds']
# d = row['allredsStrict']
# #row['games'] = 1
# for _ in range(n):
# row['allreds'] = 1 if (c-_) > 0 else 0
# row['allredsStrict'] = 1 if (d-_) > 0 else 0
# rowlist=list(row) #convert from pandas Series to prevent overwriting previous values of out[j]
# out[j] = rowlist
# j += 1
# pd.DataFrame(out, columns=colnames).to_csv('../data/redcard/crowdstorm_disaggregated.csv', index=False)
The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website. -- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit
Variable Name: | Variable Description: |
---|---|
playerShort | short player ID |
player | player name |
club | player club |
leagueCountry | country of player club (England, Germany, France, and Spain) |
height | player height (in cm) |
weight | player weight (in kg) |
position | player position |
games | number of games in the player-referee dyad |
goals | number of goals in the player-referee dyad |
yellowCards | number of yellow cards player received from the referee |
yellowReds | number of yellow-red cards player received from the referee |
redCards | number of red cards player received from the referee |
photoID | ID of player photo (if available) |
rater1 | skin rating of photo by rater 1 |
rater2 | skin rating of photo by rater 1 |
refNum | unique referee ID number (referee name removed for anonymizing purposes) |
refCountry | unique referee country ID number |
meanIAT | mean implicit bias score (using the race IAT) for referee country |
nIAT | sample size for race IAT in that particular country |
seIAT | standard error for mean estimate of race IAT |
meanExp | mean explicit bias score (using a racial thermometer task) for referee country |
nExp | sample size for explicit bias in that particular country |
seExp | standard error for mean estimate of explicit bias measure |
In [80]:
dfd.shape
Out[80]:
In [184]:
dfd.describe()
Out[184]:
Dataset
How do we operationalize the question of referees giving more red cards to dark skinned players?
Potential issues
First, is there systematic discrimination across all refs?
Exploration/hypotheses:
Observations so far
In [4]:
games = dfd.groupby('playerShort').games.sum()
sns.distplot(games)
Out[4]:
In [69]:
games_red = dfd.groupby('playerShort').sum()[['games','redCards']]
sns.jointplot('games', 'redCards', data=games_red, kind='hexmap', xlim=(-5,800), ylim=(-.5,10))
Out[69]:
This is incorrect, because I'm grouping over skin tone and not first aggregating up to players, so this is averaging over player-referee dyads. There are many more of those than players, so will drive the average down...
TODO: Still, why is 0.625 much lower than the rest? ANSWER: Fewer in that bin.
In [6]:
red_skin = dfd.groupby('avg_skintone').redCards.agg(['mean','sem','count'])
red_skin.plot.bar(y='mean', yerr='sem')
Out[6]:
In [7]:
red_skin['count'].plot.bar()
Out[7]:
In [10]:
# Most number of refs encountered
dfd.groupby(['playerShort','refNum']).size().argmax()
Out[10]:
In [11]:
def add_totals(df):
total = df.filter(like='Cards').sum(1)
all_red = df.yellowRedCards + df.redCards
assert (all_red<=total).all()
df = df.assign(
total_cards=total,
all_red=all_red,
frac_red = df.redCards/total,
frac_all_red = all_red/total
)
assert df.frac_red.dropna().between(0,1).all()
assert (all_red>=df.redCards).all()
return df
totals = (dfd.groupby('playerShort').agg({
'yellowCards': 'sum',
'redCards': 'sum',
'yellowRedCards': 'sum',
'games': 'sum',
'avg_skintone': 'mean'
})
.pipe(add_totals)
)
totals.head()
Out[11]:
In [12]:
totals.shape
Out[12]:
albelda got almost 200 yellow cards in one season!!
In [13]:
totals.describe()
Out[13]:
In [14]:
totals.yellowCards.argmax(),totals.yellowRedCards.argmax(),totals.redCards.argmax()
Out[14]:
In [15]:
sns.distplot(totals.yellowCards)
Out[15]:
In [65]:
# These colors suck; upgrade to Matplotlib 2
r,y = totals[['redCards','yellowRedCards']].apply(np.histogram, bins=range(12))
pd.DataFrame({'red': r[0], 'yellowRed': y[0]}, index=r[1][:-1]).plot.bar(color=['red','yellow'])
Out[65]:
In [37]:
sns.distplot(totals.frac_red.dropna(), kde=False)
Out[37]:
In [106]:
totals.hist('frac_red', by='skincolor')
Out[106]:
In [67]:
totals.plot.scatter('games', 'all_red')
Out[67]:
In [19]:
totals['red_per_game'] = totals.redCards/totals.games
sns.boxplot(x='avg_skintone', y='red_per_game', data=totals)
Out[19]:
In [85]:
totals.hist('total_per_game', bins=np.arange(0, .4, .05),
by=totals.skincolor, layout=(3,1),
figsize=(5,8), sharex=True)
Out[85]:
In [20]:
totals['total_per_game'] = totals.total_cards / totals.games
sns.distplot(totals.total_per_game)
sns.jointplot('total_per_game', 'red_per_game', data=totals[totals.games>20])
Out[20]:
In [71]:
totals['skincolor'] = pd.cut(totals.avg_skintone,[0,.3,.6,1],
labels=['white','milk chocolate','dark chocolate'],
include_lowest=True)
have_card = totals[totals]
(sns.FacetGrid(totals, hue='skincolor', size=8, aspect=1.5)
.map(sns.regplot, 'total_per_game', 'red_per_game', lowess=True)
.add_legend()
.ax.set(title='Red vs total cards per game by skin color',
xlim=(0,1), ylim=(-0.01,.16))
);
In [97]:
(sns.FacetGrid(totals, hue='skincolor', size=8, aspect=1.5)
.map(sns.regplot, 'total_cards', 'all_red', lowess=True)
.add_legend()
.ax.set(title='Total red vs yellow cards by skin color',
xlim=(0,200), ylim=(0,20))
);
In [107]:
(sns.FacetGrid(totals[totals.total_cards>10], hue='skincolor', size=8, aspect=1.5)
.map(sns.regplot, 'total_per_game', 'frac_all_red', lowess=True)
.add_legend()
.ax.set(title='Frac red vs cards per game by skin color',
# xlim=(0,200), ylim=(0,20)
)
);