This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.
Many analysts, one dataset: Making transparent how variations in analytical choices affect results
The data can be found here.
Do an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?
In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'
from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)
import missingno as msno
import pandas_profiling
from sklearn.datasets import make_blobs
import time
The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.
From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of playerâreferee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.
-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit
Variable Name: | Variable Description: |
---|---|
playerShort | short player ID |
player | player name |
club | player club |
leagueCountry | country of player club (England, Germany, France, and Spain) |
height | player height (in cm) |
weight | player weight (in kg) |
position | player position |
games | number of games in the player-referee dyad |
goals | number of goals in the player-referee dyad |
yellowCards | number of yellow cards player received from the referee |
yellowReds | number of yellow-red cards player received from the referee |
redCards | number of red cards player received from the referee |
photoID | ID of player photo (if available) |
rater1 | skin rating of photo by rater 1 |
rater2 | skin rating of photo by rater 2 |
refNum | unique referee ID number (referee name removed for anonymizing purposes) |
refCountry | unique referee country ID number |
meanIAT | mean implicit bias score (using the race IAT) for referee country |
nIAT | sample size for race IAT in that particular country |
seIAT | standard error for mean estimate of race IAT |
meanExp | mean explicit bias score (using a racial thermometer task) for referee country |
nExp | sample size for explicit bias in that particular country |
seExp | standard error for mean estimate of explicit bias measure |
Hadley Wickham's concept of a tidy dataset summarized as:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table
A longer paper describing this can be found in this pdf.
Having datasets in this form allows for much simpler analyses. So the first step is to try and clean up the dataset into a tidy dataset.
The first step that I am going to take is to break up the dataset into the different observational units. By that I'm going to have separate tables (or dataframes) for:
In [2]:
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
# Test that we recover what we send in
if dataframe.equals(test_df):
print("Test-passed: we recover the equivalent subgroup dataframe.")
else:
print("Warning -- equivalence test!!! Double-check.")
In [3]:
def load_subgroup(filename, index_col=[0]):
return pd.read_csv(filename, compression='gzip', index_col=index_col)
In [4]:
# players = load_subgroup("../data/redcard/raw_players.csv.gz")
players = load_subgroup("raw_players.csv.gz")
In [5]:
players.head()
Out[5]:
In [6]:
players.shape
Out[6]:
In [7]:
ls
In [8]:
msno.matrix(players.sample(500),
figsize=(16, 7),
width_ratios=(15, 1))
In [9]:
msno.bar(players.sample(500),
figsize=(16, 7),)
Interesting! It appears that a significant fraction of players don't have a skintone rating from at least one rater. Are they missing in both? Since this is the feature we are supposed to be answering against -- it's an important thing to figure out early! Let's see if there's correlated patterns in how the data is missing for this table:
In [10]:
msno.heatmap(players.sample(500),
figsize=(16, 7),)
In [11]:
print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))
In [12]:
# modifying dataframe
players = players[players.rater1.notnull()]
players.shape[0]
Out[12]:
In [13]:
2053-1585
Out[13]:
We've removed 468 players from the table who had no skintone rating.
Let's look again at the missing data in this table.
In [14]:
msno.matrix(players.sample(500),
figsize=(16, 7),
width_ratios=(15, 1))
In [15]:
msno.bar(players.sample(500),
figsize=(16, 7),)
Each remaining player in the table has 2 skintone ratings -- are they sufficiently close that they can be combined?
In [16]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()
In [17]:
players.head()
Out[17]:
In [18]:
# modifying dataframe
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
players.head()
Out[18]:
In [19]:
sns.distplot(players.skintone, kde=False);
In [20]:
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()
Intuitively, the different positions in the field probably have different redcard rates, but we have data that's very granular.
Recommendation:
I chose to split up the position types by their primary roles (you can disagree with my categorization and do it differently if you feel).
In [21]:
position_types = players.position.unique()
position_types
Out[21]:
In [22]:
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'
# modifying dataframe -- adding the aggregated position categorical position_agg
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"
In [23]:
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()
In [24]:
sns.distplot(players.height.dropna());
In [25]:
sns.distplot(players.weight.dropna());
In [26]:
from pandas.tools.plotting import scatter_matrix
In [27]:
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);
In [28]:
# Perhaps you want to see a particular relationship more clearly
In [29]:
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()
There aren't obvious outliers in the height vs weight relationship. Things that I check for:
In [30]:
weight_categories = ["vlow_weight",
"low_weight",
"mid_weight",
"high_weight",
"vhigh_weight",
]
players['weightclass'] = pd.qcut(players['weight'],
len(weight_categories),
weight_categories)
In [31]:
players.head()
Out[31]:
In [32]:
height_categories = ["vlow_height",
"low_height",
"mid_height",
"high_height",
"vhigh_height",
]
players['heightclass'] = pd.qcut(players['height'],
len(height_categories),
height_categories)
In [33]:
players['skintoneclass'] = pd.qcut(players['skintone'], 3)
There is a library that gives a high level overview -- https://github.com/JosPolfliet/pandas-profiling
In [34]:
pandas_profiling.ProfileReport(players)
Out[34]:
Notice a few things -- it automatically finds highly correlated features -- is that what we want?
In this case no -- we want the skintone
column, and the fact that it's highly correlated with rater1
and rater2
, 1, makes sense, but 2, should be consciously selected.
In [35]:
players.birthday.head()
Out[35]:
In [36]:
# modifying dataframe
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25
In [37]:
sns.distplot(players.age_years);
In [38]:
players.head()
Out[38]:
In [39]:
players_cleaned_variables = players.columns.tolist()
players_cleaned_variables
Out[39]:
In [40]:
players_cleaned_variables = [#'birthday',
'height',
'weight',
# 'position',
# 'photoID',
# 'rater1',
# 'rater2',
'skintone',
'position_agg',
'weightclass',
'heightclass',
'skintoneclass',
# 'birth_date',
'age_years']
In [41]:
pandas_profiling.ProfileReport(players[players_cleaned_variables])
Out[41]:
In [42]:
players[players_cleaned_variables].to_csv("cleaned_players.csv.gz", compression='gzip')
In [ ]: