In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
In [2]:
import pandas as pd
In [3]:
%%time
df = pd.read_csv('../data/redcard/crowdstorm_disaggregated.csv.gz', compression='gzip')
The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website. -- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit
Variable Name: | Variable Description: |
---|---|
playerShort | short player ID |
player | player name |
club | player club |
leagueCountry | country of player club (England, Germany, France, and Spain) |
height | player height (in cm) |
weight | player weight (in kg) |
position | player position |
games | number of games in the player-referee dyad |
goals | number of goals in the player-referee dyad |
yellowCards | number of yellow cards player received from the referee |
yellowReds | number of yellow-red cards player received from the referee |
redCards | number of red cards player received from the referee |
photoID | ID of player photo (if available) |
rater1 | skin rating of photo by rater 1 |
rater2 | skin rating of photo by rater 2 |
refNum | unique referee ID number (referee name removed for anonymizing purposes) |
refCountry | unique referee country ID number |
meanIAT | mean implicit bias score (using the race IAT) for referee countr |
In [4]:
# how many records are there?
df.shape
Out[4]:
In [5]:
# what do the entries in the table look like?
df.sample(100)
Out[5]:
In [6]:
import missingno as msno
In [7]:
(df.isnull().sum(axis=0)/float(len(df))).sort_values()
Out[7]:
In [8]:
msno.matrix(df.sample(1000))
In [9]:
msno.bar(df)
In [10]:
msno.heatmap(df)
the data is mostly there.
the most frequently missing fields are the photoID, rater1, rater2, and skintone. they're missing for about 12.5% of the rows and their missing ness is highly correlated.
position, weight, and height are the next most commonly mising fields at a rate of about ~10%, ~1%, and ~0.1% respectively. moderately correlated
meanIAT, nIAT, seIAT, meanExp, nExp, seExp are all missing at about ~0.05%, highly correlated
assumption for below: ignore rows with missing data
In [11]:
df_nona = df.dropna()
In [12]:
df_nona.shape
Out[12]:
In [13]:
len(df_nona) / float(len(df))
Out[13]:
In [14]:
df_nona.describe()
Out[14]:
In [15]:
df_focus = df_nona[['rater1', 'rater2', 'redCards', 'skintone', 'allreds', 'allredsStrict']]
In [16]:
df_focus.describe()
Out[16]:
In [17]:
r1_vs_r2 = df_focus.groupby(['rater1', 'rater2']).size().unstack()
In [18]:
r1_vs_r2
Out[18]:
In [19]:
plt.imshow(r1_vs_r2.values)
shape = r1_vs_r2.shape
plt.xticks(np.arange(shape[0]), r1_vs_r2.index.values)
plt.xlabel('rater2')
plt.yticks(np.arange(shape[1]), r1_vs_r2.columns.values)
plt.ylabel('rater1')
plt.colorbar()
Out[19]:
In [20]:
def compare_cols(col0, col1):
data = df_focus.groupby([col0, col1]).size().unstack()
shape = data.shape
plt.imshow(data.values)
plt.yticks(np.arange(shape[0]), data.index.values)
plt.ylabel(col0)
plt.xticks(np.arange(shape[1]), data.columns.values)
plt.xlabel(col1)
plt.colorbar()
In [21]:
compare_cols('rater1', 'skintone')
In [22]:
compare_cols('rater2', 'skintone')
skintone, rater1, rater2 all seem pretty highly correlated
In [23]:
aggs = ['count', 'median', 'sum', 'mean', 'var']
In [24]:
df_focus_rater1 = df_focus.groupby(['rater1']).agg(aggs)
In [25]:
data = df_focus_rater1['redCards']['count']
x = data.index.values
y = data.values
plt.bar(np.arange(len(x)), y)
plt.xticks(np.arange(len(x)), x)
plt.title('Interaction counts by rating from rater1')
plt.ylim(0, None)
Out[25]:
In [26]:
error_of_mean = np.sqrt(df_focus_rater1['redCards']['var'].values/df_focus_rater1['redCards']['count'].values)
In [27]:
plt.plot([np.arange(len(x)), np.arange(len(x))], [-error_of_mean, +error_of_mean], color='red', lw=4)
plt.show()
In [28]:
def df_col_agg_bar(ax, df, title, col, agg):
title = '{} {}({})'.format(title, agg, col)
data = df[col][agg]
x = data.index.values
y = data.values
ax.bar(np.arange(len(x)), y)
if agg == 'mean':
error_of_mean = np.sqrt(df[col]['var'].values/df[col]['count'].values)
ax.plot([np.arange(len(x)), np.arange(len(x))],
[y - error_of_mean, y + error_of_mean], color='red', lw=4)
ax.xaxis.set_ticks(np.arange(len(x)))
ax.xaxis.set_ticklabels(x)
ax.set_title(title)
ax.set_ylim(0, None)
In [29]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4))
data = df_focus.groupby(['rater1']).agg(aggs)
df_col_agg_bar(axes[0], data, 'groupby(rater1)', 'redCards', 'count')
df_col_agg_bar(axes[1], data, 'groupby(rater1)', 'allreds', 'count')
df_col_agg_bar(axes[2], data, 'groupby(rater1)', 'allredsStrict', 'count')
In [30]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4))
data = df_focus.groupby(['rater1']).agg(aggs)
df_col_agg_bar(axes[0], data, 'groupby(rater1)', 'redCards', 'sum')
df_col_agg_bar(axes[1], data, 'groupby(rater1)', 'allreds', 'sum')
df_col_agg_bar(axes[2], data, 'groupby(rater1)', 'allredsStrict', 'sum')
In [31]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4), sharey=True)
data = df_focus.groupby(['rater1']).agg(aggs)
df_col_agg_bar(axes[0], data, 'groupby(rater1)', 'redCards', 'sum')
df_col_agg_bar(axes[1], data, 'groupby(rater1)', 'allreds', 'sum')
df_col_agg_bar(axes[2], data, 'groupby(rater1)', 'allredsStrict', 'sum')
In [32]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4), sharey=True)
data = df_focus.groupby(['rater1']).agg(aggs)
df_col_agg_bar(axes[0], data, 'groupby(rater1)', 'redCards', 'mean')
df_col_agg_bar(axes[1], data, 'groupby(rater1)', 'allreds', 'mean')
df_col_agg_bar(axes[2], data, 'groupby(rater1)', 'allredsStrict', 'mean')
In [33]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4), sharey=True)
data = df_focus.groupby(['rater2']).agg(aggs)
df_col_agg_bar(axes[0], data, 'groupby(rater2)', 'redCards', 'mean')
df_col_agg_bar(axes[1], data, 'groupby(rater2)', 'allreds', 'mean')
df_col_agg_bar(axes[2], data, 'groupby(rater2)', 'allredsStrict', 'mean')
In [34]:
fig, axes = plt.subplots(ncols=3, figsize=(16, 4), sharey=True)
data = df_focus.groupby(['skintone']).agg(aggs)
df_col_agg_bar(axes[0], data, 'groupby(skintone)', 'redCards', 'mean')
df_col_agg_bar(axes[1], data, 'groupby(skintone)', 'allreds', 'mean')
df_col_agg_bar(axes[2], data, 'groupby(skintone)', 'allredsStrict', 'mean')
what is going on there with 0.625
group?
In [ ]: