Redcard Exploratory Data Analysis

This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.

Many analysts, one dataset: Making transparent how variations in analytical choices affect results

The data can be found here.

The Task

Do an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?

  • Before plotting/joining/doing something, have a question or hypothesis that you want to investigate
  • Draw a plot of what you want to see on paper to sketch the idea
  • Write it down, then make the plan on how to get there
  • How do you know you aren't fooling yourself
  • What else can I check if this is actually true?
  • What evidence could there be that it's wrong?

In [1]:
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

About the Data

The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of player–referee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.

-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

Variable Name: Variable Description:
playerShort short player ID
player player name
club player club
leagueCountry country of player club (England, Germany, France, and Spain)
height player height (in cm)
weight player weight (in kg)
position player position
games number of games in the player-referee dyad
goals number of goals in the player-referee dyad
yellowCards number of yellow cards player received from the referee
yellowReds number of yellow-red cards player received from the referee
redCards number of red cards player received from the referee
photoID ID of player photo (if available)
rater1 skin rating of photo by rater 1
rater2 skin rating of photo by rater 2
refNum unique referee ID number (referee name removed for anonymizing purposes)
refCountry unique referee country ID number
meanIAT mean implicit bias score (using the race IAT) for referee country
nIAT sample size for race IAT in that particular country
seIAT standard error for mean estimate of race IAT
meanExp mean explicit bias score (using a racial thermometer task) for referee country
nExp sample size for explicit bias in that particular country
seExp standard error for mean estimate of explicit bias measure

Tidy Data

Hadley Wickham's concept of a tidy dataset summarized as:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

A longer paper describing this can be found in this pdf.

Having datasets in this form allows for much simpler analyses. So the first step is to try and clean up the dataset into a tidy dataset.

The first step that I am going to take is to break up the dataset into the different observational units. By that I'm going to have separate tables (or dataframes) for:

  • players
  • clubs
  • referees
  • countries
  • dyads

In [2]:
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")

In [3]:
def load_subgroup(filename, index_col=[0]):
    return pd.read_csv(filename, compression='gzip', index_col=index_col)

Players


In [4]:
# players = load_subgroup("../data/redcard/raw_players.csv.gz")
players = load_subgroup("raw_players.csv.gz")

In [5]:
players.head()


Out[5]:
weight birthday rater1 height position photoID rater2
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25

In [6]:
players.shape


Out[6]:
(2053, 7)

In [7]:
ls


anaconda2_410@  anaconda3_431@    raw_countries.csv.gz  raw_players.csv.gz
anaconda3_410@  raw_clubs.csv.gz  raw_dyads.csv.gz      raw_referees.csv.gz

Visualize the missing-ness of data

https://github.com/ResidentMario/missingno


In [8]:
msno.matrix(players.sample(500),
            figsize=(16, 7),
            width_ratios=(15, 1))



In [9]:
msno.bar(players.sample(500),
            figsize=(16, 7),)


Interesting! It appears that a significant fraction of players don't have a skintone rating from at least one rater. Are they missing in both? Since this is the feature we are supposed to be answering against -- it's an important thing to figure out early! Let's see if there's correlated patterns in how the data is missing for this table:


In [10]:
msno.heatmap(players.sample(500),
            figsize=(16, 7),)



In [11]:
print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))


All players: 2053
rater1 nulls: 468
rater2 nulls: 468
Both nulls: 468

In [12]:
# modifying dataframe
players = players[players.rater1.notnull()]
players.shape[0]


Out[12]:
1585

In [13]:
2053-1585


Out[13]:
468

We've removed 468 players from the table who had no skintone rating.

Let's look again at the missing data in this table.


In [14]:
msno.matrix(players.sample(500),
            figsize=(16, 7),
            width_ratios=(15, 1))



In [15]:
msno.bar(players.sample(500),
            figsize=(16, 7),)


Each remaining player in the table has 2 skintone ratings -- are they sufficiently close that they can be combined?


In [16]:
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()


Create useful new columns

This looks correlated enough to me -- let's combine the rater's skintone ratings into a new column that is the average rating.


In [17]:
players.head()


Out[17]:
weight birthday rater1 height position photoID rater2
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25

In [18]:
# modifying dataframe
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
players.head()


Out[18]:
weight birthday rater1 height position photoID rater2 skintone
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00 0.125
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25 0.125
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25 0.250
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00 0.000
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25 0.250

Visualize distributions of univariate features

What is the skintone distribution?


In [19]:
sns.distplot(players.skintone, kde=False);


Positions

Might the player's position correlate with the baseline susceptibility to redcards? Likely that a defender would have a higher rate than a keeper, for example.


In [20]:
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()


Create higher level categories

Intuitively, the different positions in the field probably have different redcard rates, but we have data that's very granular.

Recommendation:

  • create a new column
  • Don't overwrite the original data in case you need it or decide later that the higher level category is not useful

I chose to split up the position types by their primary roles (you can disagree with my categorization and do it differently if you feel).


In [21]:
position_types = players.position.unique()
position_types


Out[21]:
array(['Center Back', 'Attacking Midfielder', 'Right Midfielder',
       'Center Midfielder', 'Goalkeeper', 'Defensive Midfielder',
       'Left Fullback', nan, 'Left Midfielder', 'Right Fullback',
       'Center Forward', 'Left Winger', 'Right Winger'], dtype=object)

In [22]:
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'

# modifying dataframe -- adding the aggregated position categorical position_agg
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"

In [23]:
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()


Height and Weight

It's possible that size might correspond with redcards.


In [24]:
sns.distplot(players.height.dropna());



In [25]:
sns.distplot(players.weight.dropna());


Examine pair-wise relationships

Take a look at measures that will let you quickly see if there are problems or opportunities in the data.


In [26]:
from pandas.tools.plotting import scatter_matrix

In [27]:
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);



In [28]:
# Perhaps you want to see a particular relationship more clearly

In [29]:
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()


There aren't obvious outliers in the height vs weight relationship. Things that I check for:

  • Are there zeros?
  • Are there strange correlations?
  • Are there separate clusters (something that could come from recording the height in inches for some and in cm for others, for example)?

Create quantile bins for continuous variables


In [30]:
weight_categories = ["vlow_weight",
                     "low_weight",
                     "mid_weight",
                     "high_weight",
                     "vhigh_weight",
                    ]

players['weightclass'] = pd.qcut(players['weight'],
                                 len(weight_categories),
                                 weight_categories)

In [31]:
players.head()


Out[31]:
weight birthday rater1 height position photoID rater2 skintone position_agg weightclass
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00 0.125 Defense low_weight
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25 0.125 Forward low_weight
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25 0.250 Midfield vlow_weight
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00 0.000 Midfield mid_weight
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25 0.250 Defense low_weight

In [32]:
height_categories = ["vlow_height",
                     "low_height",
                     "mid_height",
                     "high_height",
                     "vhigh_height",
                    ]

players['heightclass'] = pd.qcut(players['height'],
                                 len(height_categories),
                                 height_categories)

In [33]:
players['skintoneclass'] = pd.qcut(players['skintone'], 3)

Pandas profiling

There is a library that gives a high level overview -- https://github.com/JosPolfliet/pandas-profiling


In [34]:
pandas_profiling.ProfileReport(players)


Out[34]:

Overview

Dataset info

Number of variables 13
Number of observations 1585
Total Missing (%) 1.7%
Total size in memory 128.7 KiB
Average record size in memory 83.1 B

Variables types

Numeric 3
Categorical 6
Date 0
Text (Unique) 2
Rejected 2

Warnings

  • birthday has a high cardinality: 1389 distinct values Warning
  • position has 152 / 9.6% missing values Missing
  • position_agg has 152 / 9.6% missing values Missing
  • rater1 has 626 / 39.5% zeros
  • rater2 is highly correlated with rater1 (ρ = 0.92361) Rejected
  • skintone is highly correlated with rater2 (ρ = 0.98072) Rejected
  • weight has 21 / 1.3% missing values Missing
  • weightclass has 21 / 1.3% missing values Missing

Variables

birthday
Categorical

Distinct count 1389
Unique (%) 87.6%
Missing (%) 0.0%
Missing (n) 0
04.08.1985
 
4
25.05.1986
 
3
03.08.1987
 
3
Other values (1386)
1575
Value Count Frequency (%)  
04.08.1985 4 0.3%
 
25.05.1986 3 0.2%
 
03.08.1987 3 0.2%
 
18.04.1983 3 0.2%
 
25.04.1988 3 0.2%
 
10.04.1986 3 0.2%
 
17.03.1986 3 0.2%
 
20.02.1981 3 0.2%
 
01.01.1993 3 0.2%
 
31.05.1989 3 0.2%
 
Other values (1379) 1554 98.0%
 

height
Numeric

Distinct count 43
Unique (%) 2.7%
Missing (%) 0.2%
Missing (n) 3
Infinite (%) 0.0%
Infinite (n) 0
Mean 182.04
Minimum 161
Maximum 203
Zeros (%) 0.0%

Quantile statistics

Minimum 161
5-th percentile 170.05
Q1 178
Median 182
Q3 187
95-th percentile 193
Maximum 203
Range 42
Interquartile range 9

Descriptive statistics

Standard deviation 6.7429
Coef of variation 0.03704
Kurtosis -0.35285
Mean 182.04
MAD 5.495
Skewness -0.060692
Sum 287990
Variance 45.467
Memory size 12.5 KiB
Value Count Frequency (%)  
180.0 125 7.9%
 
183.0 102 6.4%
 
188.0 94 5.9%
 
178.0 91 5.7%
 
185.0 91 5.7%
 
184.0 86 5.4%
 
186.0 78 4.9%
 
182.0 67 4.2%
 
190.0 66 4.2%
 
179.0 64 4.0%
 
Other values (32) 718 45.3%
 

Minimum 5 values

Value Count Frequency (%)  
161.0 1 0.1%
 
162.0 1 0.1%
 
163.0 1 0.1%
 
164.0 2 0.1%
 
165.0 3 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
198.0 4 0.3%
 
199.0 2 0.1%
 
201.0 1 0.1%
 
202.0 2 0.1%
 
203.0 1 0.1%
 

heightclass
Categorical

Distinct count 6
Unique (%) 0.4%
Missing (%) 0.2%
Missing (n) 3
vlow_height
344
low_height
329
mid_height
319
Other values (2)
590
Value Count Frequency (%)  
vlow_height 344 21.7%
 
low_height 329 20.8%
 
mid_height 319 20.1%
 
high_height 317 20.0%
 
vhigh_height 273 17.2%
 
(Missing) 3 0.2%
 

photoID
Categorical, Unique

First 3 values
14005.jpg
12551.jpg
986.jpg
Last 3 values
166498.jpg
166489.jpg
90196.jpg

First 10 values

Value Count Frequency (%)  
100063.jpg 1 0.1%
 
100350.jpg 1 0.1%
 
10037.jpg 1 0.1%
 
10038.jpg 1 0.1%
 
100459.jpg 1 0.1%
 

Last 10 values

Value Count Frequency (%)  
9910.jpg 1 0.1%
 
9911.jpg 1 0.1%
 
99196.jpg 1 0.1%
 
9922.jpg 1 0.1%
 
9975.jpg 1 0.1%
 

playerShort
Categorical, Unique

First 3 values
shane-long
filip-malbasic
alou-diarra
Last 3 values
emir-spahic
christofer-heimeroth
pulido_4

First 10 values

Value Count Frequency (%)  
aaron-hughes 1 0.1%
 
aaron-hunt 1 0.1%
 
aaron-lennon 1 0.1%
 
aaron-ramsey 1 0.1%
 
abdelhamid-el-kaoutari 1 0.1%
 

Last 10 values

Value Count Frequency (%)  
zoltan-gera 1 0.1%
 
zoltan-stieber 1 0.1%
 
zoumana-camara 1 0.1%
 
zubikarai 1 0.1%
 
zurutuza 1 0.1%
 

position
Categorical

Distinct count 13
Unique (%) 0.9%
Missing (%) 9.6%
Missing (n) 152
Center Back
248
Center Forward
191
Defensive Midfielder
172
Other values (9)
822
Value Count Frequency (%)  
Center Back 248 15.6%
 
Center Forward 191 12.1%
 
Defensive Midfielder 172 10.9%
 
Goalkeeper 161 10.2%
 
Attacking Midfielder 127 8.0%
 
Left Fullback 117 7.4%
 
Right Fullback 109 6.9%
 
Left Midfielder 76 4.8%
 
Center Midfielder 60 3.8%
 
Right Midfielder 60 3.8%
 
Other values (2) 112 7.1%
 
(Missing) 152 9.6%
 

position_agg
Categorical

Distinct count 5
Unique (%) 0.3%
Missing (%) 9.6%
Missing (n) 152
Defense
646
Forward
430
Midfield
196
Value Count Frequency (%)  
Defense 646 40.8%
 
Forward 430 27.1%
 
Midfield 196 12.4%
 
Keeper 161 10.2%
 
(Missing) 152 9.6%
 

rater1
Numeric

Distinct count 5
Unique (%) 0.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.26861
Minimum 0
Maximum 1
Zeros (%) 39.5%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0.25
Q3 0.5
95-th percentile 1
Maximum 1
Range 1
Interquartile range 0.5

Descriptive statistics

Standard deviation 0.29658
Coef of variation 1.1041
Kurtosis 0.21394
Mean 0.26861
MAD 0.22512
Skewness 1.0645
Sum 425.75
Variance 0.087958
Memory size 12.5 KiB
Value Count Frequency (%)  
0.0 626 39.5%
 
0.25 551 34.8%
 
0.5 170 10.7%
 
0.75 140 8.8%
 
1.0 98 6.2%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 626 39.5%
 
0.25 551 34.8%
 
0.5 170 10.7%
 
0.75 140 8.8%
 
1.0 98 6.2%
 

Maximum 5 values

Value Count Frequency (%)  
0.0 626 39.5%
 
0.25 551 34.8%
 
0.5 170 10.7%
 
0.75 140 8.8%
 
1.0 98 6.2%
 

rater2
Highly correlated

This variable is highly correlated with rater1 and should be ignored for analysis

Correlation 0.92361

skintone
Highly correlated

This variable is highly correlated with rater2 and should be ignored for analysis

Correlation 0.98072

skintoneclass
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
[0, 0.125]
651
(0.125, 0.25]
472
(0.25, 1]
462
Value Count Frequency (%)  
[0, 0.125] 651 41.1%
 
(0.125, 0.25] 472 29.8%
 
(0.25, 1] 462 29.1%
 

weight
Numeric

Distinct count 45
Unique (%) 2.9%
Missing (%) 1.3%
Missing (n) 21
Infinite (%) 0.0%
Infinite (n) 0
Mean 76.035
Minimum 54
Maximum 100
Zeros (%) 0.0%

Quantile statistics

Minimum 54
5-th percentile 65
Q1 71
Median 76
Q3 81
95-th percentile 88
Maximum 100
Range 46
Interquartile range 10

Descriptive statistics

Standard deviation 7.1056
Coef of variation 0.093452
Kurtosis -0.075715
Mean 76.035
MAD 5.6862
Skewness 0.20112
Sum 118920
Variance 50.489
Memory size 12.5 KiB
Value Count Frequency (%)  
75.0 95 6.0%
 
70.0 93 5.9%
 
76.0 91 5.7%
 
72.0 89 5.6%
 
74.0 85 5.4%
 
78.0 84 5.3%
 
80.0 81 5.1%
 
77.0 79 5.0%
 
73.0 74 4.7%
 
71.0 63 4.0%
 
Other values (34) 730 46.1%
 

Minimum 5 values

Value Count Frequency (%)  
54.0 1 0.1%
 
55.0 1 0.1%
 
56.0 2 0.1%
 
58.0 1 0.1%
 
59.0 3 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
94.0 5 0.3%
 
95.0 5 0.3%
 
96.0 4 0.3%
 
97.0 2 0.1%
 
100.0 2 0.1%
 

weightclass
Categorical

Distinct count 6
Unique (%) 0.4%
Missing (%) 1.3%
Missing (n) 21
vlow_weight
361
mid_weight
349
low_weight
311
Other values (2)
543
Value Count Frequency (%)  
vlow_weight 361 22.8%
 
mid_weight 349 22.0%
 
low_weight 311 19.6%
 
vhigh_weight 293 18.5%
 
high_weight 250 15.8%
 
(Missing) 21 1.3%
 

Sample

weight birthday rater1 height position photoID rater2 skintone position_agg weightclass heightclass skintoneclass
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00 0.125 Defense low_weight mid_height [0, 0.125]
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25 0.125 Forward low_weight mid_height [0, 0.125]
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25 0.250 Midfield vlow_weight vlow_height (0.125, 0.25]
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00 0.000 Midfield mid_weight low_height [0, 0.125]
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25 0.250 Defense low_weight low_height (0.125, 0.25]

Notice a few things -- it automatically finds highly correlated features -- is that what we want?

In this case no -- we want the skintone column, and the fact that it's highly correlated with rater1 and rater2, 1, makes sense, but 2, should be consciously selected.

Question -- What to do with birthday column?

Some points to consider:

  • this is a career-long dataset of redcards as of 2012-2013 season
  • the age of the player at 2012 does not (necessarily) correspond to the date of receiving a redcard

In [35]:
players.birthday.head()


Out[35]:
playerShort
aaron-hughes              08.11.1979
aaron-hunt                04.09.1986
aaron-lennon              16.04.1987
aaron-ramsey              26.12.1990
abdelhamid-el-kaoutari    17.03.1990
Name: birthday, dtype: object

In [36]:
# modifying dataframe
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25

In [37]:
sns.distplot(players.age_years);


Select variables to (possibly) use


In [38]:
players.head()


Out[38]:
weight birthday rater1 height position photoID rater2 skintone position_agg weightclass heightclass skintoneclass birth_date age_years
playerShort
aaron-hughes 71.0 08.11.1979 0.25 182.0 Center Back 3868.jpg 0.00 0.125 Defense low_weight mid_height [0, 0.125] 1979-11-08 33.149897
aaron-hunt 73.0 04.09.1986 0.00 183.0 Attacking Midfielder 20136.jpg 0.25 0.125 Forward low_weight mid_height [0, 0.125] 1986-09-04 26.327173
aaron-lennon 63.0 16.04.1987 0.25 165.0 Right Midfielder 13515.jpg 0.25 0.250 Midfield vlow_weight vlow_height (0.125, 0.25] 1987-04-16 25.713895
aaron-ramsey 76.0 26.12.1990 0.00 178.0 Center Midfielder 94953.jpg 0.00 0.000 Midfield mid_weight low_height [0, 0.125] 1990-12-26 22.017796
abdelhamid-el-kaoutari 73.0 17.03.1990 0.25 180.0 Center Back 124913.jpg 0.25 0.250 Defense low_weight low_height (0.125, 0.25] 1990-03-17 22.795346

In [39]:
players_cleaned_variables = players.columns.tolist()
players_cleaned_variables


Out[39]:
['weight',
 'birthday',
 'rater1',
 'height',
 'position',
 'photoID',
 'rater2',
 'skintone',
 'position_agg',
 'weightclass',
 'heightclass',
 'skintoneclass',
 'birth_date',
 'age_years']

In [40]:
players_cleaned_variables = [#'birthday',
                             'height',
                             'weight',
#                              'position',
#                              'photoID',
#                              'rater1',
#                              'rater2',
                             'skintone',
                             'position_agg',
                             'weightclass',
                             'heightclass',
                             'skintoneclass',
#                              'birth_date',
                             'age_years']

In [41]:
pandas_profiling.ProfileReport(players[players_cleaned_variables])


Out[41]:

Overview

Dataset info

Number of variables 9
Number of observations 1585
Total Missing (%) 1.4%
Total size in memory 79.1 KiB
Average record size in memory 51.1 B

Variables types

Numeric 4
Categorical 4
Date 0
Text (Unique) 1
Rejected 0

Warnings

  • position_agg has 152 / 9.6% missing values Missing
  • skintone has 424 / 26.8% zeros
  • weight has 21 / 1.3% missing values Missing
  • weightclass has 21 / 1.3% missing values Missing

Variables

age_years
Numeric

Distinct count 1389
Unique (%) 87.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 26.652
Minimum 16.282
Maximum 41.626
Zeros (%) 0.0%

Quantile statistics

Minimum 16.282
5-th percentile 19.766
Q1 23.239
Median 26.461
Q3 29.815
95-th percentile 33.933
Maximum 41.626
Range 25.344
Interquartile range 6.5763

Descriptive statistics

Standard deviation 4.4577
Coef of variation 0.16725
Kurtosis -0.48301
Mean 26.652
MAD 3.6731
Skewness 0.25118
Sum 42244
Variance 19.871
Memory size 12.5 KiB
Value Count Frequency (%)  
27.4113620808 4 0.3%
 
20.0 3 0.2%
 
31.8138261465 3 0.2%
 
26.7296372348 3 0.2%
 
25.6016427105 3 0.2%
 
31.8631074606 3 0.2%
 
24.6872005476 3 0.2%
 
23.5893223819 3 0.2%
 
25.4154688569 3 0.2%
 
26.6064339493 3 0.2%
 
Other values (1379) 1554 98.0%
 

Minimum 5 values

Value Count Frequency (%)  
16.2819986311 1 0.1%
 
17.0622861054 1 0.1%
 
17.4702258727 1 0.1%
 
17.6290212183 1 0.1%
 
17.749486653 1 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
39.1950718686 1 0.1%
 
39.3018480493 1 0.1%
 
40.2381930185 1 0.1%
 
40.8405201916 1 0.1%
 
41.6262833676 1 0.1%
 

height
Numeric

Distinct count 43
Unique (%) 2.7%
Missing (%) 0.2%
Missing (n) 3
Infinite (%) 0.0%
Infinite (n) 0
Mean 182.04
Minimum 161
Maximum 203
Zeros (%) 0.0%

Quantile statistics

Minimum 161
5-th percentile 170.05
Q1 178
Median 182
Q3 187
95-th percentile 193
Maximum 203
Range 42
Interquartile range 9

Descriptive statistics

Standard deviation 6.7429
Coef of variation 0.03704
Kurtosis -0.35285
Mean 182.04
MAD 5.495
Skewness -0.060692
Sum 287990
Variance 45.467
Memory size 12.5 KiB
Value Count Frequency (%)  
180.0 125 7.9%
 
183.0 102 6.4%
 
188.0 94 5.9%
 
178.0 91 5.7%
 
185.0 91 5.7%
 
184.0 86 5.4%
 
186.0 78 4.9%
 
182.0 67 4.2%
 
190.0 66 4.2%
 
179.0 64 4.0%
 
Other values (32) 718 45.3%
 

Minimum 5 values

Value Count Frequency (%)  
161.0 1 0.1%
 
162.0 1 0.1%
 
163.0 1 0.1%
 
164.0 2 0.1%
 
165.0 3 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
198.0 4 0.3%
 
199.0 2 0.1%
 
201.0 1 0.1%
 
202.0 2 0.1%
 
203.0 1 0.1%
 

heightclass
Categorical

Distinct count 6
Unique (%) 0.4%
Missing (%) 0.2%
Missing (n) 3
vlow_height
344
low_height
329
mid_height
319
Other values (2)
590
Value Count Frequency (%)  
vlow_height 344 21.7%
 
low_height 329 20.8%
 
mid_height 319 20.1%
 
high_height 317 20.0%
 
vhigh_height 273 17.2%
 
(Missing) 3 0.2%
 

playerShort
Categorical, Unique

First 3 values
shane-long
filip-malbasic
alou-diarra
Last 3 values
emir-spahic
christofer-heimeroth
pulido_4

First 10 values

Value Count Frequency (%)  
aaron-hughes 1 0.1%
 
aaron-hunt 1 0.1%
 
aaron-lennon 1 0.1%
 
aaron-ramsey 1 0.1%
 
abdelhamid-el-kaoutari 1 0.1%
 

Last 10 values

Value Count Frequency (%)  
zoltan-gera 1 0.1%
 
zoltan-stieber 1 0.1%
 
zoumana-camara 1 0.1%
 
zubikarai 1 0.1%
 
zurutuza 1 0.1%
 

position_agg
Categorical

Distinct count 5
Unique (%) 0.3%
Missing (%) 9.6%
Missing (n) 152
Defense
646
Forward
430
Midfield
196
Value Count Frequency (%)  
Defense 646 40.8%
 
Forward 430 27.1%
 
Midfield 196 12.4%
 
Keeper 161 10.2%
 
(Missing) 152 9.6%
 

skintone
Numeric

Distinct count 9
Unique (%) 0.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.28951
Minimum 0
Maximum 1
Zeros (%) 26.8%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0.25
Q3 0.375
95-th percentile 1
Maximum 1
Range 1
Interquartile range 0.375

Descriptive statistics

Standard deviation 0.2909
Coef of variation 1.0048
Kurtosis 0.23468
Mean 0.28951
MAD 0.22555
Skewness 1.0843
Sum 458.88
Variance 0.084622
Memory size 12.5 KiB
Value Count Frequency (%)  
0.25 472 29.8%
 
0.0 424 26.8%
 
0.125 227 14.3%
 
0.5 116 7.3%
 
0.75 104 6.6%
 
1.0 92 5.8%
 
0.375 77 4.9%
 
0.875 40 2.5%
 
0.625 33 2.1%
 

Minimum 5 values

Value Count Frequency (%)  
0.0 424 26.8%
 
0.125 227 14.3%
 
0.25 472 29.8%
 
0.375 77 4.9%
 
0.5 116 7.3%
 

Maximum 5 values

Value Count Frequency (%)  
0.5 116 7.3%
 
0.625 33 2.1%
 
0.75 104 6.6%
 
0.875 40 2.5%
 
1.0 92 5.8%
 

skintoneclass
Categorical

Distinct count 3
Unique (%) 0.2%
Missing (%) 0.0%
Missing (n) 0
[0, 0.125]
651
(0.125, 0.25]
472
(0.25, 1]
462
Value Count Frequency (%)  
[0, 0.125] 651 41.1%
 
(0.125, 0.25] 472 29.8%
 
(0.25, 1] 462 29.1%
 

weight
Numeric

Distinct count 45
Unique (%) 2.9%
Missing (%) 1.3%
Missing (n) 21
Infinite (%) 0.0%
Infinite (n) 0
Mean 76.035
Minimum 54
Maximum 100
Zeros (%) 0.0%

Quantile statistics

Minimum 54
5-th percentile 65
Q1 71
Median 76
Q3 81
95-th percentile 88
Maximum 100
Range 46
Interquartile range 10

Descriptive statistics

Standard deviation 7.1056
Coef of variation 0.093452
Kurtosis -0.075715
Mean 76.035
MAD 5.6862
Skewness 0.20112
Sum 118920
Variance 50.489
Memory size 12.5 KiB
Value Count Frequency (%)  
75.0 95 6.0%
 
70.0 93 5.9%
 
76.0 91 5.7%
 
72.0 89 5.6%
 
74.0 85 5.4%
 
78.0 84 5.3%
 
80.0 81 5.1%
 
77.0 79 5.0%
 
73.0 74 4.7%
 
71.0 63 4.0%
 
Other values (34) 730 46.1%
 

Minimum 5 values

Value Count Frequency (%)  
54.0 1 0.1%
 
55.0 1 0.1%
 
56.0 2 0.1%
 
58.0 1 0.1%
 
59.0 3 0.2%
 

Maximum 5 values

Value Count Frequency (%)  
94.0 5 0.3%
 
95.0 5 0.3%
 
96.0 4 0.3%
 
97.0 2 0.1%
 
100.0 2 0.1%
 

weightclass
Categorical

Distinct count 6
Unique (%) 0.4%
Missing (%) 1.3%
Missing (n) 21
vlow_weight
361
mid_weight
349
low_weight
311
Other values (2)
543
Value Count Frequency (%)  
vlow_weight 361 22.8%
 
mid_weight 349 22.0%
 
low_weight 311 19.6%
 
vhigh_weight 293 18.5%
 
high_weight 250 15.8%
 
(Missing) 21 1.3%
 

Sample

height weight skintone position_agg weightclass heightclass skintoneclass age_years
playerShort
aaron-hughes 182.0 71.0 0.125 Defense low_weight mid_height [0, 0.125] 33.149897
aaron-hunt 183.0 73.0 0.125 Forward low_weight mid_height [0, 0.125] 26.327173
aaron-lennon 165.0 63.0 0.250 Midfield vlow_weight vlow_height (0.125, 0.25] 25.713895
aaron-ramsey 178.0 76.0 0.000 Midfield mid_weight low_height [0, 0.125] 22.017796
abdelhamid-el-kaoutari 180.0 73.0 0.250 Defense low_weight low_height (0.125, 0.25] 22.795346

In [42]:
players[players_cleaned_variables].to_csv("cleaned_players.csv.gz", compression='gzip')

Challenge: Do the same process we did above but for the Referee, Clubs, and Country dataframes


In [ ]: