Redcard Exploratory Data Analysis

This dataset is taken from a fantastic paper that looks to see how analytical choices made by different data science teams on the same dataset in an attempt to answer the same research question affect the final outcome.

Many analysts, one dataset: Making transparent how variations in analytical choices affect results

The data can be found here.

The Task

Do an Exploratory Data Analysis on the redcard dataset. Keeping in mind the question is the following: Are soccer referees more likely to give red cards to dark-skin-toned players than light-skin-toned players?

Before plotting/joining/doing something, have a question or hypothesis that you want to investigate
Draw a plot of what you want to see on paper to sketch the idea
Write it down, then make the plan on how to get there
How do you know you aren't fooling yourself
What else can I check if this is actually true?
What evidence could there be that it's wrong?



In [1]:

    
%matplotlib inline
%config InlineBackend.figure_format='retina'

from __future__ import absolute_import, division, print_function
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.pyplot import GridSpec
import seaborn as sns
import numpy as np
import pandas as pd
import os, sys
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')
sns.set_context("poster", font_scale=1.3)

import missingno as msno
import pandas_profiling

from sklearn.datasets import make_blobs
import time

About the Data

The dataset is available as a list with 146,028 dyads of players and referees and includes details from players, details from referees and details regarding the interactions of player-referees. A summary of the variables of interest can be seen below. A detailed description of all variables included can be seen in the README file on the project website.

From a company for sports statistics, we obtained data and profile photos from all soccer players (N = 2,053) playing in the first male divisions of England, Germany, France and Spain in the 2012-2013 season and all referees (N = 3,147) that these players played under in their professional career (see Figure 1). We created a dataset of playerâreferee dyads including the number of matches players and referees encountered each other and our dependent variable, the number of red cards given to a player by a particular referee throughout all matches the two encountered each other.

-- https://docs.google.com/document/d/1uCF5wmbcL90qvrk_J27fWAvDcDNrO9o_APkicwRkOKc/edit

Variable Name:	Variable Description:
playerShort	short player ID
player	player name
club	player club
leagueCountry	country of player club (England, Germany, France, and Spain)
height	player height (in cm)
weight	player weight (in kg)
position	player position
games	number of games in the player-referee dyad
goals	number of goals in the player-referee dyad
yellowCards	number of yellow cards player received from the referee
yellowReds	number of yellow-red cards player received from the referee
redCards	number of red cards player received from the referee
photoID	ID of player photo (if available)
rater1	skin rating of photo by rater 1
rater2	skin rating of photo by rater 2
refNum	unique referee ID number (referee name removed for anonymizing purposes)
refCountry	unique referee country ID number
meanIAT	mean implicit bias score (using the race IAT) for referee country
nIAT	sample size for race IAT in that particular country
seIAT	standard error for mean estimate of race IAT
meanExp	mean explicit bias score (using a racial thermometer task) for referee country
nExp	sample size for explicit bias in that particular country
seExp	standard error for mean estimate of explicit bias measure

Tidy Data

Hadley Wickham's concept of a tidy dataset summarized as:

Each variable forms a column

Each observation forms a row

Each type of observational unit forms a table

A longer paper describing this can be found in this pdf.

Having datasets in this form allows for much simpler analyses. So the first step is to try and clean up the dataset into a tidy dataset.

The first step that I am going to take is to break up the dataset into the different observational units. By that I'm going to have separate tables (or dataframes) for:

players
clubs
referees
countries
dyads



In [2]:

    
def save_subgroup(dataframe, g_index, subgroup_name, prefix='raw_'):
    save_subgroup_filename = "".join([prefix, subgroup_name, ".csv.gz"])
    dataframe.to_csv(save_subgroup_filename, compression='gzip', encoding='UTF-8')
    test_df = pd.read_csv(save_subgroup_filename, compression='gzip', index_col=g_index, encoding='UTF-8')
    # Test that we recover what we send in
    if dataframe.equals(test_df):
        print("Test-passed: we recover the equivalent subgroup dataframe.")
    else:
        print("Warning -- equivalence test!!! Double-check.")



In [3]:

    
def load_subgroup(filename, index_col=[0]):
    return pd.read_csv(filename, compression='gzip', index_col=index_col)

Players



In [4]:

    
# players = load_subgroup("../data/redcard/raw_players.csv.gz")
players = load_subgroup("raw_players.csv.gz")



In [5]:

    
players.head()









    Out[5]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
    
    
      playerShort
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25



In [6]:

    
players.shape









    Out[6]:





(2053, 7)



In [7]:

    
ls









    



anaconda2_410@  anaconda3_431@    raw_countries.csv.gz  raw_players.csv.gz
anaconda3_410@  raw_clubs.csv.gz  raw_dyads.csv.gz      raw_referees.csv.gz

Visualize the missing-ness of data

https://github.com/ResidentMario/missingno



In [8]:

    
msno.matrix(players.sample(500),
            figsize=(16, 7),
            width_ratios=(15, 1))



In [9]:

    
msno.bar(players.sample(500),
            figsize=(16, 7),)

Interesting! It appears that a significant fraction of players don't have a skintone rating from at least one rater. Are they missing in both? Since this is the feature we are supposed to be answering against -- it's an important thing to figure out early! Let's see if there's correlated patterns in how the data is missing for this table:



In [10]:

    
msno.heatmap(players.sample(500),
            figsize=(16, 7),)



In [11]:

    
print("All players:", len(players))
print("rater1 nulls:", len(players[(players.rater1.isnull())]))
print("rater2 nulls:", len(players[players.rater2.isnull()]))
print("Both nulls:", len(players[(players.rater1.isnull()) & (players.rater2.isnull())]))









    



All players: 2053
rater1 nulls: 468
rater2 nulls: 468
Both nulls: 468



In [12]:

    
# modifying dataframe
players = players[players.rater1.notnull()]
players.shape[0]









    Out[12]:





1585



In [13]:

    
2053-1585









    Out[13]:





468

We've removed 468 players from the table who had no skintone rating.

Let's look again at the missing data in this table.



In [14]:

    
msno.matrix(players.sample(500),
            figsize=(16, 7),
            width_ratios=(15, 1))



In [15]:

    
msno.bar(players.sample(500),
            figsize=(16, 7),)

Each remaining player in the table has 2 skintone ratings -- are they sufficiently close that they can be combined?



In [16]:

    
fig, ax = plt.subplots(figsize=(12, 8))
sns.heatmap(pd.crosstab(players.rater1, players.rater2), cmap='Blues', annot=True, fmt='d', ax=ax)
ax.set_title("Correlation between Rater 1 and Rater 2\n")
fig.tight_layout()

Create useful new columns

This looks correlated enough to me -- let's combine the rater's skintone ratings into a new column that is the average rating.



In [17]:

    
players.head()









    Out[17]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
    
    
      playerShort
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25



In [18]:

    
# modifying dataframe
players['skintone'] = players[['rater1', 'rater2']].mean(axis=1)
players.head()









    Out[18]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
      skintone
    
    
      playerShort
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
      0.125
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
      0.125
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
      0.250
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
      0.000
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25
      0.250

Visualize distributions of univariate features

What is the skintone distribution?



In [19]:

    
sns.distplot(players.skintone, kde=False);

Positions

Might the player's position correlate with the baseline susceptibility to redcards? Likely that a defender would have a higher rate than a keeper, for example.



In [20]:

    
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players.position.value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("Position")
ax.set_xlabel("Counts")
fig.tight_layout()

Create higher level categories

Intuitively, the different positions in the field probably have different redcard rates, but we have data that's very granular.

Recommendation:

create a new column
Don't overwrite the original data in case you need it or decide later that the higher level category is not useful

I chose to split up the position types by their primary roles (you can disagree with my categorization and do it differently if you feel).



In [21]:

    
position_types = players.position.unique()
position_types









    Out[21]:





array(['Center Back', 'Attacking Midfielder', 'Right Midfielder',
       'Center Midfielder', 'Goalkeeper', 'Defensive Midfielder',
       'Left Fullback', nan, 'Left Midfielder', 'Right Fullback',
       'Center Forward', 'Left Winger', 'Right Winger'], dtype=object)



In [22]:

    
defense = ['Center Back','Defensive Midfielder', 'Left Fullback', 'Right Fullback', ]
midfield = ['Right Midfielder', 'Center Midfielder', 'Left Midfielder',]
forward = ['Attacking Midfielder', 'Left Winger', 'Right Winger', 'Center Forward']
keeper = 'Goalkeeper'

# modifying dataframe -- adding the aggregated position categorical position_agg
players.loc[players['position'].isin(defense), 'position_agg'] = "Defense"
players.loc[players['position'].isin(midfield), 'position_agg'] = "Midfield"
players.loc[players['position'].isin(forward), 'position_agg'] = "Forward"
players.loc[players['position'].eq(keeper), 'position_agg'] = "Keeper"



In [23]:

    
MIDSIZE = (12, 8)
fig, ax = plt.subplots(figsize=MIDSIZE)
players['position_agg'].value_counts(dropna=False, ascending=True).plot(kind='barh', ax=ax)
ax.set_ylabel("position_agg")
ax.set_xlabel("Counts")
fig.tight_layout()

Height and Weight

It's possible that size might correspond with redcards.



In [24]:

    
sns.distplot(players.height.dropna());



In [25]:

    
sns.distplot(players.weight.dropna());

Examine pair-wise relationships

Take a look at measures that will let you quickly see if there are problems or opportunities in the data.



In [26]:

    
from pandas.tools.plotting import scatter_matrix



In [27]:

    
fig, ax = plt.subplots(figsize=(10, 10))
scatter_matrix(players[['height', 'weight', 'skintone']], alpha=0.2, diagonal='hist', ax=ax);



In [28]:

    
# Perhaps you want to see a particular relationship more clearly



In [29]:

    
fig, ax = plt.subplots(figsize=MIDSIZE)
sns.regplot('weight', 'height', data=players, ax=ax)
ax.set_ylabel("Height [cm]")
ax.set_xlabel("Weight [kg]")
fig.tight_layout()

There aren't obvious outliers in the height vs weight relationship. Things that I check for:

Are there zeros?
Are there strange correlations?
Are there separate clusters (something that could come from recording the height in inches for some and in cm for others, for example)?

Create quantile bins for continuous variables



In [30]:

    
weight_categories = ["vlow_weight",
                     "low_weight",
                     "mid_weight",
                     "high_weight",
                     "vhigh_weight",
                    ]

players['weightclass'] = pd.qcut(players['weight'],
                                 len(weight_categories),
                                 weight_categories)



In [31]:

    
players.head()









    Out[31]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
      skintone
      position_agg
      weightclass
    
    
      playerShort
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
      0.125
      Defense
      low_weight
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
      0.125
      Forward
      low_weight
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
      0.250
      Midfield
      vlow_weight
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
      0.000
      Midfield
      mid_weight
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25
      0.250
      Defense
      low_weight



In [32]:

    
height_categories = ["vlow_height",
                     "low_height",
                     "mid_height",
                     "high_height",
                     "vhigh_height",
                    ]

players['heightclass'] = pd.qcut(players['height'],
                                 len(height_categories),
                                 height_categories)



In [33]:

    
players['skintoneclass'] = pd.qcut(players['skintone'], 3)

Pandas profiling

There is a library that gives a high level overview -- https://github.com/JosPolfliet/pandas-profiling



In [34]:

    
pandas_profiling.ProfileReport(players)









    Out[34]:









    
        Overview
    
    
    
        Dataset info
        
            
            
                Number of variables
                13 
            
            
                Number of observations
                1585 
            
            
                Total Missing (%)
                1.7% 
            
            
                Total size in memory
                128.7 KiB 
            
            
                Average record size in memory
                83.1 B 
            
            
        
    
    
        Variables types
        
            
            
                Numeric
                3 
            
            
                Categorical
                6 
            
            
                Date
                0 
            
            
                Text (Unique)
                2 
            
            
                Rejected
                2 
            
            
        
    
    
        Warnings
        birthday has a high cardinality: 1389 distinct values  Warning
position has 152 / 9.6% missing values Missing
position_agg has 152 / 9.6% missing values Missing
rater1 has 626 / 39.5% zeros
rater2 is highly correlated with rater1 (Ï = 0.92361) Rejected
skintone is highly correlated with rater2 (Ï = 0.98072) Rejected
weight has 21 / 1.3% missing values Missing
weightclass has 21 / 1.3% missing values Missing 
    

    
        Variables
    
    
    
        birthday

            Categorical
        
    

    
        
            Distinct count
            1389
        
        
            Unique (%)
            87.6%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    04.08.1985
    
        
             
        
        4
    

    25.05.1986
    
        
             
        
        3
    

    03.08.1987
    
        
             
        
        3
    

    Other values (1386)
    
        
            1575
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        04.08.1985
        4
        0.3%
        
             
        

        25.05.1986
        3
        0.2%
        
             
        

        03.08.1987
        3
        0.2%
        
             
        

        18.04.1983
        3
        0.2%
        
             
        

        25.04.1988
        3
        0.2%
        
             
        

        10.04.1986
        3
        0.2%
        
             
        

        17.03.1986
        3
        0.2%
        
             
        

        20.02.1981
        3
        0.2%
        
             
        

        01.01.1993
        3
        0.2%
        
             
        

        31.05.1989
        3
        0.2%
        
             
        

        Other values (1379)
        1554
        98.0%
        
             
        




    
        height

            Numeric
        
    

    
        
            
                
                    Distinct count
                    43
                
                
                    Unique (%)
                    2.7%
                
                
                    Missing (%)
                    0.2%
                
                
                    Missing (n)
                    3
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    182.04
                
                
                    Minimum
                    161
                
                
                    Maximum
                    203
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        161
                    
                    
                        5-th percentile
                        170.05
                    
                    
                        Q1
                        178
                    
                    
                        Median
                        182
                    
                    
                        Q3
                        187
                    
                    
                        95-th percentile
                        193
                    
                    
                        Maximum
                        203
                    
                    
                        Range
                        42
                    
                    
                        Interquartile range
                        9
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        6.7429
                    
                    
                        Coef of variation
                        0.03704
                    
                    
                        Kurtosis
                        -0.35285
                    
                    
                        Mean
                        182.04
                    
                    
                        MAD
                        5.495
                    
                    
                        Skewness
                        -0.060692
                    
                    
                        Sum
                        287990
                    
                    
                        Variance
                        45.467
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        180.0
        125
        7.9%
        
             
        

        183.0
        102
        6.4%
        
             
        

        188.0
        94
        5.9%
        
             
        

        178.0
        91
        5.7%
        
             
        

        185.0
        91
        5.7%
        
             
        

        184.0
        86
        5.4%
        
             
        

        186.0
        78
        4.9%
        
             
        

        182.0
        67
        4.2%
        
             
        

        190.0
        66
        4.2%
        
             
        

        179.0
        64
        4.0%
        
             
        

        Other values (32)
        718
        45.3%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        161.0
        1
        0.1%
        
             
        

        162.0
        1
        0.1%
        
             
        

        163.0
        1
        0.1%
        
             
        

        164.0
        2
        0.1%
        
             
        

        165.0
        3
        0.2%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        198.0
        4
        0.3%
        
             
        

        199.0
        2
        0.1%
        
             
        

        201.0
        1
        0.1%
        
             
        

        202.0
        2
        0.1%
        
             
        

        203.0
        1
        0.1%
        
             
        


        
    


    
        heightclass

            Categorical
        
    

    
        
            Distinct count
            6
        
        
            Unique (%)
            0.4%
        
        
            Missing (%)
            0.2%
        
        
            Missing (n)
            3
        
    


    
        
    vlow_height
    
        
            344
        
        
    

    low_height
    
        
            329
        
        
    

    mid_height
    
        
            319
        
        
    

    Other values (2)
    
        
            590
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        vlow_height
        344
        21.7%
        
             
        

        low_height
        329
        20.8%
        
             
        

        mid_height
        319
        20.1%
        
             
        

        high_height
        317
        20.0%
        
             
        

        vhigh_height
        273
        17.2%
        
             
        

        (Missing)
        3
        0.2%
        
             
        




    
        photoID

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      14005.jpg
    
    
      12551.jpg
    
    
      986.jpg
    
  


  
    
      Last 3 values
    
  
  
    
      166498.jpg
    
    
      166489.jpg
    
    
      90196.jpg
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        100063.jpg
        1
        0.1%
        
             
        

        100350.jpg
        1
        0.1%
        
             
        

        10037.jpg
        1
        0.1%
        
             
        

        10038.jpg
        1
        0.1%
        
             
        

        100459.jpg
        1
        0.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        9910.jpg
        1
        0.1%
        
             
        

        9911.jpg
        1
        0.1%
        
             
        

        99196.jpg
        1
        0.1%
        
             
        

        9922.jpg
        1
        0.1%
        
             
        

        9975.jpg
        1
        0.1%
        
             
        




    
        playerShort

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      shane-long
    
    
      filip-malbasic
    
    
      alou-diarra
    
  


  
    
      Last 3 values
    
  
  
    
      emir-spahic
    
    
      christofer-heimeroth
    
    
      pulido_4
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        aaron-hughes
        1
        0.1%
        
             
        

        aaron-hunt
        1
        0.1%
        
             
        

        aaron-lennon
        1
        0.1%
        
             
        

        aaron-ramsey
        1
        0.1%
        
             
        

        abdelhamid-el-kaoutari
        1
        0.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        zoltan-gera
        1
        0.1%
        
             
        

        zoltan-stieber
        1
        0.1%
        
             
        

        zoumana-camara
        1
        0.1%
        
             
        

        zubikarai
        1
        0.1%
        
             
        

        zurutuza
        1
        0.1%
        
             
        




    
        position

            Categorical
        
    

    
        
            Distinct count
            13
        
        
            Unique (%)
            0.9%
        
        
            Missing (%)
            9.6%
        
        
            Missing (n)
            152
        
    


    
        
    Center Back
    
        
            248
        
        
    

    Center Forward
    
        
            191
        
        
    

    Defensive Midfielder
    
        
            172
        
        
    

    Other values (9)
    
        
            822
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Center Back
        248
        15.6%
        
             
        

        Center Forward
        191
        12.1%
        
             
        

        Defensive Midfielder
        172
        10.9%
        
             
        

        Goalkeeper
        161
        10.2%
        
             
        

        Attacking Midfielder
        127
        8.0%
        
             
        

        Left Fullback
        117
        7.4%
        
             
        

        Right Fullback
        109
        6.9%
        
             
        

        Left Midfielder
        76
        4.8%
        
             
        

        Center Midfielder
        60
        3.8%
        
             
        

        Right Midfielder
        60
        3.8%
        
             
        

        Other values (2)
        112
        7.1%
        
             
        

        (Missing)
        152
        9.6%
        
             
        




    
        position_agg

            Categorical
        
    

    
        
            Distinct count
            5
        
        
            Unique (%)
            0.3%
        
        
            Missing (%)
            9.6%
        
        
            Missing (n)
            152
        
    


    
        
    Defense
    
        
            646
        
        
    

    Forward
    
        
            430
        
        
    

    Midfield
    
        
            196
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Defense
        646
        40.8%
        
             
        

        Forward
        430
        27.1%
        
             
        

        Midfield
        196
        12.4%
        
             
        

        Keeper
        161
        10.2%
        
             
        

        (Missing)
        152
        9.6%
        
             
        




    
        rater1

            Numeric
        
    

    
        
            
                
                    Distinct count
                    5
                
                
                    Unique (%)
                    0.3%
                
                
                    Missing (%)
                    0.0%
                
                
                    Missing (n)
                    0
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    0.26861
                
                
                    Minimum
                    0
                
                
                    Maximum
                    1
                
                
                    Zeros (%)
                    39.5%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0.25
                    
                    
                        Q3
                        0.5
                    
                    
                        95-th percentile
                        1
                    
                    
                        Maximum
                        1
                    
                    
                        Range
                        1
                    
                    
                        Interquartile range
                        0.5
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.29658
                    
                    
                        Coef of variation
                        1.1041
                    
                    
                        Kurtosis
                        0.21394
                    
                    
                        Mean
                        0.26861
                    
                    
                        MAD
                        0.22512
                    
                    
                        Skewness
                        1.0645
                    
                    
                        Sum
                        425.75
                    
                    
                        Variance
                        0.087958
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        626
        39.5%
        
             
        

        0.25
        551
        34.8%
        
             
        

        0.5
        170
        10.7%
        
             
        

        0.75
        140
        8.8%
        
             
        

        1.0
        98
        6.2%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        626
        39.5%
        
             
        

        0.25
        551
        34.8%
        
             
        

        0.5
        170
        10.7%
        
             
        

        0.75
        140
        8.8%
        
             
        

        1.0
        98
        6.2%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        626
        39.5%
        
             
        

        0.25
        551
        34.8%
        
             
        

        0.5
        170
        10.7%
        
             
        

        0.75
        140
        8.8%
        
             
        

        1.0
        98
        6.2%
        
             
        


        
    


    
        rater2

            Highly correlated
        
    

    This variable is highly correlated with rater1 and should be ignored for analysis


    
        
            Correlation
            0.92361
        
    


    
        skintone

            Highly correlated
        
    

    This variable is highly correlated with rater2 and should be ignored for analysis


    
        
            Correlation
            0.98072
        
    


    
        skintoneclass

            Categorical
        
    

    
        
            Distinct count
            3
        
        
            Unique (%)
            0.2%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    [0, 0.125]
    
        
            651
        
        
    

    (0.125, 0.25]
    
        
            472
        
        
    

    (0.25, 1]
    
        
            462
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        [0, 0.125]
        651
        41.1%
        
             
        

        (0.125, 0.25]
        472
        29.8%
        
             
        

        (0.25, 1]
        462
        29.1%
        
             
        




    
        weight

            Numeric
        
    

    
        
            
                
                    Distinct count
                    45
                
                
                    Unique (%)
                    2.9%
                
                
                    Missing (%)
                    1.3%
                
                
                    Missing (n)
                    21
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    76.035
                
                
                    Minimum
                    54
                
                
                    Maximum
                    100
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        54
                    
                    
                        5-th percentile
                        65
                    
                    
                        Q1
                        71
                    
                    
                        Median
                        76
                    
                    
                        Q3
                        81
                    
                    
                        95-th percentile
                        88
                    
                    
                        Maximum
                        100
                    
                    
                        Range
                        46
                    
                    
                        Interquartile range
                        10
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        7.1056
                    
                    
                        Coef of variation
                        0.093452
                    
                    
                        Kurtosis
                        -0.075715
                    
                    
                        Mean
                        76.035
                    
                    
                        MAD
                        5.6862
                    
                    
                        Skewness
                        0.20112
                    
                    
                        Sum
                        118920
                    
                    
                        Variance
                        50.489
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        75.0
        95
        6.0%
        
             
        

        70.0
        93
        5.9%
        
             
        

        76.0
        91
        5.7%
        
             
        

        72.0
        89
        5.6%
        
             
        

        74.0
        85
        5.4%
        
             
        

        78.0
        84
        5.3%
        
             
        

        80.0
        81
        5.1%
        
             
        

        77.0
        79
        5.0%
        
             
        

        73.0
        74
        4.7%
        
             
        

        71.0
        63
        4.0%
        
             
        

        Other values (34)
        730
        46.1%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        54.0
        1
        0.1%
        
             
        

        55.0
        1
        0.1%
        
             
        

        56.0
        2
        0.1%
        
             
        

        58.0
        1
        0.1%
        
             
        

        59.0
        3
        0.2%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        94.0
        5
        0.3%
        
             
        

        95.0
        5
        0.3%
        
             
        

        96.0
        4
        0.3%
        
             
        

        97.0
        2
        0.1%
        
             
        

        100.0
        2
        0.1%
        
             
        


        
    


    
        weightclass

            Categorical
        
    

    
        
            Distinct count
            6
        
        
            Unique (%)
            0.4%
        
        
            Missing (%)
            1.3%
        
        
            Missing (n)
            21
        
    


    
        
    vlow_weight
    
        
            361
        
        
    

    mid_weight
    
        
            349
        
        
    

    low_weight
    
        
            311
        
        
    

    Other values (2)
    
        
            543
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        vlow_weight
        361
        22.8%
        
             
        

        mid_weight
        349
        22.0%
        
             
        

        low_weight
        311
        19.6%
        
             
        

        vhigh_weight
        293
        18.5%
        
             
        

        high_weight
        250
        15.8%
        
             
        

        (Missing)
        21
        1.3%
        
             
        




    
        Sample
    
    
    
        
  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
      skintone
      position_agg
      weightclass
      heightclass
      skintoneclass
    
    
      playerShort
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
      0.125
      Forward
      low_weight
      mid_height
      [0, 0.125]
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
      0.250
      Midfield
      vlow_weight
      vlow_height
      (0.125, 0.25]
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
      0.000
      Midfield
      mid_weight
      low_height
      [0, 0.125]
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25
      0.250
      Defense
      low_weight
      low_height
      (0.125, 0.25]

Notice a few things -- it automatically finds highly correlated features -- is that what we want?

In this case no -- we want the skintone column, and the fact that it's highly correlated with rater1 and rater2, 1, makes sense, but 2, should be consciously selected.

Question -- What to do with birthday column?

Some points to consider:

this is a career-long dataset of redcards as of 2012-2013 season
the age of the player at 2012 does not (necessarily) correspond to the date of receiving a redcard



In [35]:

    
players.birthday.head()









    Out[35]:





playerShort
aaron-hughes              08.11.1979
aaron-hunt                04.09.1986
aaron-lennon              16.04.1987
aaron-ramsey              26.12.1990
abdelhamid-el-kaoutari    17.03.1990
Name: birthday, dtype: object



In [36]:

    
# modifying dataframe
players['birth_date'] = pd.to_datetime(players.birthday, format='%d.%m.%Y')
players['age_years'] = ((pd.to_datetime("2013-01-01") - players['birth_date']).dt.days)/365.25



In [37]:

    
sns.distplot(players.age_years);

Select variables to (possibly) use



In [38]:

    
players.head()









    Out[38]:






  
    
      
      weight
      birthday
      rater1
      height
      position
      photoID
      rater2
      skintone
      position_agg
      weightclass
      heightclass
      skintoneclass
      birth_date
      age_years
    
    
      playerShort
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      71.0
      08.11.1979
      0.25
      182.0
      Center Back
      3868.jpg
      0.00
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      1979-11-08
      33.149897
    
    
      aaron-hunt
      73.0
      04.09.1986
      0.00
      183.0
      Attacking Midfielder
      20136.jpg
      0.25
      0.125
      Forward
      low_weight
      mid_height
      [0, 0.125]
      1986-09-04
      26.327173
    
    
      aaron-lennon
      63.0
      16.04.1987
      0.25
      165.0
      Right Midfielder
      13515.jpg
      0.25
      0.250
      Midfield
      vlow_weight
      vlow_height
      (0.125, 0.25]
      1987-04-16
      25.713895
    
    
      aaron-ramsey
      76.0
      26.12.1990
      0.00
      178.0
      Center Midfielder
      94953.jpg
      0.00
      0.000
      Midfield
      mid_weight
      low_height
      [0, 0.125]
      1990-12-26
      22.017796
    
    
      abdelhamid-el-kaoutari
      73.0
      17.03.1990
      0.25
      180.0
      Center Back
      124913.jpg
      0.25
      0.250
      Defense
      low_weight
      low_height
      (0.125, 0.25]
      1990-03-17
      22.795346



In [39]:

    
players_cleaned_variables = players.columns.tolist()
players_cleaned_variables









    Out[39]:





['weight',
 'birthday',
 'rater1',
 'height',
 'position',
 'photoID',
 'rater2',
 'skintone',
 'position_agg',
 'weightclass',
 'heightclass',
 'skintoneclass',
 'birth_date',
 'age_years']



In [40]:

    
players_cleaned_variables = [#'birthday',
                             'height',
                             'weight',
#                              'position',
#                              'photoID',
#                              'rater1',
#                              'rater2',
                             'skintone',
                             'position_agg',
                             'weightclass',
                             'heightclass',
                             'skintoneclass',
#                              'birth_date',
                             'age_years']



In [41]:

    
pandas_profiling.ProfileReport(players[players_cleaned_variables])









    Out[41]:









    
        Overview
    
    
    
        Dataset info
        
            
            
                Number of variables
                9 
            
            
                Number of observations
                1585 
            
            
                Total Missing (%)
                1.4% 
            
            
                Total size in memory
                79.1 KiB 
            
            
                Average record size in memory
                51.1 B 
            
            
        
    
    
        Variables types
        
            
            
                Numeric
                4 
            
            
                Categorical
                4 
            
            
                Date
                0 
            
            
                Text (Unique)
                1 
            
            
                Rejected
                0 
            
            
        
    
    
        Warnings
        position_agg has 152 / 9.6% missing values Missing
skintone has 424 / 26.8% zeros
weight has 21 / 1.3% missing values Missing
weightclass has 21 / 1.3% missing values Missing 
    

    
        Variables
    
    
    
        age_years

            Numeric
        
    

    
        
            
                
                    Distinct count
                    1389
                
                
                    Unique (%)
                    87.6%
                
                
                    Missing (%)
                    0.0%
                
                
                    Missing (n)
                    0
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    26.652
                
                
                    Minimum
                    16.282
                
                
                    Maximum
                    41.626
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        16.282
                    
                    
                        5-th percentile
                        19.766
                    
                    
                        Q1
                        23.239
                    
                    
                        Median
                        26.461
                    
                    
                        Q3
                        29.815
                    
                    
                        95-th percentile
                        33.933
                    
                    
                        Maximum
                        41.626
                    
                    
                        Range
                        25.344
                    
                    
                        Interquartile range
                        6.5763
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        4.4577
                    
                    
                        Coef of variation
                        0.16725
                    
                    
                        Kurtosis
                        -0.48301
                    
                    
                        Mean
                        26.652
                    
                    
                        MAD
                        3.6731
                    
                    
                        Skewness
                        0.25118
                    
                    
                        Sum
                        42244
                    
                    
                        Variance
                        19.871
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        27.4113620808
        4
        0.3%
        
             
        

        20.0
        3
        0.2%
        
             
        

        31.8138261465
        3
        0.2%
        
             
        

        26.7296372348
        3
        0.2%
        
             
        

        25.6016427105
        3
        0.2%
        
             
        

        31.8631074606
        3
        0.2%
        
             
        

        24.6872005476
        3
        0.2%
        
             
        

        23.5893223819
        3
        0.2%
        
             
        

        25.4154688569
        3
        0.2%
        
             
        

        26.6064339493
        3
        0.2%
        
             
        

        Other values (1379)
        1554
        98.0%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        16.2819986311
        1
        0.1%
        
             
        

        17.0622861054
        1
        0.1%
        
             
        

        17.4702258727
        1
        0.1%
        
             
        

        17.6290212183
        1
        0.1%
        
             
        

        17.749486653
        1
        0.1%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        39.1950718686
        1
        0.1%
        
             
        

        39.3018480493
        1
        0.1%
        
             
        

        40.2381930185
        1
        0.1%
        
             
        

        40.8405201916
        1
        0.1%
        
             
        

        41.6262833676
        1
        0.1%
        
             
        


        
    


    
        height

            Numeric
        
    

    
        
            
                
                    Distinct count
                    43
                
                
                    Unique (%)
                    2.7%
                
                
                    Missing (%)
                    0.2%
                
                
                    Missing (n)
                    3
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    182.04
                
                
                    Minimum
                    161
                
                
                    Maximum
                    203
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        161
                    
                    
                        5-th percentile
                        170.05
                    
                    
                        Q1
                        178
                    
                    
                        Median
                        182
                    
                    
                        Q3
                        187
                    
                    
                        95-th percentile
                        193
                    
                    
                        Maximum
                        203
                    
                    
                        Range
                        42
                    
                    
                        Interquartile range
                        9
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        6.7429
                    
                    
                        Coef of variation
                        0.03704
                    
                    
                        Kurtosis
                        -0.35285
                    
                    
                        Mean
                        182.04
                    
                    
                        MAD
                        5.495
                    
                    
                        Skewness
                        -0.060692
                    
                    
                        Sum
                        287990
                    
                    
                        Variance
                        45.467
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        180.0
        125
        7.9%
        
             
        

        183.0
        102
        6.4%
        
             
        

        188.0
        94
        5.9%
        
             
        

        178.0
        91
        5.7%
        
             
        

        185.0
        91
        5.7%
        
             
        

        184.0
        86
        5.4%
        
             
        

        186.0
        78
        4.9%
        
             
        

        182.0
        67
        4.2%
        
             
        

        190.0
        66
        4.2%
        
             
        

        179.0
        64
        4.0%
        
             
        

        Other values (32)
        718
        45.3%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        161.0
        1
        0.1%
        
             
        

        162.0
        1
        0.1%
        
             
        

        163.0
        1
        0.1%
        
             
        

        164.0
        2
        0.1%
        
             
        

        165.0
        3
        0.2%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        198.0
        4
        0.3%
        
             
        

        199.0
        2
        0.1%
        
             
        

        201.0
        1
        0.1%
        
             
        

        202.0
        2
        0.1%
        
             
        

        203.0
        1
        0.1%
        
             
        


        
    


    
        heightclass

            Categorical
        
    

    
        
            Distinct count
            6
        
        
            Unique (%)
            0.4%
        
        
            Missing (%)
            0.2%
        
        
            Missing (n)
            3
        
    


    
        
    vlow_height
    
        
            344
        
        
    

    low_height
    
        
            329
        
        
    

    mid_height
    
        
            319
        
        
    

    Other values (2)
    
        
            590
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        vlow_height
        344
        21.7%
        
             
        

        low_height
        329
        20.8%
        
             
        

        mid_height
        319
        20.1%
        
             
        

        high_height
        317
        20.0%
        
             
        

        vhigh_height
        273
        17.2%
        
             
        

        (Missing)
        3
        0.2%
        
             
        




    
        playerShort

            Categorical, Unique
        
    

  
    
      First 3 values
    
  
  
    
      shane-long
    
    
      filip-malbasic
    
    
      alou-diarra
    
  


  
    
      Last 3 values
    
  
  
    
      emir-spahic
    
    
      christofer-heimeroth
    
    
      pulido_4
    
  


    
        Toggle details
    


    First 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        aaron-hughes
        1
        0.1%
        
             
        

        aaron-hunt
        1
        0.1%
        
             
        

        aaron-lennon
        1
        0.1%
        
             
        

        aaron-ramsey
        1
        0.1%
        
             
        

        abdelhamid-el-kaoutari
        1
        0.1%
        
             
        


    Last 10 values
    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        zoltan-gera
        1
        0.1%
        
             
        

        zoltan-stieber
        1
        0.1%
        
             
        

        zoumana-camara
        1
        0.1%
        
             
        

        zubikarai
        1
        0.1%
        
             
        

        zurutuza
        1
        0.1%
        
             
        




    
        position_agg

            Categorical
        
    

    
        
            Distinct count
            5
        
        
            Unique (%)
            0.3%
        
        
            Missing (%)
            9.6%
        
        
            Missing (n)
            152
        
    


    
        
    Defense
    
        
            646
        
        
    

    Forward
    
        
            430
        
        
    

    Midfield
    
        
            196
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        Defense
        646
        40.8%
        
             
        

        Forward
        430
        27.1%
        
             
        

        Midfield
        196
        12.4%
        
             
        

        Keeper
        161
        10.2%
        
             
        

        (Missing)
        152
        9.6%
        
             
        




    
        skintone

            Numeric
        
    

    
        
            
                
                    Distinct count
                    9
                
                
                    Unique (%)
                    0.6%
                
                
                    Missing (%)
                    0.0%
                
                
                    Missing (n)
                    0
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    0.28951
                
                
                    Minimum
                    0
                
                
                    Maximum
                    1
                
                
                    Zeros (%)
                    26.8%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        0
                    
                    
                        5-th percentile
                        0
                    
                    
                        Q1
                        0
                    
                    
                        Median
                        0.25
                    
                    
                        Q3
                        0.375
                    
                    
                        95-th percentile
                        1
                    
                    
                        Maximum
                        1
                    
                    
                        Range
                        1
                    
                    
                        Interquartile range
                        0.375
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        0.2909
                    
                    
                        Coef of variation
                        1.0048
                    
                    
                        Kurtosis
                        0.23468
                    
                    
                        Mean
                        0.28951
                    
                    
                        MAD
                        0.22555
                    
                    
                        Skewness
                        1.0843
                    
                    
                        Sum
                        458.88
                    
                    
                        Variance
                        0.084622
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.25
        472
        29.8%
        
             
        

        0.0
        424
        26.8%
        
             
        

        0.125
        227
        14.3%
        
             
        

        0.5
        116
        7.3%
        
             
        

        0.75
        104
        6.6%
        
             
        

        1.0
        92
        5.8%
        
             
        

        0.375
        77
        4.9%
        
             
        

        0.875
        40
        2.5%
        
             
        

        0.625
        33
        2.1%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.0
        424
        26.8%
        
             
        

        0.125
        227
        14.3%
        
             
        

        0.25
        472
        29.8%
        
             
        

        0.375
        77
        4.9%
        
             
        

        0.5
        116
        7.3%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        0.5
        116
        7.3%
        
             
        

        0.625
        33
        2.1%
        
             
        

        0.75
        104
        6.6%
        
             
        

        0.875
        40
        2.5%
        
             
        

        1.0
        92
        5.8%
        
             
        


        
    


    
        skintoneclass

            Categorical
        
    

    
        
            Distinct count
            3
        
        
            Unique (%)
            0.2%
        
        
            Missing (%)
            0.0%
        
        
            Missing (n)
            0
        
    


    
        
    [0, 0.125]
    
        
            651
        
        
    

    (0.125, 0.25]
    
        
            472
        
        
    

    (0.25, 1]
    
        
            462
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        [0, 0.125]
        651
        41.1%
        
             
        

        (0.125, 0.25]
        472
        29.8%
        
             
        

        (0.25, 1]
        462
        29.1%
        
             
        




    
        weight

            Numeric
        
    

    
        
            
                
                    Distinct count
                    45
                
                
                    Unique (%)
                    2.9%
                
                
                    Missing (%)
                    1.3%
                
                
                    Missing (n)
                    21
                
                
                    Infinite (%)
                    0.0%
                
                
                    Infinite (n)
                    0
                
            

        
        
            

                
                    Mean
                    76.035
                
                
                    Minimum
                    54
                
                
                    Maximum
                    100
                
                
                    Zeros (%)
                    0.0%
                
            
        
    


    



    
        Toggle details
    


    
        Statistics
        Histogram
        Common Values
        Extreme Values

    

    
        
            
                Quantile statistics
                
                    
                        Minimum
                        54
                    
                    
                        5-th percentile
                        65
                    
                    
                        Q1
                        71
                    
                    
                        Median
                        76
                    
                    
                        Q3
                        81
                    
                    
                        95-th percentile
                        88
                    
                    
                        Maximum
                        100
                    
                    
                        Range
                        46
                    
                    
                        Interquartile range
                        10
                    
                
            
            
                Descriptive statistics
                
                    
                        Standard deviation
                        7.1056
                    
                    
                        Coef of variation
                        0.093452
                    
                    
                        Kurtosis
                        -0.075715
                    
                    
                        Mean
                        76.035
                    
                    
                        MAD
                        5.6862
                    
                    
                        Skewness
                        0.20112
                    
                    
                        Sum
                        118920
                    
                    
                        Variance
                        50.489
                    
                    
                        Memory size
                        12.5 KiB
                    
                
            
        
        
            
        
        
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        75.0
        95
        6.0%
        
             
        

        70.0
        93
        5.9%
        
             
        

        76.0
        91
        5.7%
        
             
        

        72.0
        89
        5.6%
        
             
        

        74.0
        85
        5.4%
        
             
        

        78.0
        84
        5.3%
        
             
        

        80.0
        81
        5.1%
        
             
        

        77.0
        79
        5.0%
        
             
        

        73.0
        74
        4.7%
        
             
        

        71.0
        63
        4.0%
        
             
        

        Other values (34)
        730
        46.1%
        
             
        


        
        
            Minimum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        54.0
        1
        0.1%
        
             
        

        55.0
        1
        0.1%
        
             
        

        56.0
        2
        0.1%
        
             
        

        58.0
        1
        0.1%
        
             
        

        59.0
        3
        0.2%
        
             
        


            Maximum 5 values
            

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        94.0
        5
        0.3%
        
             
        

        95.0
        5
        0.3%
        
             
        

        96.0
        4
        0.3%
        
             
        

        97.0
        2
        0.1%
        
             
        

        100.0
        2
        0.1%
        
             
        


        
    


    
        weightclass

            Categorical
        
    

    
        
            Distinct count
            6
        
        
            Unique (%)
            0.4%
        
        
            Missing (%)
            1.3%
        
        
            Missing (n)
            21
        
    


    
        
    vlow_weight
    
        
            361
        
        
    

    mid_weight
    
        
            349
        
        
    

    low_weight
    
        
            311
        
        
    

    Other values (2)
    
        
            543
        
        
    

    


    
        Toggle details
    


    

    
    
        Value
        Count
        Frequency (%)
         
    
    
    
        vlow_weight
        361
        22.8%
        
             
        

        mid_weight
        349
        22.0%
        
             
        

        low_weight
        311
        19.6%
        
             
        

        vhigh_weight
        293
        18.5%
        
             
        

        high_weight
        250
        15.8%
        
             
        

        (Missing)
        21
        1.3%
        
             
        




    
        Sample
    
    
    
        
  
    
      
      height
      weight
      skintone
      position_agg
      weightclass
      heightclass
      skintoneclass
      age_years
    
    
      playerShort
      
      
      
      
      
      
      
      
    
  
  
    
      aaron-hughes
      182.0
      71.0
      0.125
      Defense
      low_weight
      mid_height
      [0, 0.125]
      33.149897
    
    
      aaron-hunt
      183.0
      73.0
      0.125
      Forward
      low_weight
      mid_height
      [0, 0.125]
      26.327173
    
    
      aaron-lennon
      165.0
      63.0
      0.250
      Midfield
      vlow_weight
      vlow_height
      (0.125, 0.25]
      25.713895
    
    
      aaron-ramsey
      178.0
      76.0
      0.000
      Midfield
      mid_weight
      low_height
      [0, 0.125]
      22.017796
    
    
      abdelhamid-el-kaoutari
      180.0
      73.0
      0.250
      Defense
      low_weight
      low_height
      (0.125, 0.25]
      22.795346



In [42]:

    
players[players_cleaned_variables].to_csv("cleaned_players.csv.gz", compression='gzip')

Challenge: Do the same process we did above but for the Referee, Clubs, and Country dataframes



In [ ]:

	weight	birthday	rater1	height	position	photoID	rater2
playerShort
aaron-hughes	71.0	08.11.1979	0.25	182.0	Center Back	3868.jpg	0.00
aaron-hunt	73.0	04.09.1986	0.00	183.0	Attacking Midfielder	20136.jpg	0.25
aaron-lennon	63.0	16.04.1987	0.25	165.0	Right Midfielder	13515.jpg	0.25
aaron-ramsey	76.0	26.12.1990	0.00	178.0	Center Midfielder	94953.jpg	0.00
abdelhamid-el-kaoutari	73.0	17.03.1990	0.25	180.0	Center Back	124913.jpg	0.25

Number of variables	13
Number of observations	1585
Total Missing (%)	1.7%
Total size in memory	128.7 KiB
Average record size in memory	83.1 B

Distinct count	1389
Unique (%)	87.6%
Missing (%)	0.0%
Missing (n)	0

Value	Count	Frequency (%)
04.08.1985	4	0.3%
25.05.1986	3	0.2%
03.08.1987	3	0.2%
18.04.1983	3	0.2%
25.04.1988	3	0.2%
10.04.1986	3	0.2%
17.03.1986	3	0.2%
20.02.1981	3	0.2%
01.01.1993	3	0.2%
31.05.1989	3	0.2%
Other values (1379)	1554	98.0%

Distinct count	43
Unique (%)	2.7%
Missing (%)	0.2%
Missing (n)	3
Infinite (%)	0.0%
Infinite (n)	0

Minimum	161
5-th percentile	170.05
Q1	178
Median	182
Q3	187
95-th percentile	193
Maximum	203
Range	42
Interquartile range	9

Standard deviation	6.7429
Coef of variation	0.03704
Kurtosis	-0.35285
Mean	182.04
MAD	5.495
Skewness	-0.060692
Sum	287990
Variance	45.467
Memory size	12.5 KiB

Center Back	248
Center Forward	191
Defensive Midfielder	172
Other values (9)	822

Standard deviation	0.29658
Coef of variation	1.1041
Kurtosis	0.21394
Mean	0.26861
MAD	0.22512
Skewness	1.0645
Sum	425.75
Variance	0.087958
Memory size	12.5 KiB

Distinct count	45
Unique (%)	2.9%
Missing (%)	1.3%
Missing (n)	21
Infinite (%)	0.0%
Infinite (n)	0

Standard deviation	7.1056
Coef of variation	0.093452
Kurtosis	-0.075715
Mean	76.035
MAD	5.6862
Skewness	0.20112
Sum	118920
Variance	50.489
Memory size	12.5 KiB

Minimum	16.282
5-th percentile	19.766
Q1	23.239
Median	26.461
Q3	29.815
95-th percentile	33.933
Maximum	41.626
Range	25.344
Interquartile range	6.5763

Standard deviation	4.4577
Coef of variation	0.16725
Kurtosis	-0.48301
Mean	26.652
MAD	3.6731
Skewness	0.25118
Sum	42244
Variance	19.871
Memory size	12.5 KiB

Distinct count	9
Unique (%)	0.6%
Missing (%)	0.0%
Missing (n)	0
Infinite (%)	0.0%
Infinite (n)	0

Standard deviation	0.2909
Coef of variation	1.0048
Kurtosis	0.23468
Mean	0.28951
MAD	0.22555
Skewness	1.0843
Sum	458.88
Variance	0.084622
Memory size	12.5 KiB